No video

GenAI on the Edge Forum: LLM Pipelines: Seamless Integration on Embedded Devices

  Рет қаралды 550

tinyML Foundation

tinyML Foundation

Күн бұрын

LLM Pipelines: Seamless Integration on Embedded Devices
Enzo RUEDAS, AI Engineering Student, NXP Semiconductors
Large Language Models (LLMs) and broader Generative Artificial Intelligence have gained increasing prominence in the AI landscape. Various initiatives, including Hugging Face and libraries such as GGML, have played a crucial role in facilitating the accessibility and development of LLMs. Nevertheless, deploying such models on embedded devices remains extremely challenging, given the inherent constraints of computational power and memory. NXP’s LLM Pipelines project aims to enhance user experience with LLMs on embedded devices, facilitating more accessible deployment and improving human-machine interactions.
This presentation details our solutions to improve LLMs porting through quantization and fine-tuning. In particular, our experiments focus high end NXP MPUs, such as:
- i.MX 8M Plus featuring a 4x Arm Cortex-A53 Processor and a Neural Processing Unit (NPU)
- i.MX 93 featuring a 2x Arm Cortex-A55 Processor and an NPU
- i.MX 95 featuring a 6x Arm Cortex-A55 Processor and an NPU
When deploying AI models in resource-constrained environments, machine learning quantization techniques offer several significant benefits, including reductions in model size and memory footprint, as well as faster execution time. However, most integer quantization techniques can result in important accuracy drops, especially in auto-regressive models. The LLM Pipelines project features advanced quantization algorithms, encompassing model compression, dynamic quantization and latest post-training static quantization techniques. Our presentation will focus on comparing these different approaches.
On the other hand, most use-cases for embedded LLMs necessitate specialization, either to limit computational costs and usage or to mitigate hallucinations and biases. For example, a car assistant should focus on assisting the driver with vehicle-related tasks, avoiding unrelated topics like politics. Using Retrieval Augmented Generation (RAG), we explore various fine-tuning scenario for the smart assistant, utilizing user manual knowledge or even interacting with machine sensors. This presentation will address different RAG- related challenges, including constricting the input prompt to meet hardware requirements and handling out-of-topic queries.

Пікірлер
GenAI on the Edge Forum - Song Han: Visual Language Models for Edge AI 2.0
36:44
WHO CAN RUN FASTER?
00:23
Zhong
Рет қаралды 30 МЛН
The moment we stopped understanding AI [AlexNet]
17:38
Welch Labs
Рет қаралды 903 М.
Will NVIDIA Survive The Era of 1-Bit LLMs?
27:47
Finxter
Рет қаралды 22 М.
What are AI Agents?
12:29
IBM Technology
Рет қаралды 148 М.
Jascha Goltermann: The Impact of AI on UX Design - Hatch Conference 2023
27:34
Generative AI on mobile and web with Google AI Edge
38:19
Google for Developers
Рет қаралды 6 М.
Brad Munday - Modzy - Robust and Resilient Edge+Cloud Inference for LLMs
26:39
AI Infrastructure Alliance
Рет қаралды 439
Foundation Models Tutorial, and Why Not to Fine Tune Them
23:13
Snorkel AI
Рет қаралды 13 М.
Large Language Models (LLMs) - Everything You NEED To Know
25:20
Matthew Berman
Рет қаралды 76 М.
Bringing Generative AI to Life with NVIDIA Jetson
42:26
NVIDIA Developer
Рет қаралды 19 М.
WHO CAN RUN FASTER?
00:23
Zhong
Рет қаралды 30 МЛН