Рет қаралды 550
LLM Pipelines: Seamless Integration on Embedded Devices
Enzo RUEDAS, AI Engineering Student, NXP Semiconductors
Large Language Models (LLMs) and broader Generative Artificial Intelligence have gained increasing prominence in the AI landscape. Various initiatives, including Hugging Face and libraries such as GGML, have played a crucial role in facilitating the accessibility and development of LLMs. Nevertheless, deploying such models on embedded devices remains extremely challenging, given the inherent constraints of computational power and memory. NXP’s LLM Pipelines project aims to enhance user experience with LLMs on embedded devices, facilitating more accessible deployment and improving human-machine interactions.
This presentation details our solutions to improve LLMs porting through quantization and fine-tuning. In particular, our experiments focus high end NXP MPUs, such as:
- i.MX 8M Plus featuring a 4x Arm Cortex-A53 Processor and a Neural Processing Unit (NPU)
- i.MX 93 featuring a 2x Arm Cortex-A55 Processor and an NPU
- i.MX 95 featuring a 6x Arm Cortex-A55 Processor and an NPU
When deploying AI models in resource-constrained environments, machine learning quantization techniques offer several significant benefits, including reductions in model size and memory footprint, as well as faster execution time. However, most integer quantization techniques can result in important accuracy drops, especially in auto-regressive models. The LLM Pipelines project features advanced quantization algorithms, encompassing model compression, dynamic quantization and latest post-training static quantization techniques. Our presentation will focus on comparing these different approaches.
On the other hand, most use-cases for embedded LLMs necessitate specialization, either to limit computational costs and usage or to mitigate hallucinations and biases. For example, a car assistant should focus on assisting the driver with vehicle-related tasks, avoiding unrelated topics like politics. Using Retrieval Augmented Generation (RAG), we explore various fine-tuning scenario for the smart assistant, utilizing user manual knowledge or even interacting with machine sensors. This presentation will address different RAG- related challenges, including constricting the input prompt to meet hardware requirements and handling out-of-topic queries.