LLM Prompt FORMATS make or break your LLM and RAG

Рет қаралды 5,190

5 ай бұрын

LLM Prompt formatting essentially concerns the way in which input data or questions are structured and presented to LLMs or VLMs. The sensitivity of LLMs to prompt formatting is a complex phenomenon. Subtle changes in the format of the prompts, such as variations in phrasing, the use of different punctuation marks, alterations in the layout, or even the presence or absence of specific keywords, can significantly influence the model's response. This sensitivity extends to the nuances of language and structure, making it a crucial aspect for developers and users of these models to consider.
The impact of prompt formatting on LLM performance cannot be understated. A prompt formatted in one way might elicit a highly accurate and relevant response from a model, while a slight alteration in the formatting could lead to a response that is off-target or less precise. This variability poses a significant challenge, particularly in applications where accuracy and consistency of the model's output are paramount.
One of the ongoing challenges in this field is the lack of standardization in prompt formatting. Given the diversity of LLM applications and the varying nature of tasks, there is no universal prompt format that guarantees optimal performance across all scenarios. This lack of standardization means that developers and users must often experiment with different prompt structures to find the most effective format for their specific application, which can be a time-consuming and resource-intensive process.
The issue of prompt formatting has wide-ranging implications for the development and practical application of LLMs. It necessitates a deep understanding of the model's internal workings, including how it processes and interprets different types of input. Addressing this challenge is not only crucial for enhancing the efficiency and effectiveness of these models but also for ensuring their reliability and utility in real-world applications.
Consequently, the field of AI and natural language processing is actively engaged in research to address the challenges posed by prompt formatting. Efforts are being made to develop LLMs that are less sensitive to changes in prompt structure or to devise methodologies for creating the most effective prompts. This research involves exploring various linguistic, contextual, and structural factors and their impact on model performance.
In conclusion, while LLMs represent a significant stride forward in AI, the nuanced challenge of prompt formatting remains a critical area for ongoing research and development. Addressing this issue is key to unlocking the full potential of these sophisticated language models and ensuring their effective deployment across a wide range of applications. Prompt formatting will be as important as our classical prompt engineering, since prompt formatting is sensitive to each singular LLM and even the particular fine-tuning methodology of a specific LLM.
Prompt Format optimized RAG:
Not to mention the importance of Prompt Formats for RAG systems, where old-fashioned prompt templates and non-specifically optimized prompt formats are in operation (given open-source RAG code insights). Maybe also ask your professional, proprietary RAG provider for their specific Prompt Format Optimizer and on what LLMs it has been evaluated, including the latest Mixture-of-Expert Systems (MoE) or even the latest merged LLMs, by our classical MergeKit.py?
How does a merged LLM behave, given an untested Prompt Format optimization for merged LLMs / merged VLMs?
Your LLM RAG performance has the potential for a significant boost!
Scientific literature(rights w/ authors):
QUANTIFYING LANGUAGE MODELS’ SENSITIVITY TO SPURIOUS FEATURES IN PROMPT DESIGN or: How I learned to start worrying about prompt formatting
arxiv.org/pdf/2310.11324.pdf
#airesearch
#aieducation
#formatting

Пікірлер: 31

@BradleyKieser 5 ай бұрын

This has to be one of the most important videos for anyone using AIs as they are today, to watch. Who knew that such small changes to a prompt, as tiny as a single space, could affect the LLM's performance so much? Stunning discovery. Really important information. There's a whole field in prompt engineering still to be discovered.

@vijaynadkarni 24 күн бұрын

This is an incredible video. Entirely by accident I found that the way I was formatting my prompts was causing major variations in the nature of responses I was getting from the LLM or RAG model. But I wasn't clear at all on what types of formatting resulted in the different types of responses. This video confirms that the formatting does have a huge impact on the quality of responses one gets from the LLM and has saved me a great deal of experimentation. Thank you so much!

@BradleyKieser 5 ай бұрын

Cannot thank you enough for finding the failure points as well as the success points for prompts. This is incredibly useful information.

@code4AI 5 ай бұрын

Glad it was helpful!

@Canna_Science_and_Technology 5 ай бұрын

Wow! Thank you so much for this video. I built my company’s local RAG system with the thought of swapping out LLMs as they improve over time. I didn’t even think of this. I’ll start implementing this today.

@beddows 5 ай бұрын

Great video! I've been fine-tuning for use cases with structured inputs and outputs, and it works very well (at least for gpt-3.5-turbo). Even wording of the system prompt can influence the output, as well as making explicit associations between the system, user, and assistant prompts (in training data).

@NitinKumar-xz8cz 5 ай бұрын

This video is so timely! I have been experimenting with structured output generation and I experienced the same sensitivity. I just hacked my way into changing the prompt that gave me good performance. In fact, how you present the few shot examples to the model has a huge impact on the model outputs. I have found (for llama2 and mistral models) if we provide the few shot example as in the following format [INST] Can you briefly explain the task described to you? [/INST] [INST][/INST] [INST] Explain why your response is correct [/INST] [INST] Are you ready to process another input? [/INST] Sure! I am waiting for the input [INST] [/INST] This essentially mimics a chain of thought process where the thoughts are sampled from the model that is going to be fine tuned. This has given me relatively good success

@tintintintin576 5 ай бұрын

Interesting. Thank you so much for sharing the tips

@tvwithtiffani 5 ай бұрын

I really enjoyed the reflection for yourself in the end. Encourages me to keep track of all of my own experiments in a more efficient way.

@loicbaconnier9150 5 ай бұрын

Awasome i want also to point out that: - maybe the lost in the middle came from training dataset where long text include always introduction and conclusion. So according to the fact your queries during training are most of the time relative to information included in intro or conclusion, the development ( in the middle) will not be relevant.... - to understand why it.s not the training format the more accurate just have a look at what nobody never look.... the tokenizer and token corpus - to conclude, just use special token for prompt training and you will improve your final model. Last ideas, we have to make moe with loras. And the same for Rag.. An kind of Mixture of Rag expert Regards Loic

@AdamTwardoch 5 ай бұрын

Pretty much every LLM API has a large set of parameters: temperature, max output length, top P, [top K], frequency penalty, presence penalty. Shrink-wrapped UIs like ChatGPT don't give access to these. The defaults differ in some APIs: sometimes temperature is set to 1, sometimes 0.8. Some experiments I've done indicate that changing these parameters has serious impact on the results. But I've hardly ever seen benchmarks, papers, videos that discuss this. As far as I can tell, most LLM benchmarks only test the "default" settings. I'd love to see some more in-depth experiments that compare models and change these parameters. The community has been trying a lot of elaborate optimizations to get the most desired results out of LLMs. But my partial experiments suggest that there's a fair bit of untapped potential with the model parameters.

@jimgsewell 5 ай бұрын

Thank you for pointing out this important information. So, what we need is some sort of standardized tool that automates running through all the prompt variations and provides a score. An open source tool that we can run against whichever model we choose, and lets us know which format is best for that specific model. Oh, and I’m not smart enough to build such a tool.

@omountassir 5 ай бұрын

Wow! Such a great discovery !

@markburton5318 5 ай бұрын

Thanks for highlighting! I used a separator recently and performance dropped away. I assumed that the separator was invalid or something and I changed it, but I see it was probably just format. I suppose a simple space is just fewer tokens, a very common token, effectively no information and less likely to mess up next-word prediction.

@tintintintin576 5 ай бұрын

👏👏

@spirobel2.0 5 ай бұрын

that means we have the same problem for RAG. If the search results / what ever text is brought into the context is formatted differently the performance will be affected. I wonder if this even applies to embeddings. Do you use embeddings or plain text for your RAG system?

@kamleshpaul414 5 ай бұрын

formatting for finetune is it good to use tokenizer's chat template ? or any custom format ? like your are showing

@MadhavanSureshRobos 5 ай бұрын

I understand the importance of prompt enginnering especially in small size LLMs. But it is my conviction that we'll soon have smaller LLMs outperforming or easily as well performing as today's large LLMs

@avkashav 4 ай бұрын

Do you thinkg GPT-4 would be any better in handeling arbitrary prompt format?

@everyoneisodd 5 ай бұрын

Is this only limited to RAG application or it would work for information extraction task from unstructured data too

@wdonno 5 ай бұрын

Do you plan to let us follow along as you again fine tune your models? For the less skilled, like myself (I just pretend to be ‘younger’ to save face!), it is a long road to start from the beginning of your series to relearn all the evolutionary steps!

@heist7539 5 ай бұрын

hell there, ben here hope i am not getting off topic but would like to ask if you could make a video about neuro-symbolicprogramming thats if its with in your scope, i only ask as seeing what the Rabbit R1 is capable of "LAM, an AI OS" thank you in advance

@justindressler5992 5 ай бұрын

Ive started to move away from instruct based models, there output are too unpredictable. But of the instruct models the Alpaca tuned models seem to be the most predictable. The problem is the instruct models can give the most creative results so unless your using a pre-trained model that is trained on the particular type of result you want eg one tuned from programming or one for creative writing you need to use instruct models to get dynamic results from the base model. The tricky bit is priming the model with a prompt that gets the result you want, this can be drastically different based on the foundation model eg SOLAR vs LLAMA the instruction prompts can be completely different. I think this is why instruct models are bad especially when considering using MoE style models like Mixtral. But it is cool that the formatting plays a big role, this can explains alot of why these models are so sensitive of the prompt. As for the merged models there even harder to get predictable results from instruct tuning. These model probably need to be tuned over more epochs, to build coherent results from prompt instruction a little like TinyLLAMA.

@tintintintin576 5 ай бұрын

I also noticed that RoBERTa outperformed Llama2 on a classification task. Also, RoBERTa not only outperformed in terms of performance but also in training time, and memory footprint output.

@tintintintin576 5 ай бұрын

There's a blog on it published by a respectable researcher on Huggingface blog. I was intrigued. Performed a few tests on a different data. And damn I was stunned!

@Steponlyone 5 ай бұрын

Did you remove/unlist your ICL video?

@code4AI 5 ай бұрын

Given the high variance in the results, I started to look for explanations before I published it, and this video was generated first. Smile. But I plan to build my prompt optimizer for my LLMs and then maybe publish the performance difference between standard prompt templates from RAG systems and further prompt format optimised versions.

@Steponlyone 5 ай бұрын

@@code4AI makes sense