In-Context Learning: EXTREME vs Fine-Tuning, RAG

Рет қаралды 3,802

2 ай бұрын

Fill the complete context length with many shot examples and evaluate the performance! Great new insights, although extreme scaling happens on open source LLMs with extended context length, not on latest models w/ 1 Mio token length, like Gemini 1.5 Pro (see my new, upcoming video in the next days).
A comprehensive analysis of in-context learning (ICL) when extended to long-context models, specifically examining the performance and characteristics. A key finding is that performance in ICL continues to improve with the inclusion of hundreds or even thousands of demonstrations, surpassing traditional fine-tuning methods in certain scenarios. This improvement is not merely additive but is significantly driven by the model's ability to attend to more relevant examples during inference.
For datasets with extensive label spaces, the gains from increasing context length are particularly pronounced. Additionally, the study highlights that while retrieval methods show diminishing returns with extended contexts, the use of a large, randomly selected set of demonstrations in long-context ICL remains surprisingly effective, suggesting that the sheer volume of context can sometimes compensate for the lack of finely-tuned example selection.
Another critical insight from the study is the reduced sensitivity of long-context ICL to the order of examples and the negative impact of grouping same-label examples, which suggests that optimal performance relies on a diverse set of in-context demonstrations rather than clustered or ordered ones. The research also identifies that the performance gains in long-context ICL are primarily attributed to the model's ability to reference relevant examples rather than refining task-specific decision boundaries through extensive encoding. This revelation is supported by experiments showing that performance saturates before reaching the maximum context length for many datasets, indicating that current models have not yet fully exploited the potential of long-context ICL.
Furthermore, long-context models exhibit robust performance across various datasets, maintaining efficiency and accuracy, and offering a promising alternative to traditional fine-tuning, especially when computational efficiency and rapid adaptability are paramount.
All rights w/ authors:
In-Context Learning with Long-Context Models: An In-Depth Exploration
arxiv.org/pdf/2405.00200
#airesearch
#ai
#newtechnology

Пікірлер: 8

@DaveRetchless Ай бұрын

Your content is the best detailed explanation of new papers and topics as they evolve. Thank you for the exceptional work.

@code4AI Ай бұрын

Appreciate your comment!

@AaronALAI Ай бұрын

I built a rig at home which lets me run mixtral 8*22 quantized to 8bit using exllamav2. It has a 64k context length and I've found that in context learning works very well. This is just my subjective observation, but i have my setup so new conversations with specific topics first incorporate a bunch of context. It's a small upfront time cost (about a minute on intial setup), but the model responds much better this way. Additionally, ive tried giving the model a bunch of context up front via rag with similar results. I think in context learning is going to dominate rag and fine-tuning, because of its simplicity, dynamic nature, and one needs fewer resources to have a larger impact on the model output.

@kevinpham6658 Ай бұрын

Really interesting approach. Using sglang’s RadixAttention and the fork primitive right after all the ICL examples would make this strategy extremely viable and fast because you only have to evaluate the examples once. Multiple forks == multi-LoRa, but without the hassle of finetuning?

@frankschilder8974 Ай бұрын

very nice summary. I liked in particular your insights at the end of the video. I'm wondering, however, about the additional cost of ICL+ for a production system compared to a fine-tuned model. It would be nice to see a chart comparing the inferencing cost answering the question of which approach would be more cost-effective in the long run, especially for high through-put scenarios.

@kevinpham6658 Ай бұрын

If you use sglang’s fork primitive, you cache the ICL tokens and get it for free on subsequent calls.

@covertassassin1885 Ай бұрын

@code4AI How could we apply this with longer problems? Having more examples where each example is long would fill up the context window very rapidly (eg, summarization). How would you recommend we balance those? My ideas: 1. Simply use RAG ICL to get the most relevant examples until the context is nearly filled 2. If the output of the model is short but the input is long, show many examples of the output of the llm/answer and just omit the long input that would take up many tokens. There are a few reasons behind this: the answer typically is a condensed form of information meaning much of the utility of the example is in the answer, the answer has the formatting the model should follow, and preventing dilution of the context window. (If you are filling a lot of the context window with a lot of tokens from the actual input of the problem, then the model will have fewer tokens of example answer text to pay attention to) 3. Potentially the inverse of #2 could also be useful: if the output is long for a short input, eg write a piece of code to solve a problem, then give the llm multiple examples of the input so it knows the general types of things it should be solving for. What are your thoughts on #2? I think it would still be very important to give a variety of examples to make aure you get a good distribution. However, maybe a 4th solution would be even better: 4. Hybrid ICL: Use RAG to retrieve a few full-length examples but then append many short examples (eg. just the output if the input is long). This way, the model can look at a few full problems & solutions but then has many more examples of the answer to reference. These output answers if in the form of chain-of-thought could be similar to what you referenced at the end with regards to many examples of reasoning