New Trick for Fine-Tuning LLMs

Рет қаралды 2,714

2 ай бұрын

The study by Google Research investigates the impact of fine-tuning large language models (LLMs) with new factual knowledge and its potential to induce hallucinations. Specifically, it explores whether introducing new, previously unknown information during the fine-tuning process leads to the generation of factually incorrect responses by the models. The concern is that LLMs might learn to generate information that is not grounded in their pre-existing knowledge, increasing the likelihood of hallucinations.
A. Methodology
The researchers designed a controlled setup focused on closed-book question answering (QA) to study the effect of new knowledge. They varied the proportion of fine-tuning examples that introduced new knowledge (Unknown examples) versus those consistent with the model's pre-existing knowledge (Known examples). The methodology involved:
Dataset Construction: Using ENTITYQUESTIONS, which consists of factual triplets from Wikidata converted into QA pairs.
Categorization: Introducing a hierarchical system (SliCK) to classify fine-tuning examples into four categories based on the model's knowledge: HighlyKnown, MaybeKnown, WeaklyKnown, and Unknown.
Evaluation: Measuring the model's performance on test sets with varying proportions of Unknown examples, and analyzing the impact on hallucinations and knowledge integration.
B. Main Results and Insights
The study yielded several significant findings:
Integration of New Knowledge: LLMs struggle to integrate new factual knowledge through fine-tuning. Unknown examples are learned significantly slower than Known examples, indicating difficulty in incorporating new information.
Induction of Hallucinations: As the model learns new knowledge through fine-tuning, there is a linear increase in its tendency to hallucinate. This suggests that exposure to new knowledge can indeed encourage the generation of factually incorrect responses.
Role of Early Stopping: Implementing early stopping during fine-tuning minimizes the risk of hallucinations. This approach prevents the model from overfitting to Unknown examples, which are primarily responsible for inducing hallucinations.
Importance of MaybeKnown Examples: Fine-tuning with a mix of Known categories, particularly MaybeKnown examples, enhances the model's ability to utilize its pre-existing knowledge effectively. This balanced approach yields better overall performance compared to fine-tuning solely on HighlyKnown examples.
C. Insights
The study provides crucial insights into the fine-tuning process of LLMs:
Risk Management: Introducing new factual knowledge during fine-tuning carries the risk of increased hallucinations. To mitigate this, strategies such as early stopping and filtering out Unknown examples can be effective.
Knowledge Utilization: LLMs primarily acquire factual knowledge during pre-training, while fine-tuning is more effective for optimizing the use of this knowledge rather than integrating new facts.
Practical Implications: For practical applications, it is essential to carefully design fine-tuning datasets and monitor training dynamics to balance the benefits of new knowledge with the risks of hallucinations.
In summary, the study highlights the challenges and risks associated with fine-tuning LLMs on new knowledge, emphasizing the need for careful management of the fine-tuning process to maintain model accuracy and reliability.
[text generated by GPT4o]
All rights w/ authors:
arxiv.org/pdf/2405.05904
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
#airesearch
#newtech
#insights

Пікірлер: 25

@wdonno 2 ай бұрын

Ha!!! You have always been telling us to blend ‘new’ data with some ‘old’ data when conducting SFT! Your intuition was spot on. You also always reminded us to mind the format of the new data, to match as much as possible the format of the original training data.

@borisguarisma8810 2 ай бұрын

Wow! I need to watch this again while reading the paper...thank you!

@code4AI 2 ай бұрын

Glad it was helpful!

@AdamBrusselback 2 ай бұрын

This is interesting. When ive been SFT new tasks, I was originally having problems getting models to learn the output with just an input + output example. I noticed much, much better performance on the final task when I augmented the training data to include answering questions about the input format, pulling out specific data points from it, generating intermediate representation, etc.

@code4AI 2 ай бұрын

Great observation. However with closed LLMs, or so called "open" ones but without any transparency of what their pre-training dataset included, global corporations being afraid of legal implications based on copyright violations .... we have no chance of really optimizing for a coherent fine-tuning dataset. Damn it.

@desmur36 2 ай бұрын

If this holds, this implies we need to sequence our training data in a format that scaffolds the model from lightly known to known knowledge. Intuitively this makes sense. Most students learn through a process of building on known concepts that are easy to grasp, then expanding that to more advanced topics using that base knowledge as a foundation. It also begs the question what was the sequence in the pretraining dataset? Was that carefully curated? And how would organize the internet from fundamental to advanced concepts? I think we got lucky with research papers because they always follow this sequence of known to new knowledge.

@i_accept_all_cookies 2 ай бұрын

I've been fine-tuning SLMs like TinyLlama, Phi 2, and Gemma 2b. This might explain some of the accuracy variance I've been seeing.

@milindgaharwar827 2 ай бұрын

It seems generally reasonable that 'new data + conceptually related known data' should lead to fewer hallucinations - when compared to only new data or new data + conceptually unrelated known data. It would probably not make a big difference IF there were a mechanism in the model architecture itself to find common patterns in different learnt concepts. Do please share if you are aware of any such research direction.

@gileneusz 2 ай бұрын

23:15 ICL is high compute demanding... with longer prompts you will get slow prompt processing...

@code4AI 2 ай бұрын

Not in parallel processing like ring attention

@zekehobbs7738 2 ай бұрын

At what volume of new tokens does this break down? Ie 1k, 10k, 100k, 1M, 10M etc.

@proterotype 2 ай бұрын

I wonder if combining Fine Tuning with RAG would solve this

@code4AI 2 ай бұрын

No. We need our fine- tuned LLMs within our active agents when complex external info (RAx) is returned to the LLM.

@proterotype 2 ай бұрын

@@xspydazx interesting stuff! I understand you may not have personally trained an LLM on embeddings, but have you done the type of workflow in your first, longer comment? If so, how well have you witnessed it working, that is to say, how accurate are the results of the method you outline in your first longer comment

@gileneusz 2 ай бұрын

I think you missed the point that this paper is about dense models like Llama 3, which is trained on large amount of tokens, this will not appear as much for models that are not as dense as Llama 3

@code4AI 2 ай бұрын

Smile. The formulation "it will not appear as much ..." Is hopeful, but do we have any validated data on this?! Why should MoE be immune, and if "maybe", to what degree?

@gileneusz 2 ай бұрын

@@code4AI I have no idea, all these fine tuning stuff is just pure experimental. Which is good, we are still learning

@kishoretvk 2 ай бұрын

so pre-training of a LLM ? can we do it with a 7b or 8b model ? can we further fine tune a pre trainled LLM and avoid this ?

@code4AI 2 ай бұрын

Some argue, that fine-tuning is just a continuous pre-training, kind of. IF we have an open source LLM, where we know all the pre-training datasets and formats and complexities... then we might have a chance to create an additional coherent fine-tuning dataset. With closed LLMs however.... No chance.

@gileneusz 2 ай бұрын

25:38 pre training is too expensive, but if you split your knowledge into many AI models, you can train smaller models and it would be much cheaper...

@code4AI 2 ай бұрын

Look at the performance of Snowflake Arctic 128x3.66B. Any questions left?

@gileneusz 2 ай бұрын

@@code4AI that's opposite spectrum. Snowflake suffer because 3.66B models are just undertrained there.