Better not Bigger: Distilling LLMs into Specialized Models

Рет қаралды 2,118

8 ай бұрын

Jason Fries, a research scientist at Snorkel AI and Stanford University, discussed the challenges of deploying LLMs and presented two variations of one solution: distillation.
The first solution, called “distilling step-by-step” emerged from a collaboration between researchers at Snorkel AI and Google Research. This approach prompts an LLM to give an answer to a question along with the model’s reasoning behind its answer. Data scientists then use both the answer and the rationale to train a smaller model. In experiments, this allowed researchers to train models on much less data while maintaining similar performance.
Jason also showed how the Snorkel Flow data development platform allows users to effectively distill the expertise of multiple LLMs into a deployable, small-format model.
More related videos: • Foundation Models: The...
More related videos: • Snorkel AI's 2023 Ente...
#airesearch #modeldistillation #largelanguagemodels

Пікірлер: 12

@vivekpadman5248 Ай бұрын

Very nice short informative video. I'm looking to create a distilled model on reasoning tasks for games which could run locally. This will help 😊 thanks

@SnorkelAI Ай бұрын

Glad it was helpful!

@riser9644 8 ай бұрын

Link to the blog code or ppt would be good

@420_gunna 5 ай бұрын

When you talk about distilation requiring large, unlabeled datsets... to be clear for my understanding, it's not necessarily that they're unlabeled data, it's more like we don't care about the dataset's labels, and instead use the teacher model's output distribution as the replacement pseudolabel. I guess you COULD create a distilled model by training against some data distribution that the teacher wasn't itself trained on... but I can't imagine why you would want to do that😄

@SnorkelAI 3 ай бұрын

Sort of. Typically, you would use this for data that is, in fact, unlabeled-think sections of contracts or paragraphs from text books. You could also employ this approach for data that has labels that don't fit your desired schema, in which case your statement of "we don't care about the dataset's labels" would be 100% correct. As for your second comment, there could be a number of reasons you may want to do that. Perhaps the teacher LLM does quite well on a particular labeling task when given a highly-engineered prompt. This approach would let you transfer that performance into a smaller and cheaper model.

@vivekpadman5248 Ай бұрын

Is this approach used on all three levels of training? Base instruct ane chat fine-tuning? And are there different things to be considered for the above?

@SnorkelAI Ай бұрын

I'm not 100% clear on your question. Are you referring to pre-training, fine-tuning and alignment? If so, this approach could be used on fine-tuning and/or alignment. It could also theoretically be used on pre-training, but I suspect that would yield poor results.

@vivekpadman5248 Ай бұрын

@@SnorkelAI yes that was exactly my question, thanks 😊. I have one follow up question here. Why do you think it would yeild poorer results on pre training phase any insights on that and in that case what kind (size and arch) of pretrained student model should be used with a specific teacher llm Or anything would work?

@SnorkelAI Ай бұрын

Sorry for the slow reply here. KZfaq didn't surface your reply comment the same way it did your initial comment. We're getting a bit outside the bounds of what can be reasonably answered within a KZfaq comment, but I think we can reasonably say this: Distilling a model means using its output to train a smaller model. For pre-training, that would mean creating an immense volume of raw generated outputs to form the parent model. Several studies have shown that pre-training generative models on other models' generated output tends not to work so well. We don't yet fully understand why, but we understand that it is a questionable practice at present.

@vivekpadman5248 Ай бұрын

@@SnorkelAI no worries man, getting such a nice detailed reply is all that matters. Ah understood it properly now, also I guess the limits of the parameter size will come into picture while doing that if we use it for pretraining. Clean data plua synthetic data is anyways available now. Thanks again 😊🙏