Knowledge Distillation in Deep Learning

Knowledge Distillation in Deep Learning - Basics

Рет қаралды 18,059

Күн бұрын

Here I try to explain the basic idea behind Knowledge distillation and how the technique helps in compressing large deep learning models.
Part 2 : • Knowledge Distillation...

Пікірлер: 45

@mariama8157 Жыл бұрын

Great Professor. Easy and high level explanation.

@WALLACE9009 Жыл бұрын

Amazing that this works at all

@manub.n2451 Жыл бұрын

Thanks Sagar for a brilliant explanation of the basics of KD

@kristianmamforte4129 Жыл бұрын

Wow, thanks for this video!

@gaurav230187 Жыл бұрын

Well done, good and simple explanation.

@goelnikhils Жыл бұрын

Amazing explanation of knowledge distillaton

@dhirajkumarsahu999 Жыл бұрын

Thank you so much, you have earned my subscription

@dingusagar Жыл бұрын

Thanks. Will try to do more such videos.

@teay5767 5 ай бұрын

Nice Video, thanks for the help

@tranvoquang1224 Жыл бұрын

Thank you.

@mariama8157 Жыл бұрын

Thank you so much. Please, make a lot of videos in machine learning.

@lazy.researcher Жыл бұрын

can you please tell me the advantage of smoothening the logits using temperature inclusion; why can't we use softmax to compare the output of the teacher and student models for distillation loss?

@dingusagar Жыл бұрын

good question. one way to think about it is this. the teacher's probabilities on different classes are very semantically rich. it captures the data distribution and the relation between various classes as explained in the video on the animals example. But the teacher's probabilities are coming from softmax which was trained to match the one hot encoded labels of the correct class when the teacher was trained. so even though, various class probabilities from the final teacher layer represent rich information about the data distribution, the actual values for the correct class will be so high and the other classes will have low probabilities. It's like the signal is there, but very hard to see unless we amplify it. So that's why we do softmax with temperature. it amplifies the probabilities of the rest of the classes at the cost of bringing down the probability of the positive class a little bit, (because softmax outputs should sum to 1). This way, the student is able to see these other probabilities more clearly and learn from them.

@TechNewsReviews 8 ай бұрын

The explanation looks good. However, many words are unclear because of bad sound quality. My suggestion is to use some AI-based audio enhancement tools to make the voice clearer and noise-free, then update the video. You will definitely get more views.

@dingusagar 7 ай бұрын

Thanks for the feedback. Yes the audio is really bad. I am planning to re-record this and upload soon.

@miriamsilverman5126 2 жыл бұрын

great explanation!!! thank you! I wish the sound was better.. maybe you can record it again:)

@dingusagar 2 жыл бұрын

Thanks :). Sorry about the bad sound quality. Will definitely work on it next time.

@ilhamafounnas8279 2 жыл бұрын

Looking forward for more information about KD. Thank you

@dingusagar 2 жыл бұрын

Glad to hear that. I was exploring KD in the nlp space and thought of creating few videos around it. Let me know if there is any specific topic in KD or in general that you are looking forward to. If it overlaps with the things that i am exploring, would be happy to make videos around it.

@jatinsingh9062 2 жыл бұрын

thansk !!

@Speedarion 2 жыл бұрын

If the final layer has a sigmoid activation function , can the output of the sigmoid function be used as the input to a softmax function with temperature ?

@dingusagar 2 жыл бұрын

Interesting idea, In theory we could define the loss function like u said and the training would still work. But practically I am not sure to what extent it would help. Should try this out. We are essentially doing softmax twice. Here is an article on why we should't do that in a normal NN setup. jamesmccaffrey.wordpress.com/2018/03/07/why-you-shouldnt-apply-softmax-twice-to-a-neural-network/ Intuitively applying softmax twice is making the function more smoother and that's what we are trying to achieve in the KD setup. But if the same effect can be achieved by tuning the hyperparameter T of the softmax with temperature directly from the logits, then that's a simpler approach from training perspective, neverthless its an interesting idea to explore.

@Speedarion 2 жыл бұрын

@@dingusagar Thanks for the reply . If the final layer is a fully connected layer followed by a sigmoid activation function , essentially , the logits would be the inputs going into the sigmoid right ? I guess to perform KD , I would take this inputs and pass it to a softmax function with temperature

@dingusagar 2 жыл бұрын

@@Speedarion yes you are right. logits are what is coming out of the final layer before any activation is applied.

@andreisimion1636 Жыл бұрын

For Loss2, don' you want to do CrossEntropy(p(1), y_true), i.e. use the probabilities from the student w/o temperature scaling? Also, y_true is a 1 hot vector, no? It seems like Loss2 is a CrossEntropy between 2 1-hot vectors, so unsure if this is right. Am I missing something?

@dingusagar Жыл бұрын

Yes correct, loss 2 is between two one hot vectors. Cross entropy is just defined over 2 distributions and it doesn't really have any requirement of the distribution being 1hot or not..what you suggested is also correct i feel. It's just that this is how the authors have defined initially. Now different implementations can modify the loss based on what they find empirically more accurate. Having said that, intuitively i can think one reason in favor of this approach and that is the fact that loss one already uses soft predictions which help in models converging on learning the differences between the rich features of the images from the teacher model. So loss 2 is restricted to just focus on getting the classification correct which is expressed in one hot vector format.

@Jamboreeni 11 ай бұрын

Great video! Love how you simplified it that even a novice like me understood it 😊😊😊 If possible please use a better mic the sound qualit on this video was a little low and foggy

@dingusagar 11 ай бұрын

Thanks. Glad to hear that.😊 Yes, I will definitely work on the sound quality.

@lm_mage Жыл бұрын

If the second argument of CrossEntropy() is the true labels, shouldn't Loss 1 be CrossEntropy(p,q) instead of CrossEntropy(q,p)?

@ruksharalam173 8 ай бұрын

Would be great if you could please improve the audio.

@shipan5940 2 жыл бұрын

I've an probably stupid question: why don't we just directly train the Student model? having the pre-trained Teacher model makes the Student model more accurate?

@dingusagar 2 жыл бұрын

There are no such thing as stupid questions :) Let me try to answer as per my understanding, feel free to reply back with further queries if any. In a simplified analogy, knowledge distillation is like a real life teacher and student combination. If a student tries to learn a new subject from scratch all by herself, then it would take a lot of time. Whereas an intelligent teacher who has already done all the hardwork in learning everythig before can give rich and summarised information after skipping the useless info to the student and the student would be able to learn it in lesser time. the trend today is really large models when trained on huge sizes of data tends to have better representational power and thus more accuracy. that is why the bigshot companies are on a constant race to build the next biggest model trained on bigger datasets. In our analogy, this is like the teacher reading lots of books to really understand the subject. Since the teacher has a bigger brain (more layers), it can go through the huge datasets, learn interesting patterns and discard useless patterns. After this intensive learning is done, the teacher acts as a pretrained model. Now the output coming out from the teacher model is very rich in information (refer 1:53), this is why a student model with a smaller brain (lesser number of layers), is able to consume this rich information and learn in shorter time.

@ThePaintingpeter Жыл бұрын

Great video but the sound could be improved

@nayanshah6715 Жыл бұрын

why dingu sounds like english man ? or is it just me ... And also good content!!!!!!!!

@prasanthnoelpanguluri7167 Жыл бұрын

Can you share this PPT.

@dingusagar Жыл бұрын

docs.google.com/presentation/d/1IkPeSGOcUSO_qyCwtrP9ZBMx-l2aBzj7FDqwPLK9Ekk/edit?usp=drivesdk

@prasanthnoelpanguluri7167 Жыл бұрын

@@dingusagar Can you share the other file which talks about distilbert too.

@dingusagar Жыл бұрын

@@prasanthnoelpanguluri7167 docs.google.com/presentation/d/1wU1ZVkgA-qU-5kkHqe824IVxsyLQEqqojOVaNK6Afv8/edit?usp=sharing

@lm_mage Жыл бұрын

@@dingusagar you are a saint.

@terrortalkhorror 2 ай бұрын

if the model has just 1 and 0 in labels actual, then you must have mistakenly said that the model predicts with 0.39% that it is a horse. Instead it should be that with 0.39% the model thinks its the deer.

@terrortalkhorror 2 ай бұрын

but i must say your expalanation is really good

@dingusagar 2 ай бұрын

@@terrortalkhorror thanks for the feedback. I am not sure if I understood what you pointed out exactly. the predictions are done on the input image. The 3 images on the right are just for visualizing the classes. From the perspective of predicting the input image, the model thinks it is a deer, horse and peacock by probabilities 0.6, 0.39, 0.01 respectively as mentioned in the slide. The audio quality is poor and that could have created some confusion.

@terrortalkhorror 2 ай бұрын

@@dingusagar yes you are right. i just rewatched it and now it makes sense.

@terrortalkhorror 2 ай бұрын

I just sent a connection request on LinkedIn