Batch Normalization - EXPLAINED!

Рет қаралды 102,256

Күн бұрын

What is Batch Normalization? Why is it important in Neural networks? We get into math details too. Code in references.
Follow me on M E D I U M: towardsdatascience.com/likeli...
REFERENCES
[1] 2015 paper that introduced Batch Normalization: arxiv.org/abs/1502.03167
[2] The paper that claims Batch Norm does NOT reduce internal covariate shift as claimed in [1]: arxiv.org/abs/1805.11604
[3] Using BN + Dropout: arxiv.org/abs/1905.05928
[4] Andrew Ng on why normalization speeds up training: www.coursera.org/lecture/deep...
[5] Ian Goodfellow on how Batch Normalization helps regularization: www.quora.com/Is-there-a-theo...
[6] Code Batch Normalization from scratch: kratzert.github.io/2016/02/12...

Пікірлер: 127

@ssshukla26 4 жыл бұрын

Shouldn't be Gamma should approximate to the true variance of the neuron activation and beta should approximate to the true mean of the neuron activation? I am just confused...

@CodeEmporium 4 жыл бұрын

You're right. Misspoke there. Nice catch!

@ssshukla26 4 жыл бұрын

@@CodeEmporium Cool

@dhananjaysonawane1996 3 жыл бұрын

How is this approximation happening? And how do we use beta, gamma at test time? We have only one example at a time during testing.

@FMAdestroyer 2 жыл бұрын

@@dhananjaysonawane1996 in most frameworks when you create a BN Layer, the mean and variance (Beta and gamma) are both learnable parameters usually represented as the weights and bias from the layer. You can deduce that from Torch BN2D Layer's description bellow "The mean and standard-deviation are calculated per-dimension over the mini-batches and γ and β are learnable parameter vectors of size C (where C is the input size)."

@AndyLee-xq8wq Жыл бұрын

Thanks for clarification!

@efaustmann 4 жыл бұрын

Exactly what I was looking for. Very well researched and explained in a simply way with visualizations. Thank you very much!

@sumanthbalaji1768 4 жыл бұрын

Just found your channel and binged through all your videos so heres a general review. As a student i assure you your content is on point and goes in depth unlike other channels that just skim the surface. Keep it up and dont be afraid to go more in depth on concepts. We love it. Keep it up brother you have earned a supporter till your channels end

@CodeEmporium 4 жыл бұрын

Thanks ma guy. I'll keep pushing up content. Good to know my audience loves the details ;)

@sumanthbalaji1768 4 жыл бұрын

@@CodeEmporium damn did not actually expect you to reply lol. Maybe let me throw a topic suggestion then. More NLP please, take a look at summarisation tasks as a topic. Would be damn interesting.

@jodumagpi 4 жыл бұрын

This is good! I think that giving an example as well as the use cases (advantages) before diving into the details alwayd gets the job done

@maxb5560 4 жыл бұрын

Love your videos. They help me alot understanding machine learning more and more

@EB3103 3 жыл бұрын

The loss is not a function of the features but a function of the weights

@yeripark1135 2 жыл бұрын

I clearly understand the need of batch normalization and its advantages! Thanks !!

@balthiertsk8596 2 жыл бұрын

Hey man, thank you. I really appreciate this quality content!

@ultrasgreen1349 Жыл бұрын

thats actually a very very good and intuitive video. Honestly Thank you

@ahmedshehata9522 2 жыл бұрын

You are really and also really good because you reference paper and introduce the idea

@Slisus 2 жыл бұрын

Awesome video. I really like, how you go into the actual papers behind it.

@CodeEmporium 2 жыл бұрын

Glad you liked this!

@parthshastri2451 3 жыл бұрын

why did you plot the cost against height and the age isnt it supposed to be a function of weights in a neural network

@hervebenganga8561 Жыл бұрын

This is beautiful. Thank you

@ayandogra2952 3 жыл бұрын

Amazing work really liked it

@iliasaarab7922 3 жыл бұрын

Great explanation, thanks!

@chandnimaria9748 9 ай бұрын

Just what I was looking for, thanks.

@SaifMohamed-de8uo Ай бұрын

Great explanation thank you!

@MaralSheikhzadeh 2 жыл бұрын

thanks, this video helped me understand BN better. and I liked your sense of humor. made watching is more fun.:)

@sriharihumbarwadi5981 4 жыл бұрын

Can you please make a video on how batch normalization and l1/l2 regularization interact with each other ?

@user-wf2fq2vn5m 3 жыл бұрын

Awesome explanation.

@dragonman101 3 жыл бұрын

Quick note: at 6:50 there should be brackets after 1/3 (see below) Yours: 1/3 (4 - 5.33)^2 + (5 - 5.33)^2 + (7 - 5.33)^2

@oheldad 4 жыл бұрын

Hey there . Im on my way to become data scientist , and your videos help me a lot ! Keep going Im sure I am not the only one you inspired :) thank you !!

@CodeEmporium 4 жыл бұрын

Awesome! Glad these videos help! Good luck with your Data science ventures :)

@ccuuttww 4 жыл бұрын

Your aim should not become a data scientist to fit other people expectation you should become a people who can deal with data and estimate any unknown parameter with your own standard

@oheldad 4 жыл бұрын

@@ccuuttww dont know why you decided that Im fulfilling others expectations on me - its not true. Im on the last semester of my electrical engineering degree , and decided to change path a little :)

@ccuuttww 4 жыл бұрын

because most of people think in the following pattern : Finish all exam semester and graduate with good marks send mass CV and try to get a job titled:"Data Scientist" try to fit their jobs what they learn from university like a trained monkey however u are not deal with a real wold situation u just try to deal with your customer or your boss since this topic never have standard answer u can only define by yourself and your client only trust your title I fell this is really bad

@ryanchen6147 2 жыл бұрын

at 3:27, I think your axises should be the *weight* for the height feature and the *weight* for the age feature if that is a contour plot of the cost function

@mohameddjilani4109 Жыл бұрын

Yes , that was an error across a long period in the video

@strateeg32 2 жыл бұрын

Awesome thank you!

@angusbarr7952 4 жыл бұрын

Hey! Just cited you in my undergrad project because your example finally made me understand batch norm. Thanks a lot!

@CodeEmporium 4 жыл бұрын

Sweet! Glad it was helpful homie

@seyyedpooyahekmatiathar624 4 жыл бұрын

Subtracting the mean and dividing by std is standardization. Normalization is when you change the range of the dataset to be [0,1].

@uniquetobin4real 4 жыл бұрын

The best I have seen so far

@manthanladva6547 4 жыл бұрын

Thanks for awesome video Get many idea about Batch Norm

@God-vl5uz Ай бұрын

Thank you!

@aminmw5258 Жыл бұрын

Thank you bro.

@shaz-z506 4 жыл бұрын

Good video, could you please make a video on capsule network.

@hemaswaroop7970 4 жыл бұрын

Thanks, Man!

@superghettoindian01 Жыл бұрын

I see you are checking all these comments - so will try to comment on all the videos I see going forward and how I’m using these videos. Currently using this video as supplement to Andrej Karpathy’s makemore series pt 3. The other video has a more detailed implementation of batch normalization but you do a great job of summarizing the key concepts. I hope one day you and Andrej can create a video together 😊.

@CodeEmporium Жыл бұрын

Thanks a ton for the comment. Honestly, any critical feedback is appreciated. So thanks you. It would certainly be a privilege to collaborate with Andrej for sure. Maybe in the future :)

@thoughte2432 3 жыл бұрын

I found this a really good and intuitive explanation, thanks for that. But there was one thing that confused me: isn't the effect of batch normalization the smoothing of the loss function? I found it difficult to associate the loss function directly to the graph shown at 2:50.

@Paivren 11 ай бұрын

yes, the graph is a bit weird in the sense that the loss function is not a function of the features but of the model parameters.

@enveraaa8414 3 жыл бұрын

Bro you have made the perfect video

@user-nx8ux5ls7q 2 жыл бұрын

Do we calculate the mean and SD across a mini-batch for a given neutron or across all the neurone in a layer? Andrew NG says it's across each layer. Thanks.

@ccuuttww 4 жыл бұрын

I wonder is it suitable to use population estimator? I think nowadays most of the machine learning learner/student/fans spent very less time on statistics after several year study I find that The model selection and the statistical theory take the most important part especially the Bayesian learning the most underrated topic today

@pranavjangir8338 3 жыл бұрын

Is not Batch Normalization also used to counter the exploding gradient problem? Would have loved some explanation on that too..

@sanjaykrish8719 4 жыл бұрын

Fantastic explanation using contour plots.

@CodeEmporium 4 жыл бұрын

Thanks! Contour plots are the best!

@PavanTripathi-rj7bd Жыл бұрын

great explanation

@CodeEmporium Жыл бұрын

Thank you! Enjoy your stay on the channel :)

@JapiSandhu 2 жыл бұрын

this is a great video

@lamnguyentrong275 4 жыл бұрын

wow, easy to understand , and clear accent. Thank you, sir. u done a great job

@erich_l4644 4 жыл бұрын

This was so well put together- why less than 10k views? Oh... it's batch normalization

@taghyeertaghyeer5974 Жыл бұрын

Hello, thank you for your video. I am wondering regarding the batch normalisation speeding up the training: you showed at 2:42 the contour plot of the loss as a function of height and age. However, the loss function contours should be plotted against the weights (the optimization is performed in the weights' space, and not the input space). In other words, why did you base your argument on the loss function with weight and and height being the variable (they should be held constant during optimization)? Thank you! Lana

@marcinstrzesak346 9 ай бұрын

For me, it also seemed quite confusing. I'm glad someone else noticed it too.

@atuldivekar 5 ай бұрын

The contour plot is being shown as a function of height and age to show the dependence of the loss on the input distribution, not the weights

@sultanatasnimjahan5114 7 ай бұрын

thanks

@kriz1718 4 жыл бұрын

Very helpfull!!

@danieldeychakiwsky1928 3 жыл бұрын

Thanks for the video. I wanted to add that there's debate in the community over whether to normalize pre vs. post non-linearity within the layers, i.e., for a given neuron in some layer, do you normalize the result of the linear function that gets piped through non-linearity or do you pipe the linear combination through non-linearity and then apply normalization, in both cases, over the mini-batch.

@kennethleung4487 3 жыл бұрын

Here's what I found from MachineLearningMastery: o Batch normalization may be used on inputs to the layer before or after the activation function in the previous layer o It may be more appropriate after the activation function if for S-shaped functions like the hyperbolic tangent and logistic function o It may be appropriate before the activation function for activations that may result in non-Gaussian distributions like the rectified linear activation function, the modern default for most network types

@priyankakaswan7528 3 жыл бұрын

the real magic starts at 6.07, this video was exactly what I needed

@samratkorupolu 3 жыл бұрын

wow, you explained pretty clearly

@mohammadkaramisheykhlan9 2 жыл бұрын

How can we use batch normalization in the test set?

@pranaysingh3950 2 жыл бұрын

Thanks!

@CodeEmporium 2 жыл бұрын

Welcome!

@luisfraga3281 3 жыл бұрын

Hello, I wonder what if we don't normalize the image input data (RGB 0-255) and then we use batch normalization? Is it going to work smoothly? or is it going to mess up with the learning?

@ajayvishwakarma6943 4 жыл бұрын

Thanks buddy

@abhishekp4818 4 жыл бұрын

@CodeEmporium , could you please tell me that why do we need to normalize the outputs of activation function whe they are already within a small range(example sigmoid ranges from 0 to 1)? and if we do normalize them, then how do we compute and updates of its parameters during backpropgation? please answer.

@boke6184 4 жыл бұрын

The activation function should be the modifiing the predictability of error or learning too

@abheerchrome 3 жыл бұрын

grate video bro keep it up

@JapiSandhu 2 жыл бұрын

can I add a Batch Normalization layer after an LSTM layer in pytorch?

@SillyMakesVids 4 жыл бұрын

Sorry, but where did gamma and beta come from and how is it used?

@nobelyhacker 2 жыл бұрын

Nice video, but i guess there is a little error at 6:57? I guess you have to multiply the whole with 1/3 not only the first term

@user-nx8ux5ls7q 2 жыл бұрын

Also if someone can say how to make gamma and beta learnable? gamma can be thought as an additional weight attached to the activation but how about beta? how to train that?

@SetoAjiNugroho 4 жыл бұрын

what about layer norm ?

@novinnouri764 2 жыл бұрын

thansk

@PierreH1968 3 жыл бұрын

Great explanation, very helpful!

@elyasmoshirpanahi7184 Жыл бұрын

Nice content

@CodeEmporium Жыл бұрын

Thanks so much

@themightyquinn100 Жыл бұрын

Wasn't there an episode where Peter was playing against Larry Bird?

@mizzonimirko Жыл бұрын

I do not understand property how this Is going to be implemented. At the end of an epoch actually we perform those operations right? At the end of that epoch, at this point the layer where i have applied It Is normalized right?

@rockzzstartzz2339 4 жыл бұрын

Why to use beta and gamma?

@akhileshpandey123 3 жыл бұрын

Nice explanation :+1

@akremgomri9085 Ай бұрын

Very good explanation. However, there is something I didn't understand. Doesn't batch normalisation modify the inout data so that m=0 and v=1 as explained in the beginning ?? So how the heck we moved from normalisation being applied on inputs, to normalisation affecting activation function ? 😅😅

@ai__76 3 жыл бұрын

Nice animations

@CodeEmporium 3 жыл бұрын

Thank you

@SunnySingh-tp6nt 2 ай бұрын

can I get these slides?

@gyanendradas 4 жыл бұрын

Can u make a video for all types pooling layers

@CodeEmporium 4 жыл бұрын

Interesting. I'll look into this. Thanks for the idea

@anishjain8096 4 жыл бұрын

Hey brother can you please tell me how on fly data augmentation increase the image data set every on blogs and vedios they said it increase the data size but hiw

@CodeEmporium 4 жыл бұрын

For images, you would need to make minor distortions (rotation, crop, scale, blur) in an image such that the result is a realistic input. This way, you have more training data for your model to generalize

@boke6184 4 жыл бұрын

This is good for ghost box

@GauravSharma-ui4yd 4 жыл бұрын

Awesome, keep going like this

@CodeEmporium 4 жыл бұрын

Thanks for watching every video Gaurav :)

@sevfx Жыл бұрын

Great explanation, but missing parantheses at 6:52 :p

@lazarus8011 27 күн бұрын

Good video here's a comment for the algorithm

@irodionzaytsev 2 жыл бұрын

The only difficult part of batch norm, namely the back prop isn't explained.

@aaronk839 4 жыл бұрын

Good explanation until 7:17 after which, I think, you miss the point which makes the whole thing very confusing. You say: "Gamma should approximate to the true mean of the neuron activation and beta should approximate to the true variance of the neuron activation." Apart from the fact that this should be the other way around, as you acknowledge in the comments, you don't say what you mean by "true mean" and "true variance". I learned from Andrew Ng's video (kzfaq.info/get/bejne/qrR5o6iLsdzDlZs.html) that the actual reason for introducing two learnable parameters is that you actually don't necessarily want all batch data to be normalized to mean 0 and variance 1. Instead, shifting and scaling all normalized data at one neuron to obtain a different mean (beta) and variance (gamma) might be advantageous in order to exploit the non-linearity of your activation functions. Please don't skip over important parts like this one with sloppy explanations in future videos. This gives people the impression that they understand what's going on, when they actually don't.

@dragonman101 3 жыл бұрын

Thank you very much for this explanation. The link and the correction are very helpful and do provide some clarity to a question I had. That being said, I don't think it's fair to call his explanation sloppy. He broke down complicated material in a fantastic and clear way for the most part. He even linked to research so we could do further reading, which is great because now I have a solid foundation to understand what I read in the papers. He should be encouraged to fix his few mistakes rather than slapped on the wrist.

@sachinkun21 2 жыл бұрын

thanks a ton!! I was actually looking for this comment as I had the same question as to why do we even need to approximate!

@99dynasty Жыл бұрын

BatchNorm reparametrizes the underlying optimization problem to make it more stable (in the sense of loss Lipschitzness) and smooth (in the sense of “effective” β-smoothness of the loss). Not my words

@xuantungnguyen9719 3 жыл бұрын

good visualization

@CodeEmporium 3 жыл бұрын

Thanks a ton :)

@adosar7261 Жыл бұрын

And why not just normalizing the whole training set instead of batch normalization?

@CodeEmporium Жыл бұрын

Batch normalization will normalize through different steps of the network. If we want to “normalize the whole training set”, we need to pass all training examples at once to the network as a single batch. This is what we see in “batch gradient descent”, but isn’t super common for large datasets because of memory constraints.

@sealivezentrum 3 жыл бұрын

fuck me, you explained way better than my prof did

@Acampandoconfrikis 3 жыл бұрын

Hey 🅱eter, did you make it to the NBA?

@eniolaajiboye4399 2 жыл бұрын

🤯

@its_azmii 4 жыл бұрын

hey can u link the graph that you used please?

@SAINIVEDH 3 жыл бұрын

For RNN's Batch Normalisation should be avoided, use Layer Normalisation instead

@alexdalton4535 3 жыл бұрын

why didnt peter make it..

@CodeEmporium 3 жыл бұрын

Clearly the model was wrong

@nyri0 2 жыл бұрын

Your visualizations are misleading. Normalization doesn't turn the shape on the left into the circle seen on the right. It will be less elongated but still keep a diagonal ellipse shape.

@roeeorland Жыл бұрын

Peter is most definitely not 1.9m That’s 6’3

@rodi4850 4 жыл бұрын

Sorry to say but very poor video. Intro was way too long and explaining more the math and why BN works was left for 1-2mins.

@CodeEmporium 4 жыл бұрын

Thanks for watching till the end. I tried going for a layered approach to the explanation - get the big picture. Then the applications. Then details. I wasn't sure how much more math was necessary. This was the main math in the paper, so I thought that was adequate. Always open to suggestions if you have any. If you've looked at my recent videos, you can tell the delivery is not consistent. Trying to see what works

@PhilbertLin 4 жыл бұрын

I think the intro with the samples in the first few minutes was a little drawn out but the majority of the video spent on intuition and visuals without math was nice. Didn’t go through the paper so can’t comment on how much more math detail is needed.