Optimization for Deep Learning (Momentum, RMSprop, AdaGrad, Adam)

Рет қаралды 34,014

Күн бұрын

Here we cover six optimization schemes for deep neural networks: stochastic gradient descent (SGD), SGD with momentum, SGD with Nesterov momentum, RMSprop, AdaGrad and Adam.
Chapters
---------------
Introduction 00:00
Brief refresher 00:27
Stochastic gradient descent (SGD) 03:16
SGD with momentum 05:01
SGD with Nesterov momentum 07:02
AdaGrad 09:46
RMSprop 12:20
Adam 13:23
SGD vs Adam 15:03

Пікірлер: 33

@AkhilKrishnaatg 3 ай бұрын

Beautifully explained. Thank you!

@rhugvedchaudhari4584 7 ай бұрын

The best explanation I've seen till now!

@dongthinh2001 5 ай бұрын

Clearly explained indeed! Great video!

@saqibsarwarkhan5549 Ай бұрын

That's a great video with clear explanations in such a short time. Thanks a lot.

@Justin-zw1hx 10 ай бұрын

keep doing the awesome work, you deserve more subs

@idiosinkrazijske.rutine 10 ай бұрын

Very nice explanation!

@markr9640 5 ай бұрын

Fantastic video and graphics. Please find time to make more. Subscribed 👍

@luiskraker807 4 ай бұрын

Many thanks, clear explanation!!!

@rasha8541 6 ай бұрын

really well explained

@benwinstanleymusic 3 ай бұрын

Great video thank you!

@physis6356 Ай бұрын

great video, thanks!

@makgaiduk 6 ай бұрын

Well explained!

@zhang_han 8 ай бұрын

Most mind blowing thing in this video was what Cauchy did in 1847.

@TheTimtimtimtam Жыл бұрын

Thank you this is really well put together and presented !

@leohuang-sz2rf 2 ай бұрын

I love your explaination

@tempetedecafe7416 5 ай бұрын

Very good explanation! 15:03 Arguably, I would say that it's not the responsibility of the optimization algorithm to ensure good generalization. I feel like it would be more fair to judge optimizers only on their fit of the training data, and leave the responsibility of generalization out of their benchmark. In your example, I think it would be the responsibility of model architecture design to get rid of this sharp minimum (by having dropout, fewer parameters, etc...), rather than the responsibility of Adam not to fall inside of it.

@wishIKnewHowToLove Жыл бұрын

thank you so much :)

@MikeSieko17 3 ай бұрын

why didnt you explain the (1-\beta_1) term?

@wishIKnewHowToLove Жыл бұрын

Really? i didn't know SGD generalized better than ADAM

@deepbean Жыл бұрын

Thank you for your comments Sebastian! This result doesn't seem completely clear cut so may be open to refutation in some cases. For instance, one Medium article concludes that "fine-tuned Adam is always better than SGD, while there exists a performance gap between Adam and SGD when using default hyperparameters", which means the problem is one of hyperparameter optimization, which can be more difficult with Adam. Let me know what you think! medium.com/geekculture/a-2021-guide-to-improving-cnns-optimizers-adam-vs-sgd-495848ac6008

@wishIKnewHowToLove Жыл бұрын

@@deepbean it's sebastiEn with E.Learn how to read carefully :)

@deepbean Жыл бұрын

🤣

@deepbean Жыл бұрын

@@wishIKnewHowToLove my bad

@dgnu 11 ай бұрын

@@wishIKnewHowToLove bruh cmon the man is being nice enough to u just by replying jesus

@donmiguel4848 3 ай бұрын

Nesterov is silly. You have the gradient g(w(t)) because the weight w is calculating in the forward the activation of the neuron and contributes to the loss. You don't have the gradient g(w(t)+pV(t)) because at this fictive position of the weight the inference was not calculated and so you don't have any information about what the loss contribution at that weight position would have been. It's PURE NONSENSE. But it only cost a few more calculations without doing much damage, so no one really seems to complain about it.

@Nerdimo 2 ай бұрын

This does not make sense…at all. The intuition is that you’re making an educated guess for the gradient in the future; you’re already going to compute g(w(t) + pV(t)) anyway, so why not correct for that and move in that direction instead on the current step?

@donmiguel4848 2 ай бұрын

@@Nerdimo Let's remember that the actual correct gradient of w is computed as the average gradient over ALL samples. So for runtime complexity reason we already make a "educated guess", or better a stochastic approximation, with our present per sample or per batch gradient by using a running gradient or a batch gradient. But these approximations are based on actual inference that we have calculated. Adding to that uncertainty some guessing about what in future will happen is not a correction based on facts, it's pure fiction. Of course you will find for every training process a configuration of hyper parameter, with which this fiction is beneficial as well you will find configurations, with which it is not. But you get this knowledge only by experiment instead of having an algorithm, that is beneficial in general.

@Nerdimo 2 ай бұрын

@@donmiguel4848 Starting to wonder if this is AI generated “pure fiction” 😂.

@Nerdimo 2 ай бұрын

@@donmiguel4848 I understand your point, however, I think it’s unfair to discount it as something that’s “fiction”. My main argument is just that there’s intuitions in why doing this could improve taking good steps in the direction towards the local minimum of the loss function.

@donmiguel4848 2 ай бұрын

@@Nerdimo These "intuitions" are based on assumptions about the NN which don't match with reality. We humans understand a hill and a sink or a mountain or a canyon and we assume the loss function being like that, but the real power of NeuralNetworks is the non-linearity of the activation and the flexibility of a lot of interacting non-linear components. If our intuition would match what actually is going on in the NN we could write an algorithm which would be much faster than the NN. But NN are fare more complex and beyond human imagination, so I think we have to be very careful with our assumptions and "intuitions", even though it seems to be "unfair".😉