The Unreasonable Effectiveness of Stochastic Gradient Descent (in 3 minutes)

No video

The Unreasonable Effectiveness of Stochastic Gradient Descent (in 3 minutes)

Рет қаралды 63,856

Visually Explained

Күн бұрын

Пікірлер: 40

@dudelookslikealady12 2 жыл бұрын

Your saddle point animation took two seconds to illustrate why SGD might outperform vanilla GD. Amazing

@anikdas567 3 ай бұрын

very nice animations, and well explained. But just to be a bit technical isn't what you described called "mini-batch gradient descent". Because for stochastic gradient descent don't we just use one training example per iteration?? 😅😅

@hnbmm 8 ай бұрын

2:22 and after is just magical. Thanks for the amazing video.

@PeppeMarino 2 жыл бұрын

Awesome explanation, better than many books

@josepht4799 2 жыл бұрын

Didnt expect to see Dota gameplay lol. Very useful video btw

@kaynkayn9870 9 ай бұрын

I love watching these videos when I just need a short refresher. Great content.

@ashimov1970 3 ай бұрын

Brilliantly Genius!

@NikolajKuntner 2 жыл бұрын

I enjoy slow and sloppy the most.

@nathansmith8187 7 ай бұрын

Came here to find this comment.

@evyats9127 2 жыл бұрын

Thabks a lot, this great short video closed that corner for me

@MrKohlenstoff 6 ай бұрын

These are very nice visualizations, and a great explanation of the fundamental idea of SGD. But I'm very skeptical of some of the intuitive seeming explanations of why it's better than regular gradient descent. In particular, the saddle point example seems extremely constructed. It works with simple R²->R functions like the one we see, but even there only if the starting point (=model weights) are placed perfectly on the line, the probability of which is basically 0. Given model weights are usually initialized randomly, and we're in R^n space with n >> 2, I doubt that such cases ever happen in actual deep learning. Secondly, of course you can argue that SGD due to its noisiness may better escape local minima. But 1) do local minima actually exist in these extremely high dimensional spaces? If you have a billion dimensions, it's exceedingly unlikely that the derivative of all of them is 0 at the same time, and it may be almost impossible to run into them with a discrete approach. I think all these R²->R visualizations build some very strong yet incorrect intuitions about what high-dimensoinal gradient descent actually looks like. And 2) it could via the same process also _miss_ a _global_ minimum that regular GD would find (or avoids a local minimum but never makes it to any point that's better than the local minimum that GD would have found - so avoiding it then was a _bad_ thing). Noise is not inherently good. We can construct specific examples where it happens to help, but we could just as well find many examples where GD would win against SGD, and in the end the reality is probably that noise in SGD _hurts_ less than the gained performance helps, meaning overall it's the better option. It kind of makes sense: evaluating all your training samples to compute the gradient has diminishing returns. So the first, say, 10% of samples are much more important than the last 10%. But if you always used the same 10% of samples, you of course lose a lot of information overall. And if you choose some systematic process of how to select them, you might get strange biases. So naturally you pick them randomly. And hence, SGD.

@seasong7655 2 жыл бұрын

If the outcome of the SGD step is random, do you think it could be done multiple times and we could chose the best step?

@VisuallyExplained 2 жыл бұрын

Absolutely, this can help sometimes.

@cristian-bull 8 ай бұрын

If you want to try N times per step to see which point "is better", that would require running N forwards, N backward steps, N updates of parameters, and N forwards again to see which one gave the best result. Not only computation is increases N times, you would also need N copies of the model (which can mean a lot of GPU), or keep a temporal copy of the model and do N copies 1 at a time, which can also mean a lot of extra time. Not saying it's not "technically possible", but I doubt anyone would use that. I don't know what @visuallyExpained was talking about saying *that* can help, like it's a common practice or something. Am I wrong or am I missing something here?

@trendish9456 6 ай бұрын

Watching these videos gives Way better enjoyment than memes.

@jessielesbian6791 9 ай бұрын

TinyGPT uses adam (an SGD variant) with a small batchsize for pre-training warmup, and large batchsize for fine-tuning

@sinasec Жыл бұрын

Such a great work. May i ask which software did you use for animation?

@DG123z 3 ай бұрын

It's like being less restrictive keeps you from optimizing the wrong thing and getting stuck in the wrong valley (or hill for evolution). Feels a lot like how i kept trying to optimize being a nice guy bc there was some positive responses and without some chaos i never would have seen another valley of being a bad boy which has much less cost and better results

@JuanCamiloAcostaArango 7 ай бұрын

Finds better solutions? Isn't it just offering faster convergence?

@handlenull 2 жыл бұрын

Great channel. Thanks!

@sidhpandit5239 Жыл бұрын

amazing explaination

@xuanthanhnguyen6741 Жыл бұрын

nice explanation

@Gapi505 Жыл бұрын

I'm trying to program my own neural network, but there training algorithms can't get in my head. Thanx.

@chinokyou 2 жыл бұрын

good one

@EdeYOlorDSZs 2 жыл бұрын

You're awesome, subbed!

@igorg4129 Жыл бұрын

Sorry, I do not get something What do you randomly take random obsevations? Or random set of features? Or random amount of weights (=random amount of neurons)

@fatihburakakcay5026 2 жыл бұрын

Perfect

@bennicholl7643 2 жыл бұрын

stochastic gradient descent doesn't take some constant number of terms, it takes one training example at random, then performs the feed forward and back propagation with that one training example.

@VisuallyExplained 2 жыл бұрын

Sure, that's how it is usually defined. But in practice, it's way more common to pick a random mini-batch of size > 1 for training.

@bennicholl7643 2 жыл бұрын

@@VisuallyExplained Yes, then that would be called mini batch gradient descent, not stochastic gradient descent

@nathanwycoff4627 Жыл бұрын

@@bennicholl7643 In general optimization, SGD is defined in any situation where we have a random gradient; it doesn't even have to be a finite sum problem. The restriction of the term "stochastic gradient descent" to batch size 1approximations to finite sum problems is terminology specific to Machine Learning.

@tariq_dev3116 2 жыл бұрын

You are insane with those animations, please tell me which software you use for doing that

@VisuallyExplained 2 жыл бұрын

Thanks Tariq! I use Blender3D for all of my 3D animations.

@tariq_dev3116 2 жыл бұрын

@@VisuallyExplained thank you 💜💜❤❤

@ianthehunter3532 11 ай бұрын

how to use emoji in manim

@chogy7875 Жыл бұрын

Hello, I don't understand the meaning of 2:38 (the formula can be understood). Can you explain more?

@radhikadesai7781 Жыл бұрын

Has something to do with momentum I guess ? Search on KZfaq SGD with momentum. It is basically a math technique that smooths out the function using running average