iMAML: Meta-Learning with Implicit Gradients (Paper Explained)

Рет қаралды 22,839

Күн бұрын

Gradient-based Meta-Learning requires full backpropagation through the inner optimization procedure, which is a computational nightmare. This paper is able to circumvent this and implicitly compute meta-gradients by the clever introduction of a quadratic regularizer.
OUTLINE:
0:00 - Intro
0:15 - What is Meta-Learning?
9:05 - MAML vs iMAML
16:35 - Problem Formulation
19:15 - Proximal Regularization
26:10 - Derivation of the Implicit Gradient
40:55 - Intuition why this works
43:20 - Full Algorithm
47:40 - Experiments
Paper: arxiv.org/abs/1909.04630
Blog Post: www.inference.vc/notes-on-ima...
Abstract:
A core capability of intelligent systems is the ability to quickly learn new tasks by drawing on prior experience. Gradient (or optimization) based meta-learning has recently emerged as an effective approach for few-shot learning. In this formulation, meta-parameters are learned in the outer loop, while task-specific models are learned in the inner-loop, by using only a small amount of data from the current task. A key challenge in scaling these approaches is the need to differentiate through the inner loop learning process, which can impose considerable computational and memory burdens. By drawing upon implicit differentiation, we develop the implicit MAML algorithm, which depends only on the solution to the inner level optimization and not the path taken by the inner loop optimizer. This effectively decouples the meta-gradient computation from the choice of inner loop optimizer. As a result, our approach is agnostic to the choice of inner loop optimizer and can gracefully handle many gradient steps without vanishing gradients or memory constraints. Theoretically, we prove that implicit MAML can compute accurate meta-gradients with a memory footprint that is, up to small constant factors, no more than that which is required to compute a single inner loop gradient and at no overall increase in the total computational cost. Experimentally, we show that these benefits of implicit MAML translate into empirical gains on few-shot image recognition benchmarks.
Authors: Aravind Rajeswaran, Chelsea Finn, Sham Kakade, Sergey Levine
Links:
KZfaq: / yannickilcher
Twitter: / ykilcher
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher

Пікірлер: 51

@waxwingvain 4 жыл бұрын

you have no idea how relevant this is for me now, I'm currently working on an NLP problem using maml, thanks!

@amitkumarsingh406 2 жыл бұрын

interesting. what is it about?

@nDrizza 4 жыл бұрын

Awesome explanation! I really like that you took enough time to explain the idea clearly instead of trying to shrink the explanation down to something of 30mins which might have not been understandable.

@aasimbaig01 4 жыл бұрын

I learn new things everyday from your videos !!

@JackSPk 4 жыл бұрын

Really enjoyed this one. Pretty good companion and intuitons for reading the paper (specially the "shacka da bomb" part).

@SCIISano 3 жыл бұрын

Ty for explaining the Implicit Jaccobian. This was exactly what I was looking for.

@anthonyrepetto3474 4 жыл бұрын

thank you for the details!

@leondawn3593 3 жыл бұрын

very clearly explained! great thanks!

@herp_derpingson 4 жыл бұрын

34:50 Mind blown. Great paper. Keep it coming! 39:40 What happens if the matrix is not invertible? Do we just discard that and try again? 41:50 This is kinda like the N-bodies problem but with SGD instead of gravity.

@YannicKilcher 4 жыл бұрын

I don't think that matrix is ever non-invertible in practice, because of the identity add. But if so, just take a pseudo inverse or something.

@arkasaha4412 4 жыл бұрын

This is one of your best videos! :)

@AshishMittal61 4 жыл бұрын

Great Video! Really helped with the intuition.

@zikunchen6303 4 жыл бұрын

daily uploads are amazing, i watch your videos instead of random memes now

@ernestkirstein6233 4 жыл бұрын

The last step that he wasn't explicit about at 39:13 was that dphi/dtheta + 1/lambda * hessian * dphi/dtheta = Ident so (Ident + 1/lambda * hessian ) dphi/dtheta = Ident so Ident + 1/lambda * hessian is the inverse of dphi/dtheta.

@ernestkirstein6233 4 жыл бұрын

Another great video Yannic!

@tianyuez 4 жыл бұрын

Great video!

@S0ULTrinker 2 жыл бұрын

How do you backpropagate gradient through previous gradient steps, when you need multiple forward passes to get Theta for each of the K steps? 13:11

@JTMoustache 3 жыл бұрын

I've missed this one before, this juste highlights how useful it is to really master (convex) optimization when you want to be original in ML. Too bad I did not go to nerd school.

@ekjotnanda6832 3 жыл бұрын

Really good explanation 👍🏻

@marouanemaachou7875 4 жыл бұрын

Keep the good job !!

@tusharprakash6235 2 ай бұрын

In inner loop, for more than one step the gradients should be computed wrt initial parameters right?

@YIsTheEarthRound 4 жыл бұрын

I'm new to MAML so maybe this is a naive question but I'm not sure I understand the motivation for MAML (versus standard multi-task learning). Why is it a good idea? More specifically, it seems that MAML is doing a multi-scale optimisation (one at the level of training data with \phi and one at the level of validation data with \theta), but why does this help with generalisation? Is there any intuition/theoretical work?

@YannicKilcher 4 жыл бұрын

The generalization would be across tasks. I.e. if a new (but similar) task comes along, you have good initial starting weights for fine-tuning that task.

@YIsTheEarthRound 4 жыл бұрын

@@YannicKilcher But why does it do better than 'standard' multi-task ML in which you keep the task-agnostic part of the network (from training these other tasks) and retrain the task-specific part for the new task? It seems like there's 2 parts to why MAML does so well -- (1) having learned representations from previous tasks (which the standard multi-task setting also leverages), and (2) using a validation set to learn this task-agnostic part. I was just wondering what role the second played and whether there was some intuition for why it makes sense.

@user-xy7tg7xc1d 4 жыл бұрын

Sanket Shah You can check out the new meta learning course by Chelsea Finn kzfaq.info/sun/PLoROMvodv4rMC6zfYmnD7UG3LVvwaITY5

@alexanderchebykin6448 4 жыл бұрын

You've mentioned that first-order MAML doesn't work well - AFAIK that's not true: in the original MAML paper they achieve same (or better) results with it in comparison to the normal MAML (see Table 1, bottom). This also holds for all the independent reproductions on github (or at least the ones I looked at)

@shijizhou5334 4 жыл бұрын

Thanks for correcting that, I was also confused about this question.

@jonathanballoch 3 жыл бұрын

if anything the plots show that FOMMAML *does* work well, but much slower

@andreasv9472 4 жыл бұрын

Hi, interesting video! what is this parameter theta? is it the weights of the neural nets? or how many neurons there are? or is it something like learning rate, step-size, or something like that?

@YannicKilcher 4 жыл бұрын

yes, theta are the weights of the neural nets in this case

@brojo9152 3 жыл бұрын

Which software you used to write things along with the paper?

@arindamsikdar5961 3 жыл бұрын

At 36:15 sec of your video derive the whole equation (both sides) w.r.t \phi and not \theta to get the equation 6 in the paper

@hiyamghannam1939 3 жыл бұрын

Hello Thank you so much!! Have you explained the original MAML paper ??

@YannicKilcher 3 жыл бұрын

Not yet, unfortunately

@nbrpwng 4 жыл бұрын

Nice video, it reminds me of the e-maml paper I think you reviewed some time ago. Have you by chance considered making something like a channel discord server? Maybe it would be a nice thing for viewers to discuss papers or other topics in ML, although these comments sections are good too from what I’ve seen.

@YannicKilcher 4 жыл бұрын

Yes my worry is that there's not enough people to sustain that sort of thing.

@nbrpwng 4 жыл бұрын

Yannic Kilcher I’m not entirely sure about how many others would join, but I think maybe enough to keep it fairly active, at least enough to be a nice place to talk about papers or whatever sometimes. I’m in a few servers with just a few dozen active members and that seems to be enough for good daily interaction.

@freemind.d2714 3 жыл бұрын

Regularization is like turning the Maximum Likelihood Estimation (MLE) to Maximum A Posteriori (MAP)

@go00o87 4 жыл бұрын

hm... isn't grad_phi(phi)=dim(phi)? provided phi is a multidimensional vector, it shouldn't be 1. Granted it doesn't matter as it just rescales Lambda and that parameter is arbitrary anyways.

@herp_derpingson 4 жыл бұрын

I think you are confusing grad with hessian. grad operaton on a tensor doesnt change its dimensions. For example, if we take phi = [f(x) = x] then grad_x [x] which is equal to [1] or the identity matrix