Deep Ensembles: A Loss Landscape Perspective (Paper Explained)

  Рет қаралды 22,775

Yannic Kilcher

Yannic Kilcher

Күн бұрын

#ai #research #optimization
Deep Ensembles work surprisingly well for improving the generalization capabilities of deep neural networks. Surprisingly, they outperform Bayesian Networks, which are - in theory - doing the same thing. This paper investigates how Deep Ensembles are especially suited to capturing the non-convex loss landscape of neural networks.
OUTLINE:
0:00 - Intro & Overview
2:05 - Deep Ensembles
4:15 - The Solution Space of Deep Networks
7:30 - Bayesian Models
9:00 - The Ensemble Effect
10:25 - Experiment Setup
11:30 - Solution Equality While Training
19:40 - Tracking Multiple Trajectories
21:20 - Similarity of Independent Solutions
24:10 - Comparison to Baselines
30:10 - Weight Space Cross-Sections
35:55 - Diversity vs Accuracy
41:00 - Comparing Ensembling Methods
44:55 - Conclusion & Comments
Paper: arxiv.org/abs/1912.02757
Abstract:
Deep ensembles have been empirically shown to be a promising approach for improving accuracy, uncertainty and out-of-distribution robustness of deep learning models. While deep ensembles were theoretically motivated by the bootstrap, non-bootstrap ensembles trained with just random initialization also perform well in practice, which suggests that there could be other explanations for why deep ensembles work well. Bayesian neural networks, which learn distributions over the parameters of the network, are theoretically well-motivated by Bayesian principles, but do not perform as well as deep ensembles in practice, particularly under dataset shift. One possible explanation for this gap between theory and practice is that popular scalable variational Bayesian methods tend to focus on a single mode, whereas deep ensembles tend to explore diverse modes in function space. We investigate this hypothesis by building on recent work on understanding the loss landscape of neural networks and adding our own exploration to measure the similarity of functions in the space of predictions. Our results show that random initializations explore entirely different modes, while functions along an optimization trajectory or sampled from the subspace thereof cluster within a single mode predictions-wise, while often deviating significantly in the weight space. Developing the concept of the diversity--accuracy plane, we show that the decorrelation power of random initializations is unmatched by popular subspace sampling methods. Finally, we evaluate the relative effects of ensembling, subspace based methods and ensembles of subspace based methods, and the experimental results validate our hypothesis.
Authors: Stanislav Fort, Huiyi Hu, Balaji Lakshminarayanan
Links:
KZfaq: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher

Пікірлер: 104
@ayushthakur736
@ayushthakur736 4 жыл бұрын
I loved this paper explaination. And the paper is so interesting. Will try it out. Thanks for the explaination.
@marc-andrepiche1809
@marc-andrepiche1809 3 жыл бұрын
I'm not convinced the final decision vectors are so different since they can be all permuted. I am intrigued by the fact please they disagree, but we know from adversarial methods that it doesn't take much for a network to change its decision. Very Intrigued, but not fully convinced 🤔
@first-thoughtgiver-of-will2456
@first-thoughtgiver-of-will2456 3 жыл бұрын
this channel is life changing! Please keep up the amazing work!
@matthewtaylor7859
@matthewtaylor7859 3 жыл бұрын
What a paper! Some really fascinating information there, and immensely thought provoking. Thanks a lot for the talk-through!
@herp_derpingson
@herp_derpingson 4 жыл бұрын
0:00 Your pronunciation for names of all cultures is remarkably good. In fact, I would say that you are the best amongst all the people I have seen so far. A Russian, Chinese and an Indian walked into a bar, Yannic greeted all of them and they had a drink together. There is no joke here, go away. . 24:00 It is interesting to see that ResNet has lower disagreement and higher accuracy. I wonder if disagreement is inversely correlated with accuracy. . 32:50 I think one way of interpreting this is that both 3 * 5 and 5 * 3 give the same result 15. So, although they are different in weight space, they are not in solution space. Thus, they have the same loss/accuracy. This is difficult to prove for neural networks with millions of parameters, but I would wager that something similar happens. I think this problem may disappear completely if we manage to find a way to make a neural network whose parameters are order independent. . 44:00 I wonder what would happen if we slice a ResNet lengthwise into maybe 2-5 trunks so that the neurons in each layer are only connected to its own trunk. All trunks would have a common start and end. Would that outperform the regular ResNet? Technically, it is an ensemble. Right? . I think authors should also start publishing disagreement matrix in the future.
@jasdeepsinghgrover2470
@jasdeepsinghgrover2470 4 жыл бұрын
24:00 I guess residual connections makes the loss landscape more convex, that's why resnet and densenet seem more similar. A similar idea was shown in a paper on visualizing the loss landscape of NNs. I guess it relates.
@snippletrap
@snippletrap 4 жыл бұрын
It must be a Swiss thing -- as a small country in the middle of Europe they need to speak with all their neighbors.
@YannicKilcher
@YannicKilcher 3 жыл бұрын
Thanks, that's a big compliment :) Yea I think there's definitely a link between disagreement and accuracy, but also the disagreement metric among different architectures is very shaky, because it's entirely unclear how to normalize it. In the paper, they do mention weight space symmetry and acknowledge it, but the effect here goes beyond that, otherwise there would be no disagreement, if I understand you correctly. Your idea is interesting, sounds a bit like AlexNet. The question here is how much of the ensemble-ness comes from the fact that you actually train the different parts separately from each other. I have no idea what would turn out, but it's an interesting question :)
@deepblender
@deepblender 4 жыл бұрын
Since the Lottery Ticket Hypothesis, I have been expecting a paper which shows that those ensembles can be joined into one neural network.
@NicheAsQuiche
@NicheAsQuiche 3 жыл бұрын
Mean, mode median: oh yeah its all coming together
@thinknotclear
@thinknotclear 3 жыл бұрын
You are correct. Training independent subnetworks for robust prediction under review of ICLR 2021
@jcorey333
@jcorey333 Жыл бұрын
This is so cool! I love learning about this stuff
@Notshife
@Notshife 4 жыл бұрын
Nice when other people prove your theory on your behalf :) Thanks for the clear breakdown of this paper
@RAZZKIRAN
@RAZZKIRAN 2 жыл бұрын
thank you, great efforts ,great explanation.
@lucca1820
@lucca1820 4 жыл бұрын
What a awesome paper and video
@greencoder1594
@greencoder1594 3 жыл бұрын
TL;DR [45:14] Multiple random initializations lead to functionally different but equally accurate modes of the solution space that can be combined into ensambles to combine their competence. (This works far better than building an ensamble from a single mode in solution space that has been perturbed multiple times to capture parts of it's local neighborhood.)
@sillygoose2241
@sillygoose2241 4 жыл бұрын
Love this paper and your explanation. It almost seems that if you train the model on different random initialization, on say a dataset of cat and dog pictures for example, it will try to figure out the labels based on one set of features like eye shape, and another initialization will try to figure it out based on another features like hair textures. Completely different types of approaches to problems, but I would expect different accuracies at the end. Incredible results, makes the cogs turn in my head as to whats going on.
@larrybird3729
@larrybird3729 3 жыл бұрын
25:26 That caught me off guard! LMAO!!!
@grantsmith3653
@grantsmith3653 Жыл бұрын
@23:44 the chart says that they disagree on 20% of the labels, but it doesn't say that those 20% are different elements from the dataset. For example, it could be the same 20% of the dataset, but training 1 guesses they're cats, and traing 2 says they're dogs. Of course, this has some diminishing returns because there are only 10 classes in CIFAR-10, but I think the point still holds. Also, I agree with the paper that this supports the idea that they're different functions, but i don't think it supports that the 20% are totally different elements from the dataset. What do you think, @Yannic Kilcher?
@anthonyrepetto3474
@anthonyrepetto3474 4 жыл бұрын
This is the kind of research that makes it all possible - makes me wonder about nested hierarchies of ensembles, a la Hinton's capsules?
@NeoShameMan
@NeoShameMan 3 жыл бұрын
I would like to see the difference per layer, if the weight diverge on higher level rather than lower, this could a good insight about how to optimize, basically train a network once, extract stable layer, retrain with divergence on unstable layer. I always felt something akin to frequency separation would work.
@machinelearningdojowithtim2898
@machinelearningdojowithtim2898 4 жыл бұрын
Fascinating insight into ensemble methods, and by extension lottery tickets etc. It rather begs the question though, how many basins/modes are there, and what do they most correspond to? It feels like we are exploring the surface of a distant and unknown planet, exciting times 😀
@jonassekamane
@jonassekamane 4 жыл бұрын
Super interesting video. I wonder whether the differently initialised models learn to “specialise” on a specific subset of classes, given that 1) each independent model performs about equally well, 2) the non-overlapping cosine similarities across the models, and 3) the average/sum of multiple models improves accuracy. Could it be that one model specialises in predicting airplanes and ships, while another model specialises in predicting deers and dogs. This might explain why they preform about equally well on the CIFAR10 test set, don’t overlap in terms of cosine similarity, and explain why accuracy improves by simply averaging their predictions. Or stated differently, whether the number of modes in the loss landscape correspond to the number of ways of combining the classes. This is also somewhat related to you comment at 34:21, regarding under parameterisation (“that no single model can look at both features at the same time”) vs. over-specified model (“too simple of a task” and that there is 500 different ways of solving it). With many classes this is a difficult hypothesis to prove/disprove, since the combinatorics of N classes/labels gives 2^N-1 modes in the loss landscape. With CIFAR10, although they check 25 independent solutions and find no overlapping cosine similarly, this might simply be due to there being 2^10-1 = 1023 different possible combinations of functions. If I was to test it, I would try to investigate a dataset with three classes: airplane, bird and cat (hereafter “A”, “B”, and “C”), which only gives 2^3-1 = 7 combinations. Then you could test whether you find a model that was good at predicting class A, one that was good at predicting class B, good at C, reasonable at A+B, reasonable at A+C, reasonable at B+C, and ok at predicting A+B+C. It would be disproven by checking 10-25 independent solutions and finding no overlap in the cosine similarity. On the other hand, if you find overlap, then this would indicate that the models during training become class-specific models (or class-subset-specific models), and this might also explain why we with ensembling see decreasing marginal improvements in accuracy as ensemble size grows.
@YannicKilcher
@YannicKilcher 3 жыл бұрын
Very nice thoughts, it sounds like an awesome project to look at! It gets further complicated by the numerous symmetries that exist in neural networks, it's very unclear how much cosine similarity is an actual accurate measure here.
@DCnegri
@DCnegri 2 жыл бұрын
I wonder how good that measure of weight similarity really is
@judgeomega
@judgeomega 4 жыл бұрын
im just spitballing here but what if towards the end of the NN each of the nodes was treated a bit differently when training in an attempt to stimulate multi-modal solutions; perhaps only backpropigate on the node with the current highest signal given the current training example.
@YannicKilcher
@YannicKilcher 3 жыл бұрын
nice idea!
@harshraj7591
@harshraj7591 2 жыл бұрын
Thank you for the great explaination. I wonder if works have been done for ensemble for other tasks as well, eg. Segmentation, GANs etc ?
@PhucLe-qs7nx
@PhucLe-qs7nx 4 жыл бұрын
This paper basically confirm the well-known technique of ensemble from *actual* independent runs, rather than lots of different paper describing trick to do ensemble from a single run such as SWA / SWAG. They didn't say it explicitly but I guess this might imply a fundamental tradeoff in ensembling, you can either save some computation by single-run averaging, or reap all the benefit of ensemble from multiple runs.
@deepaksadulla8974
@deepaksadulla8974 4 жыл бұрын
It seems to be a really good paper that views training from a totally different lens. Also, I have seen a few top Kaggle solutions with 8+ Fold cross validation ensembles, so that seems to work. A much needed break from SoTA papers indicating "I beat your model" :D
@yizhe7512
@yizhe7512 4 жыл бұрын
I almost lost hope in the SOTA ocean, then this paper kicks in...Thanks Google...Oh! Wait...
@oostopitre
@oostopitre 4 жыл бұрын
Thanks for the video. As always great breakdown. It is interesting why the authors are considering perturbing the weights slightly as a strategy too? Since that does not get you out of the current local minima right? So for ensembles to be effective one would need learners that are as uncorrelated as possible (kinda captured by the independent optima strategy). Maybe I need to read the paper first hand too, to understand the details more :P
@HarisPoljo
@HarisPoljo 4 жыл бұрын
I wonder how similar the weights of the first n-layers (maybe n=5) of the different networks. Do they capture the same low level features?🤔
@YannicKilcher
@YannicKilcher 3 жыл бұрын
Good question.
@firedrive45
@firedrive45 3 жыл бұрын
the cosine comparison is just a scalar projection of all final parameters. When 2 networks have similar weights, then the scalar projection is big.
@azai.mp4
@azai.mp4 3 жыл бұрын
It's even possible that they capture the same low level features, but extract/represent these in different ways.
@petruericstavarache9464
@petruericstavarache9464 3 жыл бұрын
Perhaps a bit late, but there is a paper on exactly this question: arxiv.org/pdf/1905.00414.pdf
@dermitdembrot3091
@dermitdembrot3091 4 жыл бұрын
Very cool paper! Especially the insight that the local maxima are very different in their predictions. And not just all failing with the same datapoints! 25 percent disagreement with 60 percent accuracy is not a giant effect though. With 40% mislabeled, there is a lot of space for disagreement between two wrong solutions.
@dermitdembrot3091
@dermitdembrot3091 4 жыл бұрын
And cos similarity in parameter space seems like a useless measure to me, since just by reordering Neurons, you can permute weights without changing the function behavior. This permutation is enough to make the parameters very cos-dissimilar.consider e.g. a network with a 1x2 layer of (1,-3) and a 2x1 layer with weights (2,1)^T (and no bias for simplicity). Thus the parameter vector is (1,-3,2,1) By switching the two hidden neurons we get (-3,1,1,2) resulting in cos similarity of -2/15 = -.133 while the networks are equivalent!
@herp_derpingson
@herp_derpingson 4 жыл бұрын
@@dermitdembrot3091 I wonder if we can make a neural network which is permutation invariant.
@eelcohoogendoorn8044
@eelcohoogendoorn8044 4 жыл бұрын
@@dermitdembrot3091 yeah the cosine invariance between models doesnt say much; but the difference in classification does indicate that those differences are meaningful; clearly these are not mere permutations, or something very close to it.
@slackstation
@slackstation 4 жыл бұрын
Interesting. I wonder if people have take various task types, explored the problem space and figured out if the landscape is different. Would the landscape of (hills and valleys) of CIFAR vs MNIST be different? Are there characteristics of patterns that we can derive other than just figuring out the gradient and trying to go down into the local minima? For instance, are all minima shaped the same? Can you derive the depth of a local minima from that shape? Could you take advantage of that during training to abandon that training run and start from a new random spot? Could you if you knew how large a valley (or the average valley in the landscape) was, use that to inform where you put a new starting point to your training? If the thesis that all local minima are generally about the same level of loss couldn't one randomly sample points until one is found within an error of the global minima level such that you can save training time? This would be especially fruitful if you knew some characteristics about the average size of valleys of local minima. Thus if you were going to train ensembles anyways, this would be fruitful. As always, great work. Thank you for your insight. It's become a daily habit, checking out the Kilcher paper of the day.
@herp_derpingson
@herp_derpingson 4 жыл бұрын
I dont think so. Non-convex optimization is NP-Hard after all. If there was an easy way, it would break all computer science.
@eelcohoogendoorn8044
@eelcohoogendoorn8044 4 жыл бұрын
Yeah it is somewhat problem dependent. Ensembles are particularly valuable in regression tasks with a known large nullspace. Image reconstruction tasks usually fall into this category; there being a family of interpretations consistent with the data, most variations being physically indistinguishable rather than just having similar loss, and we dont want to be fooled into believing just one of them. Neural networks, being typically overparameterized, intrinsically have some of this null-spaceness within them it seems though, regardless of what problem you are trying to solve with them.
@tristantrim2648
@tristantrim2648 3 жыл бұрын
Look into this project if you haven't: losslandscape.com They want to build a database of loss landscape visualizations.
@wentianbao5368
@wentianbao5368 3 жыл бұрын
looks like ensemble is still a good way to squeeze out a little bit more accuracy, when a single model is highly optimized.
@JoaoBarbosa-pq5pv
@JoaoBarbosa-pq5pv 4 жыл бұрын
Great video and paper. A lot of insight! However I was a bit disappointed by the 5% improvement of combining 10+ ensembles relative to the original one when the different solutions were so different (were they?). Makes me think the way ensembles were combined is not optimal.
@martindbp
@martindbp 4 жыл бұрын
This is a good case for the wisdom of crowds and democracy
@quantlfc
@quantlfc 3 жыл бұрын
How do you get the intialization points to be so close for the T SNe thing. I have been trying to implement in pytorch but to no success
@tbaaadw3375
@tbaaadw3375 4 жыл бұрын
Tobias BO It would be interesting to compare the accuracy of a single network with the parameter size of all ensembles
@eelcohoogendoorn8044
@eelcohoogendoorn8044 4 жыл бұрын
I think its a good question. The benefits provided by ensembles apply to both over and under parameterized models to some extent. Though moreso for the typically overparameterized case, I would imagine. Would be good to address that question head on experimentally. Still, the equi-parameter single big model isnt going to outcompete the ensemble, unless we are talking about a severely underparameterized scenario. These benefits are quantitatively different from having simply more parameters.
@chinbold
@chinbold 4 жыл бұрын
Will you do a live video of a paper explanation?
@hadiaghazadeh
@hadiaghazadeh Жыл бұрын
Great paper and great explanation! Can I ask what software you used for reading and writing on this paper file?
@XetXetable
@XetXetable 4 жыл бұрын
I'm not really convinced of the criticism of Bayesian NNs. The paper itself seems to indicate that issues stem from using a gaussian distribution, not from the need to approximate some prior, or whatever. It's unclear to me why approximating a prior should restrict us to one local minimum, while it's clear why using a gaussian would do that. Intuitively, replacing the gaussian with some multimodal distribution with N modes should perform similarly to an ensemble of N gaussian NNs; in fact, the latter seems like it would be basically the same thing as a bayesian NN which used normal mixture distributions instead of gaussian ones. Though non-gaussian methods are common in baysian machine learning, I can't think of any work that does this in the context of NNs; maybe this work provides adequate motivation to push in that direction.
@YannicKilcher
@YannicKilcher 3 жыл бұрын
True, the problem here is that most of the bayesian methods actually use gaussians, because these are the only models that are computable in practice.
@drhilm
@drhilm 4 жыл бұрын
Does the brains also use this trick? - Creating an ensemble of thousands of not-so-deep learners, all of them trying to solve a similar problem. Thus, it can easily generalize between distance tasks. Having a better sampling of the functional space? This is similar to the 'thousand brain theory' of Numenta isn't it ? What do you think?
@herp_derpingson
@herp_derpingson 4 жыл бұрын
I was thinking the same thing. Neurons in the brain are only connected to nearby neurons. Thus, if we take clusters of far off neurons, we can say that it acts like an ensemble.
@sacramentofwilderness6656
@sacramentofwilderness6656 4 жыл бұрын
Is there any kind of gauge invariance for the neural network? In that sense that we have to look not at particilar assignments of weights of neurons, but at equivalence class up to some transformations in the neurons?
@YannicKilcher
@YannicKilcher 3 жыл бұрын
Good questions, there are definitely symmetries, but no good way so far of capturing all of them.
@Batsup1989
@Batsup1989 3 жыл бұрын
Very cool paper, i kind of want to do that plane plot from 2 solutions with some networks now :) On fig 3 I am not super happy about the measure they used for the disagreement. If your baseline accuracy is 64% the fact that the solutions disagree on 25% of labels shows that the functions are different, but does not convince me that they are in different modes. Definitely does not suggest that picture with the first 10% errors vs last 10% errors. The diversity measure in fig 6 seems a bit better to me, but still not ideal. I would be more interested in something like a full confusion matrix for the two solutions, or parts of it. To me disagreement on examples that both networks get wrong is not interesting, in cases where at least one of the networks gets it right is a lot more interesting. Because those are the cases that at least have the potential to boost the ensemble performance.
@YannicKilcher
@YannicKilcher 3 жыл бұрын
True, I agree there is a lot of room to improve on the sensibility of these measurements.
@mattiasfagerlund
@mattiasfagerlund 4 жыл бұрын
Dood, you're on FIRE! I've used XGBoost ensambles to compete in Kaggle - there was no way to compete without them in the competitions I'm thinking of. But wasn't dropout supposed to make ensambles redundant? A network with dropout is, in effect, training continously shifting ensambles?
@mattiasfagerlund
@mattiasfagerlund 4 жыл бұрын
Ah, they mention dropout - good, at least I fed the algorithm!
@dermitdembrot3091
@dermitdembrot3091 4 жыл бұрын
Dropout can be seen as a Bayesian method if applied in training and testing. Probably suffers from the same single-mode problem
@chandanakiti6359
@chandanakiti6359 4 жыл бұрын
I think dropout is for regularizing NN only. It does NOT effect in ensemble. That's because it is still a single weight space which is converging to a local minima. Dropout just helps the convergence. Please correct me if I am wrong.
@eelcohoogendoorn8044
@eelcohoogendoorn8044 4 жыл бұрын
There are a lot of quasi-informed statements going on around dropout; it making ensembles redundant never made it past the wishful-thinking stage, as far as I can tell from my own experiences.
@bluel1ng
@bluel1ng 4 жыл бұрын
This time a "classic" from Dec-2019 ;-)... The consistent amount of disagreement in Fig. 3 right is very interesting. But Fig. 5 middle and right 15% predictions-similarity seem to me very low (e.g. if they disaggree on 85% of the predictions how can the accuracy of those models be 80% at the optima as shown in the left plot)? Would be great if somebody could give me a hint... Also I am not convinced that different t-sne projections proof a functional-difference (e.g. symmetrical weight-solutions also may have high distance). And just a though regarding weight-orthogonality.. even normal distributed random vectors are nearly orthogonal (e.g. probably also two different initializations). I will take a closer look at the paper, probably there are some details explained...
@jasdeepsinghgrover2470
@jasdeepsinghgrover2470 4 жыл бұрын
I guess they might have taken validation examples with very high noise or a very different set of vectors... One thing I always observed and wondered is that for classifiers to be accurate they need not be same or similar. Like humans would agree on images of cats and dogs but probably will never ever agree if they are given images of random patterns or pixels and have to label them as cats and dogs. This paper very very much strengthens my belief but will read it first.
@bluel1ng
@bluel1ng 4 жыл бұрын
Regarding the t-sne plots: I thought they were trajectories of the network weights... but they are mapped predictions of the networks, so the plot indeed shows functional differences...
@yerrenv.st.annaland2725
@yerrenv.st.annaland2725 4 жыл бұрын
Please update us if you figure out the accuracy thing, it's bothering me too!
@eelcohoogendoorn8044
@eelcohoogendoorn8044 4 жыл бұрын
Seems to me the axis is the amount of disagreement, given that 0 is on the diagonal. So they disagree in 15%, which is consistent with 80% accuracy.
@bluel1ng
@bluel1ng 4 жыл бұрын
@@eelcohoogendoorn8044 Thanks for the attempt but in paper it says "show function space similarity (defined as the fraction of points on which they agree on the class prediction) of the parameters along the path to optima 1 and 2" ... this is also consistent with values approaching 1 in the direct neighborhood around the target optimum. In the supplementary material are similar plots with a different color legend ranging from 0.4 to 1.04 with the description in the text "shows the similarity of these functions to their respective optima (in particular the fraction of labels predicted on which they differ divided by their error rate)". Maybe this are standard plots that can be found in other literature too ... I had not yet time to go through the references. If somebody is familiar with this topic it would be great if she/he could give a short explanation or reference.
@paulcurry8383
@paulcurry8383 3 жыл бұрын
Is there any known pattern in the distances between these loss space minima? I’m curious if this could provide a way to adjust learning rate to jump to minima quicker
@AbhishekSinghSambyal
@AbhishekSinghSambyal 2 жыл бұрын
Which app do you use to annotate papers? Thanks!
@tylertheeverlasting
@tylertheeverlasting 4 жыл бұрын
One thing is more and more deep ensembles usually ends up having diminishing or no returns after a certain point. Indicating that adding a new model after the 20th one might not have any more disagreement in predictions. Any thoughts on why that is? Maybe it's limited by the architecture + training method somehow.
@YannicKilcher
@YannicKilcher 3 жыл бұрын
I think they will still have disagreements, but these are already covered by the other networks in the ensemble.
@tylertheeverlasting
@tylertheeverlasting 3 жыл бұрын
@@YannicKilcher I'm not sure I agree with this completely, there may be a large number of minima in weight space due to symmetries. But I'm not sure there are a large number of disagreements in the output space given a particular test set, the disagreements in the output space is what gives better accuracy for the test set as more models are added.
@Gogargoat
@Gogargoat 4 жыл бұрын
Are the predictions of the ensemble models simply averaged? I would try to weight them in a way that maximizes the signal to noise ratio (or more likely the logarithm of it). I'm guessing it won't make a big difference with large ensembles, but might help if your can only afford to train an ensemble of only a few different initializations. Also, I'm slightly damp.
@eelcohoogendoorn8044
@eelcohoogendoorn8044 4 жыл бұрын
Trimmed mean is where it is at; pretty much best of both worlds of a median and mean. Yes it looks simple and hacky; but just like training 10 models from scratch, show me something that works better :).
@RobotProctor
@RobotProctor 3 жыл бұрын
Is it impossible to design networks with a globally convex loss space? The paper seems to say these networks have multiple local minima but does this necessarily have to be so? I'm curious of a small enough network on a small enough problem can have a globally convex loss space, then we could build larger networks our of these smaller networks with some sort of hebbian process controlling the connections between them
@YannicKilcher
@YannicKilcher 3 жыл бұрын
as soon as you add nonlinearities, the non-convexity appears, unfortunately
@antonpershin998
@antonpershin998 Жыл бұрын
Do we have combination of this and "Groking, generalisation beyond overfitting" paper?
@prakharmishra8194
@prakharmishra8194 4 жыл бұрын
Is is not costly toTrain same network multiple times if the dataset is large. I read one paper titled as “Snapshot ensembles : train 1 get M for free” there the authors propose training a network just once with LR schedule and checkpointing the minimas. Later treating checkpointed models as ensemble candidates. I see both of them talking in same lines. Is there any comparison done with the results from that paper as well?
@YannicKilcher
@YannicKilcher 3 жыл бұрын
I don't think there is an explicit comparison, but I'd expect the snapshot ensembles to generally fall into the same mode.
@luke2642
@luke2642 4 жыл бұрын
If each network covers a subset of the data, a preprocessing network could classify a sample and choose which sub network will do the best job "actually" classifying it? I'm sure this isn't novel, what's it called?
@snippletrap
@snippletrap 4 жыл бұрын
A multiplexor network? Probably better to learn a function that takes a weighted average of each sub network's output, since it uses all of them together rather than choosing just one. Or better yet -- learn an ensemble of such functions! Ha.
@luke2642
@luke2642 4 жыл бұрын
@@snippletrapThanks! Looking at the MUXConv paper, that's spatial multiplexing, not quite what I was wondering. I'll keep searching multiplexing though. However, Rather than putting all samples through all subnetworks, I'm imagining e.g. 10 small networks each trained on 1/10 of the dataset. The partitioning which samples end up in which 1/10th would also be learned, and then a controller network would also be learned! Overfitting strategies would be needed.
@luke2642
@luke2642 4 жыл бұрын
"Run-time Deep Model Multiplexing" is what I'm reading now!
@luke2642
@luke2642 3 жыл бұрын
PathNet: Evolution Channels Gradient Descent in Super Neural Networks arxiv.org/abs/1701.08734 seems relevant too!
@JakWho92
@JakWho92 3 жыл бұрын
You should also search for “Mixture of experts”. It’s a general principle which has been applied to neural networks in the way you’re proposing.
@tristantrim2648
@tristantrim2648 3 жыл бұрын
I didn't catch, is there a specific name for multi, as opposed to single, maxima ensembles?
@SimonJackson13
@SimonJackson13 3 жыл бұрын
N part disjunction efficiency neuron clustering?
@TheGodSaw
@TheGodSaw 2 жыл бұрын
Hey so, great content as usual however I disagree with your interpretation. You are saying, since the parameter space of the models is so different the model must perform differently on different examples. As far as I have understood that you take that as evidence that our intuition of easy and hard examples is incorrect. However, I dont think these two things are related. I think this big discoverey that the models parameters are so different is just a fact of strangeness of high dimensions. If you pick any 2 random vectors in high dimensional space they will be almost orthogonal to each other. So any difference in initial conidtions will elad to quite different outcomes
@jasdeepsinghgrover2470
@jasdeepsinghgrover2470 4 жыл бұрын
Like no two humans are the same, no two randomly initialised classifiers are the same. (Generally)
@G12GilbertProduction
@G12GilbertProduction 4 жыл бұрын
Bayesian... I hate this marchening pseudo-duonomial filtering networks using only the VGE small-memory encoder. Would you call it TensorFlow libraries it's too low for these 2D refractions?
@herp_derpingson
@herp_derpingson 4 жыл бұрын
You sound like a GPT
@not_a_human_being
@not_a_human_being 3 жыл бұрын
won't it hugely depend on the data?
@YannicKilcher
@YannicKilcher 3 жыл бұрын
sure, I guess it always does
@Kerrosene
@Kerrosene 4 жыл бұрын
25:20 😂😂
@rishikaushik8307
@rishikaushik8307 4 жыл бұрын
Talk about machines taking our jobs XD
@not_a_human_being
@not_a_human_being 3 жыл бұрын
reminds me of Kazanova and his stacking -- analyticsweek.com/content/stacking-made-easy-an-introduction-to-stacknet-by-competitions-grandmaster-marios-michailidis-kazanova/
@grantsmith3653
@grantsmith3653 Жыл бұрын
Here's a talk by the author: kzfaq.info/get/bejne/qdqErcppzrvFiIE.html.. highly recommend along with Yannic's explanation here
Neural Architecture Search without Training (Paper Explained)
35:06
Yannic Kilcher
Рет қаралды 27 М.
When You Get Ran Over By A Car...
00:15
Jojo Sim
Рет қаралды 10 МЛН
Khó thế mà cũng làm được || How did the police do that? #shorts
01:00
Must-have gadget for every toilet! 🤩 #gadget
00:27
GiGaZoom
Рет қаралды 12 МЛН
The Most Important Algorithm in Machine Learning
40:08
Artem Kirsanov
Рет қаралды 300 М.
[Classic] Generative Adversarial Networks (Paper Explained)
37:04
Yannic Kilcher
Рет қаралды 60 М.
Tom Goldstein: "What do neural loss surfaces look like?"
50:26
Institute for Pure & Applied Mathematics (IPAM)
Рет қаралды 17 М.
This is why Deep Learning is really weird.
2:06:38
Machine Learning Street Talk
Рет қаралды 360 М.
Deep Networks Are Kernel Machines (Paper Explained)
43:04
Yannic Kilcher
Рет қаралды 59 М.
Gaussian Processes
23:47
Mutual Information
Рет қаралды 118 М.
Rethinking Attention with Performers (Paper Explained)
54:39
Yannic Kilcher
Рет қаралды 55 М.
ПОКУПКА ТЕЛЕФОНА С АВИТО?🤭
1:00
Корнеич
Рет қаралды 3,4 МЛН