Stanford Seminar - Information Theory of Deep Learning, Naftali Tishby

Рет қаралды 82,957

Күн бұрын

EE380: Computer Systems Colloquium Seminar
Information Theory of Deep Learning
Speaker: Naftali Tishby, Computer Science, Hebrew Univerisity
I will present a novel comprehensive theory of large scale learning with Deep Neural Networks, based on the correspondence between Deep Learning and the Information Bottleneck framework. The new theory has the following components:
1. rethinking Learning theory; I will prove a new generalization bound, the input-compression bound, which shows that compression of the representation of input variable is far more important for good generalization than the dimension of the network hypothesis class, an ill defined notion for deep learning.
2. I will prove that for large scale Deep Neural Networks the mutual information on the input and the output variables, for the last hidden layer, provide a complete characterization of the sample complexity and accuracy of the network. This makes the information Bottleneck bound for the problem as the optimal trade-off between sample complexity and accuracy with ANY learning algorithm.
3. I will show how Stochastic Gradient Descent, as used in Deep Learning, achieves this optimal bound. In that sense, Deep Learning is a method for solving the Information Bottleneck problem for large scale supervised learning problems. The theory provide a new computational understating of the benefit of the hidden layers, and gives concrete predictions for the structure of the layers of Deep Neural Networks and their design principles. These turn out to depend solely on the joint distribution of the input and output and on the sample size.
Based partly on works with Ravid Shwartz-Ziv and Noga Zaslavsky.
About the Speaker:
Dr. Naftali Tishby is a professor of Computer Science , and the incumbent of the Ruth and Stan Flinkman Chair for Brain Research at the Edmond and Lily Safra Center for Brain Science (ELSC) at the Hebrew University of Jerusalem. He is one of the leaders of machine learning research and computational neuroscience in Israel and his numerous ex - students serve at key academic and industrial research positions all over the world.
Prof. Tishby was the founding chair of the new computer - engineering program, and a director of the Leibnitz research center in computer science, at the Hebrew University.
Tishby received his PhD in theoretical physics from the Hebrew university in 1985 and was a research staff member at MIT and Bell Labs from 1985 and 1991. Prof. Tishby was also a visiting professor at Princeton NECI, University of Pennsylvania, UCSB, and IBM research.
His current research is at the interface between computer science, statistical physics, and computational neuroscience. He pioneered various applications of statistical physics and information theory in computational learning theory. More recently, he has been working on the foundations of biological information processing and the connections between dynamics and information. He has introduced with his colleagues new theoretical frameworks for optimal adaptation an d efficient information representation in biology, such as the Information Bottleneck method and the Minimum Information principle for neural coding.
For more information about this seminar and its speaker, you can visit ee380.stanford.edu/Abstracts/...
Support for the Stanford Colloquium on Computer Systems Seminar Series provided by the Stanford Computer Forum.
Colloquium on Computer Systems Seminar Series (EE380) presents the current research in design, implementation, analysis, and use of computer systems. Topics range from integrated circuits to operating systems and programming languages. It is free and open to the public, with new lectures each week.
Learn more: bit.ly/WinYX5
#deeplearning

Пікірлер: 29

@krasserkalle 5 жыл бұрын

This is my personal summary: 00:00:00 History of Deep Learning 00:07:30 "Ingredients" of the Talk 00:12:30 DNN and Information Theory 00:19:00 Information Plane Theorem 00:23:00 First Information Plane Visualization 00:29:00 Mention of Critics of the Method 00:32:00 Rethinking Learning Theory 00:37:00 "Instead of Quantizing the Hypothesis Class, let's Quantize the Input!" 00:43:00 The Information Bottleneck 00:47:30 Second Information Plane Visualization 00:50:00 Graphs for Mean and Variance of the Gradient 00:55:00 Second Mention of Critics of the Method 01:00:00 The Benefit of Hidden Layers 01:05:00 Separation of Labels by Layers (Visualization) 01:09:00 Summary of the Talk 01:12:30 Question about Optimization and Mutual Information 01:16:30 Question about Information Plane Theorem 01:19:30 Question about Number of Hidden Layers 01:22:00 Question about Mini-Batches

@clusteralgebra 5 жыл бұрын

Thank you!

@zhechengxu121 4 жыл бұрын

Bless your soul

@willjennings7191 4 жыл бұрын

I have used your personal summary as a template for a section of my personal notes. Thank you very much!

@paritoshkulkarni6354 2 жыл бұрын

RIP Naftali!

@nickybutton2736 3 жыл бұрын

Amazing talk, thank you!

@applecom1de509 5 жыл бұрын

Aah this is so relaxing.. Thank you!

@FlyingOctopus0 6 жыл бұрын

I wonder if based on this we can create better training algorithms. Like for example effectiveness of dropout may have a connection to this theory. The dropout may introduce more randomness in "diffusion" stage of training.

@phaZZi6461 4 жыл бұрын

1:22:31 - thesis statement about how to choose mini batch size

@paulcurry8383 2 жыл бұрын

Anybody know what a “pattern” is in information theory?

@jaimeziratearzate 7 ай бұрын

does anybody know how to show the part that the gibbs distribution converges to the optimal IB bound? And what is the epsilon cover of an hypothesis class?

@alexkai3727 4 жыл бұрын

I read another paper ON THE INFORMATION BOTTLENECK THEORY OF DEEP LEARNING by Harvard's researchers published in 2018, and they hold a very different view. Seems it's still unclear how neural network works.

@Checkedbox 3 жыл бұрын

Is that the one he mentions at ~ 29:00 ?

@zessazzenessa1345 6 жыл бұрын

"Learn to ignore irrelevant labels" yes intriguing..........

@julianbuchel1087 5 жыл бұрын

When was this talk given? Has he published his paper yet? I found nothing online so far, but maybe I just didn't see it.

@Chr0nalis 5 жыл бұрын

1)Deep learning and the Information Bottleneck, 2) Opening the black box of Deep neural networks via Information

@alexanderkurz2409 5 ай бұрын

11:30 "information measures are invariant to computational complexity"

@amirmn7 5 жыл бұрын

Can he use deep learning to fix the audio problems of this video?

@DheerajAeshdj 3 жыл бұрын

probably not because there are none

@AZTECMAN 2 жыл бұрын

Seems like this was asked in jest, but it's actually a good question.

@dexterdev 2 жыл бұрын

23:04

@minhtoannguyen1862 2 жыл бұрын

44:25

@AlexCohnAtNetvision 2 жыл бұрын

such a loss… blessed be his memory

@hanchisun6164 Жыл бұрын

This theory looks correct! When neural networks became popular, everybody in the scientific computation community eagerly wanted to describe it in their own languages. Many had achieved limited success. I think the information theory one makes the most sense, because it finds simplicity of the information from complexity of data. It is like how human thinks. We create abstract symbols that captures essence of the nature and conduct logical reasoning, which means that the dimension of freedom behind the world should be small since it is structured. Why did the ML community and industry not adopt this explanation?

@absolute___zero 4 жыл бұрын

oooo! so it is SGD ? If I wouldn't listen to the Q&A session I wouldn't understand it all. Now I do. Well, with second order algorithms (like Levenberg Marquard) you won't need all these balls floating to understand what's going on with your neurons. Gradient Descent is poor's man gold.

@binyuwang6563 6 жыл бұрын

If the theories are true, maybe we can compute the weights directly without iteratively learning them via gradient decsent.

@zessazzenessa1345 6 жыл бұрын

Binyu Wang oh

@prem4708 5 жыл бұрын

How so?

@Daniel-ih4zh Жыл бұрын

I've been thinking about this a lot too. The weights are partly function of the data of course, and we also have things like the good regulator theorem that kinda points towards it. Also, a latent code and the parameters learned aren't distinguished in Bayesian model selection.