Пікірлер
@Faustordz
@Faustordz 3 сағат бұрын
Amazing value coming from Sam!
@ferencszalma7094
@ferencszalma7094 7 сағат бұрын
0:20 Implicit regularization effect of the noise cont'd 15:35 𝗜𝗺𝗽𝗹𝗶𝗰𝗶𝘁 𝗯𝗶𝗮𝘀 in a linear 𝗰𝗹𝗮𝘀𝘀𝗶𝗳𝗶𝗰𝗮𝘁𝗶𝗼𝗻 model with 𝗴𝗿𝗮𝗱𝗶𝗲𝗻𝘁 𝗳𝗹𝗼𝘄 (See also Lect 16 1:17:25 kzfaq.info/get/bejne/o86RoJeZrdbTZ3U.htmlsi=kTBsxPQ_W15lIB_R&t=4645) Theorem 18:20 Main intuition In GD, norm of weights ‖w‖ and the γ margin both grow 33:30 We show that ẇ(t) is correlated with w​⃰, and the correlation depends on γ̅ and the loss ẇ(t) is not too small compared to the loss 38:55 End of implicit regularization 44:55 𝗨𝗻𝘀𝘂𝗽𝗲𝗿𝘃𝗶𝘀𝗲𝗱 𝗹𝗲𝗮𝗿𝗻𝗶𝗻𝗴 Classical theoretical approach (moment methods) Modern methods with DL Self-learning Contrastive learning Semi-supervised learning Unsupervised domain adaptation Setup - 𝗟𝗮𝘁𝗲𝗻𝘁 𝘃𝗮𝗿𝗶𝗮𝗯𝗹𝗲 𝗺𝗼𝗱𝗲𝗹 a disctribution pᶿ parametrized by θ given: x⁽¹⁾,...,x⁽ⁿ⁾ ~ pᶿ goal: recover pᶿ from the data Example: 𝗠𝗶𝘅𝘁𝘂𝗿𝗲 𝗼𝗳 (𝗸) 𝗚𝗮𝘂𝘀𝘀𝗶𝗮𝗻𝘀 θ=((μ₁,...,μₖ), μᵢ∈ℝᵈ, (p₁,...,pₖ)), p=(p₁,...,pₖ) probability vector The μᵢ are the mean of the components The pᵢ are the probabilities of each Gaussian component, pᵢ∈ℝ 0≤p≤1, ‖p‖₁=1 Sample x ~ pᶿ by i ~ categorical(p) <- 𝗟𝗮𝘁𝗲𝗻𝘁 𝗩𝗮𝗿𝗶𝗮𝗯𝗹𝗲 x ~ 𝒩(μᵢ,𝕀) Mixture of Gaussians are the basis of HMM Hidden Markov Models ICA Independent Component Analysis 50:00 Approach: 𝗠𝗼𝗺𝗲𝗻𝘁 𝗺𝗲𝘁𝗵𝗼𝗱 ➀ Estimate moments using empirical samples ➁ Recover parameters from moments of X 𝗠𝗶𝘅𝘁𝘂𝗿𝗲 𝗼𝗳 𝘁𝘄𝗼 𝗚𝗮𝘂𝘀𝘀𝗶𝗮𝗻𝘀 First moment: M₁=𝔼[X]=0 Second moment: M₂=𝔼[XXᵀ]=μμᵀ+𝕀 ➀ M̂₂=𝚺ⁿᵢ₌₁x⁽ⁱ⁾x⁽ⁱ⁾ ➁ Recover μ from M̂₂ (infinite data in ➀) Another way to get μ: 𝗦𝗽𝗲𝗰𝘁𝗿𝗮𝗹 𝗺𝗲𝘁𝗵𝗼𝗱: top eigenvector of M₂ is μ̅≝μ/‖μ‖₂ top eigenvalue: ‖μ‖²₂+𝕀 need this algo to be robust to errors 1:05:20 𝗠𝗶𝘅𝘁𝘂𝗿𝗲 𝗼𝗳 𝘁𝗵𝗿𝗲𝗲 𝗚𝗮𝘂𝘀𝘀𝗶𝗮𝗻𝘀 1:09:45 𝗠𝗶𝘅𝘁𝘂𝗿𝗲 𝗼𝗳 (𝗸) 𝗚𝗮𝘂𝘀𝘀𝗶𝗮𝗻𝘀 1:14:25 Missing rotation information -> need M₃ to recover the μᵢ
@globalprosperity123
@globalprosperity123 11 сағат бұрын
It is really stain on great university like Stanford university to accept funding from for-profit defense contractors. Universities are places to promote peace.
@Aditya-ri7em
@Aditya-ri7em 12 сағат бұрын
he came and started teaching like a teacher .
@zhuolin730
@zhuolin730 22 сағат бұрын
Do we have lec 17 Discrete Latent Variable Models recording?
@CPTSMONSTER
@CPTSMONSTER Күн бұрын
0:40 Close connection between score based models and DDPMs (denoising diffusion probabilistic models) 1:00 Score based model goes from noise to data by running Langevin dynamics chains. VAE perspective (DDPM), fixed encoder (SDE) adds noise (Gaussian transition kernel) at each time step, decoder is a joint distribution (parameterized in reverse direction) over the same RVs, sequence of decoders are also Gaussian and parameterized by neural networks (simple DDPM formula), train in usual way (as in VAE) by optimizing evidence lower bound (minimize KL divergence), ELBO is equivalent to a sum of denoising score matching objectives (learning the optimal decoder requires estimating the score of the noise perturbed data density), learning ELBO corresponds to learning a sequence of denoisers (noise conditional score based models) 4:55 Means of decoders at optimality correspond to the score functions, the updates performed in DDPM are very similar to annealed Langevin dynamics 5:45 Diffusion version, infinite noise levels 9:15 SDE describes how the RVs in the continuous diffusion model (or fine discreteness of VAE) are related, enables sampling 14:25 Reversing SDE is change of variables 19:30 Interpolation between two data sets requires gradients wrt t's 19:45? Fokker Planck equation, gradient wrt is completely determined by these objects 20:25 Discretize SDE is equivalent to Langevin dynamics or sampling procedure of DDPM (follow gradient and add noise at every step) 21:40 Get generative model by learning the score functions (of reverse SDE), score functions parameterized by neural networks (theta) 21:55? Same as DDPM 1000 steps 23:15? Equivalence of Langevin, DDPM and diffusion generative modelling 24:40? DDPM, SDE numerical predictor is a Taylor expansion 25:40? Score based MCMC corrector uses Langevin dynamics to generate a sample at the corresponding density 27:15? Score based model uses corrector without predictor, DDPM uses predictor without corrector 27:50 Decoder is trying to invert encoder, defined as Gaussian (only limit of continuous time after infinite steps yields a tight ELBO assuming Gaussian decoders) 29:05? Predictor takes one step, corrector uses Langevin to generate a sample 34:50 Neural ODE 35:55 Reparametrizing randomness into the initial condition and then transforming it deterministically (equivalent computation graph), variational inference backprop through encoder is stochastic computation 38:55 ODE formula (integral) to compute probability density, conversion to ODE accesses solving techniques to generate samples fast 40:10 DDPM as VAE with fixed encoder and same dimension, latent diffusion which first learns a VAE to map data to a lower dimensional space and then learns a diffusion model over that latent space 44:50? Compounding errors in denoiser but not SDE 46:30 Maximum likelihood would differentiate through ODE solver, very difficult and expensive 49:35 Scores and marginals are equivalent (SDE and ODE models) and always learned by score matching, inference time samples generated differently 58:40 Stable diffusion pretrained autoencoder, not trained end to end, only care about reconstruction (disregarding distribution of latent space similar to Gaussian) and getting a good autoencoder, keep initial autoencoder fixed and train diffusion model over latent space 1:08:35 Score of prior (unconditional score), likelihood (forward model/classifier) and normalization constant 1:09:55 Solve SDE or ODE and follow the gradient of the prior plus the likelihood (controlled sampling), Langevin increases the likelihood of the image wrt prior and makes sure the classifier predicts that image (changing the drift to push the samples towards specific classifications) 1:12:35 Classifier free guidance, train two diffusion models on conditional and unconditional scores and take the difference
@4th_wall511
@4th_wall511 Күн бұрын
43:24 To the student's point, isn't it the case that P(E|T) = P(T|E) iff P(E) = P(T) and P(E and T) =/ 0?
@profnikki
@profnikki Күн бұрын
Thank you! This was a great conversation to listen to - part of the fun was listening to two people with a long history. Very much appreciated. :)
@raul36
@raul36 Күн бұрын
If AI is aligned then it is not AGI.
@mixshare
@mixshare Күн бұрын
Great Vid
@zyxbody
@zyxbody Күн бұрын
I dont understand anything but I like how these people teach.May all get to understand the concepts thats my only prayer.
@ronus007
@ronus007 Күн бұрын
Fantastic talk. Asked an LLM to get some highlight ideas from the transcription: Historical Context and Evolution: 1. Prehistoric Era: • Early models like RNNs and LSTMs were good at encoding history but struggled with long sequences and context. • Example: Predicting “French” in “I grew up in France. I speak fluent ___” is difficult for these models. 2. 2017: Attention is All You Need: • The landmark paper by Vaswani et al. introduced the transformer architecture. • Focused on the self-attention mechanism, which allows models to process data more effectively. 3. 2018-2020: Expansion Beyond NLP: • Transformers began being used in various fields beyond NLP, such as computer vision, biology, and robotics. • Google’s quote on improved performance with transformers highlights their impact. 4. 2021-2022: Generative Era: • Introduction of generative models like GPT, DALL-E, and stable diffusion. • Increased capabilities in AI, with models scaling up significantly and being applied to more complex tasks. Technical Deep Dive: 1. Self-Attention Mechanism: • The self-attention mechanism allows models to weigh the importance of different parts of the input data. • It computes the relevance of each word to every other word in a sentence, enabling better context understanding. 2. Multi-Headed Attention: • Multi-headed attention involves running the attention mechanism in parallel multiple times, each with different weights. • This allows the model to focus on different aspects of the data simultaneously. 3. Transformer Architecture: • Consists of encoder and decoder layers that process input and output sequences. • Each layer has self-attention and feed-forward neural network components. 4. Implementation Details: • Use of embedding layers to convert input data into vectors. • Positional encoding to maintain the order of the sequence. • Application of residual connections and layer normalization to stabilize training. Applications and Future Directions: 1. Current Applications: • Computer Vision: Using Vision Transformers (ViTs) to process images by breaking them into patches. • Speech Recognition: Models like Whisper use transformers to process mel-spectrograms of audio. • Reinforcement Learning: Decision transformers model sequences of states, actions, and rewards. • Biology: AlphaFold uses transformers to predict protein structures. 2. Future Potential: • Video Understanding and Generation: Anticipated advances in models capable of processing and generating video content. • Long Sequence Modeling: Improving transformers to handle longer sequences more efficiently. • Domain-Specific Models: Development of specialized models like DoctorGPT or LawyerGPT trained on specific datasets. • Generalized Agents: Creating models like Gato that can perform multiple tasks and handle various inputs. 3. Challenges and Innovations: • External Memory: Enhancing models with long-term memory capabilities. • Computational Complexity: Reducing the quadratic complexity of the attention mechanism. • Controllability: Improving the ability to control and predict model outputs. • Alignment with Human Brain: Researching how to align transformer models with human cognitive processes.
@robertwilsoniii2048
@robertwilsoniii2048 Күн бұрын
Something that always bothered me was that adding in random terms increases predicability power, holding sample size constant (scaling compute without increasing data size). The peoblem is it decreases explanatory power and ability to understand the individual contributions of each variable. It's like pop-astrology, star signs -- libra, gemini, leo... etc. -- adding extra variables improves scaling compute and predictability, but does it add anything to clarity? I suppose to make predictions clarity doesn't matter. That always annoyed me.
@lh7564
@lh7564 Күн бұрын
Asian looks kids until they hit 60
@connorshowell6763
@connorshowell6763 Күн бұрын
the excel comment is so true
@inforoundup9826
@inforoundup9826 Күн бұрын
great talk
@ferencszalma7094
@ferencszalma7094 2 күн бұрын
2:15 Setup generic loss function g(θ) SGD - Stochastic Gradient Descent mean-zero noise 4:35 Warm-up: 10:25 Solving the recurrence 16:05 t→∞ 17:25 Interpretation 21:35 Ornstein-Uhlenbeck process 23:10 Multi-dimensional quadratic 31:00 Noise level in the i-th eigenvector dimension, iterate fluctuation 37:40 Covariance of mini-batch gradient at global min θ​⃰ 41:00 Non-quadratic loss functions 49:40 Heuristic derivation: drop third order term Back 2017 Bridging the gap between constant step size SGD and Markov Chain Glasgow, Yuan, Ma 2021 1:02:00 More complex case with stronger implicit regualarization Both H and Σ are not full rank (partly coming from overparametrization) 1:08:30 Visualization of SGD in a valley in (K,K⊥) 1:14:55 xₜ,uₜ,rₜ 1:17:30 In the subspace K=span(H), contractions In the K⊥ subspace 1:23:20 Informally, bias direction SGD with label noise converges to a stationary point of the regularized noise
@Faustordz
@Faustordz 2 күн бұрын
Very intriguing!
@CPTSMONSTER
@CPTSMONSTER 2 күн бұрын
4:40 Estimating the noise (denoising) is equivalent to estimating the score of the noise perturbed data distribution (sigma model). Knowing how to denoise is knowing which direction to perturb the image to increase the likelihood most rapidly. 5:50? Taylor approximation of likelihood around each data point 12:45 Set of Gaussian conditional densities, encoder interpretation of Markovian joint distribution 15:20 Typical VAE maps data point through some neural network that gives mean and sd parameters for the distribution over the latence. In diffusion, just add noise to the data, nothing learned. 16:15 T times dimension of original data point, mapping is not invertible 17:00 Closed form of joint distribution (Gaussian) 18:20 Same way as generating training data for denoising score matching procedure (as in perturbing the data samples for each sigma_i) 22:20? Process is invertible if score is known, variationally learn an operator (ELBO) as a decoder instead of invertible mapping (which doesn't need the score, but if Gaussian then score is needed anyway) 23:05 Initial data not required to be mixture of Gaussians but the model has to be continuous and the transition kernel has to be Gaussian, in latent diffusion models discrete data can be embedded in continuous space 26:00 Exact denoising distribution is unknown (reverse kernel), variational approximation 28:15 Similar to VAE decoder, reverse process is defined variationally through conditionals parameterized by neural networks 30:05 Alpha parameter, define a diffusion process such that when it runs for a sufficiently long amount of time it reaches a steady state of pure noise 31:00 Langevin equivalence, variational training gives mu which is the score 32:15? Langevin corrections upon vanilla reverse Gaussian kernel 32:30 Transition is Gaussian therefore stochastic (same as VAE decoder), neural network parameters are deterministic 33:50 Flavor of VAE, sequence of latent variables indexed by time, encoder does not learn features, interpreted as VAE with fixed encoder that adds noise 35:00? VAE ELBO (reference formula in previous lecture), second term encourages high entropy 37:35? Hierarchical VAE evidence lower bound formula 38:40 In the usual VAE q is learnable, in diffusion q is fixed 40:00? ELBO loss is equivalent to the denoising score matching loss, minimizing the negative ELBO or maximizing the lower bound on the average log-likelihood is exactly the same as estimating the scores of the noise perturbed data distributions 41:40 In score based model sample from Langevin, in diffusion model sample from decoder 41:50? Denoising diffusion probabilistic model training procedure, equivalence of denoising score matching loss 43:20? Encoder is fixed, decoder minimizes KL divergence or maximizing ELBO which is the same as inverting the generative process (turns out this needs estimation of scores) 44:10 Training both encoder and decoder results in better ELBOs but worse sample quality, equivalence to one step Langevin, score based model perspective as limit of infinite number of noise levels (tricky to get with VAE perspective) 45:35 Optimizing likelihood does not correlate with sample quality 45:45 Even if encoder is fixed and something simple like adding Gaussian noise, the inverse is nontrivial 47:15 Expensive computation graph but trained incrementally layer by layer (locally) without having to look at the whole process 48:00 Efficient training process due to structure of q, forward jumps 48:45 Solving the loss of the vanilla VAE yields the same loss function as in the diffusion model, fixed encoder loss function is the same as denoising score matching loss 49:15 Equivalence to VAE to sample without Langevin 50:45 A fixed encoder for VAE (one step) would be very complicated to invert, diffusion model forms 1000 subproblems 51:30 Argument in epsilon_theta is the perturbed data point (sample from q x_t given x_0), architecture is same as noise conditional score model 51:50 Not learning decoders for every t, amortize epsilon_theta network 52:40 U Net for learning denoising, transformers also used 1:03:15? Training objective 1:05:00? A score based model fixes errors of basic numerical SDE solver by running Langevins for that time step 1:05:40 Transitions are Gaussian, marginals are not 1:05:50 DDPM is a particular type of discretization of the SDE 1:07:35 Converting VAE (SDE) to flow (ODE), equivalent marginals, ODE is a deterministic invertible mapping so likelihoods can be computed exactly (change of variable formula)
@shahriarabid191
@shahriarabid191 2 күн бұрын
Thank you so much. Grateful!
@the_master_of_cramp
@the_master_of_cramp 2 күн бұрын
53:50 I don't see why this is not gradient descent. The target is fixed. It's just that we change it periodically. Still gradient descent and should work.
@chenmarkson7413
@chenmarkson7413 2 күн бұрын
Another example for "The Bitter Lesson": AI research is always advancing from human-knowledge-based methods (e.g. feature engineering) to automatic learning/searching-based methods (e.g. representational learning).
@ferencszalma7094
@ferencszalma7094 2 күн бұрын
0:20 Last time: Implicit regularization of initialization This time: ➀ implicit regularization of init (regression: minimum norm solution) ➁ classification (maximum margin solution) 1:50 ➀ More precise characterization of reg init Preparation: 𝗚𝗿𝗮𝗱𝗶𝗲𝗻𝘁 𝗳𝗹𝗼𝘄: GD with infinitesimal LR L(w) loss function GD w(t+1)=w(t)-η∇L(w(t)) Scale time by η: w(t+η)=w(t)-η∇L(w(t)) Continuous process: η→0, w(t+dt)=w(t)-dt ∇L(w(t)) Differential equation: ẇ=∂w(t)/∂t=-∇L(w(t)) 6:15 𝗠𝗼𝗱𝗲𝗹: 𝗾𝘂𝗮𝗱𝗿𝗮𝘁𝗶𝗰𝗮𝗹𝗹𝘆 𝗽𝗮𝗿𝗮𝗺𝗲𝘁𝗲𝗿𝗶𝘇𝗲𝗱 𝗹𝗶𝗻𝗲𝗮𝗿 𝗺𝗼𝗱𝗲𝗹 (variant of last lecture) Setup: Model/hypothesis function Loss function: 𝗜𝗻𝗶𝘁𝗶𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻: w₊(0)=w₋(0)=α𝟙⃗ 13:40 𝗧𝗵𝗲𝗼𝗿𝗲𝗺 𝗳𝗼𝗿 α 𝗶𝗻𝗶𝘁𝗶𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗶𝗻𝗱𝘂𝗰𝗶𝗻𝗴 𝗹₁ 𝗮𝗻𝗱 𝗹₂ 𝗿𝗲𝗴𝘂𝗹𝗮𝗿𝗶𝘇𝗮𝘁𝗶𝗼𝗻𝘀 𝗶𝗻 𝘀𝗶𝗺𝗽𝗹𝗲 𝗺𝗼𝗱𝗲𝗹: For any 0<α<∞, assuming that the solution converges to a feasible θα in Xθ_α=y⃗ the solution θ_α=argmin Q_α(θ) s.t. Xθ=y with Q_α complexity Q_α(θ)=𝚺ⁿᵢ₌₁q(...), q(z) = 2 + √{4+z²} + z arcsinh(z/2) 𝗞𝗲𝗿𝗻𝗲𝗹 𝗮𝗻𝗱 𝗥𝗶𝗰𝗵 𝗥𝗲𝗴𝗶𝗺𝗲𝘀 𝗶𝗻 𝗢𝘃𝗲𝗿𝗽𝗮𝗿𝗮𝗺𝗲𝘁𝗿𝗶𝘇𝗲𝗱 𝗠𝗼𝗱𝗲𝗹𝘀 by Woodworth etal 2020 arxiv.org/abs/2002.09277 18:10 Complexity measure for various α α→∞, q(Qᵢ/α²)=... , Q(θ)= ...‖θ‖₂² min l₂ norm solution of θ, l₄ or w α→0, q(Qᵢ/α²)=... , Q(θ)= ...‖θ‖₁ min l₁ norm solution of θ, l₂ or w α in (0,∞) interpolation between l₁ and l₂ regularizations 23:11 Interpretation/discussion Kernel regime α→∞ Rich regime α→0 34:00 Proof of Thm 𝗦𝗶𝗺𝗶𝗹𝗮𝗿 𝘁𝗼 𝗿𝗲𝗴𝗿𝗲𝘀𝘀𝗶𝗼𝗻 𝗽𝗿𝗼𝗼𝗳 Not similar to last time Step ➀ 𝗙𝗶𝗻𝗱 𝗶𝗻𝘃𝗮𝗿𝗶𝗮𝗻𝗰𝗲 maintained by minimizer (was θ∈span{x⁽ⁱ⁾} for linear regression) Step ➁ 𝗖𝗵𝗮𝗿𝗮𝗰𝘁𝗲𝗿𝗶𝘇𝗲 𝘀𝗼𝗹𝘂𝘁𝗶𝗼𝗻 𝘂𝘀𝗶𝗻𝗴 𝗶𝗻𝘃𝗮𝗿𝗶𝗮𝗻𝗰𝗲 (Nothing about population vs empirical) X̃=[X -X] ∈ℝⁿˣ²ᵈ w(t)=[w₊(t), w₋(t)] ∈ℝ²ᵈ L(w(t))=1/2‖X̃w(t)⊙w(t)-y⃗‖₂² ẇ(t)=-∇L(w(t))=-2X̃ᵀr(t)⊙w(t), ẇ(t)/w(t)=-2X̃ᵀr(t), dln(w(t))/dt=-2X̃ᵀr(t) r(t)=X̃w(t)⊙w(t)-y⃗ residual vector 43:00 w(t)=w(0)⊙exp[-2X̃ᵀ∫_0^t r(s)ds] w₊(0)=α𝟙⃗ ∈ℝ²ᵈ since w₊(0)=w₋(0)=α𝟙⃗ ∈ℝᵈ θ(t)=w₊(t)⊙w₊(t) - w₋(t)⊙w₋(t)=2α²sinh(-2X̃ᵀ∫_0^t r(s)ds) 52:50 Program I 53:30 𝗞𝗞𝗧 𝗞𝗮𝗿𝘂𝘀𝗵-𝗞𝘂𝗵𝗻-𝗧𝘂𝗰𝗸𝗲𝗿 𝗰𝗼𝗻𝗱𝗶𝘁𝗶𝗼𝗻 𝗳𝗼𝗿 𝗼𝗽𝘁𝗶𝗺𝗮𝗹𝗶𝘁𝘆 𝗼𝗳 𝗰𝗼𝗻𝘀𝘁𝗿𝗮𝗶𝗻𝗲𝗱 𝗼𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗽𝗿𝗼𝗯𝗹𝗲𝗺𝘀 argmin Q(θ) s.t. Xθ=y where Q(θ) is convex fn The KKT condition is ∇Q(θ)=Xᵀv for some v∈ℝⁿ Xθ=y 𝗢𝗽𝘁𝗶𝗺𝗮𝗹𝗶𝘁𝘆: 𝗻𝗼 𝗳𝗶𝗿𝘀𝘁 𝗼𝗿𝗱𝗲𝗿 𝗹𝗼𝗰𝗮𝗹 𝗶𝗺𝗽𝗿𝗼𝘃𝗲𝗺𝗲𝗻𝘁 𝘀𝗮𝘁𝗶𝘀𝗳𝘆𝗶𝗻𝗴 𝘁𝗵𝗲 𝗰𝗼𝗻𝘀𝘁𝗿𝗮𝗶𝗻𝘁 Perturbation: Δθ ⊥ row-span(X), XΔθ=0 θ+Δθ satisfies constraint if no increse in Q(θ) to first order: Q(θ+Δθ)≈Taylor-expsn Q(θ)+<Δθ,∇Q(θ)>+... So Δθ and ∇Q(θ) must be orthogonal for any Δθ, hence ∇Q(θ)⊆row-span(X)=Xᵀv 1:02:05 Use KKT condition ∇Q(θ)=Xᵀv θ_α(t)=2α²sinh(-4Xᵀ∫_0^∞ r(s)ds)=2α²sinh(-4Xᵀv) Reverse engineered Q(θ) ∇Q(θ_α)=arcsinh(1/2α² θ_α)=-4Xᵀv So θ_α satisfies the KKT condition, hence it's global min 1:06:00 𝗖𝗹𝗮𝘀𝘀𝗶𝗳𝗶𝗰𝗮𝘁𝗶𝗼𝗻 𝗽𝗿𝗼𝗯𝗹𝗲𝗺 (𝘀𝗲𝗽𝗮𝗿𝗮𝗯𝗹𝗲 𝗱𝗮𝘁𝗮) GD -> max margin solution Setup: Linear model Logistic loss function 1:11:35 Multiple global min if separable data Q: Which direction separating hyperplane found by GD 1:15:15 Def: margin, normalized margin, max margin 1:17:25 𝗧𝗵𝗲𝗼𝗿𝗲𝗺: 𝗚𝗙 𝗰𝗼𝗻𝘃𝗲𝗿𝗴𝗲𝗱 𝘁𝗼 𝘁𝗵𝗲 𝗱𝗶𝗿𝗲𝗰𝘁𝗶𝗼𝗻 𝗼𝗳 𝗺𝗮𝘅 𝗺𝗮𝗿𝗴𝗶𝗻 1:19:00 Intuition: ➀ ➁ ➂ ➃ 1:25:25 𝗦𝗼𝗳𝘁𝗺𝗮𝘅, 𝗵𝗮𝗿𝗱𝗺𝗮𝘅
@bedtimestories1065
@bedtimestories1065 2 күн бұрын
This is a non-answer. The real answer is math. Computer Science is about the science of computers. AI just uses computers as a tool for computation. You don't need to understand bits and bytes and how they work work within a CPU register. You do however, need to use a TON of math for AI.
@kingki1953
@kingki1953 2 күн бұрын
why there's no framework to make the process was easy
@kingki1953
@kingki1953 2 күн бұрын
My lecturers in my university never explain those things, thanks for this free lecture
@Ethan_here230
@Ethan_here230 2 күн бұрын
No comments?
@CPTSMONSTER
@CPTSMONSTER 2 күн бұрын
10:30? Shannon code interpretation 11:25 Training a generative model based on maximum likelihood is equivalent to optimizing compression, comparison of models based on likelihoods is equivalent to comparing compression 22:30 GAN generates samples but getting likelihoods is difficult, kernel density approximation 33:40? VAE, compute likelihoods by annealed importance sampling 43:25? Sharpness, entropy of classifier, low entropy means classifier puts probability mass on one y leading to high sharpness. Diversity, entropy of marginal distribution of classifier. 45:50 c(y) is the marginal distribution over predicted labels when fitting synthetic samples, one sample would yield low entropy (unhappy), ideally c(y) uniform over possible y, sharpness and diversity are competing 54:00 Notation, x and x' are not global variable names 55:20? Kernel Hilbert space mapping, comparing features equivalent to comparing moments 1:05:20? VAEs have reconstruction loss embedded in them, GANs don't 1:07:55 Impossible to measure disentanglement in unlabeled data
@susdoge3767
@susdoge3767 2 күн бұрын
this is by far the best video on transformer i have seen, kudos
@user-uy4rx3hs3x
@user-uy4rx3hs3x 2 күн бұрын
where to get slides
@mclovin6537
@mclovin6537 2 күн бұрын
Let me try to explain this for everyone watching this video. Lets say we have 2 wallets, wallet A and wallet B Wallet A has a UTXO of 1 BTC and continues to make the transfer to Wallet B Once the transfer is done Wallet B has .3 BTC and Wallet A has .7 BTC. The UTXO from of 1 BTC from Wallet A is now gone, it is spent, it is gone forever and ever and ever. 2 new UTXO's are created. 1 in Wallet A for the remainder .7 BTC and 1 in Wallet B for the received .3 BTC. Basically a UTXO is a combination stuff like transaction ID, amount, blah blah blah That identifier is there to show how much BTC you have to send in the future. If you send all your BTC to another wallet, your old UTXO is gone, a new UTXO is created in the new wallet. If you send part of your BTC to another wallet, your old UTXO is gone, and 2 new UTXOs are created, the amount in the new wallet, and the remainder in the old wallet. Hope this helps. This was just how I understood it so if I am wrong, someone correct me please.
@kebas239
@kebas239 3 күн бұрын
He was so passionate at the beginning that they had to bleep everything out.
@MatijaGrcic
@MatijaGrcic 3 күн бұрын
Amazing!
@mdrasheduzzaman7613
@mdrasheduzzaman7613 3 күн бұрын
The expression at 27:27-27:29 was something else!😂😂
@heaptv2348
@heaptv2348 3 күн бұрын
the incoming messages from its upstream neighbors *
@user-zr1yv8id9w
@user-zr1yv8id9w 3 күн бұрын
As a beginner what do you need before start to watch this playlist ?? Is this lecture series cover all the basic concepts about ML??
@leodexter191
@leodexter191 3 күн бұрын
is this the last lecture of machine learning course ?
@CPTSMONSTER
@CPTSMONSTER 3 күн бұрын
7:00 Sliced score matching slower than denoising score matching, taking derivatives 13:45 Denoising data minimizes sigma, but minimum sigma is not optimal for perturbing data when sampling 27:15 Annealed Langevin, 1000 sigmas 38:50 Fokker Planck PDE, interdependence of scores, intractable so treat loss functions (scores) as independent 45:00? Weighted combination of denoising score matching losses, estimation of score for each perturbed data by sigma_i, weighted combination of the estimated scores 48:15 As efficient as estimating a single non-conditional score network, joint estimation of scores is amortized by a single score network 49:50? Smallest to largest noise during training, largest to smallest noise during inference (Langevin) 52:10? Notation, p sigma_i is equivalent to previous q (estimation of perturbed data) 57:20 Mixture denoising score matching is expensive at inference time (Langevin steps), deep computation graph which doesn't have to be unrolled at training time (not generating samples during training) 1:07:00 SDE describes perturbation iterations over time 1:08:50 Inference time (largest to smallest noise) described by reverse SDE which only depends on the score functions of the noise perturbed data densities 1:12:00 Euler-Maruyama discretizes time to solve numerically solve SDE 1:13:25 Numerically integrating SDE that goes from noise to data 1:15:00? SDE and Langevin corrector 1:20:25 Infinitely deep computation graph (refer to 57:20) 1:21:45 Possible to convert SDE model to normalizing flow and get latent variables 1:22:00 SDE can be described as ODE with same marginals 1:23:15 Machinery defines a continuous time normalizing flow where the invertible mapping is given by solving an ODE, paths of solved ODE with different initial conditions can never cross (invertible, normalizing flow), normalizing flow model trained not by maximum likelihood but by score matching, flow with infinite depth (likelihoods can be obtained)
@IOSALive
@IOSALive 3 күн бұрын
Stanford Online, This is so fun! I'm happy I found your channel!
@not_amanullah
@not_amanullah 3 күн бұрын
❤🤍
@susdoge3767
@susdoge3767 3 күн бұрын
gold'
@susdoge3767
@susdoge3767 3 күн бұрын
this is a good one
@sapienspace8814
@sapienspace8814 3 күн бұрын
@ 5:29 Very interesting, even more interesting, how Yann LeCun seems to want to get rid of RL, while at the same time, providing a blanket exception for it, "when your plan does not work", or if you are fighting a "ninja", and that it is too "dangerous". Note that in the lawsuit between Elon Musk and OpenAI a 2018 email revealed that their "core technology" is from the "90s". RL was originally funded by the United States Air Force (USAF) under Klopf (wrote the book "The Hedonistic Neuron") who brought on Sutton, and Barto. An Arizona State University (ASU) master's thesis student (who's advisor was a Chinese-American) had "early private access" to the first edition of RL by Barto and Sutton that used Fuzzy Logic combined with K-means clustering (a method of focusing attention on regions of interest in the state space), and combined this with RL to automatically learn inference rules of the physics of the world, including how to balance an inverted pendulum as an adaptive control system. This thesis was published in 1997, but probably only a handful of people know about it. Note that Fuzzy Logic merges statistical mathematics and language. One of Barto's students went on to work for Boston Dynamics where the "Big Dog" robot that uses RL can stand itself back up if you kick it.
@glitchAI
@glitchAI 3 күн бұрын
he speaks with so much bass that I have to ramp up my volume.
@susdoge3767
@susdoge3767 3 күн бұрын
guess im not really missing out on something not being in stanford
@xupan8658
@xupan8658 3 күн бұрын
There is a typo in the equality 1:05:53 (applying Holder's inequality). The outer exponent on the right hand side should be 3 but not 3/2. The same typo is also in the lecture note (5.133).
@jens8486
@jens8486 3 күн бұрын
I really like the "Attention" part of the lecture! Thank you Professor Manning!
@ferencszalma7094
@ferencszalma7094 3 күн бұрын
0:00 Last time Linear regression Initialization β=0 ⇒ min norm solution β̂ This time non-linear models, similar solution 1:30 Non-linear model fᵦ(x)=<β⊙β,x> where ⊙ is the Hadamard product linear in x∈ℝᵈ non-linear in β∈ℝᵈ loss function is non-convex ground truth y=<β​⃰⊙β​⃰,x>, where β​⃰ is r-sparse, ‖β​⃰‖₀≤r, n<d, n>poly(r) β=𝟙ₛ, where S⊆[d] |S|=r, r components of β are 1 the other are zero x₁,...,xₙ~iid 𝒩(0,𝟙ᵈˣᵈ) 7:10 Classical l₁ theory: use l₁ regularization to leverage sparsity ➀ lasso: fᶿ(x)=θᵀx, objective: 1/n𝚺ⁿᵢ₌₁(yᵢ-θᵀxᵢ)²+λ‖θ‖₁ n≥Ω(r) Classical theory: objective recovers ground truth θ​⃰=β​⃰⊙β​⃰ approximately ➁ note: θ <-> β⊙β, ‖θ‖₁=‖β‖₂² 1/n𝚺ⁿᵢ₌₁(yᵢ-fᵦ(xᵢ))² + λ‖β‖₂² 11:24 Implicit regularization Use small init without explicit regularization ⇒ ➁ L̂(β)=1/4n𝚺ⁿᵢ₌₁(yᵢ-<β⊙β,xᵢ>)² empirical loss optimizer GD on L̂(β) with small initialization for some small α β⁰=α𝟙⃗, βᵗ⁺¹=βᵗ-η∇ᵦL̂(βᵗ) converges to β​⃰ 16:55 Interpretation L̂(β) has many global min b.c. of overparametrization Runtime lower bound depends only on log1/α Upper bound on runtime (pretty mild) α cannot be zero bc α=0 is a saddle point 21:50 GD prefers global min closest to initialization 25:50 Some basic properties 27:35 Uniform convergence for r-sparse β With high probability over choice of x₁,...,xₙ's (r,δ)-RIP condition (Restricted Isometry Property) ➂ ∀v, ‖v‖₀≤r, (1-δ)‖v‖²₂ ≤ 1/n𝚺ⁿᵢ₌₁<v,xᵢ>² ≤ (1+δ)‖v‖²₂ (1-δ)𝕀 ≼ 1/n 𝚺ⁿᵢ₌₁ xᵀᵢ xᵢ ≼ (1+δ)𝕀 covariance is close to 𝕀 34:45 Uniform convergence for r-sparse β L̂(β)=1/4n 𝚺ⁿᵢ₌₁<β⊙β-β​⃰⊙β​⃰,xᵢ>², where v=β⊙β-β​⃰⊙β​⃰ ≈1/4 ‖β⊙β-β​⃰⊙β​⃰‖₂²=L(β) Uniform convergence for ∇L̂(β) for sparse β: ∇L̂(β)≈∇L(β) However ∃ dense β s.t. L̂(β)=0 but L(β)>>0 37:40 Main intuition 𝓧ᵣ is a family of sparse vectors β: 𝓧ᵣ={β: ‖β‖₀≤r} Population loss GD trajectory never leaves 𝓧ᵣ and tends to β​⃰ Empirical loss GD trajectory is close to population loss GD trajectory, so it never leaves 𝓧ᵣ either 47:40 Analysis of the population trajectory L(β) 54:15 Case I 1:02:00 Case II 1:05:50 Analysis for L̂(β) with r=1 1:08:20 Proof idea ➀ approximately sparse ➁ β trajectory never leaves 𝓧 significantly
@dkierans
@dkierans 4 күн бұрын
Yeah, this is a pretty great talk. It is quite hard to figure out at what technical level to hit the widest audience. This is nice. Not as nice as those flaxen locks though.
@CPTSMONSTER
@CPTSMONSTER 4 күн бұрын
2:00 Summary 8:00 EBMs training, maximum likelihood training requires estimation of partition function, contrastive divergence requires samples to be generated (MCMC Langevin with 1000 steps), minimize Fischer divergence (score matching) instead of KL divergence 19:15 EBMs parameterize a conservative vector field of gradients of an underlying scalar function, score based models generalize this to an arbitrary vector field. EBMS directly model the log-likelihood, score based models directly model the score (no functional parameters). 29:15? Backprops in EBM, f_theta derivative is s_theta, Jacobian of s_theta 34:00 Fischer divergence between model and perturbed data 36:25 Noisy data q to remove trace of Jacobian in calculations 39:30 Linearity of gradient 42:15 Estimating the score of the noise perturbed data density is equivalent to estimating the score of the transition kernel (Gaussian noise density) 44:10 Trace of Jacobian removed from estimation, loss function is a denoising objective 46:05 Sigma of noise distribution as small as possible 48:55? Stein unbiased risk estimator trick, evaluating quality of an estimator without knowing ground truth, denoising objective 51:55 Denoising score matching, these two objectives are equivalent up to a constant, minimizing the bottom objective (denoising) is equivalent to minimizing the top objective (which estimates the score of the distribution convolved with Gaussian noise) 52:20? Individual conditionals 53:25 Reduced generative modelling to denoising 55:35? Tweedie's formula, alternative derivation of denoising objective 58:25 Interpretation of equations, conditional on x is correlated to the joint density of x and perturbed x, q sigma is the integral of the joint density, Tweedie's formula expresses x in terms of the perturbed x with an optimal adjustment (gradient of q sigma which is correlated to the density of x conditional on the perturbed x) 1:01:35? Jacobian vector products, directional derivatives, efficient to estimate using backprop 1:04:20 Sliced score matching single backprop (directional derivative), without slicing needs d backprops 1:07:00 Sliced score matching not on perturbed data 1:12:00 Langevin MCMC, sampling with score 1:14:35 Real world data tends to lie on a low dimensional manifold 1:21:00 Langevin mixing too slowly, mixture weight disappears when taking gradients