Machine Learning Lecture 16 "Empirical Risk Minimization" -Cornell CS4780 SP17

Рет қаралды 25,703

Kilian Weinberger

Күн бұрын

Lecture Notes:
www.cs.cornell.edu/courses/cs4...

Пікірлер: 37

@matthieulin335 4 жыл бұрын

as a student from cs at Tsinghua, I would say this is the best course in ML you can find out there

@kilianweinberger698 4 жыл бұрын

Thanks! Please send my warmest regards to Prof. Gao Huang from me.

@matthieulin335 4 жыл бұрын

@@kilianweinberger698 will do when this virus ends !

@user-me2bw6ir2i Жыл бұрын

I'm incredibly grateful for your intuitive explanation of SVM, that really helped me to understand this topic.

@raviraja2691 3 жыл бұрын

I really want to put my laptop away... But I'm watching Prof Kilian's awesome lectures... So can't help it!

@rajeshs2840 4 жыл бұрын

Thank you Prof ... After your videos ,started loving ML..

@StevenSarasin 11 ай бұрын

log(cosh(x)) is such a clever idea, asymptotically linear and locally (near x=0) quadratic, a smooth version exactly analogous to the Huber idea of mixing L1 and L2 norm. Worth checking out the taylor expansion (which can be thought of us a microscope for functions to say what polynomial does this function look like close to a point (typically 0). You will get that cosh(log(x)) = .5*x^2 + O(x^4), i.e. quadratic near 0.

@sansin-dev 4 жыл бұрын

Brilliant lecture.

@in100seconds5 4 жыл бұрын

Boy this is wonderful

@8943vivek 3 жыл бұрын

Wow! CRISP!

@JoaoVitorBRgomes 3 жыл бұрын

@killian weinberger in the first 20 min of lecture you say the derivative of squared loss is the mean. But shouldnt it be the bias and variance? Or the intercept or the weights?

@omarjaafor6646 2 жыл бұрын

Where were you all these years

@jachawkvr 4 жыл бұрын

I loved your visualization for l1 and l2 regularization. I had seen these before, but never really understood what they meant. I have a question here : How would we optimize the objective function while using l1 regularization? I think gradient descent would not work well since the function is not differentiable at some very key points.

@kilianweinberger698 4 жыл бұрын

Yes, good point. SGD gets a little tricky. If you use the full gradient (summed over all sample) you can use sub-gradient descent. As long as you make sure you reduce your step size it should converge nicely.

@Theophila-FlyMoutain 4 ай бұрын

Hi Professor. Thank you for sharing the video. I am now using Gaussian Process Regression in physics field. One thing I noticed is that even though there exists specific loss function for GPR, many people use root-mean-squared-error as loss function. Is there any rule to choose loss function and regularization?

@theflippedbit 4 жыл бұрын

Hi, professor. I really like your way of explaining the ML concepts. i wish there were assignments/quizzes on the related topics, where we could try out these learning algos and get a more hands-on experience. i checked the course page but couldn't find any assignments.

@kilianweinberger698 4 жыл бұрын

Past 4780 exams are here: www.dropbox.com/s/zfr5w5bxxvizmnq/Kilian past Exams.zip?dl=0 Past 4780 Homeworks are here: www.dropbox.com/s/tbxnjzk5w67u0sp/Homeworks.zip?dl=0 Unfortunately, I cannot hand out the programming assignments from the Cornell class. There is an online version of the class (with interactive programming assignments and all that stuff), but the university does charge tuition. www.ecornell.com/certificates/technology/machine-learning/

@danielrudnicki88 3 жыл бұрын

@@kilianweinberger698 Is there a current version of link with exams? The one above unfortunately expired. Thanks for these amazing lectures:)

@vishnuvardhanchakka1308 3 жыл бұрын

Sir, In Plots of Common Regression Loss Functions , x-axis should be h(Xi) - Yi but in course page its showing h(Xi) * Yi

@sekfook97 3 жыл бұрын

I start to understand why optimise wTw instead of just W. wTw would be a scalar value and w is a vector. Guess it is much easier for us to use a scalar value as a constraint? also it would form a bigger vector space for us to search for optimal W if our constraints is wTw

@kilianweinberger698 3 жыл бұрын

It is tricky to optimize over a vector, like w. Imagine w is two dimensional, which vector is more optimal [1,2] or [2,1], or [4,0]? When you optimize w’w you get a scalar for which minimization and maximization are well defined.

@sekfook97 3 жыл бұрын

@@kilianweinberger698 thanks for the detailed explanation!

@hdang1997 4 жыл бұрын

Is using MAP estimation synonymous to regularizing?

@kilianweinberger698 4 жыл бұрын

No, not exactly, but in many settings the resulting parameter estimate is identical to what you would obtain with a specific regularizer (depending on the prior). The idea of enforcing a prior can be viewed as a form of regularization.

@bnglr 3 жыл бұрын

does this has anything to do with ERM？

@aloysiusgunawan7709 2 жыл бұрын

Hello Prof, if the constraint is w1^2 + w2^2

@kilianweinberger698 2 жыл бұрын

Yes, here B is the squared radius.

@smsubham342 Жыл бұрын

Why squared loss estimates mean and absolute error estimates median? Googled on this but no clear answer.

@kilianweinberger698 11 ай бұрын

You can derive it pretty easily if you let your classifier be a constant predictor. Let's call your prediction p. What minimizes 1/n \sum_i=1^n (p-y_i)^2 ? If you take the derivative and equate it to zero you will see that the optimum is when p is the mean of all y_i. You can proof a similar result for the median if it is the absolute loss. Hope this helps.

@rodas4yt137 3 жыл бұрын

Never seen 0 dislikes on a 10k views video though

@KulvinderSingh-pm7cr 5 жыл бұрын

I need a little help, I am studying learning theory and need some good quality material for developing intuition about it. It would be greatly helpful if professor or anyone can refer to some sort of resource or something to learn more. Will be happy if anyone can help, Thanks a lot in advance.

@kokonanahji9062 5 жыл бұрын

this series kzfaq.info/love/R4_akQ1HYMUcDszPQ6jh8Qplaylists might help