XGBoost Part 3 (of 4): Mathematical Details

No video

XGBoost Part 3 (of 4): Mathematical Details

Рет қаралды 124,473

4 жыл бұрын

In this video we dive into the nitty-gritty details of the math behind XGBoost trees. We derive the equations for the Output Values from the leaves as well as the Similarity Score. Then we show how these general equations are customized for Regression or Classification by their respective Loss Functions. If you make it to the end, you will be approximately 22% smarter than you are now! :)
NOTE: This StatQuest assumes that you are already familiar with...
XGBoost Part 1: XGBoost Trees for Regression: • XGBoost Part 1 (of 4):...
XGBoost Part 2: XGBoost Trees for Classification: • XGBoost Part 2 (of 4):...
Gradient Boost Part 1: Regression Main Ideas: • Gradient Boost Part 1 ...
Gradient Boost Part 2: Regression Details: • Gradient Boost Part 2 ...
Gradient Boost Part 3: Classification Main Ideas: • Gradient Boost Part 3 ...
Gradient Boost Part 4: Classification Details: • Gradient Boost Part 4 ...
...and Ridge Regression: • Regularization Part 1:...
Also note, this StatQuest is based on the following sources:
The original XGBoost manuscript: arxiv.org/pdf/...
The original XGBoost presentation: homes.cs.washi...
And the XGBoost Documentation: xgboost.readth...
Last but not least, I want to extend a special thanks to Giuseppe Fasanella and Samuel Judge for thoughtful discussions and helping me understand the math.
For a complete index of all the StatQuest videos, check out:
statquest.org/...
If you'd like to support StatQuest, please consider...
Buying The StatQuest Illustrated Guide to Machine Learning!!!
PDF - statquest.gumr...
Paperback - www.amazon.com...
Kindle eBook - www.amazon.com...
Patreon: / statquest
...or...
KZfaq Membership: / @statquest
...a cool StatQuest t-shirt or sweatshirt:
shop.spreadshi...
...buying one or two of my songs (or go large and get a whole album!)
joshuastarmer....
...or just donating to StatQuest!
www.paypal.me/...
Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
/ joshuastarmer
Corrections:
1:16 The Lambda should be outside of the square brackets.
#statquest #xgboost

Пікірлер: 296

@statquest 4 жыл бұрын

Corrections: 1:16 The Lambda should be outside of the square brackets. Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

@SuperJ98 Жыл бұрын

Hey Josh, Thank you very much for all of your videos. It has been very helpful for my master thesis. I know this is a small details but I think the Similarity Score in 23:05 of this video is negative and should have a minus before the 1/2. At least that's what I see on page 3 of the original paper.

@statquest Жыл бұрын

@@SuperJ98 Yes, you are referring to equation 6 in the manuscript - that equation definitely has a minus sign in it. However, I was referring to the individual terms in equation #7 and in Algorithm #1 on that same page. Those terms are what I ended up calling "similarity scores" because of what they represented and how they were used to find the optimal split. That said, I should have been clearer in the video about what, exactly, I was referring to.

@ricclx7290 Жыл бұрын

Hello Josh, Great explanation. One question I have is that in neural nets we take derivates for gradients during the epochs in the training process and do back propagation, etc. From your explanation, I think I interpret that there is a loss to be calculated and minimized theoretically but in practice the derivates (gradients) are always that one equation of adding the residuals and dividing by the total. So, I should conclude that there is no derivative calculation during training? instead just use that one equation ?

@statquest Жыл бұрын

@@ricclx7290 Yes. Unlike neural networks, XGBoost uses the same overall design for every single model, so we only have to do calculate the derivative once (on paper) and know that it will work for all of the models we create. In contrast, every neural network has a different design (different number of hidden layers, different loss functions, different number of weights, etc.) so we always have to calculate the gradient for each model.

@beautyisinmind2163 3 жыл бұрын

I wish this channel live 1000 years in youtube.

@statquest 3 жыл бұрын

Thank you!

@iraklimachabeli6659 3 жыл бұрын

This is a brilliant and very detailed explanation of math behind XGBoost. I love that notation uses minimal subscripts. I was scratching my head for a day after looking at original paper by Chen and Guestrin. This video clearly laid out all the steps , taylor expansion of loss function and then gradient of second order approximation with respect to trees current prediction. Now it so obvious that gradient is wrt to current prediction, but somehow it was not clear before.

@statquest 3 жыл бұрын

Glad it was helpful!

@SophiaSLi 3 жыл бұрын

Thank you so much for the excellent explanation and illustration Josh!!! This is the best (clearest, best-organized, most comprehensible, most detailed) XGBoost lecture I've ever seen... I don't find my self having the need to ask follow-up questions as everything is explained so well!

@statquest 3 жыл бұрын

Awesome, thank you!

@jingyang2865 Жыл бұрын

This is the best resource I can find online on explaining XGboost! Million thanks to you!

@statquest Жыл бұрын

Glad you think so!

@jiaqint961 6 ай бұрын

OMG... How you break down complicated concept to simple concept is amazing. Thank you for the content.

@statquest 6 ай бұрын

Thank you!

@mikhaeldito 4 жыл бұрын

I couldn't give a lot but I am a proud patron of your work now! I hope others who are financially capable would also donate to StatQuest. BAM!

@statquest 4 жыл бұрын

Thank you very much!!! Your patronage means a lot to me.

@damianos17xyz99 4 жыл бұрын

Oh after finish my project from Xgboost classification - max score I get! I have watched first two parts and it was really helpful, thanks! Now is the part there, yes! :-) What a helpful man !?

@statquest 4 жыл бұрын

Congratulations on your project! That is awesome! There is one more video after this one: Part 4: XGBoost Optimizations.

@damianos17xyz99 4 жыл бұрын

:-) :-) :-) ! :D 😝👍👍

@salhjasa 3 жыл бұрын

This channel is awesome. After searching and searching for somewhere to explain this clearly this is just perfect.

@statquest 3 жыл бұрын

Thank you very much! :)

@user-fi2vi9lo2c 10 ай бұрын

This series about XGBoost is marvellous! Thanks!

@statquest 10 ай бұрын

Thank you very much!

@sxfjohn 4 жыл бұрын

The best valuable and most easily explained the hard core of xgboost, Thanks!

@statquest 4 жыл бұрын

Thank you very much! :)

@damp8277 2 жыл бұрын

Watching this video with the original paper open is like deciphering forgotten texts. Thanks so much!

@statquest 2 жыл бұрын

Glad it was helpful!

@angelineamber 3 жыл бұрын

Hey, Josh! I really enjoy your videos and I could not express my gratitude enough!

@statquest 3 жыл бұрын

Glad you like them!

@pedroramon3942 Жыл бұрын

Thank you very much for explaining all this very hard math in the original article. I did all the calculations and now I can say I understand xgboost in deep.

@statquest Жыл бұрын

BAM! :)

@vinayak186f3 4 жыл бұрын

I watch your videos , get the subtitles downloaded and make notes from it. I'm really enjoying doing so . THANKS FOR EVERYTHING. 😊

@statquest 4 жыл бұрын

BAM! :)

@junbinlin6764 3 жыл бұрын

Your youtube channel is amazing. Once I find a job related to data science after uni, I will donate this channel fat stacks

@statquest 3 жыл бұрын

Triple bam! :)

@sayantandutta8353 4 жыл бұрын

I just completed the Ada Boost, Gradient Boost and XGBoost series, it was awesome. Thanks Josh for the awesome contents!

@statquest 4 жыл бұрын

Thank you very much!! You deserve a prize for getting through all those videos. :)

@sayantandutta8353 4 жыл бұрын

:-) :-)

@Vivekagrawal5800 2 жыл бұрын

Amazing Video!! Makes the Maths of XGBooost super simple. Thank you for your efforts...

@statquest 2 жыл бұрын

Thank you very much! :)

@rushilv4102 3 жыл бұрын

Your videos are really really helpful and easy to comprehend. Thank you so much!

@statquest 3 жыл бұрын

Glad you like them!

@alex_zetsu 4 жыл бұрын

I knew enough calculus to know what the second derivative with respect to Pi would be, but even though you spoke normally and I could see it coming, "the number one" seemed so funny after doing all that.

@statquest 4 жыл бұрын

Ha! Yeah, isn't that funny? It all just boils down to the number 1. :)

@3Jkkk2 2 жыл бұрын

Josh you are the best! I love your songs at the beginning

@statquest 2 жыл бұрын

Thanks!

@user-fy4mu7tp6h 9 ай бұрын

Very nice explanation on the math. love it !

@statquest 9 ай бұрын

Glad you liked it!

@auzaluis Жыл бұрын

gosh!!! such a clean explanation!!!

@statquest Жыл бұрын

Thanks!

@nielshenrikkrogh5195 6 ай бұрын

as always a very structured and easy to understand explanation......many thanks!!

@statquest 6 ай бұрын

Glad you liked it!

@abhashpr Жыл бұрын

Wonderful explanation .. did not se this sort of thing anywhere else

@statquest Жыл бұрын

Thanks!

@CrazyProgrammer16 Жыл бұрын

Very well explained. Thank you.

@statquest Жыл бұрын

Glad you liked it!

@sheenaphilip6444 4 жыл бұрын

Thank you so much for this series of videos on XG boost!! Has helped so much..esp in understanding the original paper on this, which can be very intimidating at first glance!

@statquest 4 жыл бұрын

Thanks! :)

@palebluedot8733 3 жыл бұрын

I cant get past the intro. Its so addictive and Im not kidding lol.

@statquest 3 жыл бұрын

BAM! :)

@henkhbit5748 3 жыл бұрын

The math was, as always, elegantly explained. Analogous you're support vector machine math explanation usingTaylor series for radial kernel.

@statquest 3 жыл бұрын

Yes, the Taylor series shows up in a lot of places in machine learning. It's one of the "main ideas" behind how ML really works.

@Erosis 4 жыл бұрын

It's crazy to think a graduate student (Tianqi Chen) came up with this... Very impressive.

@statquest 4 жыл бұрын

Agreed. It's super impressive.

@rameshh3821 10 күн бұрын

I understood Gradient Boosting well, but I found XGBoost challenging. I'm just making a summary of XGBoost. Please let me know if this is correct: 1)Take the initial prediction as 0.5 for both regression and classification and calculate the residuals. 2)Build a decision tree to predict the residuals. These decision trees are larger than stumps. The criterion for splitting a decision tree is Gain, which is the difference between the similarity score of the child and parent nodes. 3)The loss function for XGBoost includes regularization, which is absent in Gradient Boosting. Once a decision tree is created, incorrect predictions, along with some correct predictions, are sent to the next tree. 4)Once all the trees are built then the final prediction is given by: Initial prediction + alpha * Prediction from Decision Tree(1) + alpha* Prediction from Decision Tree(2).....+alpha*Prediction from Decision Tree(n) where n is the total number of trees and alpha is the learning rate. Please let me know if any changes needed.

@statquest 10 күн бұрын

That seems good to me.

@rameshh3821 10 күн бұрын

I have a quick question about regression. If the initial prediction is 0.5 and then we compute 0.5+𝛼⋅𝑓(𝐷𝑇1) +𝛼⋅𝑓(𝐷𝑇2) +…+𝛼⋅𝑓(𝐷𝑇𝑛) to get the final output wouldn't the output be too small? Please confirm this. Additionally, for regression, shouldn't the initial prediction be the average of the target variable?

@statquest 9 күн бұрын

@@rameshh3821 1) Why would it be too small. The output from each tree is related to the residuals, which should be relatively large if 0.5 is not a good initial guess 2) When XGBoost first came out it used 0.5 as the initial prediction for everything. Since then I believe they now use the average (or at least have it as an option), but the original authors defended the original 0.5 guess because the first few trees would make up the difference and usually people make a lot of trees, not just a few.

@rameshh3821 9 күн бұрын

@@statquest Yes, I understand that we obtain the residual output from each tree. However, if we multiply this residual by a very small learning rate, wouldn’t the final output end up being a small number? Example let's say we have four trees - Output from tree 1 - 45 Output from tree 2 - 35 Output from tree 3 - (-20) Output from tree 4 - (-50) Let's assume alpha (learning rate) = 0.01 Now the net output result is equal to 0.5+(45*0.01) +(35*0.01) +(-20*0.01) +(-50*0.01) = 0.5+0.1 = 0.6 So how can this be a regression model output? Please clarify.

@statquest 9 күн бұрын

@@rameshh3821 When you decide on a learning rate, you have to consider how many trees you want to build. If you have a small learning rate, then you should build a lot of trees. Building a lot of trees takes time, but using them for inference is fast since you can run them in parallel.

@sayantanmazumdar9371 2 жыл бұрын

If I am right here , then finding output value is just like gradient descent of the loss function .Like we do in neural networks

@statquest 2 жыл бұрын

It's a little different here, in that we have an exact formula for the output value and do not need to iterate to find it.

@knightedpanther Жыл бұрын

Thanks Josh. You are awesome. Please let me know if I got this right: For Gradient Boosting, we are fitting a regression tree so the loss function is just sum of squared residuals. When deciding a split we just try to minimize the sum of squared residuals. For XGboosting they modified the loss function by adding the regularization term. So when deciding a split, we can just try minimizing this new loss function. However they decided to flip it for clarity (or other purposes like maximization instead of minimization which we don't know) and called it similarity and we try to maximize it when deciding a split.

@statquest Жыл бұрын

Both methods have two different loss functions, depending on whether they are performing regression or classification. Since you are interested in these details, I would strongly recommend that you watch all 4 gradient boost videos and all 4 of the xgboost videos very carefully. You can find them here: statquest.org/video-index/

@knightedpanther Жыл бұрын

@@statquest Hi Josh, thank you. I have already watched the videos. After your comment, I looked up my notes which I made while watching them. For Gradient Boosting, even though the loss functions are different (Sum of Squared Residuals for regression and log loss for classification), when we are fitting an individual tree for both cases, we try to minimize the sum of squared residuals when deciding a split. But the output value for both cases are different. For regression case, it is just the mean of the residuals in that leaf but for classification, it is sum of residuals divided by sum of pi(1-pi) for all observations in that leaf. For Extreme Boosting tree, the split condition is also different for regression and classification. The definition of similarity score changes. For regression it is sum of residuals squared divided by number of residuals + lambda. For classification, it is sum of residuals squared divided by sum of pi(1-pi) for all terms + lambda. The output values are also different just like Gradient Boosting. Now My question is why don't we change the split condition in gradient boosting for classification like it is done in Extreme Gradient Boosting?

@knightedpanther Жыл бұрын

Referring to this video: kzfaq.info/get/bejne/idqHjJiCvLO8c6s.html&ab_channel=StatQuestwithJoshStarmer... for gradient boosting..If we put the calculated value of gamma back in the loss function equation, we will get something like sum of squared residuals for all observations divided by sum of p(1-p) for all observations. Why don't we use this as the split criteria for gradient boosting classification like we do in XGBoost?

@statquest Жыл бұрын

@@knightedpanther Gradient boosting came first. XGBoost improved on it. If you want what XGBoost offers, just use it instead.

@knightedpanther Жыл бұрын

@@statquest Thanks Josh. I was just trying to understand if there was a mathematical or logical reasoning behind what these two algorithms were doing that I missed.

@laveenabachani 2 жыл бұрын

Amazing! The human race thanks you for making this vdo.

@statquest 2 жыл бұрын

Thank you! :)

@aditya4974 4 жыл бұрын

Triple Bam with part 3!! Thank you so much.

@statquest 4 жыл бұрын

Thanks! :)

@jingzhouzhao8609 7 ай бұрын

Merry Christmas Josh, 😊 Just a quick observation: at 11:00, I noticed that p_i represents the previous predicted value, therefore, p_i-1 might be a better notation to denote this.

@statquest 7 ай бұрын

I'm just trying to be consistent with the notation used in the original manuscript.

@Cathy55Ms 2 жыл бұрын

Great tutorial materials to whom need the fundamental idea of those methods! Do you plan to publish videos on ligntGBM and catGBM too?

@statquest 2 жыл бұрын

I hope so!

@RishabhJain-u9r 3 күн бұрын

can you possibly proof-read this, please. Step 1: Calculate a structure score for all three nodes in the stump. Structure score is given by sum of square of residuals of the observations divided by the number of observations, plus a factor called 𝛾 which is used to avoid overfitting. (As 𝛾 increases, the effect of the score of each single tree decreases in getting the outcome of the model.) Step 2: Calculating the Gain, which is the difference between the above structure score for the parent node and the sum of the structure scores for child nodes. Step 3: If the gain is positive, then the above split is made. The true derivation of this structure score is quite interesting and can be found in the original paper by University of Washington researchers.

@VBHVSAXENA82 4 жыл бұрын

Great video! Thanks Josh

@statquest 4 жыл бұрын

Thanks! :)

@dr.kingschultz Жыл бұрын

your videos are awesome

@statquest Жыл бұрын

Thank you so much 😀!

@bktsys 4 жыл бұрын

Keep going the Quest!!!

@statquest 4 жыл бұрын

Hooray! :)

@mengzhou193 2 жыл бұрын

Hi Josh! Amazing videos! I have one question at 6:39, you replace p_i with (initial prediction+output value), but according to part 1&2, I think it should be (initial prediction+eta*output value), am I right about this?

@statquest 2 жыл бұрын

To keep that math simple, just let eta = 1.

@TennyZ-mw2jb Жыл бұрын

@@statquest Thanks for your clarification. I guess you may need to mention this in the video next time if you simplify sth coz I got confused in this part too.

@kunlunliu1746 4 жыл бұрын

Hi Josh, great videos, learned a ton. Are you gonna talk about the other parts of XGBoost, like quantile? Looking forward it!

@statquest 4 жыл бұрын

Yes, that comes up in Part 4, which should be out soon.

@anunaysanganal Жыл бұрын

Thank you for this great tutorial! I had a question regarding the similarity score; why do we need a similarity score in the first place? Why can't we just use a normal decision tree with MSE as a splitting criterion like in GBT?

@statquest Жыл бұрын

I think the main reason is that the similarity score can easily incorporate regularization penalties.

@anunaysanganal Жыл бұрын

@@statquest Got it! Thank you so much!

@knightedpanther Жыл бұрын

I had similar doubt. Please correct me if I am wrong. This is what I gathered from the video: For Gradient Boosting, we are fitting a regression tree so the loss function is just sum of squared residuals. When deciding a split we just try to minimize the sum of squared residuals. For XGboosting they modified the loss function by adding the regularization term. So when deciding a split, we can just try minimizing this new loss function. However they decided to flip it for clarity (or other purposes like maximization instead of minimization which we don't know) and called it similarity and we try to maximize it when deciding a split.

@ayenewyihune Жыл бұрын

Super clear

@statquest Жыл бұрын

Thanks!

@sureshparit2988 4 жыл бұрын

Thank Josh ! could you please make a video on LightGBM or share the difference between LightGBM and XGBoost.

@statquest 4 жыл бұрын

It's on the to-do list.

@suzyzang1659 4 жыл бұрын

I was waiting for this for a very long time, cannot wait to learn!! May I please know when the part 4 will come out? Can you help to introduce how to realize XGBoost in R or Python? Thank you!!

@statquest 4 жыл бұрын

Part 4 should be out soon - earlier for you since support StatQuest and get early access. I'll also do a video for getting XGBoost running in R or Python.

@suzyzang1659 4 жыл бұрын

@@statquest Thank you! Hurray!

@rodriguechidiac8648 4 жыл бұрын

@@statquest Can you add to that part a grid search as well once you do the video? Thanks a lot, awesome videos.

@suzyzang1659 4 жыл бұрын

@@statquest Can you please help to explain how xgboost deal with missing values in R or Phython? I was running a xgboost model but the program cannot continue if there is missing value in my data set. Thank you!

@statquest 4 жыл бұрын

@@suzyzang1659 Wow, that is strange. In theory XGBoost should work with missing data just fine. Hmmm....

@chrischu2476 2 жыл бұрын

This is the best educational channel that I've ever seen. There seem like a little problem in 18:02, when you convert L(yi, pi) to L(yi, log(odds)i). I thought pi is equal to (e^log(odds) / 1 + e^log(odds)). Please tell me if I am wrong or misunderstand something. Thanks a lot.

@statquest 2 жыл бұрын

This is explained in the video Gradient Boost Part 4 here: kzfaq.info/get/bejne/idqHjJiCvLO8c6s.html NOTE: There is a slight typo in that explanation: log(p) - log(1-p) is not equal to log(p)/log(1-p) but equal to log(p/(1-p)). In other words, the result log(p) - log(1-p) = log(odds), is correct, and thus, the error does not propagate beyond it's short, but embarrassing moment.

@chrischu2476 2 жыл бұрын

Thanks a lot. You've explained very well in the Gradient Boost Part 4. I can understand how -[yi log(pi) + (1-yi) log(1-p)] converted to -yi log(odds ) + log(1+e^log(odds)) (right of the equal sign) in 18:02, but why L(yi, pi) is equal to L(yi, log(odds)I) (left of the equal sign)? Thanks for your patience to reply me.

@statquest 2 жыл бұрын

@@chrischu2476 If we look at the right sides, of both equations, we have a function of 'p' and we have a function of 'log(odds)'. As we saw in the other video, the right hand sides are equal to each other. So, the left hand sides just show how those functions are parameterized. One is a function of 'p' and the other is a function of 'log(odds)'.

@chrischu2476 2 жыл бұрын

@@statquest Oh...I got it. Thank you again for everything you've done.

@hoomankashfi1282 Жыл бұрын

you did a great job with this quest, could you please make another quest and describe how does XGBoost handle multi class classification tasks? there are several strategies in sk learn but understanding them in another issue. Good luck

@statquest Жыл бұрын

Thanks! I'll keep that in mind.

@niguan7776 Ай бұрын

Great job Josh! The most wonderful video among all the others I can find. Just a question about the output value for each node. Why is the regularized loss function l(y, p0+O) instead of l(y, p0+eta*O) when doing the 2rd order Taylor Approximation? I agree that if you set the eta equal to 1, the output value will be sum of residuals/number of residuals+lambda, but if I take eta into account, the output value is actually eta*residuals/number of residuals*squared eta +lambda, if I am correct.

@statquest Ай бұрын

What time point, minutes and seconds, are you asking about specifically?

@niguan7776 Ай бұрын

@@statquest it’s at 6:42:)

@statquest Ай бұрын

@@niguan7776 If the question is why I left eta out of the regularized loss function, it is because it was also omitted from the derivations in the original manuscript: arxiv.org/pdf/1603.02754

@xiaoyuchen3112 4 жыл бұрын

Fantastic vedio! I have a small questions, if we calculate the similarity based on gradient and second order gradient, how can these similarities be additive? That is to say, why can we add similarities in different leaves and compare it with the similarity in the root?

@statquest 4 жыл бұрын

The formula for calculating similarity scores is just a scoring function. For more details, see: arxiv.org/pdf/1603.02754.pdf

@iraklisalia9102 3 жыл бұрын

Thank you Josh for the great explanation! I was confused at the part where Cover equaled denominator minus lambda as I thought we were supposed to subtract Lambda and got confused in the last video, but here it clicked that you meant minus as in without lambda :D I'm super stoked about Taylor series as it seems like quite an important part in ML, any chances that you will do Taylor series clearly explained video in a near future? :)

@statquest 3 жыл бұрын

I hope so! It's one of the keys to understanding how a lot of things work in machine learning.

@nishalc 2 жыл бұрын

Thanks for the great video. I'm wondering how other regression methods such as poisson, gamma and tweedie relate to what is shown in the video here. I imagine the outputs of the trees in these cases are similar to the case of regression, as we are estimating the expected value of the distribution in question. On the other hand, the loss function would be the negative log likelihood for the distribution in question. If anyone has any details of how these methods work it would be much appreciated!

@statquest 2 жыл бұрын

In the context of "xgboost" and pretty much all other machine learning methods, the word "regression" doesn't refer to linear regression specifically, but simply to any method that predicts a continuous value. So I'm not sure it makes sense to compare this to poisson regression specifically, other than to say that XGBoot's "regression" does not depend on any specific distribution.

@nishalc 2 жыл бұрын

@@statquest thanks for the reply! So with these methods would xgboost simply use the negative log likelihood of the distribution in question as the loss function and take the derivative to be the output of each tree?

@statquest 2 жыл бұрын

@@nishalc XGBoost does not use a distribution.

@nishalc 2 жыл бұрын

@@statquest hmm in that case how do these specific (gamma/poisson/tweedie) regressions work?

@statquest 2 жыл бұрын

@@nishalc en.wikipedia.org/wiki/Poisson_regression

@tulanezhu 3 жыл бұрын

Really helped me a lot understanding the math behind XGB. This is awesome! For regression, you said XGB used 2nd order Taylor approximation to derive the leaf output, while general gradient boost use 1st order Taylor. From what I understand other than the lambda regularization term, they just end up with the same answer, which is sum of residuals/number of residuals in that leaf node, right?

@statquest 3 жыл бұрын

That's what I recall. The difference in the taylor expansion is due to the regularization term.

@iOSGamingDynasties 3 жыл бұрын

I am learning XGBoost and this has helped me greatly! So thank you Josh. One question, at 7:18, in the loss function, the term p_i^0 is the total value from previous trees? That being said, p_2^0 would be initial value 0.5 + eta * (output value of the leaf from the first tree), am I right?

@statquest 3 жыл бұрын

I believe you are correct.

@iOSGamingDynasties 3 жыл бұрын

@@statquest Yay! Thank you :)

@praveerparmar8157 3 жыл бұрын

Thank God you skipped the fun parts 😅😅. They were already much fun in the Gradient Boost video 😁😁

@statquest 3 жыл бұрын

bam! :)

@RishabhJain-u9r Күн бұрын

Why would XGBoost have tree depth as a hyper-parameter with a default value of 6 when all we use is Stumps with a depth of 1!

@statquest Күн бұрын

The stumps are used to just provide examples of how the math is done. Usually you would use larger trees.

@Maepai_connect 2 ай бұрын

Love the channel always! QQ - why is the initial prediction 0.5 and not an average of all observations? 0.5 could be too far fetched with continuous data for regressions.

@statquest 2 ай бұрын

That was just the default they set it to when XGBoost first came out. The reasoning was that the first few trees would significantly improve the estimate.

@Maepai_connect Ай бұрын

@@statquest thank you for answering! Does it now use average?

@statquest Ай бұрын

@@Maepai_connect I believe it does now, but it might also be configurable.

@christianrange8987 Жыл бұрын

Great video!! Very helpful for my current bachelorthesis!🙏 Since I want to use the formulas for Similarity Score and Gain in my thesis, how can I reference them? Do you know if there is any official literatur like book, paper etc. where they are mentioned or do I have to show the whole math in my thesis to get from Tianqi Chen's formulas to the Similarity Score?

@statquest Жыл бұрын

You can cite the original manuscript: arxiv.org/pdf/1603.02754.pdf

@Gabbosauro 3 жыл бұрын

Looks like we just apply the L2 Ridge reg param but what about the L1 Lasso regularization parameter? Where is it applied in the algorithm?

@statquest 3 жыл бұрын

The original manuscript only includes the L2 penalty. However, presumably the L1 penalty is included in an elastic-net style.

@RishabhJain-u9r 3 күн бұрын

Hey Josh, is the similarity score, essentialy the irregularity score from the original paper, with a negative sign infront of it? Thanks!

@statquest 3 күн бұрын

I'm not sure what you are referring to as the irregularity score, but the similarity score in the video refers each term that includes summations in equation 7 on page 3 of the original manuscript. Although they refer to that equation as L_split, in the algorithm sections of the manuscript they call it "gain". To see "gain" in action, see: kzfaq.info/get/bejne/hdp0a9qHxqzRZnk.html

@wenzhongzhao627 4 жыл бұрын

Thanks Josh for the great series of ML videos. They are really "clearly explained". I have a question regarding calculation of g_i and h_i for the XGBoost classification case, where you used log(odds) as the variable to take the first/second derivatives. However, you used the p_i as the variable to perform the Taylor expansion. Will that cause any issue? I assume that in the classification case you have to use log(odds) to perform the Taylor expansion and variable update in stead of p_i as in the regression case.

@statquest 4 жыл бұрын

If you want details about how we can work with both probabilities and logs without causing problems, check out the video: Gradient Boost Part 4, Classification Details: kzfaq.info/get/bejne/idqHjJiCvLO8c6s.html

@ParepalliKoushik 3 жыл бұрын

Thanks for the detailed explanation Josh. Why XGBoost doesn't need feature scaling although it uses gradients?

@statquest 3 жыл бұрын

Because it is based on trees.

@cici412 2 жыл бұрын

Thanks for the video. I have one question that I'm struggled with. At 7:58, why the new predicted value is not (0.5 + learning rate X Output Value)? Why is the "learning rate" omitted to compute the new predicted value?

@statquest 2 жыл бұрын

If we included the learning rate at this point, then the optimal output value would end up being scaled to compensate for its effect. By omitting it at this stage, we can scale the output value (make it smaller) to prevent overfitting later on.

@cici412 2 жыл бұрын

@@statquest Thank you for the reply! appreciate it.

@Stoic_might 2 жыл бұрын

What is the number of decision trees should be there in our XGBoost Algorithm? And how do we calculate this

@statquest 2 жыл бұрын

I answer this question in this video: kzfaq.info/get/bejne/fdh6g5x3sbyXdnk.html

@pratt3000 8 ай бұрын

I understand the derivation of Similarity score but didnt quite get the reasoning behind flipping the parabola and taking the y coordinate. Could someone explain?

@statquest 8 ай бұрын

You know, I never really understood that either. So, let me know if you figure something out.

@ilia8265 2 жыл бұрын

Can we have a study guide for XGBoost plz plz plz plz 😅

@statquest 2 жыл бұрын

I'll keep that in mind for sure.

@maddoo23 2 жыл бұрын

At 5:27, wouldnt gamma also decide if a node gets built(going by original paper, not able to post link)? You wouldnt have to prune a node if you dont build it.

@statquest 2 жыл бұрын

If you look at equation 6 in the original paper, it shows, in theory, how 'T' could be used to build the optimal tree. However, that equation isn't actually used because it would require enumerating every single tree to find the best one. So, instead, we use a greedy algorithm and equation 7, which is the formula that is used in practice for evaluating the split candidates, and equation 7 does not include 'T'. Now, the reason we don't prune as we go is that when using the greedy algorithm, we can't know if a future split will improve the trees performance significantly. So we build all the branches first.

@maddoo23 2 жыл бұрын

@@statquest 'T' is not there in eq 7 but 'gamma' is there in equation 7 (deciding whether or not to split). For positive gamma, it always encourages pruning. I couldnt find anything in the paper about not using 'gamma' to build tree because it might lead to counterproductive pruning in the greedy approach. However, I agree with your point that 'gamma' should be avoided while building the tree. Thanks!

@tingtingli8904 4 жыл бұрын

Thank you so much for the videos. And I have a question . The pruning can be done after the tree built, if the difference between gain and gamma is negative , we can remove the branch. Could you explain this, can we have this conclusion from math details. Thank you

@statquest 4 жыл бұрын

I'm pretty sure I show this in the video. What time point are you asking about?

@ineedtodothingsandstuff9022 Жыл бұрын

Hello thanks for the video. Just one questions is the splits done based on the residuals all the time, or gradients? For instance if I use a different loss function, gradient might have different calculation. In this case, do we still use residuals to do the splits or we use respective gradients from the given loss function? Thanks a lot!

@statquest Жыл бұрын

I believe you would use the gradients.

@benjaminlu9886 4 жыл бұрын

Hi Josh, what is Ovalue? Is that the result that the tree outputs? Wouldn't then the output value be 0.5/predicted drug effectiveness in the first example? Or is the output value a hyper parameter that is used in regularization? Also, BIG thanks for all these videos!

@statquest 4 жыл бұрын

At 5:54 I say that Ovalue is the "output value for a leaf". Each leaf in a tree has its own "output value". For more details about what that means, check out XGBoost Parts 1 ( kzfaq.info/get/bejne/hdp0a9qHxqzRZnk.html ) and 2 ( kzfaq.info/get/bejne/bshhfah128vSgYk.html ), as well as the series of videos on Gradient Boost ( kzfaq.info/get/bejne/aalzZ7Fl35mrepc.html )

@karannchew2534 3 жыл бұрын

21:50 Why does the highest point of the parabola give the Similarity Score please? What exactly is the Similarity Score definition?

@statquest 3 жыл бұрын

We derive this starting at 22:18

@mostafakhalid8332 3 ай бұрын

Second order Taylor polynomial is used only to simplify the math? is there another objective?

@statquest 3 ай бұрын

It makes the unsolvable non-linear equation solvable.

@yujiezhao9825 3 жыл бұрын

What is the difference between the method in this video compared with the gradient boost (classification & regression)? Is the only difference lie on the penalty term? Are the gradient boost (classification & regression) you introduced previously the same as GBDT?

@statquest 3 жыл бұрын

Gradient Boost (GBDT) uses standard, off the shelf decision trees that minimize GINI impurity or the sum of squared residuals. XGBoost uses a completely different type of tree that is based on the math in this video. This redesign allows the tree building to be optimized highly optimized. Further more, XGBoost includes a lot of other fancy optimizations explained here: kzfaq.info/get/bejne/pbiifsiGqKvGoWw.html

@knightedpanther Жыл бұрын

@@statquest Hi Josh, As per the videos and your replies on queries on other videos , Gradient Boosted trees only use sum of squared residuals for both regression and classification.?

@zahrahsharif8431 4 жыл бұрын

Maybe this is basic, but the Hessian matrix is a matrix of what exactly ? Its the partial second order derivatives of the loss function with respect to what just the log odds?. Just trying to see bigger picture here applying it to training data

@zahrahsharif8431 4 жыл бұрын

Also you are looking at one feature in this example. If we have say 20 features how would the above be different??

@statquest 4 жыл бұрын

If you have multiple features, you calculate the gain for all of them. The Hessian is the partial second order derivatives. In the case of XGBoost, those derivatives can be with respect to the log(odds) (for classification) or the predicted values 9 (in regression).

@davidd2702 2 жыл бұрын

Thank you for your fabulous video! I enjoy it and understand well! Could you tell me the output from the xgb classifier giving 'confidence' in a specific output (allowing you to assign a class) ? is this functionally equivalent to statistical probability of an event occuring?

@statquest 2 жыл бұрын

Although the output from XGBoost can be a probability, I don't think there is a lot of statistical theory behind it.

@ShubhanshuAnand 4 жыл бұрын

Expression of Output value looks very similar to GBDT gamma field with L2 regularizer, we can always use first order derivative with SGD optimization to get the minima as we do in other optimisation problem to solve for minima, why use taylor expansion? Is taylor expansion gives faster convergence?

@statquest 4 жыл бұрын

SGD only takes a step towards the optimal solution. The Taylor expansion *is* (an approximation of) the optimal solution. The difference is subtle, but important in this case.

@ranerev2480 4 жыл бұрын

thanks for the video! could you explain why the y-axis coordinate for maximum value of the "flipped" parabola is actually the similarity score?

@statquest 4 жыл бұрын

We could make the low point be the "inverse-similarity score", but flipping it over and using a maximum value instead of a minimum value makes my head hurt less. In other words, it's just defined that way.

@ranerev2480 4 жыл бұрын

@@statquest so the similarly score is just defined as the minimum of the loss function? (Which ia the value of the loss function at the output value?)

@statquest 4 жыл бұрын

@@ranerev2480 The similarity is the maximum of -1*the loss function.

@robertocorti4859 2 жыл бұрын

Hi! Amazing video, thank you for this great content! I have a question, maybe it's stupid but it's just for having everything's clear: after that you proved that the optimal output value is computed on a minimization of a Loss + L2 penalty made by approximating this function with a 2nd order taylor approximation, I still don't get the next step when you improve the prediction by creating nodes such that the gain in terms of similarity is bigger. Of course I know that by building such a tree you would improve the optimal output value since the first guess comes from a 2nd order approximation, but I still don't get how do you prove this mathematically. Thank you again!

@statquest 2 жыл бұрын

What time point, minutes and seconds, are you asking about?

@robertocorti4859 2 жыл бұрын

@@statquest at minute 20:09 when you said that we need to derive the equations of the similarity score so we can grow the tree. Maybe I didn't explain well, my question is related on how the loss optimization is connected with the tree growth algorithm based on gain in similarity that you explained in Part 1 and Part 2. Is this a procedure that helps us to refine the optimal output value guess (done by minimizing the 2nd order approximation of the loss function ) ?

@statquest 2 жыл бұрын

@@robertocorti4859 The derivation of the similarity scores that we do in this video results in the similarity scores that I introduced in parts 1 and 2. In those videos, I just said, "Hey! This is the similarity score that we are going to use!". In contrast, in this video, I'm saying "Hey! Remember those similarity scores? Well, this is where they come from."

@midhileshmomidi2434 3 жыл бұрын

So to get ouput value and similarity score, this huge amount of calculation(double derivatives) is required No wonder why xgboost takes lot of training time One doubt Josh While running model to calculate output value and similarity score it just calculates the formulae right or it goes through all this huge process

@statquest 3 жыл бұрын

Th whole process described in this video derives the final formulas that XGBoost uses. Once derived, only the final formulas are used. So the computation is quite fast. To see the final formulas in action, see: kzfaq.info/get/bejne/hdp0a9qHxqzRZnk.html and kzfaq.info/get/bejne/bshhfah128vSgYk.html

@Thamizhadi 2 жыл бұрын

hello , josh, thank you for demystifying the loss function of xgboost regression model. I have a small doubt. Where is the regularisation term related to L1 penalty(lasso)? Could you provide a related reference including this term.

@statquest 2 жыл бұрын

The original manuscript only includes the L2 penalty, so that is all I have covered.

@Thamizhadi 2 жыл бұрын

@@statquest oh okay . thank you

@rahelehmirhashemi5213 4 жыл бұрын

love you man!!! :D

@statquest 4 жыл бұрын

Thank you! Part 4, which covers XGBoot's optimizations, will be available for early access in one week.

@FedorT54 Жыл бұрын

Hi! Can you please explain me, Similarity score for model with number of trees =1 and depth =1 is its logloss minimum value?

@statquest Жыл бұрын

You can get a sense of how this would work by watching my video on Gradient Boost that uses the log likelihood as the loss function: kzfaq.info/get/bejne/idqHjJiCvLO8c6s.html

@mamk687 4 жыл бұрын

Thank you for the great video as usual! I am currently confused by differences of the words' definitions, such as method, algorithm, technique, method, and approach. Could you tell me how you use those of words appropriately and give me some references I can see?

@statquest 4 жыл бұрын

"Method" and "technique" are very closely related and in conversation can be used interchangeably. "Algorithm", is like "method" and "technique", but it is specifically applied to the steps required to make a computer program work.

@mamk687 4 жыл бұрын

@@statquest Thank you for the reply! I am looking forward to seeing the next video!

@statquest 4 жыл бұрын

@@mamk687 The new video is available right now to channel members and patreon supporters. It will be available to everyone else in a week or two.

@amitbisht5445 3 жыл бұрын

Hi @JoshStarmer, Could you please help me in understanding, how taking the second order gradient in taylor series helped in reducing the loss function?

@statquest 3 жыл бұрын

At 10:58 I say that we use the Taylor Series to approximate the function we want to optimize because the Taylor Series simplifies the math.

@mohammadelghandour1614 2 жыл бұрын

In 18:06 How (1-yi) log(1-Pi) ended up like this : log(1+e^(log(odds))) ?

@statquest 2 жыл бұрын

See: kzfaq.info/get/bejne/idqHjJiCvLO8c6s.html

@rrrprogram8667 4 жыл бұрын

MEGAA BAAMMMMMM is backkk...

@statquest 4 жыл бұрын

Hooray!!! :)

@karthikeyapervela3230 Жыл бұрын

@statquest I am trying to workout a problem on pen and paper but just 4 features instead of 1, so once the split it is made on 1 feature does it proceed to another feature? What happens next?

@statquest Жыл бұрын

At each potential branch, each feature and threshold for that feature are tested to find the best one.

@rubyjiang8836 3 жыл бұрын

cool~~~

@statquest 3 жыл бұрын

Bam! :)

@jhlee8796 4 жыл бұрын

Thanks for great lecture. Where can I get your beamer pdf file?

@vinodananddixit7267 4 жыл бұрын

Hi, At 18:17 , I can see that you have converted pi=(e^log(odds)/1+e^log(odds))? Can you please let me know how it has been converted. I am stuck at this point. Any, reference/help would be appreciated.

@statquest 4 жыл бұрын

kzfaq.info/get/bejne/eMx7lNGdlse3d2Q.html

@user-ns2en2gs5h 10 ай бұрын

100 years later ai will come to this Channel to Learn what their grand grand father looks like😊

@statquest 10 ай бұрын

Ha! BAM! :)

@Stoic_might 2 жыл бұрын

What is the number of decision trees should be there in our XGBoost Algorithm?

@statquest 2 жыл бұрын

I answer this question in this video: kzfaq.info/get/bejne/fdh6g5x3sbyXdnk.html

@Stoic_might 2 жыл бұрын

@@statquest ok thank you

@strzl5930 3 жыл бұрын

Are the output values for the trees that are denoted as O in this video equivalent to the output values denoted as gamma in the gradient boosting videos?

@statquest 3 жыл бұрын

No. XGBoost includes regularization the calculation of the output values. Regular gradient boost does not.

@lucaslai6782 4 жыл бұрын

Hello Josh, Why does -- (yi -- pi) must be negative? (yi -- pi) can be smaller than 0, right? --(negative ) is positive, right? Do I miss something? Thank you.

@statquest 4 жыл бұрын

Can you specify the time point in the video (minutes and seconds) that you are asking about?

@lucaslai6782 4 жыл бұрын

@@statquest Hello Josh, 15:33 gi = -- (yi -- pi). In other words, gi is the negative residual

@statquest 4 жыл бұрын

That's just how the math works out. The second derivative is the negative residual. Ultimately, when we plug data into to this equation, we can get positive and negative values - in this this case, those values will cancel each other out and the numerator will be small and thus the output value and the similarity scores will be small.

@venkateshmunagala205 2 жыл бұрын

Can you please me understand why we multiplied the equation with negative (-) to get similarity score which makes parabola to get inverted ?@ time stamp 21:29

@statquest 2 жыл бұрын

I'm not sure I understand your question. We multiply the equation by -1 to flip it over, so that the problem becomes a maximization problem, rather than a minimization problem.

@venkateshmunagala205 2 жыл бұрын

@@statquest Thanks for the reply . I need to know the specific reason to flip it to make it as maximisation problem ? Btw I bought your book but I won’t see gbdt and xgboost in it

@statquest 2 жыл бұрын

@@venkateshmunagala205 To be honest, you might be better off asking the guy that invented XGBoost. I can only see what he did and can only guess about the exact reasons. Perhaps he wanted to call the splitting criteria "gain", and in that case, it makes senes to make it something we maximize.

@venkateshmunagala205 2 жыл бұрын

@@statquest BAM Thank you Josh.

@abhzz3371 5 ай бұрын

7:43, how did you get 104.4? I'm getting 103.... could anyone explain?

@statquest 5 ай бұрын

That's a typo! It should be 103. When I did the math I forgot to subtract 0.5 from each residual.

@MrDiego163 4 жыл бұрын

Great video! Could make one about Naive Bayes classifier? :)

@statquest 4 жыл бұрын

Yes, I should have one in the next month or so.

@MrDiego163 4 жыл бұрын

@@statquest Looking forward to it!

@hampirpunah2783 3 жыл бұрын

I have a question, I did not find your formula in the xgboost tianqi chen paper, can you explain the original formula XGboost ?

@statquest 3 жыл бұрын

Which formula in the Tianq Chen paper are you asking about? Most of them are in this video.

@hampirpunah2783 3 жыл бұрын

@@statquest in the Tianq Chen paper formula number (2) and then your formula similarity score, output value I can't find formula..

@statquest 3 жыл бұрын

@@hampirpunah2783 Equation 2 refers to a theoretical situation that can not actually be solved. It assumes that we can find a globally optimal solution and, to quote from the manuscript: "The tree ensemble model in Equation 2 includes functions as parameters that cannot be optimized..." Thus, we approximate equation 2 by building trees in an additive manner (i.e. boosting) and this results in equation 3, which is the equation that XGBoost is based on. Thus, in order to explain XGBoost, I start with equation 3. Also, the similarity score in my video is equation 4 in the manuscript and the output value is equation 5 in the manuscript.

@helenjude3746 4 жыл бұрын

I would like to point out that the hessian in XGBoost for Multiclass Softmax is not exactly pi(1-pi). It is actually twice that. See source code: github.com/dmlc/xgboost/blob/master/src/objective/multiclass_obj.cc See here: github.com/dmlc/xgboost/issues/1825 github.com/dmlc/xgboost/issues/638

@statquest 4 жыл бұрын

Thanks for the clarification. In this video we're only talking about boolean classification, as described at 1:29