Logistic Regression with Variable Selection and Categorical Data Analysis in R

Рет қаралды 13,420

3 жыл бұрын

Correction for mistake made on 23:45. I stated that "For every 10 year older the odds of death increases by 43% while controlling for all other predictors in the model". That statement is incorrect because I multiplied the percent change in Heart failure by one unit increase in age of 4.3 by 10 to get 43%. This was the incorrect calculation used, 10 * [(exp(.042)-1)*100]. I needed to instead multiply the log odds coefficient by 10 prior to exponentiating. This is the correct calculation (exp(.042*10)-1)*100 = 53%. Therefore, for every 10 year older the odds of death increases by 53% while controlling for all other predictors in the model. Thank you Utku Pamuksuz for spotting that.
Consultation: statsguidetree@gmail.com
This video is a tutorial of how to conduct some categorical analyses using R studio.
At 3:37 of the video, I meant to say -- "Though it was not, lets just say it was *dependent (i.e., significant)". Also, the pronunciation of Creatinine was a bit off, it is krē-ˈa-tə-ˌnēn.
The analyses reviewed are chi-squared test of independence, Fisher's exact test, and logistic regression. Effect size with Cramer's V for Chi-squared test of independence is covered. In addition, variable selection (i.e., model shrinkage) with stepwise regression, bootstrap, and multicollinearity detection with Variance Inflation Factor (VIF) for logistic regression models is also covered. The dataset used is from kaggle and contains patients with heart failure.

Пікірлер: 37

@rockleeroy 3 жыл бұрын

@Adelphos0101 Жыл бұрын

Very helpful, especially the logistic regression section. Thank you.

@Vivian-ve1qt 2 жыл бұрын

Excellent explanation! It is exactly what I needed! It will help me to finish my certification project. Subscribed.

@rockleeroy 2 жыл бұрын

Thank you so much I am glad you found it useful.

@padynz9869 2 жыл бұрын

Very logical and lucid explanation. Thank you very much.

@rockleeroy 2 жыл бұрын

Thanks so much, I am glad you liked it.

@ModupehB 3 жыл бұрын

Amazing tutorial. Thank You.

@rockleeroy 3 жыл бұрын

Thank you so much for the kind words, I am glad you found it useful.

@laxmanbisht2638 3 жыл бұрын

nicely explained.

@rockleeroy 3 жыл бұрын

Thank you for the compliment.

@humbertocardenas2096 3 жыл бұрын

Best half an hour invested!

@rockleeroy 3 жыл бұрын

Those are some very kind words. It is very much appreciated.

@whx2044 3 жыл бұрын

Thank you for teaching, very helpful ! One more question, may I use stepwise selection according to P-value instead of AIC?

@rockleeroy 3 жыл бұрын

I am glad you found it useful. The p-value is specific to each independent variable in your model and the significance of an independent variable (IV) can change depending on what else is included in the model. On the other hand, the stepwise-AIC considers the overall model and the impact to that overall model by removing one variable at a time. One alternative simplistic/basic approach is to create a model with all the IVs (a saturated model), then select only the significant IVs from that to include in a final model. But again, the issue is that what has a significant p-value changes based on what you include in the model. Using stepwise-AIC may result in a more parsimonious model (i.e., model that contains the fewest number of IVs without compromising the overall model). There are other model shrinkage approaches other than stepwise that are more preferred, that is why I include a bootstrap approach along with it. Let me know if this did not help answer your question.

@whx2044 3 жыл бұрын

@@rockleeroy Thanks so much !

@yousif533 2 жыл бұрын

Thank you for this great video. Could you please share the code and data?

@sashaewing1498 2 жыл бұрын

Thanks for this video! Can I use this same code for running a stepwise linear regression?

@rockleeroy 2 жыл бұрын

Yes, you can use the stepAIC() function for linear regression models lm() as well.

@sashaewing1498 2 жыл бұрын

@@rockleeroy great! thanks for the reply!

@sophielong8937 2 жыл бұрын

how would you interpret the coefficients for a logistic categorical variable? I could not see this on your model

@rockleeroy 2 жыл бұрын

Great question. For categorical it would be very similar but instead of talking interns of units it would be comparing the category to some baseline category. For example, let's say a binary variable 'previous_heart_issue' 1=yes and 0=no and the odds ratio percent was 4.3 for the variable 'previous_heart_issue_yes'. To interpret the coefficient in a sentence you could say the following -- the odds of death is 4.3% greater for patients with previous heart issues compared to patients with no previous heart issues while controlling for all other predictors in the model. You do not have to calculate percent you can just use the odds ratio. I believe it was 1.04 and you can say, the odds of death is 1.04 times greater for patients with previous heart issues compared to patients with no previous heart issues while controlling for all other predictors in the model.

@MHRAJAI Жыл бұрын

for how many features logistic regression works well? I have over 300 features, deos logistic regression work or other model is suggested? thank you

@rockleeroy 11 ай бұрын

I do not see much of a limit it is just your run time will be longer the larger the number of features you have. You may want to consider reviewing your data for like features, i.e., are there a cluster of features in your dataset that all provide the same information?

@utkupasha 2 жыл бұрын

Great video. However, I guess I need to verify a statement. I am not sure if we can say "for every 10 year older the odds of death increases by 43%". Sigmoid function is not linear, we cant just simply multiply 4.3 by 10. It depends on the X value. 1 unit increment (in our case one year older) will be equal to beta times mu (1-mu) increment in estimated probability at that specific x point.

@rockleeroy 2 жыл бұрын

That is correct , I just noticed the mistake. For that part, I used a calculation of 10 * [(exp(.042)-1)*100] it should have been (exp(.042*10)-1)*100. The answer should be about 53% not 43%. I will add the correction top the description and rcode.

@jolima2045 7 ай бұрын

please how to do step with glmer? Is there a package?

@manishadinesh2797 7 ай бұрын

can you help me interpret the interaction term logit(DEATH_EVENT)=−1.698+0.0385×age+0.8267×serum_creatinine−0.0006520×ejection_fraction×time

@rockleeroy 7 ай бұрын

Generally the interaction term would be defined as the effect ejection fraction has on death is conditional on values of time controlling for the other variables in the model. When you include interactions, it is often also a good idea to include the main effect of each variable also in the model. In addition, to make it easier to interpret you can center each variable before multiplying them together to form the interaction. Here is a good resource for working with interactions that go into more detail: www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=www3.nd.edu/~rwilliam/stats2/l55.pdf&ved=2ahUKEwjGvaaX5vSCAxWgmIkEHah6AfcQFnoECCUQAQ&usg=AOvVaw3KaKU8apAO-VaPq4RXqmYS

@kirtansuvarna 2 жыл бұрын

Can you make video on predicting best model using r studio example if I have data that is considered 70 percent we need to predict 30% finding the best model

@rockleeroy 2 жыл бұрын

If I am not mistaken, I believe you referring to model evaluation/validation. I do have sort of a part 2 to this tutorial video -- where I go over using a number of validation techniques (e.g., test/train, k-fold CV, etc.) for the model developed on this dataset. Here is the link kzfaq.info/get/bejne/qr9ka6VnysXDmGQ.html

@kirtansuvarna 2 жыл бұрын

@@rockleeroy thank you for replying do you have any video on propensity score logistic regression

@rockleeroy 2 жыл бұрын

@@kirtansuvarna I haven't done any yet but was thinking of doing a tutorial on Propensity Score in the very near future.

@adriansoto2107 3 жыл бұрын

Many many thanks for introducing me to bootStepAIC::boot.stepAIC!!! I have two quick (perhaps dumb) questions. How large is each bootstrapped sample? Can you change it? If you cannot, and if it replicates the sample size used for fitting the model. Shouldn't you sample a data set for training the model and then, bootstrap from the original data set? I hope I explained myself clearly. Cheers!

@rockleeroy 3 жыл бұрын

Great questions! The size of each bootstrapped sample is the same as your sample. I do not believe you can adjust the size of the sample using that function, though I could be mistaken. For your second question, what you are referring to are data splitting methods (they go by many names) designed to estimate the overall accuracy of the model. These data splitting methods fit models on to one subset of data and test it against some other subset of data that has not seen the model. Some examples include test/train splitting, cross validation k-folding, and out of bag bootstrap. For out of bag bootstrap, those data in the bootstrap sample selected are tested against those data that weren’t selected. However, the boot.stepAIC is not so much interested in the accuracy of the overall model and instead attempting to do a diagnostic of what the model is comprised of. I only showed one method to mitigate problems of inconsistency with stepwise regression as a model shrinkage method (i.e., developing a parsimonious model). The inconsistency problem of stepwise regression for model shrinkage is that it may result in the inclusion of variables that likely should not be in the model and vice versa. Using this bootstrap approach only outlines how often these variables would be included. But to answer your question, you should use some data splitting method to evaluate the overall model accuracy. In the near future, I will try do a part 2 that includes a review of some model splitting techniques to evaluate overall model accuracy.