How to perform clustering in R with the k-means algorithm

No video

How to perform clustering in R with the k-means algorithm - R for Data Science

Рет қаралды 31,572

Күн бұрын

This video talks about how to perform clustering with the k-means algorithm in R. k-means is an unsupervised classification technique.
With k-means we create groups or clusters such that observations within a same cluster are as similar as possible while observations in different groups are as dissimilar as possible.
In the video you can find the most relevant R commands for creating clusters with k-means for a sample dataset.
You'll also learn about interpretation of the k-means results and how to create visualizations to explore the data against principal components and against the original variables.
Get access to download the scripts and data from GoogleDrive: dataninjas.ck....

Пікірлер: 49

@data.ninjas 2 жыл бұрын

Get access to download the scripts and data from GoogleDrive: dataninjas.ck.page/yt-files

@DaliaAboelmakarm-un9ee 24 күн бұрын

many thanks for this sufficient illustration,, really thanks

@data.ninjas 23 күн бұрын

You're very welcome, thank you for watching my video

@juanbautista6766 2 жыл бұрын

Wow. Great tutorial. Have seen many videos for generating “elbow plot”, but using the factoextra package as you noted here is GOLDEN! Thanks!!

@data.ninjas 2 жыл бұрын

Thank you very much for your kind message! Yes, the factoextra package makes it easy to create an elbow plot. Glad to hear you find the video helpful. Kind regards

@gabrielp.40 3 ай бұрын

You are a lifesaver, thank you so much for the tutorial!

@data.ninjas 3 ай бұрын

You're very welcome! Thank you for watching my video

@amiyabasak7096 10 ай бұрын

I have gained a comprehensive understanding of this topic, and sir, your explanations have been exceedingly clear to me.

@data.ninjas 10 ай бұрын

Thank you very much for your kind message. I'm happy to hear that you find my video helpful. Best regards

@snehaj3378 10 ай бұрын

You have no idea.. how u helped me.... God Bless!!

@data.ninjas 10 ай бұрын

You're very welcome. Glad to know you find the video helpful. Kind regards

@AchiragChiragg 7 ай бұрын

Thank you for making this video! It was very informative and helpful

@data.ninjas 7 ай бұрын

Glad to hear you found the video helpful! Thanks for your kind comment

@JorgeRodriguez-mp1mt 2 жыл бұрын

Aware of your contributions greetings from Mexico

@data.ninjas 2 жыл бұрын

Thank you very much. Best regards

@tesfayewoldesemayate4506 5 ай бұрын

What a nice video, wonderful!

@data.ninjas 5 ай бұрын

Thanks for your kind comment on my video! Best regards

@thelightofgod9151 2 жыл бұрын

Wow. Very clear and precise. Thanks

@data.ninjas 2 жыл бұрын

Thanks for your kind comment

@johneagle4384 2 жыл бұрын

Thank you for the video, and also thank you for the scripts!

@data.ninjas 2 жыл бұрын

You're very welcome! Thank you for watching and for commenting on my video

@lehoangucduy1425 10 ай бұрын

Why choose center value of 3 in kmeans function? please explain help me

@letsfly8654 5 ай бұрын

fviz_nbclust(data,kmeans,method='wss' cannot be working why

@Pooh991 2 жыл бұрын

Great video, I learned a lot from it, especially in regards to the methods for choosing the optimal number of clusters. Quick question though, the clusters overlap in your plot, but I don't think that they are supposed over lat in the Kmeans method. Do you have any insight on this?

@data.ninjas 2 жыл бұрын

Thanks for your kind comment. The clusters were created using 6 variables. The plots only show 2 variables at a time (2-dimensional plots) so some overlap can be seen. If it were possible to create a 6-dimensional plot then there would be not overlap

@aysegulgunduz4292 Жыл бұрын

Hi, how can I find this data on the internet? or How can I have access to explanation about dataset?

@rafipermana7734 5 ай бұрын

when im execute fviz_nbclust, this happening: Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1) In addition: Warning messages: 1: In stats::dist(x) : NAs introduced by coercion 2: In storage.mode(x)

@data.ninjas 5 ай бұрын

It may be because of NAs, kmeans cannot handle data that has NA values. See: stackoverflow.com/questions/36469671/error-in-do-onenmeth-na-nan-inf-in-foreign-function-call-arg-1

@what2605 28 күн бұрын

that one sameple no.79 made me feel very unsatisfied ..

@Lilian.Chidinma.Nwafor 2 ай бұрын

Thank you sir. Can means be applied to analysis with likert scale data?

@data.ninjas 2 ай бұрын

You're welcome. You may need to do some data preprocessing to apply k-means to an analysis with likert scale data. You'll have to first apply one-hot encoding so each response/category becomes a binary variable (0 or 1) and then normalize the data to have a mean of 0 and a standard deviation of 1. However note that K-means clustering uses Euclidean distance and assumes that distances between points are meaningful and comparable. This may not be appropriate for likert scale data since likert scale data is ordinal and the distances between responses may not be consistent, so you may consider alternative clustering techniques that are more suited to ordinal data, such as hierarchical clustering or model-based clustering approaches

@Lilian.Chidinma.Nwafor 2 ай бұрын

@@data.ninjas thank you. I think hierarchical will be good

@anteachmad Жыл бұрын

Does cluster analysis have to start with a multicollinearity test?

@data.ninjas Жыл бұрын

No, it does not. Multicollinearity does not directly influence the cluster analysis results

@vishalisharma3883 6 ай бұрын

why my mutate function is not working

@HarpreetKaur-bx1ej 2 жыл бұрын

Hi i have a question Perform a cluster analysis for 20 randomly selected Swiss bank notes. What is 20 in this case?

@data.ninjas 2 жыл бұрын

Hi. That question is not clear. It may mean that from a given dataset select 20 observations (rows) randomly and perform a cluster analysis, or it may mean something else

@HarpreetKaur-bx1ej 2 жыл бұрын

@@data.ninjas Here is the full question What is 20? Cluster analysis for 20 randomly selected Swiss bank dataset with following requirements 1. Set pseudo random numbers for 20 randomly selected data points 2.write about accuracy, missing values and outliers 3. what is the rationale for selecting a k-means clustering and with a distance function 4. interpret and make comment on clustering output 5. is cluster analysis technique used for dataset is good? Use cluster evaluation 6. visualize 20 selected datapoints by plotting the result of principal components

@data.ninjas 2 жыл бұрын

@@HarpreetKaur-bx1ej The first interpretation was correct. Select 20 rows (data points) from the dataset randomly

@HarpreetKaur-bx1ej 2 жыл бұрын

@@data.ninjas it means I have to take nstart=20?

@HarpreetKaur-bx1ej 2 жыл бұрын

Can you please help me in this question as am stuck in it

@kharankumarr2119 2 жыл бұрын

Is this Cure algorithm

@data.ninjas 2 жыл бұрын

The kmeans() function in R uses the Hartigan-Wong algorithm by default. Other options are the Lloyd, Forgy and MacQueen algorithms

@kharankumarr2119 2 жыл бұрын

@@data.ninjas Sir now I need cure algorithm R programming code

@kharankumarr2119 2 жыл бұрын

Can you please give me your mail id

@data.ninjas 2 жыл бұрын

@@kharankumarr2119 There may not be an implementation of cure algorithm in R yet (or at least I have not found any). There is a Python implementation for cure: github.com/annoviko/pyclustering You may run cure in Python, or you may use the reticulate package in R to work with Python in R rstudio.github.io/reticulate/

@kharankumarr2119 2 жыл бұрын

@@data.ninjas sir it is a project for us to do it in R programming i am data analytics student of psgcas