Finding an outlier in a dataset using Python

Рет қаралды 186,856

5 жыл бұрын

In this video we will understand how we can find an outlier in a dataset using python.
ref: #medium articles
#Outlierdetection
github url: github.com/krishnaik06/Findin...
Data Science Projects playlist: • Generative Adversarial...
NLP playlist: • Natural Language Proce...
Statistics Playlist: • Population vs Sample i...
Feature Engineering playlist: • Feature Engineering in...
Computer Vision playlist: • OpenCV Installation | ...
Data Science Interview Question playlist: • Complete Life Cycle of...
You can buy my book on Finance with Machine Learning and Deep Learning from the below url
amazon url: www.amazon.in/Hands-Python-Fi...

Пікірлер: 118

@doubando 6 ай бұрын

Amazing Krish, now I understand the concept of outliers, thanks

@yuktikhantwal2342 4 жыл бұрын

great video sir. great content, and explained in the cleanest way possible. thanks

@yourkarma7012 3 жыл бұрын

Clustering techniques are also widely used in industry to detect outliers. Specially isolation forest algo

@srijeetful 2 жыл бұрын

Very clear and crisp explanation, loved it

@satheeshswaminathan2328 4 жыл бұрын

Hi Krish, Thank you so much for the tutorial, Very clear and crisp explanation, loved it :)

@shujashakir9952 Жыл бұрын

The tutorial offers a lucid explanation of a complex problem of outliers. It is well-presented with examples that made it easier to follow. However, threshold = 3 isn't working for me. I modified it to threshold = 3+std to make it work properly. Moreover, declaring outliers = [ ] outside the function is causing problems if you want use this function in another dataset in the same notebook. So, declaring outlier list inside the function would be a better approach, I think.

@AmitSharma-po1zb 3 жыл бұрын

Superb explanation...in very simple way..

@samarendrapradhan5067 4 жыл бұрын

Sir,pls help if i have a dataset which contains 10 features each with a date for a particula index,how can i detect and see the outliers for it happens for an index in one or more than one fearures.i have 4000 fixed indexes and feature values are updates for each date.thanks

@gyapti-fctfinder3336 3 жыл бұрын

Nice Content and you explained it very well.ThankYou So Much

@nabilahhannani2326 4 жыл бұрын

I've applied both of the method in my dataset, but I found different results for both of them? Which one should I choose? Is it possible they have different result?

@Ashokkumar-sc3vt 5 жыл бұрын

Hi Krish, well explained. can you please post a video on how to equate the outliers using any dataset. Thanks in advance.

@aashaygoel7338 3 жыл бұрын

During a project in ml I come to an scenario where when I split the dataset with train_test_split the test set contained some categorical column that were not present in the train set while label encoding it. Can you please explain what to do in this type of scenario and also do the outliers be detected before train test split or after. I have seen that you explain each topic in detail. Please help me in this scenario.

@sanathdas4071 4 жыл бұрын

Sir,please can you tell me the difference between anomaly and outliers? I am confused about this two. please, sir answer me

@dhivya_animal_lover 4 жыл бұрын

Hi Sir , a smal doubt in the video part where you talk about the Std Normal Distribution. You told the graph is about Std normal distribution, but the you told when data falls before and beyonf 3rd std deviation, you will not consider it. Kindly clarify

@deeptijoshi377 3 жыл бұрын

What will we do in case when outliers are not following gaussian distribution and outlier is present in between the data distribution but not at the extremes

@dineshlakshitha7309 3 жыл бұрын

amazing video supper explanation

@saniyamanchekar9978 4 жыл бұрын

How can I find out outliers when there will be many numbers of Columbus in a large datasets.

@sekharpink 5 жыл бұрын

Hi Krish I like ur videos alot..very informative..Could you please put videos related to word2vec models like skipgram, CBOW, gensim, glove.. Thanks in advance.

@adityapradhan8474 Ай бұрын

Thank you so much sir, I understood everything

@jatingupta4026 3 жыл бұрын

how to remove those values that are more than the upper bound and lower than the lower bound values respectively? Please tell that too sir

@meghnasingh9941 4 жыл бұрын

great explanation, kudos !

@The.Data.Scientist 11 ай бұрын

Nice work mate. I also tried something similar but with Upper and Lower Bound on the Return

@karishmaqweera3869 4 жыл бұрын

Sir, Are you having handwritten notes of whatever you taught in ML course videos?Please share them Sir.

@adarshrai22 2 жыл бұрын

@krish naik how to remove outliers from non-normal distributed dataset?

@smalirizvi8026 2 жыл бұрын

I have a couple of questions. 1. Is it always better to remove the outliers or could it be big mistake as well? You gave an example of a fraudulent transaction. Now, an outlier indeed is a hint that the transaction was fraud. If I remove all transactions at the first place, how am i going to achieve my results? 2. You did not explain how do we perform outlier checks with multivariate dataset. Suppose IRIS dataset. I have seen a couple of videos here and there but no proper way is coming out. What is the proper way to identify outliers with multivariate datasets. Tahnks

@muhammadmuneebkhanafridi154 4 жыл бұрын

Very well explained.

@mohanadjibory2191 2 жыл бұрын

Thanks , i wonder how to detect outliers in ndarry numpy. I mean n by m shape array. You explained for 1D array, what abot 2d?

@manavagarwal9763 8 ай бұрын

where can i get this jupyter notebook for revision

@aws384 4 жыл бұрын

great video and really it is inspiring

@rizkamilandgamilenio9806 Жыл бұрын

Is there any condition better we use one method over another?

@prateeksmithpatra5796 3 жыл бұрын

outliers.append(y) y is not defined but how did you complied it

@ryando4556 4 ай бұрын

Well explained, would be great if you can add some plot for visualization.

@subhamasthan7294 4 жыл бұрын

Hi Krish thank you so much for a nice video can you pls share the link of nxt video where you applied these techniques on kaggle dataset ?

@AbhishekMishra-mq4jw 3 жыл бұрын

what to do with natural outliers? the outliers which are expected to be there which are not because of any artificial errors

@shadrul2783 4 жыл бұрын

Here is the correction lower bound = q1 - 1.5*IQR and upper bound = q3 + 1.5*IQR

@rohankupate5917 Жыл бұрын

You mean in video it's mistake?

@Kishor_D7 6 ай бұрын

Yes bro, check statistics playlist by krish naik.

@dhirendrajha9667 5 жыл бұрын

Hi, Krish, well explained, can you build one video on rasa chatbot.

@BAIBHAVPATHYBEE Жыл бұрын

for z score how did you know the threshold value ???

@PratapO7O1 3 жыл бұрын

14:06 here it is a single dimension df how to sort multidimensional df. We can't sort all rows at once we need to specify one row or 2 how to do it with multi-dimension df? Thank you

@otroleonarbe 2 жыл бұрын

thanks for sharing this video. One correction, in the loop it should be *outliers.append(i) * not outliers.append(y)

@dikshadhiman2474 3 жыл бұрын

Thankyou sir for this content.

@mridulagarwal5881 4 жыл бұрын

You have explained things well. Just one correction - it's inter-quartile range and not inter-quantile range.

@FaraazKhanfz 3 жыл бұрын

It's Inter Quartile Range

@nosseibagacem9014 Жыл бұрын

Hello sir, i hope you are doing well, i was hoping if you can help me with OD, I'm doing a thesis on the subject and i'm very new to python and programming, i hope to hear from you and thank you in advance.

@pratikramteke3274 3 жыл бұрын

How to find outliers in multiple linear regression?

@rushikeshbulbule8120 4 жыл бұрын

Excellent👍👏😆

@terwasevictorsesugh3902 Жыл бұрын

What if the data does not follow a normal distribution?

@jakekiddall5108 2 жыл бұрын

Is there any anamoly detection videos that dont use credit card fraud as an example???

@mdazizulislam9653 4 жыл бұрын

Any suggestions for multivariate outliers having mixed variables (continuous & Categorical)?

@bonishagarwal9315 4 жыл бұрын

In case of categorical data, it will be better to find the outlier using a scatter plot as sir explained.

@muditmathur465 Жыл бұрын

Why do we use 1.5 times IQR? Can we take any other number?

@magicmushroom9670 3 жыл бұрын

Every single KZfaq channel explain with perspective of Univariate. Can you please explain this with Multivariate ? There is very less data about that on internet.

@amitsawant4961 2 жыл бұрын

insightful for me

@bhagyaraj5506 4 жыл бұрын

in z-score threshold value mentioned as 3 , threshold is nothing but 3rd standard deviation is it?

@mohitjoshi4209 4 жыл бұрын

yes you're correct

@sheetalyoutub 2 жыл бұрын

Very helpful !

@sakhawathossain3812 Жыл бұрын

Very helpful...

@nosseibagacem9014 Жыл бұрын

@jorgeeg2668 2 жыл бұрын

how detect outliers in fuction to datetime?

@Getrocknete_Kotze_Schlabbern Ай бұрын

i dont understand why we compute 1.5 * iqr , what does this 1.5 mean where do you get this number?

@RahulKumar-hj8qk 4 жыл бұрын

if we have more than one feature, after that we remove the outliers than, is it not affect other features

@bonishagarwal9315 4 жыл бұрын

You need to remove the whole sample of that outlier because if you remove only the outlier from one feature, it results in an empty space leading to inaccurate predictions. Eg. if you have Age, Height, and Weight as your input features and u find an outlier in your Age column, you need to remove the whole sample of that particular outlier i.e. remove the complete row of that outlier. Hope I have answered your question.

@satyanarayanajammala5129 5 жыл бұрын

excellent

@vamsinadh100 3 жыл бұрын

13:57 Correction Lower bound=Q1-IQR*1.5 Upeer bound= Q3+IQR*1.5

@aggreykip2006 Жыл бұрын

can you use Upper bound in a histogram as a max value?

@shishirdixit5996 4 жыл бұрын

Sir once we have detected these outliers using z score method and if they are too many outliers how can we drop those outliers

@RwSkipper007 4 жыл бұрын

you can use .difference() method to do that If A and B are two sets then you can calculate the difference as : A.difference(B) , equivalent to (A-B) of the set. Similarly (B-A) = B.difference(A) Hope this helps

@ga43ga54 5 жыл бұрын

Please talk about data strategy

@mithunkumar7063 5 жыл бұрын

Thank you

@parikshitgupta343 3 жыл бұрын

How is lower bound which you said is q1*1.5 is greater then lower quartile which you said it's q1 Lower bound seems like something which should be less then lower quartile

@niveshtayal979 4 жыл бұрын

Hi Krish Thanks for excellent explanation....But if we get some outliers in any feature should we remove those records containing outliers(but in this case we loose some data), if not then how can we handle outliers??? Please cover this portion also :)

@amanpreetsinghgulati2475 2 жыл бұрын

Capping (wensorization) is another way where we can deal with outliers by imputing the values (within the range) in that case the data will not be lost

@arjyabasu1311 4 жыл бұрын

Sir, shouldn't the threshold value be 3*std and not just 3 ?? Because the rule is a data point is will be considered to be a outlier if it falls outside 3rd standard deviation and not just value 3.

@jondoe3693 4 жыл бұрын

Do you mean when z score = 3? Then it is correct to use threshold of 3 because you have standardized the data and standard deviation of z scored values is 1 and its mean is 0.

@yomeshyadav3407 3 жыл бұрын

sir, I have a doubt, threshold is nothing but 3rd standard deviation as you said so it must be 3 * sigma but here you have taken the threshold as 3 can you please clarify this

@somomitachattopadhyay2846 11 ай бұрын

yes thats because here in standard normal distribution the standard deviation is considered to be having the value 1 , sigma = 1

@abdulaziz-lh3nb 2 жыл бұрын

what if I have a lot of outliers in the dataset (around 27%), how to handle that?

@newbie8051 Жыл бұрын

If I were you, I would go for missing value treatment first, then try to go with outlier treatment, also if I had to deal with such high % of outliers, my first thought would be treat them like normal data points, as deleting outliers would lead to loss of too-many data points. Can you share how you solved the problem ?

@aakashsinghrawat3313 4 жыл бұрын

sir, in any dataset like bank loan prediction, what if credit score is beyond its ranging(300-850), will they considered as outliers? if yes, how to handle them? great fellows are welcome to help...please

@rachittoshniwal 3 жыл бұрын

If the range itself is 300-850 and you are having values above or below that range, then that is a data error, and you can drop them unless you can devise a way to find the real value

@vishalb1204 5 жыл бұрын

Can you please enable English subtitle?

@kaka83185 3 жыл бұрын

Just a correction, when calculating z-score , you are doing subtraction of i to an array, you should enumerate on datasets and then subset i from the current index of mean and std.

@nosseibagacem9014 Жыл бұрын

@karimdandachi9200 Жыл бұрын

mean and std are not arrays... the mean of a list of values is a single value and so is the standard deviation

@aparnashrivastava5837 3 жыл бұрын

Thanks

@কোরআন-শিখি 4 жыл бұрын

can u do a ransac

@cliffkwok 5 жыл бұрын

Hi Krish, I just ordered your finance book in Amazon, which is the newest one in whole amazon about python in finance, will you do more video on finance?

@krishnaik06 5 жыл бұрын

Thanks Kwok for buying my book...yes I will be uploading more videos on finance.

@varunchandrappa5123 3 жыл бұрын

@@krishnaik06 Hands-On Python for Finance is out of stock..Please let us know when it will be available for sale

@ksoftqatutorials9251 5 жыл бұрын

I have been following your videos and I have learnt many things Krish Naik. Could you please tell me have you written any Datascience and machine learning books. I would like to buy your books and follow your videos to clinch Datascience job as soon as possible.

@krishnaik06 5 жыл бұрын

Hi Kiran, I have written a book on finance with ML and DL

@ksoftqatutorials9251 5 жыл бұрын

@@krishnaik06 could you please share the link,so that I would buy that book..looking forward to more videos.

@chandrasekharpoluboyina8865 3 жыл бұрын

tell us about robust outlier

@aayushijain2160 4 жыл бұрын

Sir I understood that how to identify outliers using Z-score and IQR but can you tell us how to fix them like either we should drop that column or what else we should do to remove that outlier from the dataset????

@farazmev3430 4 жыл бұрын

drop rows or replace them (mean,mode,median)

@mashirnizami134 4 жыл бұрын

Gr8

@jayantdikshit4181 3 жыл бұрын

Hi Krish thanks for making such an amazing content. I have a query at 09:35. As you have mentioned that we can find outliers using scatter plots. But how can we find outliers if we do have multiple features(more than 2 features)? Your views/response on this would be much appreciated. Thanks in advance.

@rachittoshniwal 3 жыл бұрын

You can try with any two random features from your data You'll either see most values following a trend with a few outliers, or you'll see most values cluster at a place with a few outliers. Or maybe something else too!

@sanjaysanjay862 2 жыл бұрын

yes, you can do it by plotting each feature with the target.

@raghavgirigiri1 3 жыл бұрын

Krish i just wanna make a small correction, while saying "less than 2" OR "less than 3" say "10% of the data (or whatever the data is) fall below 2 or 3"....otherwise it's great, Good job !!

@chandrasekharpoluboyina8865 3 жыл бұрын

Generally we remove this noise, But for fraud detection and identifying a rare disease outliers will be helpful, in such cases how to handle or use them instead of removing them.

@muhammadyazidbaihaqi1479 2 жыл бұрын

why your video no subtitle? please make it, thanks

@deepquest 2 жыл бұрын

Hi Krish, How can we identify root cause of an outlier?

@newbie8051 Жыл бұрын

Due to human error in data entry/recording or maybe due to some error/bug in the Data Pipeline

@NickolayGrin 5 жыл бұрын

Using mean is Ok, but not best idea for outlier detection. Median based methods usually more robust.

@hritwijkamble9988 Жыл бұрын

Why threeshold = 3

@Blodia1990 Ай бұрын

It represents the quartile

@ganeshkumarpatel 4 жыл бұрын

Why to do such calculations and looping to find outlier... Just apply standard scaling and create new conditional dataframe of scaled data which contains morethan 3 std values... Those are outliers... Isn't it?