Semiconductor Test Result Prediction (Imbalanced Classes)

No video

Semiconductor Test Result Prediction (Imbalanced Classes) - Data Every Day

Рет қаралды 1,571

Күн бұрын

Hi guys, welcome back to Data Every Day!
On today's episode, we are looking at a dataset of semiconductors in production and trying to predict whether a given semiconductor will pass or fail a quality assurance test. We will be using a logistic regression model to make our predictions.
Here is a link to the Kaggle dataset:
www.kaggle.com...
And here is a link to my notebook from the video:
www.kaggle.com...
Thanks so much for watching! If you enjoyed today's episode, be sure to subscribe and hit the bell for more content!
See you all tomorrow! :)
----------
Patreon: / gcdatkin
LinkedIn: / gcdatkin
Twitter: / gcdatkin

Пікірлер: 8

@atharvadumbre 3 жыл бұрын

Is it necessary to have 50% of both the target values , can't we just oversample the minority class and undersample the majority class so that we have our target column in 70:30 or 60:40 ratio. I think that would give more better results , correct me if I am wrong I don't have much practical experience 😅 btw love your videos , I have started watching your every upload❤️

@gcdatkin 3 жыл бұрын

I highly recommend trying it out and seeing! It may be so. The results should dictate which approach to use. In theory, we want both classes to have equal representation so that the model is used to seeing both kinds of training examples in equal quantity, but theory does not always rule. Practice will reveal the truth.

@nayansarma840 3 жыл бұрын

To identify the columns with only a single value, would it have been easier just to check out the variance of that column ? A 0 variance will indicate only a single value in that column. Also, Scikit Learn provides a library function to automate this sklearn.feature_selection.VaruanceThreshold(). However, using this function could mess up the column indexes.

@gcdatkin 3 жыл бұрын

Very true! That's a good idea. Definitely easier than typing out that whole dictionary comprehension. Thanks for the tip! :) I would avoid using VarianceThreshold unless performing feature selection. What you could do instead is get the variance of each column with df.var() and then check which are equal to zero: single_valued_columns = df.columns[df.var() == 0] df = df.drop(single_valued_columns, axis=1)

@priyation 3 жыл бұрын

If you were to find out which features are most important to predict fails, how would you do it?

@gcdatkin 3 жыл бұрын

Great question! There are many ways to do this. One way is to use a model with interpretability built in. For example, with logistic regression, you get to see the actual feature contributions just by looking at the weights learned by the model (since there is only one weight per feature). Another interpretable model would be the decision tree. Once you've built the tree, you can look at its structure to see how the model is making the predictions it's making. Another way you can gauge feature importance is by using explanation metrics such as LIME or Shapley values. LIME (Local Interpretable Model-Agnostic Explanations) is a way to get a sense of how your model is making its predictions by building a linear (interpretable) model that approximates your model. Shapley values measure the marginal contributions of each feature with respect to the final output. If any of this is confusing, please let me know! :)

@goldenmikeLeKing 3 жыл бұрын

@@gcdatkin how would you decide which to use between those methods?

@gcdatkin 3 жыл бұрын

Well it depends on what you are looking for in a model. If accuracy is extremely important to you, you would be better off using a non-interpretable model, because generally, models that have high interpretability have low accuracy relative to other models. For example, neural networks and random forests are really accurate, but it is very difficult if not impossible to interpret their results. In this case, using explanation metrics would better suit your needs. If accuracy is secondary and not as important as interpretability, then opting for a simple linear/logistic regression or a decision tree might be your best option.