PRACTICAL 2
Practical 2
AIM :
Data Pre-processing (Feature Selection/Elimination) tasks using python.
THEORY:
Feature Selection:
Feature Selection is the process where you automatically or manually select those features which contribute most to your prediction variable or output in which you are interested in. Having irrelevant features in your data can decrease the accuracy of the models and make your model learn based on irrelevant features.
Why feature selection?
- Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
- Improves Accuracy: Less misleading data means modeling accuracy improves.
- Reduces Training Time: fewer data points reduce algorithm complexity and algorithms train faster.
Different methods of Feature Selection/Elimination:
- Variance Threshold – Variance Threshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.
- Univariate feature selection – Univariate feature selection examines each feature individually to determine the strength of the relationship of the feature with the response variable.
- Recursive Feature Elimination – Recursive feature elimination (RFE) is a feature selection method that fits a model and removes the weakest feature (or features) until the specified number of features is reached.
- PCA – A feature selection method is proposed to select a subset of variables in principal component analysis (PCA) that preserves as much information present in the complete data as possible. The information is measured by means of the percentage of consensus in generalized Procrustes analysis.
- Correlation – A good feature will always be highly correlated to the class and not redundant to any other relevant features. Correlation based feature selection consists of two important stages namely, selecting relevant features from the class and identifying redundant features and eliminating the same from original dataset.
Dataset Description:
This database was created to identify a voice as male or female, based upon acoustic properties of the voice and speech. The dataset consists of 3,168 recorded voice samples, collected from male and female speakers.
Task 1: Univariate feature selection
Task 2: Recursive Feature Elimination
Task 3: Principal component analysis
QUESTION/ANSWER:
1. What is the impact on model performance for selecting only correlated features?
Once correlation is known it can be used to make predictions. When we know a score on one measure we can make a more accurate prediction of another measure that is highly related to it. The stronger the relationship between/among variables the more accurate the prediction.
2. Amongst all methods, which method avoids overfitting and improves model performance?
Recursive Feature Elimination is a greedy optimization algorithm which aims to find the best performing feature subset. Hence it also avoids overfitting and improves model performance.
REFERENCES:
- https://medium.com/analytics-vidhya/feature-selection-using-scikit-learn-5b4362e0c19b
- https://machinelearningmastery.com/rfe-feature-selection-in-python/
- https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60
- https://towardsdatascience.com/feature-selection-using-python-for-classification-problem-b5f00a1c7028
- https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
CODE FOR REFERENCE:
https://github.com/Meghanshi999/17IT087_DATASCIENCE/blob/main/87_DS_PRAC2.ipynb
Comments
Post a Comment