DropConstantFeatures¶
API Reference¶
- class feature_engine.selection.DropConstantFeatures(variables=None, tol=1, missing_values='raise')[source]¶
Drop constant and quasi-constant variables from a dataframe. Constant variables show the same value across all the observations in the dataset. Quasi-constant variables show the same value in almost all the observations in the dataset.
By default, DropConstantFeatures() drops only constant variables. This transformer works with both numerical and categorical variables. The user can indicate a list of variables to examine. Alternatively, the transformer will evaluate all the variables in the dataset.
The transformer will first identify and store the constant and quasi-constant variables. Next, the transformer will drop these variables from a dataframe.
- Parameters
- variables: list, default=None
The list of variables to evaluate. If None, the transformer will evaluate all variables in the dataset.
- tol: float,int, default=1
Threshold to detect constant/quasi-constant features. Variables showing the same value in a percentage of observations greater than tol will be considered constant / quasi-constant and dropped. If tol=1, the transformer removes constant variables. Else, it will remove quasi-constant variables.
- missing_values: str, default=raises
Whether the missing values should be raised as error, ignored or included as an additional value of the variable, when considering if the feature is constant or quasi-constant. Takes values ‘raise’, ‘ignore’, ‘include’.
Attributes
features_to_drop_:
List with constant and quasi-constant features.
variables_:
The variables to consider for the feature selection.
n_features_in_:
The number of features in the train set used in fit.
See also
sklearn.feature_selection.VarianceThreshold
Notes
This transformer is a similar concept to the VarianceThreshold from Scikit-learn, but it evaluates number of unique values instead of variance
Methods
fit:
Find constant and quasi-constant features.
transform:
Remove constant and quasi-constant features.
fit_transform:
Fit to the data. Then transform it.
Example¶
The DropConstantFeatures() drops constant and quasi-constant variables from a dataframe. By default, DropConstantFeatures drops only constant variables. This transformer works with both numerical and categorical variables.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.selection import DropConstantFeatures
# Load dataset
def load_titanic():
data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
data['embarked'].fillna('C', inplace=True)
return data
# load data as pandas dataframe
data = load_titanic()
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['survived', 'name', 'ticket'], axis=1),
data['survived'], test_size=0.3, random_state=0)
# set up the transformer
transformer = DropConstantFeatures(tol=0.7, missing_values='ignore')
# fit the transformer
transformer.fit(X_train)
# transform the data
train_t = transformer.transform(X_train)
transformer.constant_features_
['parch', 'cabin', 'embarked']
We see in the following code snippets that for the variables parch and embarked, more than 70% of the observations displayed the same value:
X_train['embarked'].value_counts() / len(X_train)
S 0.711790
C 0.197598
Q 0.090611
Name: embarked, dtype: float64
71% of the passengers embarked in S.
X_train['parch'].value_counts() / len(X_train)
0 0.771834
1 0.125546
2 0.086245
3 0.005459
4 0.004367
5 0.003275
6 0.002183
9 0.001092
Name: parch, dtype: float64
77% of the passengers had 0 parent or child. Because of this, these features were deemed constant and removed.