Univariate feature selection is a method employed in machine learning for identifying the most pertinent features within a dataset. It works on a single feature, evaluating each feature using a designated statistical measure or scoring function. The objective of univariate feature selection is to recognize the features that demonstrate the highest correlation with the target variable without considering potential interactions or dependencies between the features.
Feature scoring: Each feature is individually evaluated using statistical measures such as correlation, mutual information, chi-squared test, or ANOVA. These measures quantify the relationship or dependence between each feature and the target variable.
Ranking or thresholding: The features are ordered based on their scores in ascending or descending order. Alternatively, a predetermined threshold value can be established, and features with scores surpassing the threshold are considered for selection.
Feature subset selection: The top-ranked features or those exceeding the threshold are chosen as the subset of relevant features, while the remaining features are discarded.
The following are the advantages of using univariate selection:
Dimensionality reduction: Univariate feature selection effectively reduces the dimensionality of the dataset by selecting only the most informative features. This simplifies models, improves computational efficiency, and reduces the risk of overfitting.
Interpretability: Univariate feature selection prioritizes individual feature relevance, enhancing interpretability. The selected features are understandable and explainable, promoting model transparency and trust.
Preprocessing: Univariate feature selection can be used as a preprocessing step to identify a subset of features for further analysis or model development. It helps identify important input variables that significantly contribute to the target variable.
Baseline method: Univariate feature selection serves as a straightforward baseline or initial approach for feature selection. It provides a basis for comparison with more advanced techniques, such as recursive feature elimination or feature importance from ensemble models.
Here’s an example code snippet that demonstrates how to perform univariate feature selection:
from sklearn.datasets import load_diabetesfrom sklearn.feature_selection import SelectKBest, f_regression# Load the diabetes datasetdata = load_diabetes()X, y = data.data, data.target# Print the dataset informationprint("Dataset:")print("Features:")print(data.feature_names)print("\nTarget:")print("Disease progression one year after baseline")# Perform univariate feature selectionselector = SelectKBest(score_func=f_regression, k=4) # Select top 4 featuresX_selected = selector.fit_transform(X, y)# Get the indices of the selected featuresselected_feature_indices = selector.get_support(indices=True)# Print the selected featuresprint("\nSelected Features:")selected_features = [data.feature_names[i] for i in selected_feature_indices]print(selected_features)
The code above is explained in detail below:
Lines 1–2: These lines import the necessary modules from scikit-learn. The load_diabetes
method is used to load the diabetes dataset while SelectKBest
and f_regression
are used for univariate feature selection.
Line 5: Here, we load the diabetes dataset.
Line 6: Here, we assign the features X
and the target variable y
to separate variables.
Lines 9–13: Here, we simply print the dataset information.
Line 16: Here, we create an SelectKBest
object for univariate feature selection. The parameter k=4
specifies that we want to select the top 4 features.
Line 17: Here, we perform the univariate feature selection. It fits the selector to the data X
and target variable y
using the fit_transform
method. It returns the transformed dataset X_selected
containing only the selected features.
Line 20: Here, we retrieve the indexes of the selected features using the get_support
method of the selector object.
Lines 23–25: Here, we print the selected features.
Univariate feature selection is a valuable technique in machine learning for identifying essential features in a dataset. By evaluating individual features using statistical measures and selecting those with the highest relevance to the target variable, this method offers several advantages. It reduces dimensionality, enhances interpretability, aids in preprocessing, and serves as a baseline approach for feature selection.
Free Resources