Classification is a type of supervised machine learning concept, where data is categorized into classes.
In a binary classification example, the data is classified into 2 classes. However, data available for one class can be less, resulting in an imbalanced dataset.
The disadvantage of this is that the model will learn more from the majority class and miss out on the minor one. Although the model might give high accuracy results, its ability to classify the minority class will be impaired.
One solution to solving class imbalance is to assign weights. Weights ensure that the model pays more attention to underlying patterns of the minority class and therefore reduces errors of misclassification.
class weight
parameterMost algorithms have a built-in parameter called class weight
that can be used to offset the class imbalance.
Logistic regression is one such example. By default, this parameter is set to None
but can also take the form of a dictionary or balanced
.
When set to balanced
, the values of y
(target) are used to automatically adjust weights inversely proportional to class frequencies in the input data as
n_samples / (n_classes * np.bincount(y))
Where:
n_samples
is the total rows in a dataset.n_classes
are the number of classes in a dataset.np.bincount(y)
is the total count of a specific class in that dataset.A dataset with 1000 rows and 2 classes made of 100 and 900 for the minority and majority class respectively, the weights assigned will be as follows:
Weights can also be assigned manually, especially if the previous methods’ results are unsatisfactory.
from sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.datasets import make_classificationimport numpy as npimport pandas as pdfrom sklearn.metrics import f1_score, accuracy_score, confusion_matrixX, y = make_classification(n_samples=100000, n_features=2, n_informative=2,n_redundant=0, n_repeated=0, n_classes=2,n_clusters_per_class=1,weights=[0.995, 0.005],class_sep=0.5, random_state=42)# Convert the data from numpy array to a pandas dataframedf = pd.DataFrame({'feature1': X[:, 0], 'feature2': X[:, 1], 'target': y})print(round(df.target.value_counts(normalize = True)*100),1)X = df.drop(columns = 'target')y = df['target']X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)print(X_train.shape,y_train.shape)model_1 = LogisticRegression(class_weight = 'balanced')model_1.fit(X_train,y_train)print(f'Accuracy score on balanced weights: {model_1.score(X_test,y_test)*100:.1f}%')print(f'F1 score on balanced weights: {f1_score(y_test,model_1.predict(X_test)):.3f}')conf_matrix_1=confusion_matrix(y_test,model_1.predict(X_test))print(conf_matrix_1)
def class_weight(labels_dict,mu=0.15):total = sum(labels_dict.values())keys = labels_dict.keys()weights = dict()for i in keys:score = np.log((mu*total)/float(labels_dict[i]))weights[i] = score if score > 1 else 1return weightslabels_dict = y.value_counts().to_dict()weights = class_weight(labels_dict)print('labels dictionary: ', labels_dict)print('weights: ',weights)model = LogisticRegression(class_weight = weights)model.fit(X_train,y_train)print(f'model_score for manual weights,{model.score(X_test,y_test)*100:.1f}%')print(f'F1_score for manual weights {f1_score(y_test,model.predict(X_test)):.2f}')conf_matrix = confusion_matrix(y_test,model.predict(X_test))print(conf_matrix)
The codes above generates an unbalanced dataset using make_classification
with 2 classes each, convert the data into a data frame, split the data into training and testing sets, and use different class weights to train a model.
The first code uses balanced
class weights and obtains 80% accuracy but a low f1_score
of 0.057
.
The second code passes a dictionary as the weight parameter to assign weights and obtains 99% accuracy and an improved f1_score
of 0.39
.
The closer f1_score
is to 1
, the better the model.
Note: There are other ways of dealing with class imbalance such as oversampling, undersampling, and data augmentation.
Free Resources