KNeighborsClassifier in scikit-learn

Scikit-learn is a Python-focused library on machine learning whose algorithms we can utilize for various machine learning tasks, including classification, regression, clustering, and model selection. In this Answer, our main focus will be on the classification domain of scikit-learn.

Note: To get hands-on practice in Scikit-learn, you can explore the course
Hands on machine learning with scikit-learn.

Classification

Classification is a fundamental ML task that involves assigning predefined labels or categories to input data based on patterns or features. The main goal here is to create a mapping between input features and the corresponding target labels.

The algorithm to be discussed in this answer, KNeighborsClassifier, is a classification algorithm.

`KNeighborsClassifier` algorithm

KNeighborsClassifier is an algorithm that effectively categorizes data points according to the trends found in that said point's nearest data points or neighbors. Let's take the already labeled data points below.

Fun astronomical scenario

Imagine we have a dataset of stars in the night sky, each represented by features such as brightness, distance from Earth, and other spectral characteristics. Further, each star has a label indicating its type, e.g., main sequence, red giant, or white dwarf.

The question at hand

Suppose we want to classify a newly discovered star based on its features. For this, we can use the KNeighborsClassifier algorithm. How? Let's see ahead.

The solution

First, we'll gather a training dataset with labeled stars, including their features and corresponding types.

A crucial step is to select a value for k, representing the number of nearest neighbors to consider. Let's say we keep it to five.

When a new star is observed, and its features are measured, the algorithm calculates the distances between this star and all the already existing stars based on their feature values. It then identifies the five nearest neighbors in the training data.

To see KNeighborsClassifier in action, let's check the labels of these five nearest neighbors. Assuming four are labeled as main sequence stars and one as a red giant, we can begin with our prediction phase.

Note: Based on this, the algorithm will predict that the new star is likely a main sequence star.

Predictions

To depict a bird's eye view of the scenario, we can say that stars with similar features, like brightness, tend to belong to the same type.

By finding the nearest labeled stars in the training data and letting this observation decide the new star type, the algorithm uses patterns observed in the known stars to predict the unknown star.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.datasets import make_blobs
from sklearn.neighbors import KNeighborsClassifier

X,y = make_blobs(n_samples = 550, centers = 4, random_state = 50)

knn = KNeighborsClassifier(n_neighbors = 5)
 
knn.fit(X, y)

h = 0.05

minimumX = X[:, 0].min() - 1
maximumX = X[:, 0].max() + 1
minimumY = X[:, 1].min() - 1
maximumY = X[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(minimumX, maximumX, h), np.arange(minimumY, maximumY, h))

Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])

cmapOne = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF', '#FFCCAA'])
cmapTwo = ListedColormap(['#FF0000', '#00FF00', '#0000FF', '#FF8800'])

Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmapOne)

plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmapTwo, edgecolor = 'k', s = 20)

plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())

plt.title("KNeighborsClassifier")

plt.show()

Note: Click the "Run button to see the code in action. You can also experiment with the code to understand it better!

Code explanation

Lines 1–2: Prior to writing our main code, we import the necessary libraries. We import numpy as np, which provides numerical operations. Then we import matplotlib.pyplot as plt, which is a plotting library.
Line 3: We import the ListedColormap class from the matplotlib.colors module. This class is used to create a custom colormap for our plot.
Line 4: We import the make_blobs function from the sklearn.datasets module. It's a built-in dataset we can make use of. The following code shows the results of this dataset.

Line 5: We import the KNeighborsClassifier class from the sklearn.neighbors module for classification tasks.
Line 7: We generate a dataset using the make_blobs function. We set the number of samples n_samples to 550, the number of centers centers to 4, and the random state random_state to 50. The generated dataset is assigned to the variables X and y, representing the input features and corresponding labels, respectively.
Line 9: We create an instance of the KNeighborsClassifier class knn. Here we can customize the number of neighbors n_neighbors according to our own needs. Let's say we set it to 5.
Line 11: We fit the model using the fit method of the knn object so that the dataset (X, y) can be trained with the KNeighborsClassifier model.
Line 13: Moving on to the data visualization part, we set the value of h to 0.05. This variable will be used to define the step size for the meshgrid.
Lines 15–18: We calculate the minimum and maximum values for the x and y coordinates of our plot using one point lesser and greater than the minimum and maximum values in the original dataset. These are saved in minimumX, minimumY, maximumX, and maximumY.
Line 20: We use the np.meshgrid function to create a grid of coordinates. For this purpose, we pass the range of x-coordinates from minimumX to maximumX with a step size of h, and the range of y-coordinates from minimumY to maximumY with a step size of h. The final grid is stored in xx and yy.
Line 22: We use the predict method of knn to classify each point on the meshgrid. The flattened coordinates of the meshgrid are obtained and passed as parameters. The predicted labels are then stored in the variable Z.
Line 24: We create a ListedColormap object called cmapOne for the different classes.
Line 25: For our data points, we create cmapTwo with different colors.
Line 27: We reshape the predicted labels Z to match the shape of the meshgrid xx.
Line 28: We create a new figure for the plot using plt.figure().
Line 29: We use the plt.pcolormesh function to create a pseudocolor plot of the predicted labels Z on the meshgrid. For customizations, we pass the meshgrid coordinates xx and yy, the predicted labels Z, and the colormap cmapOne as parameters.
Line 31: We use the plt.scatter function to create a scatter plot of the input features X. For customizations, we pass the x-coordinates X[:, 0], y-coordinates X[:, 1], the labels y, the colormap cmapTwo, the edgecolor as 'k', i.e., black, and the marker size s as 20 pixels.
Lines 33–34: We set the x-axis limits of the plot to the minimum and maximum values of xx and the y-axis limits to of yy.
Line 36: We set the plot title to "KNeighborsClassifier".
Line 38: Yay, our plot is now ready! We finally display the plot using the plt.show() function.