K-means clustering in R using the nstart parameter

To recap, the k-means algorithm is a clustering method that partitionsnndata points intokkclusters such that each data point belongs to the cluster at the shortest distance.

Note: To learn more about the k-means algorithm, check out this Answer.

In this Answer, we’ll focus on implementing the k-means algorithm in R using the nstart parameter. The steps to do this are explained below. Let’s get started!

Step 1: Initialize the dataset

Firstly, we will generate some data points to make a sample dataset from which we can perform k-means clustering.

# Sample data being generated
set.seed(123)
data <- data.frame(
x = rnorm(100, mean = c(1, 4), sd = c(1, 1)),
y = rnorm(100, mean = c(1, 4), sd = c(1, 1))
)
# Printing the data points
data

We will use the rnorm method to generate data points. It will create 100 normally distributed random numbers along thex\text{x}andy\text{y}dimensions, with means of 1 and 4 and standard deviations of 1 and 1 respectively.

Step 2: Perform k-means clustering

Next, we will perform k-means clustering algorithm with the nstart parameter.

# Perform k-means clustering with k = 4
kmeans_result <- kmeans(data, centers = 4,nstart=10)

Here, the centers = 4 parameter means that four clusters will be made from the dataset made in the previous step. The nstart = 10 parameter is more significant.

Why use nstart?

With respect to k-means clustering in R, the nstart parameter is an argument that specifies the number of initial random cluster assignments to be tried. It is used to find the best initial set of cluster centers for the k-means algorithm by repeatedly running the clustering process with different initializations and selecting the one with the lowest total within-cluster sum of squares (WCSS).

Note: The default value for nstart is 1, meaning the algorithm is run only once with a single set of initial cluster centers. We can increase nstart to a larger number (e.g., 10, 20, etc.) to improve the chances of finding a better clustering solution.

Step 3: Plot the clustered data

Finally, we will plot the clustered data using the ggplot2 package, which is used for creating data visualizations, such as plots and graphs:

# Load the necessary libraries
library(ggplot2)
# Original data with an additional column 'cluster' is combined with the result of k-means
clustered_data <- cbind(data, cluster = as.factor(kmeans_result$cluster))
# Create a scatter plot
plot <- ggplot(clustered_data, aes(x = x, y = y, color = cluster)) +
geom_point(size = 3) +
labs(title = "K-means Clustering") +
theme_minimal()

Let’s explain the code below:

  • Line 4: Here, we combine the original data with an additional column 'cluster' that contains the cluster assignments obtained from the k-means clustering results. We will use the as.factor() method to convert the cluster assignments to factors for better handling categorical data.

  • Lines 7–10: Here, we will use the ggplot() package to create a scatter plot, where the aes() function defines the aesthetics, wherex\text{x}andy\text{y}are mapped to the columnsx\text{x}andy\text{y}from clustered_data. The color of the points is determined by the 'cluster' column. The geom_point() function adds a layer of points to the plot, with a specified size of 3. The labs(title = "K-means Clustering") function sets the title of the plot to "K-means Clustering". Lastly, theme_minimal() applies a minimalistic theme to the plot, adjusting its appearance.

Conclusion

We can perform k-means clustering on a random dataset, and create a scatter plot visualizing the clustering results using different colors for each cluster. We can adjust the nstart parameter to mitigate the sensitivity of k-means to the initial cluster centers, resulting in more reliable and stable clustering outcomes. Fine-tuning this parameter can be crucial, especially when dealing with complex datasets, resulting in better exploration and understanding of the data given to us.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved