To recap, the k-means algorithm is a clustering method that partitions
Note: To learn more about the k-means algorithm, check out this Answer.
In this Answer, we’ll focus on implementing the k-means algorithm in R using the nstart
parameter. The steps to do this are explained below. Let’s get started!
Firstly, we will generate some data points to make a sample dataset from which we can perform k-means clustering.
# Sample data being generatedset.seed(123)data <- data.frame(x = rnorm(100, mean = c(1, 4), sd = c(1, 1)),y = rnorm(100, mean = c(1, 4), sd = c(1, 1)))# Printing the data pointsdata
We will use the rnorm
method to generate data points. It will create 100 normally distributed random numbers along the1
and 4
and standard deviations of 1
and 1
respectively.
Next, we will perform k-means clustering algorithm with the nstart
parameter.
# Perform k-means clustering with k = 4kmeans_result <- kmeans(data, centers = 4,nstart=10)
Here, the centers = 4
parameter means that four clusters will be made from the dataset made in the previous step. The nstart = 10
parameter is more significant.
With respect to k-means clustering in R, the nstart
parameter is an argument that specifies the number of initial random cluster assignments to be tried. It is used to find the best initial set of cluster centers for the k-means algorithm by repeatedly running the clustering process with different initializations and selecting the one with the lowest total within-cluster sum of squares (WCSS).
Note: The default value for
nstart
is1
, meaning the algorithm is run only once with a single set of initial cluster centers. We can increasenstart
to a larger number (e.g., 10, 20, etc.) to improve the chances of finding a better clustering solution.
Finally, we will plot the clustered data using the ggplot2
package, which is used for creating data visualizations, such as plots and graphs:
# Load the necessary librarieslibrary(ggplot2)# Original data with an additional column 'cluster' is combined with the result of k-meansclustered_data <- cbind(data, cluster = as.factor(kmeans_result$cluster))# Create a scatter plotplot <- ggplot(clustered_data, aes(x = x, y = y, color = cluster)) +geom_point(size = 3) +labs(title = "K-means Clustering") +theme_minimal()
Let’s explain the code below:
Line 4: Here, we combine the original data with an additional column 'cluster' that contains the cluster assignments obtained from the k-means clustering results. We will use the as.factor()
method to convert the cluster assignments to factors for better handling categorical data.
Lines 7–10: Here, we will use the ggplot()
package to create a scatter plot, where the aes()
function defines the aesthetics, whereclustered_data
. The color of the points is determined by the 'cluster'
column. The geom_point()
function adds a layer of points to the plot, with a specified size of 3. The labs(title = "K-means Clustering")
function sets the title of the plot to "K-means Clustering"
. Lastly, theme_minimal()
applies a minimalistic theme to the plot, adjusting its appearance.
We can perform k-means clustering on a random dataset, and create a scatter plot visualizing the clustering results using different colors for each cluster. We can adjust the nstart
parameter to mitigate the sensitivity of k-means to the initial cluster centers, resulting in more reliable and stable clustering outcomes. Fine-tuning this parameter can be crucial, especially when dealing with complex datasets, resulting in better exploration and understanding of the data given to us.
Free Resources