K-means data clustering

Using k-means algorithm to cluster data

This tutorial explores the use of k-means algorithm to cluster data. K-means clustering is a widely used in data clustering for unsupervised learning tasks. The algorithm uses features to divide data into K groups with the most close inherent relationship. These groups are found by minimizing the within-cluster sum-of-squares. This means that instead of having a target variable Y, the K-Means algorithm produces a specific classification, or cluster number, for each observation. This tutorial examines how to use:

Load data from a dataset using loadd
Visualize a 2D dataset to identify the number of clusters.
Fit a k-means model to dataset using kmeansFit.
Plot clustered data using plotClasses.
Add centroids to the plotted data.

Load data

The data for this tutorial is stored in the file kmeans_data.csv. This tutorial uses loadd to load the dataset into GAUSS prior to fitting the model. The function loadd uses the GAUSS formula string format which allows for loading and transforming of data in a single line. Detailed information on using formula string is available in the formula string tutorials.

The formula string syntax in loadd uses two specifications:

The dataset specification
A formula which specifies how to load the data which is optional if the complete dataset is to be loaded.

new;
cls;
library gml;
rndseed 234234;

//Load hitters dataset
x = loadd(getGAUSSHome $+ "pkgs/gml/examples/kmeans_data.csv");

Visualize the data

The kmeansFit function in GAUSS requires the number of clusters as a user input. Visualizing the data can be one helpful step towards choosing the correct number of clusters. Since we are looking for a quick visualization of the data for model setup, the plotScatter function can be used with default format settings:

//View plot to get idea of clusters
plotScatter(x[.,1], x[.,2]);

The resulting plot shows three clear clusters and suggests that we should use k = 3 for fitting our k-means model. kmeans_2D

Fitting the k-means model

The k-means model is fit using the GAUSS procedure kmeansFit. The kmeansFits procedure takes two required inputs, a feature matrix and the number of clusters. In addition, the kmeansControl structure may be optionally included to specify model parameters.

The kmeansFit returns all output to a kmeansModel structure. An instance of the kmeansModel structure must be declared prior to calling kmeansFit. Each instance of the kmeansModel structure contains the following members:

Member	Description
centroids	kxP matrix, containing the centroids with the lowest intra-cluster sum of squares.
assignments	Nx1 matrix, containing the centroid assignment for the corresponding observation of the input matrix.
totalSS	Scalar, sum, over all observations, of the squared differences of each observation from the overall mean.
clusterSS	Scalar, sum of squared differences between each observation and its assigned centroid.
elapsedIters	Scalar, the number of iterations taken by the 'start' with the lowest 'clusterSS'.

The code below uses the k-means model to fit clusters to the data matrix, x :

//Step One declare kmeansModel struct
struct kmeansModel mdl;

//Step Two: Fit kmeans model
mdl = kmeansFit(x , n_clusters);

Plotting the assigned classes

The GAUSS plotClasses function provides a convenient tool for plotting the assigned clusters. The plotClasses function produces a 2-D scatter plot of the data matrix with each class plotted in a different color. The procedure requires two inputs, a 2-dimensional data vector, x, and a vector of class labels, labels. The label vector may be either a string array or numeric vector. Finally, the plot can be formatted by including an optional plotControl structure.
To start, let's set-up the plotControl to add a title to our graph and to turn the grid on the plot off. This is done in four steps:

Declare an instance of the plotControl structure.
Fill the structure with the defaults settings for a scatter plot using plotGetDefaults
Use plotSetTitle to specify, the wording, font, and font color for the graph title.
Use plotSetGrid to turn grid off.

//Declare plotControl structure
struct plotControl myPlot;
myPlot = plotGetDefaults("scatter");

//Set up title
plotSetTitle(&myPlot, "K-mean Clustering", "Arial", 16, "Black");

//Turn grid off
plotSetGrid(&myPlot, "off");

Next, we will plot the class assignments found using kmeansFit. These are stored in the kmeansModel member mdl.assignments:

//Step Four: Plot results
plotClasses(x, mdl.assignments, myPlot );

kmeans_2D

The plot shows the same scatter point as our initial plot of the data. However, the plot now shows three clusters, plotted in red, green, and blue.

Adding centroids

This graph is helpful but we may also be interested in seeing the centroids used to determine the clusters. To do this we will write our own procedure built around the GAUSS plotAddScatter procedure. Our procedure will format and add the centroids. User defined procedures always start with proc(number returns) and end with endp. Any returns from procedure should be within the statement retp(returns):

proc(1) = myNewProc(inputs);

  ...
  ...
  retp(myOutput);
endp;

Our plot will take two inputs, both centroid vectors:

proc(0) = plotAddCentroids(centroid1, centroid2);

    //Set up plot format
    struct plotControl myPlot2;
    myPlot2 = plotGetDefaults("scatter");

    //Set fill on marker
    plotSetLineStyle(&myPlot2, 1);

    //Set market ot star
    plotSetLineSymbol(&myPlot2, 0);

    //Set marker color
    plotSetLineColor(&myPlot2, "black");

    plotAddScatter(myPlot2, mdl.centroids[.,1], mdl.centroids[.,2]);

endp;

Once we have written our procedure, the procedure can be called just the same as any internal GAUSS procedure:

//Add centroids
plotAddCentroids(mdl.centroids[.,1], mdl.centroids[.,2]);