## Using k-means algorithm to cluster data

This tutorial explores the use of k-means algorithm to cluster data. K-means clustering is a widely used in data clustering for unsupervised learning tasks. The algorithm uses features to divide data into K groups with the most close inherent relationship. These groups are found by minimizing the within-cluster sum-of-squares. This means that instead of having a target variable Y, the K-Means algorithm produces a specific classification, or cluster number, for each observation. This tutorial examines how to use:

- Load data from a dataset using
`loadd`

- Visualize a 2D dataset to identify the number of clusters.
- Fit a k-means model to dataset using
`kmeansFit`

. - Plot clustered data using
`plotClasses`

. - Add centroids to the plotted data.

### Load data

The data for this tutorial is stored in the file kmeans_data.csv. This tutorial uses `loadd`

to load the dataset into GAUSS prior to fitting the model. The function `loadd`

uses the **GAUSS** formula string format which allows for loading and transforming of data in a single line. Detailed information on using formula string is available in the formula string tutorials.

The formula string syntax in `loadd`

uses two specifications:

- The dataset specification
- A formula which specifies how to load the data which is optional if the complete dataset is to be loaded.

```
new;
cls;
library gml;
rndseed 234234;
//Load hitters dataset
x = loadd(getGAUSSHome $+ "pkgs/gml/examples/kmeans_data.csv");
```

### Visualize the data

The `kmeansFit`

function in GAUSS requires the number of clusters as a user input. Visualizing the data can be one helpful step towards choosing the correct number of clusters. Since we are looking for a quick visualization of the data for model setup, the `plotScatter`

function can be used with default format settings:

```
//View plot to get idea of clusters
plotScatter(x[.,1], x[.,2]);
```

The resulting plot shows three clear clusters and suggests that we should use k = 3 for fitting our k-means model.

### Fitting the k-means model

The k-means model is fit using the **GAUSS** procedure `kmeansFit`

. The `kmeansFits`

procedure takes two required inputs, a feature matrix and the number of clusters. In addition, the `kmeansControl`

structure may be optionally included to specify model parameters.

The `kmeansFit`

returns all output to a `kmeansModel`

structure. An instance of the `kmeansModel`

structure must be declared prior to calling `kmeansFit`

. Each instance of the `kmeansModel`

structure contains the following members:

Member | Description |
---|---|

centroids | kxP matrix, containing the centroids with the lowest intra-cluster sum of squares. |

assignments | Nx1 matrix, containing the centroid assignment for the corresponding observation of the input matrix. |

totalSS | Scalar, sum, over all observations, of the squared differences of each observation from the overall mean. |

clusterSS | Scalar, sum of squared differences between each observation and its assigned centroid. |

elapsedIters | Scalar, the number of iterations taken by the 'start' with the lowest 'clusterSS'. |

The code below uses the k-means model to fit clusters to the data matrix, *x* :

```
//Step One declare kmeansModel struct
struct kmeansModel mdl;
//Step Two: Fit kmeans model
mdl = kmeansFit(x , n_clusters);
```

### Plotting the assigned classes

The GAUSS `plotClasses`

function provides a convenient tool for plotting the assigned clusters. The `plotClasses`

function produces a 2-D scatter plot of the data matrix with each class plotted in a different color. The procedure requires two inputs, a 2-dimensional data vector, *x*, and a vector of class labels, *labels*. The label vector may be either a string array or numeric vector. Finally, the plot can be formatted by including an optional `plotControl`

structure.

To start, let's set-up the `plotControl`

to add a title to our graph and to turn the grid on the plot off. This is done in four steps:

- Declare an instance of the
`plotControl`

structure. - Fill the structure with the defaults settings for a scatter plot using
`plotGetDefaults`

- Use
`plotSetTitle`

to specify, the wording, font, and font color for the graph title. - Use
`plotSetGrid`

to turn grid off.

```
//Declare plotControl structure
struct plotControl myPlot;
myPlot = plotGetDefaults("scatter");
//Set up title
plotSetTitle(&myPlot, "K-mean Clustering", "Arial", 16, "Black");
//Turn grid off
plotSetGrid(&myPlot, "off");
```

Next, we will plot the class assignments found using `kmeansFit`

. These are stored in the `kmeansModel`

member *mdl.assignments*:

```
//Step Four: Plot results
plotClasses(x, mdl.assignments, myPlot );
```

The plot shows the same scatter point as our initial plot of the data. However, the plot now shows three clusters, plotted in red, green, and blue.

### Adding centroids

This graph is helpful but we may also be interested in seeing the centroids used to determine the clusters. To do this we will write our own procedure built around the GAUSS `plotAddScatter`

procedure. Our procedure will format and add the centroids. User defined procedures always start with `proc(number returns)`

and end with `endp`

. Any returns from procedure should be within the statement `retp(returns)`

:

```
proc(1) = myNewProc(inputs);
...
...
retp(myOutput);
endp;
```

Our plot will take two inputs, both centroid vectors:

```
proc(0) = plotAddCentroids(centroid1, centroid2);
//Set up plot format
struct plotControl myPlot2;
myPlot2 = plotGetDefaults("scatter");
//Set fill on marker
plotSetLineStyle(&myPlot2, 1);
//Set market ot star
plotSetLineSymbol(&myPlot2, 0);
//Set marker color
plotSetLineColor(&myPlot2, "black");
plotAddScatter(myPlot2, mdl.centroids[.,1], mdl.centroids[.,2]);
endp;
```

Once we have written our procedure, the procedure can be called just the same as any internal GAUSS procedure:

```
//Add centroids
plotAddCentroids(mdl.centroids[.,1], mdl.centroids[.,2]);
```