Intuition Building: A Toy Problem for K-Means Clustering.

[Note: I've made a Jupyter Notebook (Python) for this so that you can mess around with a few of these ideas yourself. The figures come from this notebook.]

Suppose we have a few "blobs" of points that we'd like to classify as a few different clusters.

Here, we can see that there's either two or three blobs which seem separated from each other. For the sake of this post, let's assume there are three blobs: upper-left, upper-right, and bottom.

In order to create three different clusters, we're going to go through a relatively nice algorithm that uses the ideas behind Voronoi Diagrams, so if you're not familiar with those diagrams then go back to that post, read it, then come on back. We'll wait.

⚫ ⚫ ⚫ ⚫

K-Means Clustering Algorithm.

The gist of the algorithm for $$k$$-means is the following:

  1. First initialize $$k$$ random points. (Try to make these points relatively spaced out but near to your clusters.)
  2. For each center point $$c$$:
    1. Look at all of the datapoints in its Voronoi cell (the points closest to $$c_{0}$$ and no other center point). Call this $$X_{0}$$. If any $$X_{0}$$ is empty, start the algorithm over by re-initializing the centers.
    2. Compute the mean value of each coordinate using all of the points in $$X_{0}$$. Create a new point using the mean values of each coordinate that you've just calculated; call this $$c_{1}$$.
    3. Remove the old center $$c_{0}$$; consider $$c_{1}$$ to be the new center replacing $$c_{0}$$.
  3. Once this is done, we have a new plot with the same data points but different centers. Calculate the distance between each of the old centers and its corresponding new center. If this doesn't move much* then you're done. Else, go back to step 2 above.

The asterisk here is because what is "close enough"? I'll usually use the sum of the squares of the Euclidean distance between the old and new centers, \[\sum_{i = 1}^{k}dist(c^{(i)}_{j-1}, c^{(i)}_{j})^{2}\] where here $$c^{i}_{j}$$ is the value of the $$i$$-th center at the $$j$$-th iteration and the $$dist()$$ function is the Euclidean distance function. Visually, though, it should be relatively clear when the centers are "done moving".

⚫ ⚫ ⚫ ⚫

An Example.

Let's show an example of what happens, using the same data as above. Because it's tricky to plot the Voronoi structure in Bokeh as of this writing (and I've tried; see the notebook above), I'll simply color-coat the points to show which point they are currently closest to. Here's the initial centers:

The larger dots are the centers, the datapoints are color coded to show which Voronoi cell they're in; the purple dots, for example, are closest to the purple center.

Now, let's compute the new centers.

The centers moved fairly quickly to where we thought they ought to go! Woah. Note here that the sum of squares of the distance between the old and new centers was 24.86. Let's try one more iteration.

The sum of squares this time was 1. This means the centers moved significantly less than before, which usually points to the algorithm quickly converging to a solution.

Now, let me just show you a plot of the result after 10 iterations,

And here's a plot of the sum of squares for the first 10 iterations,

As you can see, this went down rather quickly. Awesome.

⚫ ⚫ ⚫ ⚫

Let's try one more example, but we'll do this one fast. Six blobs, six centers. Initially looks like this:

And here's the final picture as well as the sum-square curve. Pretty fast.

⚫ ⚫ ⚫ ⚫

Some Notes.

  • I've included in the notebook how we can easily do this in a few steps using Python's sklearn library. Play with it yourself!
  • In this post we only used two dimensions (so we could plot it on the xy-plane) but we don't use this information anywhere; the algorithm is general enough to use any finite number of features!
  • When we placed a black dot in the middle between the two larger groups, some of you may have thought, "Why bother classifying this? It's an outlier! It's a weirdo point." And that's ture, it was a strange point. Two things about this.
    1. First, it's not always possible to throw out outliers from a dataset: if you're working with real human data where every element needs a classification, for example. Instead, investigating outliers may need to be done with the raw data itself, or possibly after a significant number of datapoints have been included. It may be that this outlier is just a strange point, or it may be that it is a cluster we did not realize was there.
    2. Second, if a point is close to the decision boundary of two or more clusters (like the point above), then it may ultimately not matter how this point is classified: it is equally near to each cluster, so it may be the case that any cluster can appropriately represent it. In cases where important distinctions must be made between two clusters (eg, classes of patients with a fatal disease which must be treated and classes of those who don't have the disease), a different metric may be useful to weigh distance more heavily towards one-or-another group, attempting to shift the Type 1 and Type 2 error rates to a point where they are tolerable.