Intuition Building: Another Toy Problem for K-Means Clustering with more features!

[Note: I've made a Jupyter Notebook (Python) for this so that you can mess around with a few of these ideas yourself. The figures come from this notebook.]

⚫ ⚫ ⚫ ⚫

In the last post about K-Means we talked about very small data: data that was only hundred large with two features. This time let's talk about data which is a bit larger with around ten or so features. There's quite a bit of difference here: we can no longer visualize the clusters so we need to do things which are a bit more clever to check our work.

⚫ ⚫ ⚫ ⚫

Let's suppose that someone gave us some data that was either an email about Christianity or an email about Hockey. Could we figure out a way to cluster these things?

Note, there are a number of ways to do this problem. We're going to try a simple method: thinking of some common words that would be in each kind of email, then trying to classify the emails by counting if those words are contained in the email or not. Given what we've already done in this post, most of the heavy lifting will be to put the data in the proper format to use our sklearn tools.

I've put the Jupyter Notebook above in the post, but I'll copy some of the lines here for ease of viewing.

⚫ ⚫ ⚫ ⚫

import numpy as np

from sklearn.cluster import KMeans
from sklearn.datasets import fetch_20newsgroups

# Get the data.
cats = ['soc.religion.christian', 'rec.sport.hockey']
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)

train_data = newsgroups_train.data
train_target = newsgroups_train.target

We've just imported our data (note that this is actually BB data, but who knows what a BB is anymore?) and some libraries to help us. We won't use train_target until the end of the post where we evaluate our method.

Now the "semi-supervised" part. I picked a few words from each topic that I thought might be in the emails and put them into a features list. Then, for each word, I had a vector corresponding to that word list which would have a 1 in a coordinate if the email contained that word and 0 if the email did not.

# Our features will be a yes or no depending on if a word is in the element.
feature_names = np.array(["christian", "god", "exist", "atheis", "bible", "jesus", "lord", "hockey", "season", "team", "puck", "stick", "coach", "game"])

class_vectors = [[1 if feature in x.lower() else 0 
	for feature in feature_names] for x in train_data[:1000]]

I've truncated the data to 1000 points for simplicity, but you can include more data. Note that I've also converted the item to lowercase. In a future post when we do a bit more digging into this kind of thing we'll also want to eliminate things like punctuation, etc., as well as look for things like bigrams (two words next to each other).

Now, it's time to classify. We have vectors which give us points. These points aren't in the $$xy$$-plane anymore but in a higher-dimensional plane called $${\mathbb R}^{14}$$ (the 14 being the number of features) which is impossible to visualize even if you close your eyes and think real hard.

km = KMeans(n_clusters=2).fit(class_vectors)

center_strenghts = np.apply_along_axis(np.round, 0, km.cluster_centers_, decimals = 1)

This gives us the centers of the two clusters we created with KMeans. Printing this out, we get that the first center is at, \[(0.7,0.8,0.2,0.1,0.3,0.4,0,0,0,0,0,0,0)\] which tells us that the features, in order of significance for this cluster, were: "god", "christian", "jesus", "bible", "atheis" (this one matches both "atheist and atheism"). This is most likely the Christian emails cluster. Note that the larger the value here in the coordinate, the more emails containing that word. If one of the coordinates was 1, this would mean that every email contained some particular word (why?). Similarly, \[(0,0,0,0,0,0,0,0.2,0.4,0.1,0,0.1,0.4)\] shows us that the second cluster was dominated by "team" and "game". This makes sense.

⚫ ⚫ ⚫ ⚫

Okay, so we classified a few things, but it's somtimes nice to kill off features which aren't doing any heavy lifting. For example, let's eliminate any variable which had a value of less than 0.4 in the centers.

As you might expect, this gives us a similar fit as what we had before but we now have fewer features to worry about which can save valuable time if you're testing hundreds of thousands of variables! Here, it doesn't make too big of a difference but is handy to get in the habit of doing.

⚫ ⚫ ⚫ ⚫

We can check to see how well our clustering algorithm did by looking at the confusion matrix. Remember, we have the target values for these emails, so we can check to see if we predicted the individual emails into the correct cluster. Let's do the smaller sample, whose classifier I've called km_slim:

import pandas as pd

# In train_target, the value 0 is for hockey.  We'll align our data in the same way.
if km_slim.cluster_centers_[0][0] > 0.5:  # if the 0th cluster is cristianity,
    if km_slim.labels_[0] != 0: # if the first label is 0, it should be 1 (for hockey)...
        our_target = -1*(km_slim.labels_ - 1) # switches 0 to 1 and 1 to 0.  "Dumb" version of byte swap.
    else:
        our_target = km_slim.labels_
        
        
def confusion_matrix_2(target, our_values):
    """ Calculates the confusion matrix for 2 clusters. """
    class_masks = [target == 0, target == 1]
    
    # all elts which are both 0.
    both_0 = sum(our_values[class_masks[0]] == 0) 
    # all elts which are both 1.
    both_1 = sum(our_values[class_masks[1]] == 1) 
    #target is 0, we get 1.
    target_0_us_1 = sum(our_values[class_masks[0]] == 1)
    #target is 1, we get 0.
    target_1_us_0 = sum(our_values[class_masks[1]] == 0) 

    conf_array = [[both_0, target_1_us_0],
    		  [target_0_us_1, both_1]]

    return pd.DataFrame(conf_array,
                        columns=["Target=0", "Target=1"],
                        index=["Our_Values=0", "Our_Values=1"])

confusion_matrix_2(train_target[:number_of_datapoints], km_slim.labels_)

Our results look like this:

\[\begin{array}{|l|l|l|} \hline & Target=0 & Target=1 \\ \hline OurValues=0 & 491 & 109 \\ \hline OurValues=1 & 7 & 393 \\ \hline \end{array}\]

Strange. We classified pretty well, except there's a large number of target = 1 (Christian-topic) items which we classified as 0. Let's find one of these and see why. Here's the main text of one of them.

Subject: Greek Wordprocessor/Database.

Hi there,

Does anyone know about any greek database/word processor that\ncan do things like count occurrences of a word, letter et al?
I'm posting this up for a friend who studies greek.

Thanks,
Nico.

P.S.Can you email as I seldom look into usenet nowadays.

"Call unto me and I will answer you and show thee great and unsearchable things you do not know."  Jeremiah 33:3 

As you can see, none of these words came up in this email so the classifier had to make a choice and chose wrong. We could do a few things to fix this: use more features (add words), put weights on certain words (they're all currently weighted exactly the same, though I would expect "god" to be used significantly more than "exist" in a Christian email), or use collections of words instead of just a single word.

For now, this isn't a terrible classifier given that we used a total of six words. In a future post, we'll talk about how we can do some of thing things above to do little bit better than this.