[Note: I've made a Jupyter Notebook (Python) for this so that you can mess around with a few of these ideas yourself. The figures come from this notebook.]
⚫ ⚫ ⚫ ⚫
Suppose you have two kinds of people: people who like to stand by windows and people who like to stand in the middle of the room. Suppose you have a large room. Let's pretend it looks something like this:
We have some peach-colored points in the middle and blue-colored points on the outside. Suppose further that we are a pickpocket and we know that with window-people (blue points) are real easy to pickpocket. A new person comes into the room that we want to pickpocket — but how will we know if he's a window person or not?
The intuition is that he will stand on the "outside" by everyone else. But what if he's not exactly by the windows? What if he's represented by this black point:
One thing we can do is see who he's closest to and then rank him like that. This is called 1-Nearest Neighbor classification. Let's do exactly this! Below, I've drawn a line between him and his nearest neighbor.
Cool. The closest neighbor is blue, so we should make the black point into a blue point. But — just to be sure, let's look at a few more neighbors.
For two nearest neighbors, we get a tie! Uh-oh. That's no good.
In general, we avoid using even numbers with KNN because of this sort of thing; we don't want to end up having to pick randomly between two classes that win. Let's try 3 and 5 and see what we get.
Looks like the center points are winning in this case. Let's try a crazy one; something like 15.
We see that there's still a huge number of points from the center closer to our new point. The 1NN was a little misleading, then! It's probably more reasonable to color this new point peach, the color of the center points.
In general, it's not an easy problem to think about what $$k$$ values are appropriate. This is something we may talk about in a later post. Either way, this is the gist of how one does nearest-neighbor classification. In the Jupyter notebook (linked above) we also include a way to easily predict new points using Sklearn.
Homework: You can fairly easily extend this to datasets where you have more than two classes (colors). Modify the code so that you have three classes instead of two.