The knn Supervised Machine Learning Algorithm
Introduction
The supervised machine learning algorithm knn (which is short for k nearest neighbor) is a classification algorithm that makes mathematically precise the idea that "birds of a feather flock together" to classify data. Just as with any machine learning algorithm, a knn model is "trained" on a set of data in which the classifications (or "labels" or "answers") are known, and then the model can be used to make predictions about classifications of new data where the classifications are unknown.
The knn algorithm can be trained on data with any number of numerical explanatory variables, and a single categorical response variable with any number of levels.
The algorithm works by plotting the explanatory variables of the training data in the Euclidean explanatory space.
An unclassified point is then classified by taking a majority vote of the k nearest neighbors in the training data. The number k is known as a hyper-parameter of the model and is selected by the researcher; typical selections are k=1 or k=3.
Ties are handled in any number of ways including no classification, or an increase or decrease in the hyper-parameter k. In general there is no widespread agreement on how to handle ties, and different implementations of knn will proceed with different processes for ties.
Model performance of a knn model is not usually assessed by p-values as is normal for regression models. Instead knn model skill is judged by the proportion of predictions the model gets correct on unseen "testing" data (i.e. data in which the classifications are known but the model was not trained on). Higher proportions of correct predictions are better, and are usually assessed against what would be expected if random classifications were made. It is also common to use confusion matrices. This page does not talk about model skill or confusion matrices. Check back later for a separate tutorial on those concepts.
An Interactive Example
The applet below helps you explore knn in action. The plotted dots represent a training data set with two numerical explanatory variables, and a single three-level categorical response variable.
The two numerical explanatory variables range from about 20 to 80 in x and 0 to 9 in y. The plane the dots are plotted in is called the explanatory space, and in this example, it is a standard 2 dimensional Euclidean plane. Additional explanatory variables increase the dimensions of the explanatory space.
The single categorical response variable is plotted as the color of the dots. There are three levels of the response variable: Type 1 is red, Type 2 is green, and Type 3 is Blue.
The black dot located at (8.7,3.3) with the ellipse around it is an unclassified point waiting to be classified by you using the knn algorithm. What appears to be an ellipse is actually a Euclidean circle whose radius is controlled by the second black dot on the circle. The circle appears elliptical because the scale of the x and y axes are different. If the scale of the axes was the same (1:1, x:y) the ellipse would appear as a circle.
To use knn to classify the black dot at (8.7,3.3) increase the radius of the circle until the circle "captures" points from the training data set. The count of the training data captured inside the circle is tracked for you on the right. As you slowly expand the circle, you'll notice that a Type 2 and Type 3 point both enter the circle at the same radius, so a k = 1 and k = 2 classification of (8.7,3.3) is not possible because of the tie. However, as the radius of the circle increases, a second Type 3 point is captured resulting in a knn (k = 3) classification of the black point as Type 3.
After this task, you can move the unclassified point to a new location -- try (60,4) -- and increase the radius of the circle to explore that point's k nearest neighbors. For instance, at (60,4), the first three points are all Type 2 indicating that (60,4) would be knn classified as Type 2 with k = 1 or k = 2 or k = 3.
Play around! You can't break anything. If you get lost you can always reset the app by pressing the circular arrow button in the top right or simply refreshing the page.
Exercises:
Use the applet above to classify each of the following points using first k = 1 and then k = 3.
1. (50,3)
2. (52,8)
3. (20,7)
4. (25,7)
5. (30,1)
6. (30,4)
7. (30,6)
Additional Discussion Topics:
While you're classifying, also contemplate or discuss these questions:
1.) Describe in plain English the main classification regions for both k = 1 and k = 3. (Hint: Use language like "when x is greater than/less than/between __ and y is greater than/less than/between __, then the k = __ classification is Type ___" that a human could easily understand)
2.) Are there any trouble areas where you get different results for k = 1 and k = 3?
3.) Would using k > 3 improve classification?
4.) Would you increase or decrease the scale of the vertical axis to make the "circle" look like a circle?
5.) Any ideas what this dataset represents?
Additional Information about the Example
To answer the final discussion question and in case you are curious, the above example data are the results of a survey of people's favorite Star Wars Trilogy in the Skywalker Saga.
The horizontal axis is the respondent's age in years. The vertical axis is the number of Skywalker Saga films the respondent has seen (0 on up to all 9 films). The color of the dot represent's the respondent's self-reported favorite trilogy; Type 1 is the prequel trilogy (Episodes 1-3, 1999-2005), Type 2 is the original trilogy (Episodes 4-6, 1977-1984), and Type 3 is the sequel trilogy (Episodes 7-9, 2015-2019).
The applet below is updated with labeled axes and an expanded legend. Additional color coding is also added to the unclassified point and circle when a majority of nearest neighbors is captured to make the classification more transparent without having to pay close attention to the counts in the table.