A simple guide on How to get started with KNN in python.
Easy to understand and easy to implement, The K-nearest-neighbors algorithm is one of the most widely used classification algorithms out there.
A non-parametric algorithm, The math behind KNN is simple to understand thus making it easy to interpret and explain. But that’s not all.
KNN is also famous for:
- Being Robust; for instance, classes don’t have to be linearly separable.
- Having few parameters to tune to find the best model
- Having no assumptions whatsoever.
All of the above points combined with KNNs simplicity make it a good choice when working with classification problems.
In this article, I will be going over the maths that make up the KNN algorithm along with implementation in python. So without further due, let’s get started.
The math behind KNN
“A data point is known by the company it keeps”
K nearest neighbors work on the euclidean distance concept. The concept is pretty simple and straightforward.
The nearest data points “categorize” the test point. for example:
Let’s say I have two classes, Class A and Class B.
I would like to find out which class does my test point ( The Red star in the picture) belongs to.
I analyze the three “closest” data point to my Red Star test point and count which Class secures the majority count of those three data points.
The Class which secures the majority will be deemed as the Class for our Red star Test point which is “Class B” in the case of three closest neighbors.
In the case of six nearest neighbors ( K=6), We see that our Red Star has been classified as “Class A”.
This tells us that the number of closest data points or “neighbors” plays a crucial role in determining which Class our test point belongs to.
Now that we have seen How data points are classified, We will look at the math behind it.
We do know that the “nearest neighbors” are the points that help us classify our data point but how do we consider a point the “nearest point”? How do we measure the distance between the test point and data points?
We turn to the Euclidean distance concept.
The Euclidean Distance helps us “measure how far” our test point is from our data points. The mathematical representation of Euclidean Distance is:
Here, X and Y are our test points and data points, and “i” is the number of features or “columns”. Let’s Solve an example problem so we may better understand Euclidean Distance.
I have a dataset that shows whether a particular student passed or failed a grade based on the result of two final exam scores.
Let’s say, I would like to know whether a particular student who secured 8 on his Maths exam and 4 on CS, passed or failed.
We will use the Euclidean Distance concept and make our prediction.
Now that we have found our Euclidean Distance values, We can move to make our classification.
I will first look at only one “nearest neighbor” which is the data point with the smallest Euclidean Distance( K =1).
By looking at only one data point, we can see that our test point will be classified as failed. (Note: The point with Euclidean Distance = 0 is our test point so that will not be considered)
Let’s look at K=3 or three nearest data points.
When we look at the three nearest points, We can see that there are more students who passed rather than failed. So our test point will be classified as Pass in this case.
The value of K does affect our classification results and that is why It is imperative to understand How a change in the value of K will alter our model. Let’s take a look
As can be seen from the visuals above, When the value of K is lower, Our model is highly flexible and tries to classify each and every data point to the best of its abilities.
Even though that’s a good thing, but when these types of models are tested on unseen data, they tend to perform poorly because the model has “overfit” the training data.
It generalizes too well to the training data but fails to perform on unseen data.
On the other hand, When the value of K is high, The model tends to have low flexibility and has a tendency to “underfit” the model, but does not “overfit” the model and generally performs well on unseen data.
This whole concept is called the Bias-Variance Tradeoff and is an integral part of Machine Learning.
We will always be looking for parameters that give us a balance of both.
Implementation in Python
Now that we have looked at the theory behind KNN, It’s time to implement it in Python.
For this Part, I will be going over a Binary Classification problem with multiple independent variables.
The next step would be to standardize our variables. We are carrying out this step because variables with higher ranges tend to add unnecessary bias to our model. So let’s say, If we have a variable that ranges in millions for a dataset that has variables that range in 1–10, That one particular variable will add unnecessary bias in our model and give us wrong results.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_features = scaler.transform(df.drop('TARGET CLASS',axis=1))
df_feat = pd.DataFrame(scaled_features,columns=df.columns[:-1])
Running the above code gives me:
Next, We will run the train test split module
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(scaled_features,df['TARGET CLASS'],
After defining my training and testing datasets, I will import my KNN module and run it on my training and testing datasets to make predictions
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
pred = knn.predict(X_test)
Let’s see how our model performed.
from sklearn.metrics import classification_report,confusion_matrix
Our model performed reasonably well but I will go one step further.
I have already discussed how the value of K can influence our predictions and make our model either more flexible or less. So is there an optimum K value that gives us the best model possible? Let’s check it out
Choosing a K value
We will look at values of from 1–40
error_rate = # Will take some time
for i in range(1,40):
knn = KNeighborsClassifier(n_neighbors=i)plt.figure(figsize=(10,6))
pred_i = knn.predict(X_test)
error_rate.append(np.mean(pred_i != y_test))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
plt.title('Error Rate vs. K Value')
The Graph above shows us that K is the lowest after 35. Let’s see how that affects our model’s performance.
Initially, With K=1, Our models’ performance was
With K=35, Our model performed well with a 3 percent increase in our accuracy.
This was a brief introduction to K nearest neighbors in Python. I hope you guys liked it.
Feel free to share this article for that helps me out a lot. Thank you
: Jose Portillo..Python for Data Science https://www.udemy.com/course/python-for-data-science-and-machine-learning-bootcamp/