This blog post provides a detailed explanation of the Nearest Neighbor algorithms, specifically focusing on k-Nearest Neighbors (KNN), as covered in predictive modeling for data mining.
Introduction to Nearest Neighbor: Instance-Based Learning
The Nearest Neighbor method is a discriminative classification algorithm that is non-parametric and instance-based. Unlike parametric models (e.g., linear regression) that learn explicit parameters, Nearest Neighbor is non-parametric, meaning it does not assume a fixed form for the underlying data distribution. It is instance-based because it stores all training examples and defers processing until prediction time.
All data points are represented in a p-dimensional feature space, where each instance is a vector \(\mathbf{x}_i = [x_{i1}, x_{i2}, \dots, x_{ip}]\) with a corresponding discrete class label \(y_i\).
A useful analogy is seeking advice from friends in similar situations: you base your decision on those most similar to you.
Learning vs. Prediction in Nearest Neighbor
Learning Phase: The algorithm simply memorizes (stores) the entire training dataset \(( \mathbf{x}_1, y_1 ), (\mathbf{x}_2, y_2), \dots, (\mathbf{x}_n, y_n)\). No explicit model is built.
Prediction Phase: For a new instance \(\mathbf{x}_j\), the algorithm finds “nearby” training examples based on a distance metric and predicts the label based on those neighbors.
From 1-Nearest Neighbor (1NN) to k-Nearest Neighbors (KNN)
1-Nearest Neighbor Algorithm:
To predict the class for a new instance \(\mathbf{x}_j\):
- Compute the distance \(d(\mathbf{x}_i, \mathbf{x}_j)\) for all training instances \(i\).
- Find the training instance \(\mathbf{x}_i\) that minimizes this distance.
- Assign \(\hat{y}_j = y_i\) (the label of the closest neighbor).
The core idea is that similar instances (close in feature space) should have similar labels.
k-Nearest Neighbors (KNN): Generalizes 1NN by considering the K closest neighbors and aggregating their labels, typically via majority vote.
Formally: \(\hat{y}_j = g(\{y_i \mid \mathbf{x}_i \text{ among K nearest to } \mathbf{x}_j\})\), where \(g\) is often the mode (most frequent label).
Here is a classic illustration of KNN classification:


In these diagrams, a green test point is classified based on surrounding blue and red points. For small K, one class dominates the inner circle; for larger K, the other class may win.
Decision Boundaries in Nearest Neighbor
1NN Decision Boundary: The feature space is divided into Voronoi cells, where each cell consists of all points closer to a particular training instance than to any other. The decision boundary is the Voronoi tessellation, with each cell labeled by its training point’s class.

Effect of K on Decision Boundaries:
As K increases, boundaries become smoother:
- Low K (e.g., 1): Jagged, complex boundaries sensitive to noise.
- Medium K: Balanced regions.
- High K: Very smooth, potentially missing small clusters.


Key Design Choices in KNN
- K (number of neighbors): Typically small (<10), chosen via validation.
- Distance Metric \(d()\): Commonly Euclidean distance \(d(\mathbf{x}_i, \mathbf{x}_j) = \sqrt{\sum_{m=1}^p (x_{im} – x_{jm})^2}\).
- Aggregation Function \(g()\): Majority vote for classification.
Choosing Optimal K: Cross-Validation
1NN always achieves 0% training error (each point is closest to itself), but this overfits.
Use k-fold cross-validation to estimate generalization error for different K. Plot training error, CV error, and (if available) test error vs. K.


Key insights:
- Small K: Overfitting (low training error, high test error).
- Large K: Underfitting (higher errors overall).
- Optimal K: Minimizes CV error (proxy for test error).
Strengths and Weaknesses of Nearest Neighbor
Strengths:
- Simple and easy to implement.
- Efficient learning (just store data).
Weaknesses:
- Inefficient prediction: Compute distances to all n points (O(n) per query).
- Curse of Dimensionality: In high dimensions, distances become less meaningful; need exponentially more data for reliable neighbors.

Practice Example
Classify the point (3,4) using K=3 with training points (assume labels as in notes: e.g., A blue, B red, etc.).
Distances (Euclidean):
- To (1,2): \(\sqrt{(3-1)^2 + (4-2)^2} = \sqrt{8} \approx 2.83\)
- To (2,5): \(\sqrt{(3-2)^2 + (4-5)^2} = \sqrt{2} \approx 1.41\)
- To (5,3): \(\sqrt{(3-5)^2 + (4-3)^2} = \sqrt{5} \approx 2.24\)
- To (4,6): \(\sqrt{(3-4)^2 + (4-6)^2} = \sqrt{5} \approx 2.24\)
Three nearest: 1.41, 2.24, 2.24 → Majority red.

This concludes the detailed overview of Nearest Neighbor methods. These concepts form a foundation for lazy learning in machine learning.