Nearest Neighbor Methods

This blog post provides a detailed explanation of the Nearest Neighbor algorithms, specifically focusing on k-Nearest Neighbors (KNN), as covered in predictive modeling for data mining.

Introduction to Nearest Neighbor: Instance-Based Learning

The Nearest Neighbor method is a discriminative classification algorithm that is non-parametric and instance-based. Unlike parametric models (e.g., linear regression) that learn explicit parameters, Nearest Neighbor is non-parametric, meaning it does not assume a fixed form for the underlying data distribution. It is instance-based because it stores all training examples and defers processing until prediction time.

All data points are represented in a p-dimensional feature space, where each instance is a vector \(\mathbf{x}_i = [x_{i1}, x_{i2}, \dots, x_{ip}]\) with a corresponding discrete class label \(y_i\).

A useful analogy is seeking advice from friends in similar situations: you base your decision on those most similar to you.

Learning vs. Prediction in Nearest Neighbor

Learning Phase: The algorithm simply memorizes (stores) the entire training dataset \(( \mathbf{x}_1, y_1 ), (\mathbf{x}_2, y_2), \dots, (\mathbf{x}_n, y_n)\). No explicit model is built.

Prediction Phase: For a new instance \(\mathbf{x}_j\), the algorithm finds “nearby” training examples based on a distance metric and predicts the label based on those neighbors.

From 1-Nearest Neighbor (1NN) to k-Nearest Neighbors (KNN)

1-Nearest Neighbor Algorithm:

To predict the class for a new instance \(\mathbf{x}_j\):

  1. Compute the distance \(d(\mathbf{x}_i, \mathbf{x}_j)\) for all training instances \(i\).
  2. Find the training instance \(\mathbf{x}_i\) that minimizes this distance.
  3. Assign \(\hat{y}_j = y_i\) (the label of the closest neighbor).

The core idea is that similar instances (close in feature space) should have similar labels.

k-Nearest Neighbors (KNN): Generalizes 1NN by considering the K closest neighbors and aggregating their labels, typically via majority vote.

Formally: \(\hat{y}_j = g(\{y_i \mid \mathbf{x}_i \text{ among K nearest to } \mathbf{x}_j\})\), where \(g\) is often the mode (most frequent label).

Here is a classic illustration of KNN classification:

KNN example with shapes and a test point (green) classified differently for K=3 and K=5

Another KNN illustration showing varying K

In these diagrams, a green test point is classified based on surrounding blue and red points. For small K, one class dominates the inner circle; for larger K, the other class may win.

Decision Boundaries in Nearest Neighbor

1NN Decision Boundary: The feature space is divided into Voronoi cells, where each cell consists of all points closer to a particular training instance than to any other. The decision boundary is the Voronoi tessellation, with each cell labeled by its training point’s class.

Voronoi diagram illustrating 1NN decision boundaries

Effect of K on Decision Boundaries:

As K increases, boundaries become smoother:

  • Low K (e.g., 1): Jagged, complex boundaries sensitive to noise.
  • Medium K: Balanced regions.
  • High K: Very smooth, potentially missing small clusters.

Decision boundaries for varying K in KNN

Comparison of decision boundaries for different K values

Key Design Choices in KNN

  1. K (number of neighbors): Typically small (<10), chosen via validation.
  2. Distance Metric \(d()\): Commonly Euclidean distance \(d(\mathbf{x}_i, \mathbf{x}_j) = \sqrt{\sum_{m=1}^p (x_{im} – x_{jm})^2}\).
  3. Aggregation Function \(g()\): Majority vote for classification.

Choosing Optimal K: Cross-Validation

1NN always achieves 0% training error (each point is closest to itself), but this overfits.

Use k-fold cross-validation to estimate generalization error for different K. Plot training error, CV error, and (if available) test error vs. K.

Error curves for KNN: training (low for small K), CV, and test error

Optimal K selection via validation error minimum

Key insights:

  • Small K: Overfitting (low training error, high test error).
  • Large K: Underfitting (higher errors overall).
  • Optimal K: Minimizes CV error (proxy for test error).

Strengths and Weaknesses of Nearest Neighbor

Strengths:

  • Simple and easy to implement.
  • Efficient learning (just store data).

Weaknesses:

  • Inefficient prediction: Compute distances to all n points (O(n) per query).
  • Curse of Dimensionality: In high dimensions, distances become less meaningful; need exponentially more data for reliable neighbors.

Illustration of curse of dimensionality

Practice Example

Classify the point (3,4) using K=3 with training points (assume labels as in notes: e.g., A blue, B red, etc.).

Distances (Euclidean):

  • To (1,2): \(\sqrt{(3-1)^2 + (4-2)^2} = \sqrt{8} \approx 2.83\)
  • To (2,5): \(\sqrt{(3-2)^2 + (4-5)^2} = \sqrt{2} \approx 1.41\)
  • To (5,3): \(\sqrt{(3-5)^2 + (4-3)^2} = \sqrt{5} \approx 2.24\)
  • To (4,6): \(\sqrt{(3-4)^2 + (4-6)^2} = \sqrt{5} \approx 2.24\)

Three nearest: 1.41, 2.24, 2.24 → Majority red.

Example 2D points for KNN calculation

This concludes the detailed overview of Nearest Neighbor methods. These concepts form a foundation for lazy learning in machine learning.