k-Nearest Neighbors

A supervised learning algorithm that classifies a data point based on the class labels of its nearest neighbors in the feature space. For a query point , it finds the points in the training set with smallest distance (typically euclidian distance) and assigns the most frequent class among these neighbors (in the case of ties you can choose randomly, based on distance weighting, or number of total samples per class, etc.).

Formally, given training data and distance metric , the predicted class is:

where are the indices of the nearest neighbors.

The algorithm makes no assumptions about the underlying data distribution - it simply memorizes all training points and performs local majority voting during prediction. This non-parametric nature means k-NN can capture arbitrarily complex decision boundaries, but at the cost of storing the entire training set and potentially slow inference.

Key properties and considerations

Choice of : Small leads to flexible but noisy boundaries (high variance), while large produces smoother boundaries but may miss local patterns (high bias).

Distance metric: While Euclidean distance is common, the choice dramatically affects performance. cosine distance works well for high-dimensional sparse data, hamming distance for categorical features, and mahalanobis distance can account for feature correlations.

curse of dimensionality: In high dimensions, all points become roughly equidistant, making the notion of “nearest” neighbors meaningless. Feature selection or dimensionality reduction often becomes necessary.

For regression tasks, k-NN averages the target values of neighbors rather than voting. Extensions include weighted k-NN (giving closer neighbors more influence) and approximate nearest neighbor methods for faster search in large datasets.

|742x371