This blog post provides a detailed explanation of the key concepts Logistic Regression and Support Vector Machines (SVMs). These are fundamental algorithms in machine learning for classification tasks.
1. Logistic Regression
Logistic Regression is a probabilistic classification model used primarily for binary classification problems. Unlike linear regression, which predicts continuous values, logistic regression outputs the probability that a given input belongs to a particular class (typically the positive class, labeled as y=1). This probability is constrained to the interval [0, 1].
1.1 From Probability to Odds to Log-Odds
Directly modeling probability p (where 0 ≤ p ≤ 1) is challenging because linear models can produce values outside this range. To address this, we transform the probability:
- Odds: Defined as p / (1 – p), odds range from 0 to ∞.
- Log-Odds (Logit): The natural logarithm of odds, log(p / (1 – p)), which ranges from -∞ to +∞. This unrestricted range allows us to model it as a linear function of the features.
Figure 1: Illustration of the transformation from probability to odds and log-odds.
1.2 The Logistic (Sigmoid) Function
By setting the log-odds equal to a linear combination of features, we derive the logistic function:
$$ p(y=1 \mid \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + w_0) = \frac{1}{1 + e^{-(\mathbf{w}^T \mathbf{x} + w_0)}} $$
Here, \(\sigma(z)\) is the sigmoid function, which maps any real number z to [0, 1], producing an S-shaped curve.
Figure 2: The sigmoid function mapping real values to probabilities in [0,1].
Figure 3: Additional visualization of the sigmoid curve.
1.3 Handling Categorical Variables
Real-world datasets often include categorical features:
- Ordinal variables (ordered categories, e.g., grades A-F): Map to ordered numbers (A=4, B=3, etc.).
- Nominal variables (unordered, e.g., eye color): Use one-hot encoding to create binary vectors, avoiding implicit ordering.
Example: Eye colors {blue, green, brown} → blue: [1, 0, 0], green: [0, 1, 0], brown: [0, 0, 1]. (Often drop one category to avoid redundancy.)
Figure 4: Example of one-hot encoding for categorical variables.
1.4 Training Logistic Regression: Maximum Likelihood Estimation
Parameters \(\mathbf{w}\) and \(w_0\) are learned by maximizing the likelihood of observing the training data. The log-likelihood function is:
$$ L(\mathbf{w}) = \sum_{i=1}^n \left[ y_i \log p_i + (1 – y_i) \log (1 – p_i) \right] $$
where \( p_i = \sigma(\mathbf{w}^T \mathbf{x}_i + w_0) \).
This is equivalent to minimizing the negative log-likelihood (cross-entropy loss). Since the function is concave, optimization uses iterative methods like gradient descent (no closed-form solution exists).
2. Support Vector Machines (SVMs)
Support Vector Machines are discriminative classifiers that directly learn a decision boundary, outputting class labels (+1 or -1) without probabilities.
2.1 Linear SVM Decision Boundary
The decision rule is:
$$ \hat{y} = \sign(\mathbf{w}^T \mathbf{x} + b) $$
This defines a hyperplane \(\mathbf{w}^T \mathbf{x} + b = 0\).
2.2 Choosing the Optimal Boundary: Maximum Margin
Multiple hyperplanes may separate the data perfectly. SVM selects the one maximizing the margin—the distance to the nearest points (support vectors)—for better generalization and robustness to noise.
Figure 5: Comparison of possible separating hyperplanes; the maximum margin one is most robust.
Figure 6: Illustration highlighting the maximum margin principle.
Figure 7: SVM hyperplane with maximum margin and highlighted support vectors.
Figure 8: Detailed view of the maximum margin hyperplane and support vectors.
2.3 Margin Calculation and Optimization
The margins are parallel hyperplanes: \(\mathbf{w}^T \mathbf{x} + b = 1\) (positive) and \(\mathbf{w}^T \mathbf{x} + b = -1\) (negative).
The distance from a point to the hyperplane is \( \frac{|\mathbf{w}^T \mathbf{x} + b|}{||\mathbf{w}||} \). For support vectors, this is \( \frac{1}{||\mathbf{w}||} \), so total margin is \( \frac{2}{||\mathbf{w}||} \).
SVM maximizes the margin by minimizing \( ||\mathbf{w}|| \) (or \( \frac{1}{2} ||\mathbf{w}||^2 \)) subject to:
$$ y_i (\mathbf{w}^T \mathbf{x}_i + b) \geq 1 \quad \forall i $$
Figure 9: Diagram showing margin calculation and distance to hyperplane.
This constrained optimization yields a robust classifier focused on the most critical data points (support vectors).
Conclusion
Logistic Regression provides probabilistic outputs ideal for interpreting confidence, while SVMs excel in finding robust boundaries for high-dimensional data. Both are cornerstone algorithms in supervised learning.