Logistic Regression, Support Vector Machines, and Kernel SVM

Part 1: Logistic Regression

1.1 Introduction to Logistic Regression

Logistic regression is a probabilistic model for binary classification. Unlike linear regression, which predicts continuous values, logistic regression outputs probabilities between 0 and 1.

The core idea is to apply the sigmoid (logistic) function to a linear score:

\[ \sigma(z) = \frac{1}{1 + e^{-z}} \]

where \( z = \mathbf{w}^T \mathbf{x} + b \).

The model predicts:

\[ P(y=1 \mid \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + b) \]

1.2 The Sigmoid Function

The sigmoid function maps any real number to (0, 1):

x σ(x)
-10 0.0000
-8 0.0003
-6 0.0025
-4 0.0180
-2 0.1192
0 0.5000
2 0.8808
4 0.9820
6 0.9975
8 0.9997
10 1.0000

Logistic Sigmoid Function Curve

Sigmoid Function and Derivative

1.3 Learning: Maximum Likelihood Estimation

We maximize the log-likelihood:

\[ \ell(\mathbf{w}) = \sum_{i} \left[ y_i \log \hat{y}_i + (1 – y_i) \log (1 – \hat{y}_i) \right] \]

Equivalently, minimize the cross-entropy loss:

\[ J(\mathbf{w}) = \sum_{i} \left[ -y_i (\mathbf{w}^T \mathbf{x}_i) + \log(1 + e^{\mathbf{w}^T \mathbf{x}_i}) \right] \]

The gradient is:

\[ \nabla J = \sum_{i} (\hat{y}_i – y_i) \mathbf{x}_i \]

Update using gradient descent:

\[ \mathbf{w} \leftarrow \mathbf{w} – \eta \nabla J \]

1.4 Regularization

To prevent overfitting, add L2 penalty:

\[ J(\mathbf{w}) = \dots + \frac{\lambda}{2} \|\mathbf{w}\|^2 \]

1.5 Example: Computer Purchase Prediction

Features: Age>40, High Income, Student, Fair Credit.

Learned weights: w = [-5, 1.2, 3, -2, 0.7]

For a new customer [1, 0, 1, 0, 0]: score = -2, P(buy=1) ≈ 0.119 → predict no.

Part 2: Support Vector Machines (SVM)

2.1 Maximizing the Margin

SVM finds the hyperplane that maximizes the margin between classes.

SVM Margin Diagram

SVM with Support Vectors

2.2 Hard-Margin SVM

For separable data:

Constraints: \( y_i (\mathbf{w}^T \mathbf{x}_i + b) \geq 1 \)

Objective: minimize \( \frac{1}{2} \|\mathbf{w}\|^2 \)

Margin width: \( \frac{2}{\|\mathbf{w}\|} \)

2.3 Soft-Margin SVM

Introduce slack variables ξ_i:

Objective: minimize \( \frac{1}{2} \|\mathbf{w}\|^2 + C \sum \xi_i \)

Equivalent unconstrained: hinge loss

\[ L_{hinge} = \max(0, 1 – y_i f(\mathbf{x}_i)) \]

f(x) (for y=1) Hinge Loss
-2 3
-1 2
0 1
0.5 0.5
1 0
1.5 0
2 0

Hinge Loss Graph

Hinge vs Logistic Loss Comparison

2.4 Support Vectors

Only points with y_i f(x_i) ≤ 1 influence the model.

Part 3: Kernel SVM

3.1 Non-Linear Data

Linear boundaries fail for complex patterns, e.g., circular data.

Solution: Map to higher dimensions using φ(x).

Kernel Trick Circular Data

Non-Linear SVM Example

3.2 The Kernel Trick

Compute dot products in feature space without explicit mapping:

\[ k(\mathbf{x}, \mathbf{y}) = \phi(\mathbf{x})^T \phi(\mathbf{y}) \]

3.3 Common Kernels

  • Linear: \( k(\mathbf{x}, \mathbf{y}) = \mathbf{x}^T \mathbf{y} \)
  • Polynomial: \( k(\mathbf{x}, \mathbf{y}) = (r + \mathbf{x}^T \mathbf{y})^d \)
  • RBF/Gaussian: \( k(\mathbf{x}, \mathbf{y}) = \exp(-\gamma \|\mathbf{x} – \mathbf{y}\|^2) \)

RBF similarity (γ=0.5):

Distance Kernel Value
0 1.0000
1 0.6065
2 0.1353
3 0.0111
5 0.0000

RBF Kernel Decision Boundary

Gaussian Kernel Example

3.4 Prediction

\[ f(\mathbf{x}) = \sign\left( \sum_{i \in SV} \alpha_i y_i k(\mathbf{x}_i, \mathbf{x}) + b \right) \]

Part 4: Comparison and Practical Advice

  • Logistic Regression: Probabilistic, interpretable, fast.
  • Linear SVM: Margin maximization, sparse.
  • Kernel SVM: Non-linear, powerful but expensive.

Workflow: Start with logistic regression, escalate to kernel SVM if needed. Tune hyperparameters via cross-validation.

This guide provides a thorough foundation. Experiment with implementations in libraries like scikit-learn.