Part 1: Logistic Regression
1.1 Introduction to Logistic Regression
Logistic regression is a probabilistic model for binary classification. Unlike linear regression, which predicts continuous values, logistic regression outputs probabilities between 0 and 1.
The core idea is to apply the sigmoid (logistic) function to a linear score:
\[ \sigma(z) = \frac{1}{1 + e^{-z}} \]
where \( z = \mathbf{w}^T \mathbf{x} + b \).
The model predicts:
\[ P(y=1 \mid \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + b) \]
1.2 The Sigmoid Function
The sigmoid function maps any real number to (0, 1):
| x | σ(x) |
|---|---|
| -10 | 0.0000 |
| -8 | 0.0003 |
| -6 | 0.0025 |
| -4 | 0.0180 |
| -2 | 0.1192 |
| 0 | 0.5000 |
| 2 | 0.8808 |
| 4 | 0.9820 |
| 6 | 0.9975 |
| 8 | 0.9997 |
| 10 | 1.0000 |
![]()

1.3 Learning: Maximum Likelihood Estimation
We maximize the log-likelihood:
\[ \ell(\mathbf{w}) = \sum_{i} \left[ y_i \log \hat{y}_i + (1 – y_i) \log (1 – \hat{y}_i) \right] \]
Equivalently, minimize the cross-entropy loss:
\[ J(\mathbf{w}) = \sum_{i} \left[ -y_i (\mathbf{w}^T \mathbf{x}_i) + \log(1 + e^{\mathbf{w}^T \mathbf{x}_i}) \right] \]
The gradient is:
\[ \nabla J = \sum_{i} (\hat{y}_i – y_i) \mathbf{x}_i \]
Update using gradient descent:
\[ \mathbf{w} \leftarrow \mathbf{w} – \eta \nabla J \]
1.4 Regularization
To prevent overfitting, add L2 penalty:
\[ J(\mathbf{w}) = \dots + \frac{\lambda}{2} \|\mathbf{w}\|^2 \]
1.5 Example: Computer Purchase Prediction
Features: Age>40, High Income, Student, Fair Credit.
Learned weights: w = [-5, 1.2, 3, -2, 0.7]
For a new customer [1, 0, 1, 0, 0]: score = -2, P(buy=1) ≈ 0.119 → predict no.
Part 2: Support Vector Machines (SVM)
2.1 Maximizing the Margin
SVM finds the hyperplane that maximizes the margin between classes.


2.2 Hard-Margin SVM
For separable data:
Constraints: \( y_i (\mathbf{w}^T \mathbf{x}_i + b) \geq 1 \)
Objective: minimize \( \frac{1}{2} \|\mathbf{w}\|^2 \)
Margin width: \( \frac{2}{\|\mathbf{w}\|} \)
2.3 Soft-Margin SVM
Introduce slack variables ξ_i:
Objective: minimize \( \frac{1}{2} \|\mathbf{w}\|^2 + C \sum \xi_i \)
Equivalent unconstrained: hinge loss
\[ L_{hinge} = \max(0, 1 – y_i f(\mathbf{x}_i)) \]
| f(x) (for y=1) | Hinge Loss |
|---|---|
| -2 | 3 |
| -1 | 2 |
| 0 | 1 |
| 0.5 | 0.5 |
| 1 | 0 |
| 1.5 | 0 |
| 2 | 0 |


2.4 Support Vectors
Only points with y_i f(x_i) ≤ 1 influence the model.
Part 3: Kernel SVM
3.1 Non-Linear Data
Linear boundaries fail for complex patterns, e.g., circular data.
Solution: Map to higher dimensions using φ(x).


3.2 The Kernel Trick
Compute dot products in feature space without explicit mapping:
\[ k(\mathbf{x}, \mathbf{y}) = \phi(\mathbf{x})^T \phi(\mathbf{y}) \]
3.3 Common Kernels
- Linear: \( k(\mathbf{x}, \mathbf{y}) = \mathbf{x}^T \mathbf{y} \)
- Polynomial: \( k(\mathbf{x}, \mathbf{y}) = (r + \mathbf{x}^T \mathbf{y})^d \)
- RBF/Gaussian: \( k(\mathbf{x}, \mathbf{y}) = \exp(-\gamma \|\mathbf{x} – \mathbf{y}\|^2) \)
RBF similarity (γ=0.5):
| Distance | Kernel Value |
|---|---|
| 0 | 1.0000 |
| 1 | 0.6065 |
| 2 | 0.1353 |
| 3 | 0.0111 |
| 5 | 0.0000 |


3.4 Prediction
\[ f(\mathbf{x}) = \sign\left( \sum_{i \in SV} \alpha_i y_i k(\mathbf{x}_i, \mathbf{x}) + b \right) \]
Part 4: Comparison and Practical Advice
- Logistic Regression: Probabilistic, interpretable, fast.
- Linear SVM: Margin maximization, sparse.
- Kernel SVM: Non-linear, powerful but expensive.
Workflow: Start with logistic regression, escalate to kernel SVM if needed. Tune hyperparameters via cross-validation.
This guide provides a thorough foundation. Experiment with implementations in libraries like scikit-learn.