This blog post provides a detailed explanation of optimization techniques in machine learning. It is intended for college students and explains concepts step-by-step with mathematical notations, examples, and illustrative diagrams.
1. Why Optimization Matters in Machine Learning
In machine learning, training a model involves finding the best possible model from a dataset. This process is fundamentally an optimization problem: searching through a space of possible models to optimize a scoring function, such as maximizing log-likelihood or minimizing error (e.g., 0/1 loss).
Optimization techniques allow us to efficiently navigate this search space, whether discrete or continuous.
2. Fundamentals of Optimization in Model Learning
2.1 The Model Space
The model space M consists of all possible models {M₁, M₂, …, Mₖ}. Models differ in:
- Parameters: Values within a fixed structure (e.g., weights w in logistic regression).
- Structure: Architecture (e.g., different tree structures in decision trees).
The goal is to find parameters and/or structure that optimize the scoring function on training data.
2.2 Two Categories of Optimization Problems
| Category | Combinatorial Optimization | Smooth Optimization |
|---|---|---|
| Model Space | Finite or countably infinite (discrete) | Uncountable (continuous) |
| Scoring Function | Discrete | Continuous and differentiable |
| Examples | Decision tree structure search | Naive Bayes parameter estimation |
| Approach | Heuristic search (e.g., greedy) | Gradient-based methods |
Analogy: Combinatorial is like “picking the best from a list”; smooth is like “tuning a dial to the perfect position.”
3. Combinatorial Optimization
Combinatorial optimization deals with discrete spaces, often too large for exhaustive enumeration due to combinatorial explosion.
3.1 State Space Representation
We model the problem as a state space graph:
- State (s): A candidate model.
- Actions(s): Operations transforming one state to another (e.g., adding a split in a decision tree).
- Result(s, a): New state after action a.
- Score(s): Quality of the model (e.g., accuracy).

Source: Illustration of state-space search tree.
3.2 Decision Tree Example
For binary classification with features like Age, Income, Job:
- Initial state: Single leaf with majority class.
- Actions: Split on a feature.
- Successors: Trees with one more split.
The space grows exponentially with depth.
3.3 Exhaustive vs. Heuristic Search
Exhaustive Search (DFS/BFS): Guarantees global optimum but intractable for large spaces.
Greedy Search (Heuristic):
- Start with initial model (e.g., single leaf).
- Generate successors.
- Choose the best-scoring successor.
- Repeat until no improvement.
Greedy never backtracks, making it efficient but potentially suboptimal.

Source: Example of greedy decision tree building.
Standard decision tree algorithms (e.g., ID3, C4.5) use greedy search with information gain or Gini impurity.
Other Heuristics
- Beam Search: Keeps top-k promising paths (vs. greedy’s single path).
- Simulated Annealing: Allows occasional worse moves to escape local optima.
- Random Restarts: Run greedy multiple times and pick the best.

Source: Comparison of beam search and greedy search.
4. Smooth Optimization & Convex Functions
Smooth optimization applies to continuous, differentiable objectives.
4.1 Convex Sets and Functions
Convex Set: For any x, y in C and θ ∈ [0,1], θx + (1-θ)y ∈ C.

Convex Function: f(θx + (1-θ)y) ≤ θf(x) + (1-θ)f(y).
Second derivative test: f”(x) ≥ 0.

Key property: Any local minimum is global.
5. Unconstrained Convex Optimization
Minimize f(x) where f is convex.
At optimum x*, ∇f(x*) = 0.
Example: f(x) = 2x² + 3x + 1
f'(x) = 4x + 3 = 0 → x* = -3/4
f”(x) = 4 > 0 (minimum).
6. Constrained Convex Optimization & Lagrange Multipliers
Minimize f₀(x) subject to constraints.
Lagrangian: ℒ(x, λ, ν) = f₀(x) + ∑ λ_i f_i(x) + ∑ ν_i h_i(x)

λ represents the sensitivity of the objective to constraint relaxation.
7. Naive Bayes Learning with Lagrange Multipliers
Maximize log-likelihood ℓ(θ|D) subject to probability constraints (sum to 1).
Using Lagrange multipliers yields the empirical frequencies:
p_l = N_l / N
q_{l j k} = N_{l j k} / N_l
This provides a rigorous justification for the counting-based estimates.
8. Gradient Descent Algorithm
Iterative method: w_{t+1} = w_t – η ∇f(w_t)
η: learning rate.

Converges to global minimum for convex functions.
9. Non-Convex Optimization
Multiple local minima; solution depends on initialization.

Strategies: Multiple random restarts.
10. Summary and Key Takeaways
- Machine learning is optimization: combinatorial (discrete search) or smooth (gradient-based).
- Convex problems guarantee global optima.
- Greedy search powers decision trees.
- Gradient descent is foundational, with caveats in non-convex cases (e.g., neural networks).
This lecture bridges theoretical optimization to practical ML algorithms.