Search and Optimization in Machine Learning

This blog post provides a detailed explanation of optimization techniques in machine learning. It is intended for college students and explains concepts step-by-step with mathematical notations, examples, and illustrative diagrams.

1. Why Optimization Matters in Machine Learning

In machine learning, training a model involves finding the best possible model from a dataset. This process is fundamentally an optimization problem: searching through a space of possible models to optimize a scoring function, such as maximizing log-likelihood or minimizing error (e.g., 0/1 loss).

Optimization techniques allow us to efficiently navigate this search space, whether discrete or continuous.

2. Fundamentals of Optimization in Model Learning

2.1 The Model Space

The model space M consists of all possible models {M₁, M₂, …, Mₖ}. Models differ in:

Parameters: Values within a fixed structure (e.g., weights w in logistic regression).
Structure: Architecture (e.g., different tree structures in decision trees).

The goal is to find parameters and/or structure that optimize the scoring function on training data.

2.2 Two Categories of Optimization Problems

Category	Combinatorial Optimization	Smooth Optimization
Model Space	Finite or countably infinite (discrete)	Uncountable (continuous)
Scoring Function	Discrete	Continuous and differentiable
Examples	Decision tree structure search	Naive Bayes parameter estimation
Approach	Heuristic search (e.g., greedy)	Gradient-based methods

Analogy: Combinatorial is like “picking the best from a list”; smooth is like “tuning a dial to the perfect position.”

3. Combinatorial Optimization

Combinatorial optimization deals with discrete spaces, often too large for exhaustive enumeration due to combinatorial explosion.

3.1 State Space Representation

We model the problem as a state space graph:

State (s): A candidate model.
Actions(s): Operations transforming one state to another (e.g., adding a split in a decision tree).
Result(s, a): New state after action a.
Score(s): Quality of the model (e.g., accuracy).

State Space Search Diagram

Source: Illustration of state-space search tree.

3.2 Decision Tree Example

For binary classification with features like Age, Income, Job:

Initial state: Single leaf with majority class.
Actions: Split on a feature.
Successors: Trees with one more split.

The space grows exponentially with depth.

3.3 Exhaustive vs. Heuristic Search

Exhaustive Search (DFS/BFS): Guarantees global optimum but intractable for large spaces.

Greedy Search (Heuristic):

Start with initial model (e.g., single leaf).
Generate successors.
Choose the best-scoring successor.
Repeat until no improvement.

Greedy never backtracks, making it efficient but potentially suboptimal.

Greedy Algorithm in Decision Tree

Source: Example of greedy decision tree building.

Standard decision tree algorithms (e.g., ID3, C4.5) use greedy search with information gain or Gini impurity.

Other Heuristics

Beam Search: Keeps top-k promising paths (vs. greedy’s single path).
Simulated Annealing: Allows occasional worse moves to escape local optima.
Random Restarts: Run greedy multiple times and pick the best.

Beam Search vs Greedy

Source: Comparison of beam search and greedy search.

4. Smooth Optimization & Convex Functions

Smooth optimization applies to continuous, differentiable objectives.

4.1 Convex Sets and Functions

Convex Set: For any x, y in C and θ ∈ [0,1], θx + (1-θ)y ∈ C.

Convex vs Non-Convex Sets

Convex Function: f(θx + (1-θ)y) ≤ θf(x) + (1-θ)f(y).

Second derivative test: f”(x) ≥ 0.

Convex vs Non-Convex Functions

Key property: Any local minimum is global.

5. Unconstrained Convex Optimization

Minimize f(x) where f is convex.

At optimum x*, ∇f(x*) = 0.

Example: f(x) = 2x² + 3x + 1

f'(x) = 4x + 3 = 0 → x* = -3/4

f”(x) = 4 > 0 (minimum).

6. Constrained Convex Optimization & Lagrange Multipliers

Minimize f₀(x) subject to constraints.

Lagrangian: ℒ(x, λ, ν) = f₀(x) + ∑ λ_i f_i(x) + ∑ ν_i h_i(x)

Geometric Interpretation of Lagrange Multipliers

λ represents the sensitivity of the objective to constraint relaxation.

7. Naive Bayes Learning with Lagrange Multipliers

Maximize log-likelihood ℓ(θ|D) subject to probability constraints (sum to 1).

Using Lagrange multipliers yields the empirical frequencies:

p_l = N_l / N

q_{l j k} = N_{l j k} / N_l

This provides a rigorous justification for the counting-based estimates.

8. Gradient Descent Algorithm

Iterative method: w_{t+1} = w_t – η ∇f(w_t)

η: learning rate.

Gradient Descent on Convex Function

Converges to global minimum for convex functions.

9. Non-Convex Optimization

Multiple local minima; solution depends on initialization.

Non-Convex Function with Local Minima

Strategies: Multiple random restarts.

10. Summary and Key Takeaways

Machine learning is optimization: combinatorial (discrete search) or smooth (gradient-based).
Convex problems guarantee global optima.
Greedy search powers decision trees.
Gradient descent is foundational, with caveats in non-convex cases (e.g., neural networks).

This lecture bridges theoretical optimization to practical ML algorithms.