This post explains predictive modeling as a data-mining workflow composed of four essential elements:
(i) task specification, (ii) knowledge representation, (iii) learning (scoring + search), and (iv) prediction/evaluation.
- 1. Introduction
- 2. The Four Components of a Predictive Modeling Algorithm
- 3. Task Specification
- 4. Knowledge Representation (Model Families)
- 5. Learning: Model Space, Scoring Functions, and Search
- 6. Prediction and Evaluation
- 7. Overfitting
- 8. Dealing with Overfitting
- 9. Worked Examples (End-to-End)
1. Introduction
Predictive modeling is a core topic in data mining because it formalizes the problem of learning a mapping from observed attributes
to an output (class label or numeric value), and then using that learned mapping to make accurate predictions on new, unseen instances.
The lecture frames predictive modeling as learning patterns from existing data to generalize beyond the training sample. :contentReference[oaicite:1]{index=1}
Central abstraction. There exists an unknown “true” function (or rule) mapping inputs to outputs:
$$ y = f(x) $$
We observe data and construct an approximation $\hat{f}(x;\theta)$ parameterized by $\theta$ (weights, tree structure, probabilities, etc.). :contentReference[oaicite:2]{index=2}
In practice, “the function” is not a single deterministic mapping when noise is present. A more realistic view is
$y = f(x) + \varepsilon$ (regression) or $P(Y\mid X=x)$ (classification), but the lecture’s $y=f(x)$ is a useful starting point for the learning objective. :contentReference[oaicite:3]{index=3}
2. The Four Components of a Predictive Modeling Algorithm
The lecture states that any data-mining algorithm for predictive modeling can be decomposed into four essential components: (1) task specification,
(2) knowledge representation, (3) learning technique, and (4) prediction/evaluation (and interpretation). :contentReference[oaicite:4]{index=4}
learn parameters/structure via scoring + search, then evaluate on unseen data. :contentReference[oaicite:5]{index=5}
3. Task Specification
3.1 Data representation
The lecture uses a tabular $n \times p$ matrix: $n$ data points and $p$ dimensions. Each instance is represented as a pair
$\langle y^{(i)}, x^{(i)} \rangle$, where $x^{(i)}$ is the attribute vector and $y^{(i)}$ is the class label. :contentReference[oaicite:6]{index=6}
Notation.
$$
\mathcal{D} = \{(x^{(i)}, y^{(i)})\}_{i=1}^{n}, \qquad x^{(i)} \in \mathbb{R}^{p-1}, \quad y^{(i)} \in \mathcal{Y}
$$
For classification, $\mathcal{Y}$ is a finite set of categories (e.g., $\{0,1\}$). For regression, $y^{(i)} \in \mathbb{R}$. :contentReference[oaicite:7]{index=7}
The objective is to approximate the unknown function $f$ by selecting a hypothesis $\hat{f}(x;\theta)$ from a model family and learning $\theta$
from the available sample. :contentReference[oaicite:8]{index=8}
3.2 Classification vs. regression
The lecture distinguishes classification (categorical $y$) from regression (real-valued $y$). :contentReference[oaicite:9]{index=9}
| Task | Output space | Typical examples | Common metrics |
|---|---|---|---|
| Classification | $y \in \{1,\dots,K\}$ or $\{0,1\}$ | Spam detection, disease status, loan approve/reject | Accuracy, F1, ROC-AUC, log-loss |
| Regression | $y \in \mathbb{R}$ | House prices, temperature, demand forecasting | MSE, MAE, $R^2$ |
The lecture emphasizes classification as the primary focus of the course. :contentReference[oaicite:10]{index=10}
3.3 Output types for classification: labels, rankings, probabilities
The lecture describes three progressively richer output formats: (1) class labels, (2) rankings, and (3) probabilities $p(y\mid x)$.
These differ in what the model must compute and what decisions can be made downstream. :contentReference[oaicite:11]{index=11}
- Labels: $\hat{y} \in \mathcal{Y}$
- Rankings: ordering by score $s(x)$
- Probabilities: $\hat{p}(y\mid x)$
Decision-making difference. When probabilities are available, decisions can incorporate costs:
$$
\hat{y}(x)=\arg\min_{c\in\mathcal{Y}} \sum_{y\in\mathcal{Y}} \mathrm{Cost}(c,y)\, \hat{p}(y\mid x)
$$
This is a standard formulation of Bayes risk minimization; it explains why probability outputs are operationally more flexible than labels.