Foundations of Data Mining

The Data Mining Process Overview

The full data mining process includes several stages:

Data Selection: Choosing relevant data sources.
Data Preprocessing: Cleaning, transforming, and preparing data (handling missing values, outliers, etc.).
Data Mining: Applying algorithms to extract patterns/models (the focus of most courses).
Interpretation/Evaluation: Analyzing results and validating them.

While the full process is important, this lecture emphasizes the Data Mining step. A common standard model is CRISP-DM (Cross-Industry Standard Process for Data Mining), which visualizes the iterative nature:

CRISP-DM Data Mining Process Diagram
Six Steps in CRISP-DM

Four Key Elements of Data Mining Algorithms

Every data mining algorithm revolves around four interconnected elements:

Task Specification: What are we trying to achieve?
Knowledge Representation: How do we represent the discovered knowledge?
Learning Technique: How do we search for and score the best model?
Prediction and/or Interpretation: How do we use and understand the results?

1. Task Specification

This defines the goal of the analysis. There are four classical types of data mining tasks:

A. Exploratory Data Analysis (EDA)

Goal: Explore data without a specific hypothesis to summarize characteristics. Mainly uses visualization.

Example: Restaurant Tip Analysis

Basic histogram of tips: Shows mode around $2, right-skewed distribution (few high tips).
Fine-grained histogram: Spikes at whole/half dollars (people tip in 50-cent increments).
Scatter plot (bill vs. tip): Linear relationship, ~16% average tip rate.
Segmented: Less variance in female parties, more in smoking parties.

Example of Right-Skewed Histogram
Scatter Plot of Bill vs. Tip Amount
Another Bill vs. Tip Scatter Plot

B. Predictive Modeling (Supervised Learning)

Goal: Build a model to predict a target variable from features.

Classification: Discrete target (e.g., spam/not spam).
Regression: Continuous target (e.g., house price).

Examples:

Zestimate: Predict house price from features.
Loan default: Use income, criminal record, employment.

Decision Tree for Loan Approval
Another Loan Approval Decision Tree

C. Descriptive Modeling (Unsupervised Learning)

Goal: Summarize data structure without a target (e.g., clustering, density estimation).

Example: Video scene clustering using RGB values → Groups like “foosball table” (orange) vs. “bookshelf” (green).

Clustering Example
Distribution-Based Clustering

D. Pattern Discovery

Goal: Find local patterns in subsets (e.g., association rules).

Key difference: Local (subsets) vs. global models.

Example: Beer → Diapers (only in certain transactions).

Market Basket Analysis Diagram
Beer and Diapers Association Rule

Task classification examples:

Sales forecast: Predictive
Customer segmentation: Descriptive
Pregnant customer prediction: Predictive
Beer & Diapers: Pattern Discovery

2. Knowledge Representation

Defines the format of models/patterns (the “hypothesis space”).

Predictive: If-then rules, decision trees, linear/logistic regression.
Example Rule: If income > $70k AND no criminal record → loan = yes
Logistic Regression: log(P(Y=1|x)/(1-P(Y=1|x))) = β₀ + β·x
Linear Regression: y = β₀ + β₁x₁ + ...
Descriptive: Mixture models: f(x) = Σ w_k f_k(x; θ_k)

Gaussian Mixture Model Diagram

Pattern: Association rules: X → Y

3. Learning Technique

Combines model space, scoring, and search.

Model Space: Parameters (e.g., thresholds) and structure (e.g., which features).
Scoring: Error rate (classification), squared error (regression), likelihood (descriptive).
Search: Optimization (parameters), heuristics (structure).

Example: 1D classification with threshold.

Rule: If x > t then + else –

Search all t, pick lowest error.

1D Classification Boundary Example

Warning: Depends on sample data – small/biased samples can lead to poor thresholds.

4. Prediction and Interpretation

Prediction: Apply model to new data.

Interpretation: Statistical significance, novelty, interestingness.

Full Example: Spam Detection

Task: Classification
Features: Word frequencies
Model: If %george < 0.6 AND %you > 1.5 → spam