Machine learning Algorithms using Scikit-Learn
Ref : All the documentation for the functions used can be found at https://scikit-learn.org/stable/
This notebook aims to introduce you to the scikit-learn library that contains a lot of popularly used Machine Learning
algorithms. This notebook contains the following section:
(1) Regression
Each section has a data preparation section separately. For regression, we use the boston housing prices dataset.
Sections
1. Regression Data Preparation
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import *
from sklearn.datasets import *
# Set the random seeds for reproducibility.
random.seed(0)
np.random.seed(0)
In [2]:
# Load the boston housing prices dataset.
boston = (load_boston())
In [3]:
# 'boston' is a bunch object (similar to a dictionary). The important attributes are 'data' and 'target'.
# By convention, X always denotes the features, and y always denotes the label / target.
X, y = boston['data'], boston['target']
In [4]:
# It is always a good practice to see the shapes of the features and the target(s).
print (X.shape, y.shape)
(506, 13) (506,)
In [5]:
# Split the data into train and test sets.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.20)
print ("Train Set : ", X_train.shape, y_train.shape)
print ("Test Set : ", X_test.shape, y_test.shape)
Train Set : (404, 13) (404,) Test Set : (102, 13) (102,)
In [6]:
# It is always a good practice to scale the data. (Mean - Var Normalization)
# Create a standard scaler object.
scaler = StandardScaler()
# Fit the data and then transform the data.
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
2. Regression Based algorithms
2.1 Linear Regression
In [7]:
# Import the linear regression model.
from sklearn.linear_model import LinearRegression
In [8]:
# Create an object.
linReg = LinearRegression()
# Print all the parameters of the model.
print (linReg)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [9]:
# Fit the training data to the linear regressor.
linReg.fit(X_train, y_train)
Out[9]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [10]:
# Now, that the model is trained, you can test it on the test data.
modelPreds = linReg.predict(X_test)
In [11]:
# This prints the coefficient of each feature in the data.
linReg.coef_
Out[11]:
array([-0.97082019, 1.05714873, 0.03831099, 0.59450642, -1.8551476 , 2.57321942, -0.08761547, -2.88094259, 2.11224542, -1.87533131, -2.29276735, 0.71817947, -3.59245482])
In [12]:
"""
To see how good the model is, we use a variety of metrics. R2 score is a commonly used metric for regression
use-cases. You'll come across it in module 2, and it'll be explained in class. It is between 0 - 1. The closer
the score is to 1, the better model is.
"""
print (r2_score(y_test, modelPreds))
0.5892223849182512
2.2 Decision Tree Regression
In [13]:
# Import the model.
from sklearn.tree import DecisionTreeRegressor
In [14]:
# Create an object.
decTreeReg = DecisionTreeRegressor()
# Print all the parameters of the model.
print (decTreeReg)
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best')
In [15]:
# Fit the training data to the model.
decTreeReg.fit(X_train, y_train)
Out[15]:
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best')
In [16]:
# Now, that the model is trained, you can test it on the test data.
modelPreds = decTreeReg.predict(X_test)
In [17]:
# Understand the performance of the model.
print (r2_score(y_test, modelPreds))
0.5863275009269561
2.3 Ridge Regression
In [18]:
# Import the model.
from sklearn.linear_model import Ridge
In [19]:
# Create an object.
ridgeReg = Ridge()
# Print all the parameters of the model.
print (ridgeReg)
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver='auto', tol=0.001)
In [20]:
# Fit the training data to the model.
ridgeReg.fit(X_train, y_train)
Out[20]:
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver='auto', tol=0.001)
In [21]:
# Now, that the model is trained, you can test it on the test data.
modelPreds = ridgeReg.predict(X_test)
In [22]:
# Understand the performance of the model.
print (r2_score(y_test, modelPreds))
0.5881400471345534
2.4 Lasso Regression
In [23]:
# Import the model.
from sklearn.linear_model import Lasso
In [24]:
# Create an object.
lassoReg = Lasso()
# Print all the parameters of the model.
print (lassoReg)
Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000, normalize=False, positive=False, precompute=False, random_state=None, selection='cyclic', tol=0.0001, warm_start=False)
In [25]:
# Fit the training data to the model.
lassoReg.fit(X_train, y_train)
Out[25]:
Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000, normalize=False, positive=False, precompute=False, random_state=None, selection='cyclic', tol=0.0001, warm_start=False)
In [26]:
# Now, that the model is trained, you can test it on the test data.
modelPreds = lassoReg.predict(X_test)
In [27]:
# Understand the performance of the model.
print (r2_score(y_test, modelPreds))
0.5069663003862215
2.4 Support Vector Regression
In [28]:
# Import the model.
from sklearn.svm import SVR
In [29]:
# Create an object.
svr = SVR()
# Print all the parameters of the model.
print (svr)
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto_deprecated', kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
In [30]:
# Fit the training data to the model.
svr.fit(X_train, y_train)
Out[30]:
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto_deprecated', kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
In [31]:
# Now, that the model is trained, you can test it on the test data.
modelPreds = svr.predict(X_test)
In [32]:
# Understand the performance of the model.
print (r2_score(y_test, modelPreds))
0.4957469419124395