Machine learning Algorithms using Scikit-Learn

Ref : All the documentation for the functions used can be found at https://scikit-learn.org/stable/

This notebook aims to introduce you to the scikit-learn library that contains a lot of popularly used Machine Learning
algorithms. This notebook contains the following section:
(1) Classification

For the classification component, we use the iris flowers dataset. These are available readily in scikit-learn.

1. Classification Data Preparation

In [150]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import *
from sklearn.datasets import *

random.seed(0)
np.random.seed(0)

In [151]:

data = load_iris()

In [152]:

X, y = data['data'], data['target']

In [153]:

# Train test split - 75:25 rule.
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.25, shuffle = True)

In [154]:

# Scale the train and test data. (Mean - Var normalization)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [155]:

print ("Train Set : ", X_train.shape, y_train.shape)
print ("Test Set : ", X_test.shape, y_test.shape)

Train Set :  (112, 4) (112,)
Test Set :  (38, 4) (38,)

2. Classification Based Algorithms

2.1 Logistic Regression

In [156]:

# Import the linear regression model.
from sklearn.linear_model import LogisticRegression

In [157]:

# Create an object.
logReg = LogisticRegression()

# Print all the parameters of the model.
print (logReg)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [158]:

# Fit the training data to the linear regressor.
logReg.fit(X_train, y_train)

//anaconda3/envs/work/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
//anaconda3/envs/work/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)

Out[158]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [159]:

# Now, that the model is trained, you can test it on the test data.
modelPreds = logReg.predict(X_test)

In [160]:

# This prints the coefficient of each feature in the data.
# When the number of classes (C) is greater than 1, then the shape is [no. of classes, no. of features]
logReg.coef_

Out[160]:

array([[-0.80694127,  1.30322683, -1.61610659, -1.43562319],
       [ 0.13105422, -1.22884593,  0.7674263 , -0.70455729],
       [ 0.15175132,  0.01109905,  1.54181019,  2.200317  ]])

In [161]:

# Mean accuracy score on the test data. In reality, a 100% accuracy is never achieved!!
logReg.score(X_test, y_test)

Out[161]:

0.9473684210526315

2.2 Naive Bayes

In [168]:

# Import the linear regression model.
from sklearn.naive_bayes import GaussianNB

In [169]:

# Create an object.
naiveBayes = GaussianNB()

# Print all the parameters of the model.
print (naiveBayes)

GaussianNB(priors=None, var_smoothing=1e-09)

In [170]:

# Fit the training data to the linear regressor.
naiveBayes.fit(X_train, y_train)

Out[170]:

GaussianNB(priors=None, var_smoothing=1e-09)

In [171]:

# Now, that the model is trained, you can test it on the test data.
modelPreds = naiveBayes.predict(X_test)

In [172]:

# Mean accuracy score on the test data. In reality, a 100% accuracy is never achieved!!
np.sum((modelPreds == y_test)) / (y_test.shape[0])

Out[172]:

0.9473684210526315

2.3 Decision Tree Classifier

In [173]:

# Import the model.
from sklearn.tree import DecisionTreeClassifier

In [179]:

# Create an object.
# Intentionally setting the depth of the tree to be 2. (no reason! just did it.)
decTreeCla = DecisionTreeClassifier(max_depth = 2)

# Print all the parameters of the model.
print (decTreeCla)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [180]:

# Fit the training data to the model.
decTreeCla.fit(X_train, y_train)

Out[180]:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [181]:

# Now, that the model is trained, you can test it on the test data.
modelPreds = decTreeCla.predict(X_test)

In [182]:

# Mean accuracy score on the test data. In reality, a 100% accuracy is never achieved!!
np.sum((modelPreds == y_test)) / (y_test.shape[0])

Out[182]:

0.9210526315789473

2.4 Support Vector Classifier

In [183]:

# Import the model.
from sklearn.svm import SVC

In [184]:

# Create an object.
svc = SVC()

# Print all the parameters of the model.
print (svc)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [185]:

# Fit the training data to the model.
svc.fit(X_train, y_train)

Out[185]:

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [186]:

# Now, that the model is trained, you can test it on the test data.
modelPreds = svc.predict(X_test)

In [187]:

# Mean accuracy score on the test data. In reality, a 100% accuracy is never achieved!!
np.sum((modelPreds == y_test)) / (y_test.shape[0])

Out[187]:

0.9736842105263158

solidfish

Machine Learning Algorithms Scikit-Learn