Machine learning Algorithms using Scikit-Learn
Ref : All the documentation for the functions used can be found at https://scikit-learn.org/stable/
This notebook aims to introduce you to the scikit-learn library that contains a lot of popularly used Machine Learning
algorithms. This notebook contains the following section:
(1) Classification
For the classification component, we use the iris flowers dataset. These are available readily in scikit-learn.
Sections
1. Classification Data Preparation
In [150]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import *
from sklearn.datasets import *
random.seed(0)
np.random.seed(0)
In [151]:
data = load_iris()
In [152]:
X, y = data['data'], data['target']
In [153]:
# Train test split - 75:25 rule.
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.25, shuffle = True)
In [154]:
# Scale the train and test data. (Mean - Var normalization)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
In [155]:
print ("Train Set : ", X_train.shape, y_train.shape)
print ("Test Set : ", X_test.shape, y_test.shape)
Train Set : (112, 4) (112,) Test Set : (38, 4) (38,)
2. Classification Based Algorithms
2.1 Logistic Regression
In [156]:
# Import the linear regression model.
from sklearn.linear_model import LogisticRegression
In [157]:
# Create an object.
logReg = LogisticRegression()
# Print all the parameters of the model.
print (logReg)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='warn', n_jobs=None, penalty='l2', random_state=None, solver='warn', tol=0.0001, verbose=0, warm_start=False)
In [158]:
# Fit the training data to the linear regressor.
logReg.fit(X_train, y_train)
//anaconda3/envs/work/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning. FutureWarning) //anaconda3/envs/work/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning. "this warning.", FutureWarning)
Out[158]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='warn', n_jobs=None, penalty='l2', random_state=None, solver='warn', tol=0.0001, verbose=0, warm_start=False)
In [159]:
# Now, that the model is trained, you can test it on the test data.
modelPreds = logReg.predict(X_test)
In [160]:
# This prints the coefficient of each feature in the data.
# When the number of classes (C) is greater than 1, then the shape is [no. of classes, no. of features]
logReg.coef_
Out[160]:
array([[-0.80694127, 1.30322683, -1.61610659, -1.43562319], [ 0.13105422, -1.22884593, 0.7674263 , -0.70455729], [ 0.15175132, 0.01109905, 1.54181019, 2.200317 ]])
In [161]:
# Mean accuracy score on the test data. In reality, a 100% accuracy is never achieved!!
logReg.score(X_test, y_test)
Out[161]:
0.9473684210526315
2.2 Naive Bayes
In [168]:
# Import the linear regression model.
from sklearn.naive_bayes import GaussianNB
In [169]:
# Create an object.
naiveBayes = GaussianNB()
# Print all the parameters of the model.
print (naiveBayes)
GaussianNB(priors=None, var_smoothing=1e-09)
In [170]:
# Fit the training data to the linear regressor.
naiveBayes.fit(X_train, y_train)
Out[170]:
GaussianNB(priors=None, var_smoothing=1e-09)
In [171]:
# Now, that the model is trained, you can test it on the test data.
modelPreds = naiveBayes.predict(X_test)
In [172]:
# Mean accuracy score on the test data. In reality, a 100% accuracy is never achieved!!
np.sum((modelPreds == y_test)) / (y_test.shape[0])
Out[172]:
0.9473684210526315
2.3 Decision Tree Classifier
In [173]:
# Import the model.
from sklearn.tree import DecisionTreeClassifier
In [179]:
# Create an object.
# Intentionally setting the depth of the tree to be 2. (no reason! just did it.)
decTreeCla = DecisionTreeClassifier(max_depth = 2)
# Print all the parameters of the model.
print (decTreeCla)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best')
In [180]:
# Fit the training data to the model.
decTreeCla.fit(X_train, y_train)
Out[180]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best')
In [181]:
# Now, that the model is trained, you can test it on the test data.
modelPreds = decTreeCla.predict(X_test)
In [182]:
# Mean accuracy score on the test data. In reality, a 100% accuracy is never achieved!!
np.sum((modelPreds == y_test)) / (y_test.shape[0])
Out[182]:
0.9210526315789473
2.4 Support Vector Classifier
In [183]:
# Import the model.
from sklearn.svm import SVC
In [184]:
# Create an object.
svc = SVC()
# Print all the parameters of the model.
print (svc)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto_deprecated', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)
In [185]:
# Fit the training data to the model.
svc.fit(X_train, y_train)
Out[185]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto_deprecated', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)
In [186]:
# Now, that the model is trained, you can test it on the test data.
modelPreds = svc.predict(X_test)
In [187]:
# Mean accuracy score on the test data. In reality, a 100% accuracy is never achieved!!
np.sum((modelPreds == y_test)) / (y_test.shape[0])
Out[187]:
0.9736842105263158