Machine Learning Overview

Machine Learning = building a model from example inputs to make data-driven predictions vs following strictly static program instructions. Traditional programming contains logic that the machine must follow to execution. Machine Learning does not have same logic like traditional if, loops, case etc. Instead, it is based on data and some given algorithm. With that algorithm it is able to perform data analysis.

Two main types of ML. These address different types of problems

  • Supervised
    • Divided into two main types:
      • Classification = hotdog or not hotdog
      • Regression = value prediction
    • Requires training data containing value being predicted. The data needs to be labeled.
    • Based on that data a model is created that can predict value in new data
    • Example
      • Home Prices Calculator
        • Data on number of rooms, bath, sqft, etc
        • Creates model where given data on a property, the home price can be calculated
      • Rekognition custom labels
  • Unsupervised
    • This is divided into these types:
      • Clustering = Identify clusters of similar data
      • Association = person who buys X also buys Y
      • Dimensional Reduction = pre-processing data to have dataset more focused on usage (like filtering)
    • Data does not contain cluster membership
    • Model provides access to data by cluster

Supervised models can be used for predictions, however unsupervised models generally not used for prediction. Rather unsupervised used to find hidden patterns, get more transparency in a cluster.

There is also a field of ‘semi-unsupervised’ learning where the dataset is somewhat labeled (or partially). For example, label parts of image for known parts, but the image as a whole can still be analyzed as part of the unsupervised learning model.

 

Machine Learning Workflow

An orchestrated and repeatable pattern which systematically transforms and processes information to create prediction solutions. The workflow follows these steps

  1. Ask the right question
  2. Prepare the data
  3. Select the algorithm
  4. Train the model
  5. Test the model

Some guidelines:

  • The early steps are most important
  • Expect to go backwards, later knowledge effects previous steps
  • Data is never as you need it
  • More data is better
  • Dont pursue bad solution (re-evaluate, fix or quit)

 

Data Science

The field of Data Science has many parts:

Note that Machine Learning is a combination of Software Development and Math/Statistics. There is a center section “Unicorn” that is able to bring a combination of understanding software, math and the subject in question.

 

Python

Generally Python 3.x is used for ML programming. Python 2.x has incompatibilities so not used generally. Python is especially popular for ML due to the available libraries.

Some common libraries.

  • numpy – scienfic computing
  • pandas = data frames
  • matplotlib – 2D plotting
  • scikit-learn
    • Algorithms
    • Pre-processing
    • Performance evaluation
    • more…

Jupyter Notebook (formerrly IPython Notebook) is the common IDE used with Python. Project Jupyter is a non-profit, open-source project, born out of the IPython Project in 2014 as it evolved to support interactive data science and scientific computing across all programming languages. Jupyter will always be 100% open-source software, free for all to use and released under the liberal terms of the modified BSD license.

Jupyter is developed in the open on GitHub, through the consensus of the Jupyter community. For more information on our governance approach, please see our Governance Document.

All online and in-person interactions and communications directly related to the project are covered by the Jupyter Code of Conduct. This Code of Conduct sets expectations to enable a diverse community of users and contributors to participate in the project with respect and safety.

https://jupyter.org/about

Example

Below is an example of using ML to develop a solution. It follows the machine learning workflow.

Ask the Right Question

Need to define end goal, starting point, and how to goal from start. Define:

  • scope
  • target performance
  • context
  • how solution will be created

Example – we need a solution that can predict if a person will develop diabetes. Scope and Data Sources:

  • understand features in data
  • identify critical features
  • focus on the risk population
  • select data source
  • Use data from Pima Indian Diabetes study

Based on above scope/data, our solution target gets rephased to “Using Pima Indian Diabetes data, predict with 70% which people will develop diabetes”

 

Preparing the Data

Setup the Jupyter notebook and use the pandas library. Note in sagemaker this should come from S3 but for this example below the dataset is small.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# do plotting inline instead of in a separate window
%matplotlib inline

%cd /home/ec2-user/SageMaker/data
%ls -alrt
/home/ec2-user/SageMaker/data
total 184
-rw-rw-r-- 1 ec2-user ec2-user   3865 Jan  8 23:06 pima-trained-model.pkl
-rw-rw-r-- 1 ec2-user ec2-user 110868 Jan  8 23:06 pima-data.xlsx
-rw-rw-r-- 1 ec2-user ec2-user  30020 Jan  8 23:06 pima-data.csv
-rw-rw-r-- 1 ec2-user ec2-user    239 Jan  8 23:06 pima-data-trunc.csv
-rw-rw-r-- 1 ec2-user ec2-user  23094 Jan  8 23:06 pima-data-orig.csv
drwxrwxr-x 2 ec2-user ec2-user   4096 Jan  8 23:06 ./
drwxr-xr-x 6 ec2-user ec2-user   4096 Jan  8 23:17 ../

df = pd.read_csv("/home/ec2-user/SageMaker/data/pima-data.csv")
df.head(5)
num_preg	glucose_conc	diastolic_bp	thickness	insulin	bmi	diab_pred	age	skin	diabetes
0	6	148	72	35	0	33.6	0.627	50	1.3790	True
1	1	85	66	29	0	26.6	0.351	31	1.1426	False
2	8	183	64	0	0	23.3	0.672	32	0.0000	True
3	1	89	66	23	94	28.1	0.167	21	0.9062	False
4	0	137	40	35	168	43.1	2.288	33	1.3790	True

diabetes_map = {True : 1, False : 0}
df['diabetes'] = df['diabetes'].map(diabetes_map)
df.head(5)
num_preg	glucose_conc	diastolic_bp	thickness	insulin	bmi	diab_pred	age	skin	diabetes
0	6	148	72	35	0	33.6	0.627	50	1.3790	1
1	1	85	66	29	0	26.6	0.351	31	1.1426	0
2	8	183	64	0	0	23.3	0.672	32	0.0000	1
3	1	89	66	23	94	28.1	0.167	21	0.9062	0
4	0	137	40	35	168	43.1	2.288	33	1.3790	1

 

Checking the True / False ratio:

num_true = len(df.loc[df['diabetes'] == True])
num_false = len(df.loc[df['diabetes'] == False])
print("Number of True: {0} ({1:2.2f}%)".format(num_true, (num_true/ (num_true + num_false)) * 100))
print("Number of Flase: {0} ({1:2.2f}%)".format(num_false, (num_false/ (num_true + num_false)) * 100))
Number of True: 268 (34.90%)
Number of Flase: 500 (65.10%)

Selecting the Algorithm

There are a lot of ML algorithms available to the public. To figure out which algorithm to use, some things to consider:

  • Compare Decision factors
  • Difference of opinions about factors
  • Learning type
  • result
  • complexity
    • ensemble algorithms = container algorithm of multiple child algorithms. these can be complex
  • basic vs enhanced
    • enhanced = variation of basic algorithms but performance improves and additional functionality. but also more complex.

For the diabetes example, our end result must have a true or false. Therefore we can ignore algorithms that do classifications. Come to consider:

  • Naive Bayes (used in this example)
    • based on likelihood and probability
    • correlation based on data used for training
  • Logistic Regression
    • Measures relationship between each feature and gives them weights. In the end is a true or false result
  • Decision Tree
    • Uses binary tree structure and the features determine which path to traverse through the decision tree. This is good where a sequence of features an lead to the final decision

 

Training the Model

Goal = Letting specific data teach a ML algorithm to create a specific prediction model. Its possible we need to retrain models with new data. New Data = better predictions. We can also use new data to verify the model.

When training a model we need to split it into the following

  • Training Data ~ 70%
  • Test Data ~ 30%

With Python we use ‘scikit-learn‘ library to do the training and testing. This is designed to work with NumPy, SciPy and Pandas. It also contains tools to:

  • Data splitting
  • pre-processing
  • feature selection
  • model training
  • model tuning
  • common interface across algorithms

Common Problem: = missing data. What can we do?

  • Ignore
  • Drop
  • Replace (impute)

In our example, missing data was about 50% of the dataset so we dont want to ignore or drop them. Instead we need to replace the missing values with a new value that has correlation from the other data on that row. This requires SME who understands the data. This is what is called Imputing Options.

 

Testing Model for Accuracy

First we can run the predict method to evaluate the performance of the data set. We can run this on the training and testing data sets.

# predict values using the testing data
nb_predict_test = nb_model.predict(x_test)

from sklearn import metrics

# training metrics
print("Accuracy: {0:.4f}".format(metrics.accuracy_score(y_test, nb_predict_test)))

The confusion matric can be used to also evaluate the accuracy of the algorithm.

print("Confusion Matrix")
print("{0}".format(metrics.confusion_matrix(y_test, nb_predict_test)))
print("")

print("Classification Report")
print(metrics.classification_report(y_test, nb_predict_test))

Some ways to improve the performance of the model:

  • adjust current algorithm
  • get more data or improve data
  • improve training
  • switch algorithms

 

Add Random Forest algorithm, which is a ensemble algorithm that fits multiple trees with subsets of data, this can help improve performance and overfitting.

Overfitting is bad

Means the model is trying to be too perfect. This is because the model is creating its results strictly on the dataset and though performs near perfect on that dataset, it may perform badly on new data.

This can be adjusted using hyper-parameters. These can be used to dampening accuracy on train data which could improve improve performance on new data.

Another algorithm to try is Logistic Regression.

 

Another option for improving performance is to breakup the dataset into

  • Training – 70%
  • Validation – 10%
  • Testing – 20%

 

Another option is to use K-fold cross validation

 

 

 

Example Jupyter Notebook

 


 

eof