Machine Learning = building a model from example inputs to make data-driven predictions vs following strictly static program instructions. Traditional programming contains logic that the machine must follow to execution. Machine Learning does not have same logic like traditional if, loops, case etc. Instead, it is based on data and some given algorithm. With that algorithm it is able to perform data analysis.
Two main types of ML. These address different types of problems
- Supervised
- Divided into two main types:
- Classification = hotdog or not hotdog
- Regression = value prediction
- Requires training data containing value being predicted. The data needs to be labeled.
- Based on that data a model is created that can predict value in new data
- Example
- Home Prices Calculator
- Data on number of rooms, bath, sqft, etc
- Creates model where given data on a property, the home price can be calculated
- Rekognition custom labels
- Home Prices Calculator
- Divided into two main types:
- Unsupervised
- This is divided into these types:
- Clustering = Identify clusters of similar data
- Association = person who buys X also buys Y
- Dimensional Reduction = pre-processing data to have dataset more focused on usage (like filtering)
- Data does not contain cluster membership
- Model provides access to data by cluster
- This is divided into these types:
Supervised models can be used for predictions, however unsupervised models generally not used for prediction. Rather unsupervised used to find hidden patterns, get more transparency in a cluster.
There is also a field of ‘semi-unsupervised’ learning where the dataset is somewhat labeled (or partially). For example, label parts of image for known parts, but the image as a whole can still be analyzed as part of the unsupervised learning model.
Machine Learning Workflow
An orchestrated and repeatable pattern which systematically transforms and processes information to create prediction solutions. The workflow follows these steps
- Ask the right question
- Prepare the data
- Select the algorithm
- Train the model
- Test the model
Some guidelines:
- The early steps are most important
- Expect to go backwards, later knowledge effects previous steps
- Data is never as you need it
- More data is better
- Dont pursue bad solution (re-evaluate, fix or quit)
Data Science
The field of Data Science has many parts:
Note that Machine Learning is a combination of Software Development and Math/Statistics. There is a center section “Unicorn” that is able to bring a combination of understanding software, math and the subject in question.
Python
Generally Python 3.x is used for ML programming. Python 2.x has incompatibilities so not used generally. Python is especially popular for ML due to the available libraries.
Some common libraries.
- numpy – scienfic computing
- pandas = data frames
- matplotlib – 2D plotting
- scikit-learn
- Algorithms
- Pre-processing
- Performance evaluation
- more…
Jupyter Notebook (formerrly IPython Notebook) is the common IDE used with Python. Project Jupyter is a non-profit, open-source project, born out of the IPython Project in 2014 as it evolved to support interactive data science and scientific computing across all programming languages. Jupyter will always be 100% open-source software, free for all to use and released under the liberal terms of the modified BSD license.
Jupyter is developed in the open on GitHub, through the consensus of the Jupyter community. For more information on our governance approach, please see our Governance Document.
All online and in-person interactions and communications directly related to the project are covered by the Jupyter Code of Conduct. This Code of Conduct sets expectations to enable a diverse community of users and contributors to participate in the project with respect and safety.
Example
Below is an example of using ML to develop a solution. It follows the machine learning workflow.
Ask the Right Question
Need to define end goal, starting point, and how to goal from start. Define:
- scope
- target performance
- context
- how solution will be created
Example – we need a solution that can predict if a person will develop diabetes. Scope and Data Sources:
- understand features in data
- identify critical features
- focus on the risk population
- select data source
- Use data from Pima Indian Diabetes study
Based on above scope/data, our solution target gets rephased to “Using Pima Indian Diabetes data, predict with 70% which people will develop diabetes”
Preparing the Data
Setup the Jupyter notebook and use the pandas library. Note in sagemaker this should come from S3 but for this example below the dataset is small.
import pandas as pd import matplotlib.pyplot as plt import numpy as np # do plotting inline instead of in a separate window %matplotlib inline %cd /home/ec2-user/SageMaker/data %ls -alrt /home/ec2-user/SageMaker/data total 184 -rw-rw-r-- 1 ec2-user ec2-user 3865 Jan 8 23:06 pima-trained-model.pkl -rw-rw-r-- 1 ec2-user ec2-user 110868 Jan 8 23:06 pima-data.xlsx -rw-rw-r-- 1 ec2-user ec2-user 30020 Jan 8 23:06 pima-data.csv -rw-rw-r-- 1 ec2-user ec2-user 239 Jan 8 23:06 pima-data-trunc.csv -rw-rw-r-- 1 ec2-user ec2-user 23094 Jan 8 23:06 pima-data-orig.csv drwxrwxr-x 2 ec2-user ec2-user 4096 Jan 8 23:06 ./ drwxr-xr-x 6 ec2-user ec2-user 4096 Jan 8 23:17 ../ df = pd.read_csv("/home/ec2-user/SageMaker/data/pima-data.csv") df.head(5) num_preg glucose_conc diastolic_bp thickness insulin bmi diab_pred age skin diabetes 0 6 148 72 35 0 33.6 0.627 50 1.3790 True 1 1 85 66 29 0 26.6 0.351 31 1.1426 False 2 8 183 64 0 0 23.3 0.672 32 0.0000 True 3 1 89 66 23 94 28.1 0.167 21 0.9062 False 4 0 137 40 35 168 43.1 2.288 33 1.3790 True diabetes_map = {True : 1, False : 0} df['diabetes'] = df['diabetes'].map(diabetes_map) df.head(5) num_preg glucose_conc diastolic_bp thickness insulin bmi diab_pred age skin diabetes 0 6 148 72 35 0 33.6 0.627 50 1.3790 1 1 1 85 66 29 0 26.6 0.351 31 1.1426 0 2 8 183 64 0 0 23.3 0.672 32 0.0000 1 3 1 89 66 23 94 28.1 0.167 21 0.9062 0 4 0 137 40 35 168 43.1 2.288 33 1.3790 1
Checking the True / False ratio:
num_true = len(df.loc[df['diabetes'] == True]) num_false = len(df.loc[df['diabetes'] == False]) print("Number of True: {0} ({1:2.2f}%)".format(num_true, (num_true/ (num_true + num_false)) * 100)) print("Number of Flase: {0} ({1:2.2f}%)".format(num_false, (num_false/ (num_true + num_false)) * 100)) Number of True: 268 (34.90%) Number of Flase: 500 (65.10%)
Selecting the Algorithm
There are a lot of ML algorithms available to the public. To figure out which algorithm to use, some things to consider:
- Compare Decision factors
- Difference of opinions about factors
- Learning type
- result
- complexity
- ensemble algorithms = container algorithm of multiple child algorithms. these can be complex
- basic vs enhanced
- enhanced = variation of basic algorithms but performance improves and additional functionality. but also more complex.
For the diabetes example, our end result must have a true or false. Therefore we can ignore algorithms that do classifications. Come to consider:
- Naive Bayes (used in this example)
- based on likelihood and probability
- correlation based on data used for training
- Logistic Regression
- Measures relationship between each feature and gives them weights. In the end is a true or false result
- Decision Tree
- Uses binary tree structure and the features determine which path to traverse through the decision tree. This is good where a sequence of features an lead to the final decision
Training the Model
Goal = Letting specific data teach a ML algorithm to create a specific prediction model. Its possible we need to retrain models with new data. New Data = better predictions. We can also use new data to verify the model.
When training a model we need to split it into the following
- Training Data ~ 70%
- Test Data ~ 30%
With Python we use ‘scikit-learn‘ library to do the training and testing. This is designed to work with NumPy, SciPy and Pandas. It also contains tools to:
- Data splitting
- pre-processing
- feature selection
- model training
- model tuning
- common interface across algorithms
Common Problem: = missing data. What can we do?
- Ignore
- Drop
- Replace (impute)
In our example, missing data was about 50% of the dataset so we dont want to ignore or drop them. Instead we need to replace the missing values with a new value that has correlation from the other data on that row. This requires SME who understands the data. This is what is called Imputing Options.
Testing Model for Accuracy
First we can run the predict method to evaluate the performance of the data set. We can run this on the training and testing data sets.
# predict values using the testing data nb_predict_test = nb_model.predict(x_test) from sklearn import metrics # training metrics print("Accuracy: {0:.4f}".format(metrics.accuracy_score(y_test, nb_predict_test)))
The confusion matric can be used to also evaluate the accuracy of the algorithm.
print("Confusion Matrix") print("{0}".format(metrics.confusion_matrix(y_test, nb_predict_test))) print("") print("Classification Report") print(metrics.classification_report(y_test, nb_predict_test))
Some ways to improve the performance of the model:
- adjust current algorithm
- get more data or improve data
- improve training
- switch algorithms
Add Random Forest algorithm, which is a ensemble algorithm that fits multiple trees with subsets of data, this can help improve performance and overfitting.
Overfitting is bad
Means the model is trying to be too perfect. This is because the model is creating its results strictly on the dataset and though performs near perfect on that dataset, it may perform badly on new data.
This can be adjusted using hyper-parameters. These can be used to dampening accuracy on train data which could improve improve performance on new data.
Another algorithm to try is Logistic Regression.
Another option for improving performance is to breakup the dataset into
- Training – 70%
- Validation – 10%
- Testing – 20%
Another option is to use K-fold cross validation
Example Jupyter Notebook
eof