AWS SageMaker Overview

Amazon SageMaker is a fully managed machine learning service. With SageMaker, data scientists and developers can quickly and easily build and train machine learning models, and then directly deploy them into a production-ready hosted environment. It provides an integrated Jupyter authoring notebook instance for easy access to your data sources for exploration and analysis, so you don’t have to manage servers. It also provides common machine learning algorithms that are optimized to run efficiently against extremely large data in a distributed environment. With native support for bring-your-own-algorithms and frameworks, SageMaker offers flexible distributed training options that adjust to your specific workflows.

 

Using Built-in algorithms vs own algorithms

Built-in Algorithms

  • Several ML algorithms available for variety of problem types
  • Ready to be used
  • Optimized for production
  • Examples:
    • Answers that fit into discrete categories
      • Linear learner and XGBoost
    • Answers that are quantitative
      • Linear learner and XGBoost
    • Answers that are discrete recommendations
      • Factorization machines
    • Identifying groups
      • K-Means algorithm
    • Simplify and better understand attributes of observations
      • Principal Component Analysis (PCA)
    • Clasify images
      • Image classification algorithm
    • Neural machine translation
      • Sequence to sequence algorithm
    • Determining topics in a set of documents
      • Latent Dirichlet Allocation (LDA)
    • Determining topics in a set of documents using neural networks
      • Neural topic model (NTM)

 

Own Algorithms

  • Flexibility to use almost any algorithm code (provided as Docker image)
  • Any implementation language
  • Any dependent libraries and frameworks
  • Can use Tensorflow and Apache MXNet containers provided by AWS

 

SageMaker Machine Learning Environments

Amazon SageMaker supports the following machine learning environments.

  • Amazon SageMaker Studio: Lets you build, train, debug, deploy, and monitor your machine learning models.
  • RStudio on Amazon SageMaker: RStudio is an IDE for R, with a console, syntax-highlighting editor that supports direct code execution, and tools for plotting, history, debugging and workspace management.
  • Amazon SageMaker Studio Lab: Studio Lab is a free service that gives you access to AWS compute resources, in an environment based on open-source JupyterLab, without requiring an AWS account.
  • Amazon SageMaker Canvas: Gives you the ability to use machine learning to generate predictions without needing to code.
  • Amazon SageMaker geospatial: Gives you the ability to build, train, and deploy geospatial models.

To use these machine learning environments, except Studio Lab, you or your organization’s administrator must create an Amazon SageMaker Domain. Studio Lab has a separate onboarding process. An Amazon SageMaker Domain consists of an associated Amazon Elastic File System (Amazon EFS) volume; a list of authorized users; and a variety of security, application, policy, and Amazon Virtual Private Cloud (Amazon VPC) configurations. To use Amazon SageMaker Studio, Amazon SageMaker Studio notebooks, and RStudio, you must complete the Amazon SageMaker Domain onboarding process using the SageMaker console or the AWS CLI.

 

Provision and Managing Notebook Instances

One of the best ways for machine learning (ML) practitioners to use Amazon SageMaker is to train and deploy ML models using SageMaker notebook instances. The SageMaker notebook instances help create the environment by initiating Jupyter servers on Amazon Elastic Compute Cloud (Amazon EC2) and providing preconfigured kernels with the following packages: the Amazon SageMaker Python SDK, AWS SDK for Python (Boto3), AWS Command Line Interface (AWS CLI), Conda, Pandas, deep learning framework libraries, and other libraries for data science and machine learning.

An Amazon SageMaker notebook instance is a fully managed machine learning (ML) Amazon Elastic Compute Cloud (Amazon EC2) compute instance that runs the Jupyter Notebook App. You use the notebook instance to create and manage Jupyter notebooks for preprocessing data and to train and deploy machine learning models.

https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html

SageMaker XGBoost – The XGBoost model is adapted to the SageMaker environment and preconfigured as Docker containers. SageMaker provides a suite of built-in algorithms that are prepared for using SageMaker features. To learn more about what ML algorithms are adapted to SageMaker, see Choose an Algorithm and Use Amazon SageMaker Built-in Algorithms. For the SageMaker built-in algorithm API operations, see First-Party Algorithms in the Amazon SageMaker Python SDK.

 

Example

A typical process for machine learning is:

  • Get a dataset
  • Review the data
  • Split data into training, validation and test sub-sets
  • Train the model
    • Need to select the training algorithm (for this example we use  XGBoost Algorithm )
    • Run the trainer
  • Deploy the model
  • Evaluate the model
    • Using numpy library to run the predictions dataset (data to run predictions)
  • Cleanup

SageMaker also provides sample notebooks that contain complete code walkthroughs. These walkthroughs show how to use SageMaker to perform common machine learning tasks. For more information, see Example Notebooks.

 

Keyboard Shortcuts

  • shift-enter = execute line
  • shift-tab = shows tips for command on cursor

 

References

https://docs.aws.amazon.com/sagemaker/

https://sagemaker.readthedocs.io/en/stable/overview.html#use-sagemaker-jumpstart-algorithms-with-pretrained-models

https://app.pluralsight.com/course-player?clipId=e6235da0-7868-4b9f-a2c4-ff69b8a9ad0a

https://www.kaggle.com/datasets/paultimothymooney/breast-histopathology-images

 

eof