AWS Certified Machine Learning Specialty Cheat Sheet

Ashank
5 min readApr 27, 2024

--

This guide offers a quick glance at key AWS services and machine learning concepts, ideal for those preparing for the exam or looking to understand AWS’s machine learning tools better. This article features summarized tables and essential points, serving as a supplementary resource rather than comprehensive prep material. Dive in to familiarize yourself with the foundational elements of AWS machine learning in a simple and accessible format.

Creating Data Repositories

Overview of major services for Data Engineering

Exploratory Data Analysis

Types of Distribution:

Types of Data:

Data Viz:

Data Cleaning:

Feature Engineering:

Feature Engineering for NLP

Amazon SageMaker Overview

  • Service Type: Fully managed
  • Stages: Build, train, deploy ML models
  • Integration: End-to-end ML lifecycle management
  • Algorithms: Built-in, AWS Marketplace, custom via Docker
  • Access Methods: AWS Console, API (Boto3), Python SDK, Jupyter Notebooks

Key Features

  • Built-in Algorithms: Immediate use for training
  • Custom Algorithms: Support via Docker containers
  • Deployment: One-click model deployment
  • AutoPilot: Automated building, tuning, deploying of models from tabular data
  • GroundTruth: Data labeling service using human annotators or Mechanical Turk
  • Data Wrangler: Visual data preparation and cleaning tool
  • Neo: Optimizes models for edge devices
  • Automatic Model Tuning: Automated hyperparameter tuning
  • Debugger: Real-time training process insights
  • Managed Spot Training: Up to 90% cost savings on training
  • Distributed Training: Supports frameworks like TensorFlow, PyTorch, MXNet

Hyperparameters and Training

  • Hyperparameters: Variables controlling model training, tunable via SageMaker
  • Automatic Model Tuning: Finds optimal model version within specified hyperparameter limits
  • Training Data Formats: CSV, Protobuf RecordIO, JSON, Libsvm, JPEG, PNG
  • Input Modes: File mode (incremental training), Pipe mode (stream from S3)

Deployment and Inference

  • Hosting Services: Persistent HTTPS endpoint for predictions
  • Batch Transform: Batch inferences without persistent endpoint
  • Inference Pipelines: Chain models for sequential inference processing
  • Real-Time and Batch Inference: Support through SageMaker endpoints

Optimization:

  • Convert training data to protobuf RecordIO format for Pipe mode efficiency.
  • Use Amazon FSx for Lustre to speed up File mode training jobs.

Amazon SageMaker Monitoring:

  • Publish SageMaker metrics to CloudWatch for CPU, memory utilization, and latency monitoring.
  • Send training metrics to CloudWatch to monitor model performance in real-time.
  • Use Amazon CloudTrail to detect unauthorized SageMaker API calls.

Amazon SageMaker Pricing:

  • Billing for building, training, and deploying ML models is by the second.
  • No minimum fees or upfront commitments.

Security and Access

  • Notebook Security: Configurable root access, IAM policies
  • VPC Support: Private VPC with S3 VPC endpoint required for S3 access
  • Lifecycle Configurations: Bash commands setup for notebook instances

Integration and Usage

  • Data Preprocessing: In-notebook visualization, feature engineering, data conversion
  • Algorithm Training: Use of ECS, Docker; data from S3, EFS, or FSx for Lustre
  • Hyperparameter Tuning: Auto-tuning service, selection of performance metrics
  • Model Deployment: Setup model definitions, and endpoint configurations for deploying models

Machine Learning Concepts: 3 Categories

Supervised Learning

  • Uses pre-labeled data

Unsupervised Learning

  • Identifies groupings, and clusters autonomously

Reinforcement Learning

  • Example: AI for video games
  • Operates on action/reward principle
  • Learns through trial and error
  • Uses states, actions for each state, and a value (Q)
  • Updates value Q to influence decision-making

Optimization: Gradient Descent

  • Objective: Find the minima of the “sum of squares” error
  • Process: Ascertain slopes leading to a gradient closer to zero
  • Adapts step size as gradient approaches zero
  • Challenges: The potential to get trapped in local minima and Learning rate (step size) impacts speed to reach minima

Regularization: Prevent Overfitting

Strategies to prevent model overfitting:

  • Reduce complexity (fewer neurons or layers)
  • Dropout: Randomly remove neurons during training
  • Early stopping: Cease training early (e.g., at epoch 6 instead of 10)
  • Utilizes regression to adjust sensitivity to specific dimensions

Types of Regularization:

L1 Regularization

  • Sum of absolute weights
  • Encourages feature selection, reduces dimensionality
  • Computationally intensive

L2 Regularization

  • Sum of squared weights
  • None of the weights reduced to zero
  • Computationally efficient, suitable when all features are deemed important

Algorithms and Use Cases

For Recommendation
For Prediction
For Classification
For Forecasting

ML Training methods

Performance Metrics

Hyperparameters

Hyperparameter optimization

ML Frameworks

Conclusion

Anticipate that most questions will be scenario-based, typically involving an XYZ task with various constraints such as budget, time, computational needs, or security considerations. Given this information, your task is to identify the optimal solution. Multiple options may appear suitable, as there are often several approaches to solving a given task (for example, ML problems can often be addressed using services like Sagemaker or other AWS tools, or by manual scripting). Using the provided constraints, you must deduce the most logical answer. Scenarios may include issues such as model overfitting, outlier detection, or troubleshooting service malfunctions (requiring knowledge of service limitations, e.g., Sagemaker’s requirement for headers in CSV files and specific column formats in S3).

Some questions will pertain to confusion matrices, either requiring calculation or selecting the most appropriate metric for a given scenario. Additionally, expect inquiries regarding hyperparameter tuning. A general approach to these questions involves first determining whether the problem is classification, regression, or clustering, and then selecting the appropriate algorithm and hyperparameters accordingly.

All the Best!

--

--

Ashank

Curious mind delving into the nexus of technology and innovation, with a keen appreciation for life's lessons.