This guide offers a quick glance at key AWS services and machine learning concepts, ideal for those preparing for the exam or looking to understand AWS’s machine learning tools better. This article features summarized tables and essential points, serving as a supplementary resource rather than comprehensive prep material. Dive in to familiarize yourself with the foundational elements of AWS machine learning in a simple and accessible format.
Creating Data Repositories
Overview of major services for Data Engineering
Exploratory Data Analysis
Types of Distribution:
Types of Data:
Data Viz:
Data Cleaning:
Feature Engineering:
Amazon SageMaker Overview
- Service Type: Fully managed
- Stages: Build, train, deploy ML models
- Integration: End-to-end ML lifecycle management
- Algorithms: Built-in, AWS Marketplace, custom via Docker
- Access Methods: AWS Console, API (Boto3), Python SDK, Jupyter Notebooks
Key Features
- Built-in Algorithms: Immediate use for training
- Custom Algorithms: Support via Docker containers
- Deployment: One-click model deployment
- AutoPilot: Automated building, tuning, deploying of models from tabular data
- GroundTruth: Data labeling service using human annotators or Mechanical Turk
- Data Wrangler: Visual data preparation and cleaning tool
- Neo: Optimizes models for edge devices
- Automatic Model Tuning: Automated hyperparameter tuning
- Debugger: Real-time training process insights
- Managed Spot Training: Up to 90% cost savings on training
- Distributed Training: Supports frameworks like TensorFlow, PyTorch, MXNet
Hyperparameters and Training
- Hyperparameters: Variables controlling model training, tunable via SageMaker
- Automatic Model Tuning: Finds optimal model version within specified hyperparameter limits
- Training Data Formats: CSV, Protobuf RecordIO, JSON, Libsvm, JPEG, PNG
- Input Modes: File mode (incremental training), Pipe mode (stream from S3)
Deployment and Inference
- Hosting Services: Persistent HTTPS endpoint for predictions
- Batch Transform: Batch inferences without persistent endpoint
- Inference Pipelines: Chain models for sequential inference processing
- Real-Time and Batch Inference: Support through SageMaker endpoints
Optimization:
- Convert training data to protobuf RecordIO format for Pipe mode efficiency.
- Use Amazon FSx for Lustre to speed up File mode training jobs.
Amazon SageMaker Monitoring:
- Publish SageMaker metrics to CloudWatch for CPU, memory utilization, and latency monitoring.
- Send training metrics to CloudWatch to monitor model performance in real-time.
- Use Amazon CloudTrail to detect unauthorized SageMaker API calls.
Amazon SageMaker Pricing:
- Billing for building, training, and deploying ML models is by the second.
- No minimum fees or upfront commitments.
Security and Access
- Notebook Security: Configurable root access, IAM policies
- VPC Support: Private VPC with S3 VPC endpoint required for S3 access
- Lifecycle Configurations: Bash commands setup for notebook instances
Integration and Usage
- Data Preprocessing: In-notebook visualization, feature engineering, data conversion
- Algorithm Training: Use of ECS, Docker; data from S3, EFS, or FSx for Lustre
- Hyperparameter Tuning: Auto-tuning service, selection of performance metrics
- Model Deployment: Setup model definitions, and endpoint configurations for deploying models
Machine Learning Concepts: 3 Categories
Supervised Learning
- Uses pre-labeled data
Unsupervised Learning
- Identifies groupings, and clusters autonomously
Reinforcement Learning
- Example: AI for video games
- Operates on action/reward principle
- Learns through trial and error
- Uses states, actions for each state, and a value (Q)
- Updates value Q to influence decision-making
Optimization: Gradient Descent
- Objective: Find the minima of the “sum of squares” error
- Process: Ascertain slopes leading to a gradient closer to zero
- Adapts step size as gradient approaches zero
- Challenges: The potential to get trapped in local minima and Learning rate (step size) impacts speed to reach minima
Regularization: Prevent Overfitting
Strategies to prevent model overfitting:
- Reduce complexity (fewer neurons or layers)
- Dropout: Randomly remove neurons during training
- Early stopping: Cease training early (e.g., at epoch 6 instead of 10)
- Utilizes regression to adjust sensitivity to specific dimensions
Types of Regularization:
L1 Regularization
- Sum of absolute weights
- Encourages feature selection, reduces dimensionality
- Computationally intensive
L2 Regularization
- Sum of squared weights
- None of the weights reduced to zero
- Computationally efficient, suitable when all features are deemed important
Algorithms and Use Cases
ML Training methods
Performance Metrics
Hyperparameters
Hyperparameter optimization
ML Frameworks
Conclusion
Anticipate that most questions will be scenario-based, typically involving an XYZ task with various constraints such as budget, time, computational needs, or security considerations. Given this information, your task is to identify the optimal solution. Multiple options may appear suitable, as there are often several approaches to solving a given task (for example, ML problems can often be addressed using services like Sagemaker or other AWS tools, or by manual scripting). Using the provided constraints, you must deduce the most logical answer. Scenarios may include issues such as model overfitting, outlier detection, or troubleshooting service malfunctions (requiring knowledge of service limitations, e.g., Sagemaker’s requirement for headers in CSV files and specific column formats in S3).
Some questions will pertain to confusion matrices, either requiring calculation or selecting the most appropriate metric for a given scenario. Additionally, expect inquiries regarding hyperparameter tuning. A general approach to these questions involves first determining whether the problem is classification, regression, or clustering, and then selecting the appropriate algorithm and hyperparameters accordingly.
All the Best!