Podcast
Questions and Answers
When deploying machine learning models at scale, which of the following challenges becomes particularly significant?
When deploying machine learning models at scale, which of the following challenges becomes particularly significant?
- Handling millions of users with millisecond latency and high uptime requirements. (correct)
- Ensuring model development aligns with initial specifications.
- Verifying that the basic deployment is working.
- Confirming the model can utilize cloud services.
What does operating machine learning models at scale primarily involve?
What does operating machine learning models at scale primarily involve?
- Monitoring, debugging, and seamlessly updating models, often requiring collaboration between model developers and deployment teams. (correct)
- Limiting responsibilities to a single team to avoid communication overhead.
- Only focusing on model development and initial deployment.
- Reducing model updating frequency to minimize potential errors.
Why is understanding deployment important when developing machine learning models?
Why is understanding deployment important when developing machine learning models?
- It offers insights into model constraints and helps developers tailor models based on their intended use, such as online or batch predictions. (correct)
- It primarily helps in securing funding for the project.
- It simplifies the model development process, making it less complex.
- It ensures that models are developed according to academic standards.
What is a key difference between online (real-time) and batch prediction?
What is a key difference between online (real-time) and batch prediction?
Select which of the following is typically considered a myth in machine learning deployment?
Select which of the following is typically considered a myth in machine learning deployment?
Why do machine learning models require continuous monitoring and updates post-deployment?
Why do machine learning models require continuous monitoring and updates post-deployment?
What is the significance of the trend toward continuous deployment in machine learning?
What is the significance of the trend toward continuous deployment in machine learning?
Why should machine learning engineers be concerned about scalability?
Why should machine learning engineers be concerned about scalability?
What does 'Batch Prediction' involve?
What does 'Batch Prediction' involve?
In the context of online prediction, what does latency sensitivity refer to?
In the context of online prediction, what does latency sensitivity refer to?
In a food delivery platform, how might batch prediction be applied?
In a food delivery platform, how might batch prediction be applied?
What is a primary constraint of batch prediction systems regarding adaptability?
What is a primary constraint of batch prediction systems regarding adaptability?
What advancement is enhancing online prediction capabilities?
What advancement is enhancing online prediction capabilities?
What is the role of streaming features in online prediction?
What is the role of streaming features in online prediction?
What is the significance of feature stores in integrating stream and batch processing?
What is the significance of feature stores in integrating stream and batch processing?
What are the dual benefits of model compression?
What are the dual benefits of model compression?
What does the model compression technique of 'knowledge distillation' involve?
What does the model compression technique of 'knowledge distillation' involve?
What is the primary goal of 'Low-Rank Factorization' as a model compression technique?
What is the primary goal of 'Low-Rank Factorization' as a model compression technique?
What is a key advantage of using WebAssembly (WASM) in web development?
What is a key advantage of using WebAssembly (WASM) in web development?
Flashcards
Model Deployment
Model Deployment
Moving a verified model from development to a production environment.
Challenges in Deployment
Challenges in Deployment
Testing the model under duress conditions to ensure it can still function effectively.
Deploying at Scale
Deploying at Scale
Exposing an API endpoint to allow applications to request model predictions at scale.
Operating at Scale
Operating at Scale
Signup and view all the flashcards
ML Model Deployment
ML Model Deployment
Signup and view all the flashcards
Batch Prediction
Batch Prediction
Signup and view all the flashcards
Online Prediction
Online Prediction
Signup and view all the flashcards
Batch Features
Batch Features
Signup and view all the flashcards
Streaming Features
Streaming Features
Signup and view all the flashcards
Low-Rank Factorization
Low-Rank Factorization
Signup and view all the flashcards
Knowledge Distillation
Knowledge Distillation
Signup and view all the flashcards
Pruning
Pruning
Signup and view all the flashcards
Quantization
Quantization
Signup and view all the flashcards
Cloud Deployment
Cloud Deployment
Signup and view all the flashcards
Edge Deployment
Edge Deployment
Signup and view all the flashcards
Support Dynamics
Support Dynamics
Signup and view all the flashcards
TPU
TPU
Signup and view all the flashcards
Computation Graphs
Computation Graphs
Signup and view all the flashcards
autoTVM
autoTVM
Signup and view all the flashcards
WebAssembly (WASM)
WebAssembly (WASM)
Signup and view all the flashcards
Study Notes
- The presentation is for SEAS 8500: Fundamentals of Al-Enabled Systems, Week 6, and covers model deployment and prediction services, presented by John Fossaceca.
- Slides are adapted from material in Designing Machine Learning Systems by Chip Huyen.
Agenda
- Topics include deploying your model, deployment myths, batch vs. online prediction, model compression methods, and ML on the cloud and on the edge.
- Model compression methods covered include low-rank factorization, knowledge distillation, pruning, and quantization.
Deploying Your Model
- Means moving a model from development to production
- "Production" means making the model accessible to end users
- Key steps:
- Containerize model and dependencies.
- Deploy container to cloud platform.
- Expose prediction API endpoint.
Challenges in Deployment
- Monitoring performance
- Updating models
- Scaling to handle traffic
- Ensuring reliability and uptime
- Managing costs
- Securing access
- Compliance and regulations
- Understanding metrics
Deploying at Scale
- Exposing an API endpoint to receive model predictions
- Downstream apps then send their requests to this endpoint.
- Although basic deployment is straightforward, challenges arise with scale:
- Millions of users
- Milliseconds latency
- 99% uptime
Operating at Scale
- Monitoring and alerting for problems is needed
- Debugging root causes must feature
- Updating models seamlessly is needed
- There are responsibilities to consider:
- Often falls on model developers
- Can also be separate deployment team
- High communication overhead
- Slower model updating
- Harder debugging
ML Model Deployment
- Understanding deployment provides insights into model constraints
- Models should be tailored based on their intended use
- Two ways to provide predictions:
- Online (real-time)
- Batch
- Location impacts design:
- Device (edge)
- Cloud
Machine Learning Deployment Myths
- Common myths:
- It is easy and just requires calling a prediction function
- Models will work as well as they did in development
- You only need to deploy once
- Reality:
- It is more involved that calling a function
- Performance degrades in production
- Models need continuous monitoring and updating
- Debunking myths helps set the right expectations
Myth 1: Only Deploying One or Two ML Models at a Time
- In academia, the focus is often on a single model
- Real applications rely on use many models
- Different features need different models
- Separate models per country/region
- Other segmentation such as user types and languages.
- The ridesharing app is an example
- Demand forecasting, ETA, pricing, fraud, and churn require models
- There are models for each country.
- Adds up to hundreds or thousands of models.
Reality: Many Models in Production
- Uber leverages thousands of models in production.
- Google has thousands of models concurrently training with billions of parameters.
- Booking.com has over 150 models.
- 41% of large companies have over 100 models in production
- Infrastructure should support many models in parallel
- It is no longer possible to think of deploying models in isolation.
Myth 2: If we don't do anything, model performance remains the same
- Software performance degrades over time ("bit rot").
- ML models also suffer from data distribution shift
- There are differences between training data and production data
- Model accuracy declines after deployment
- Ongoing monitoring and updating needed
- It is not possible to "set and forget" models in production.
Myth 3: You Won't Need To Update Your Models As Much
- Models only need infrequent updates, but this is untrue
- Model performance degrades over time
- It is important to update models as fast as possible
- DevOps best practices should be followed for frequent updates
- In 2015, Etsy updated 50x per day, Netflix did thousands per day, and AWS every 11 seconds,
Reality: Update Models Continuously
- Many still only update monthly or quarterly
- But leaders do it much faster:
- Weibo updates some models every 10 minutes
- Alibaba, ByteDance (TikTok) iterate rapidly
- "Deploy models as fast as humanly possible"
- Trend toward continuous deployment for ML
Myth 4: Most ML Engineers Don't Need to Worry About Scale
- "Scale" varies, but often references hundreds of queries per second or millions of users per month.
- It is a misconception that only huge companies need to worry about scale
- Most ML jobs are at large companies:
- 50%+ of developers work at 100+ person companies
- ML roles likely similar
Preparing for Scale
- If seeking industry ML job, it is likely at 100+ person company
- ML systems need scalability
- Scale is no longer an exceptional case
- ML engineers should care about scale
- Apply scalable solutions upfront
- Hard to retrofit later
Batch Prediction vs. Online Prediction
- Key decision about how the model serves prediction is needed:
- Batch: predictions generated asynchronously, latency isn't critical, batch features _ Online (real-time): user waits for prediction, latency sensitive, streaming features
- Factors depend upon user experience needs, infrastructure constraints, throughput required, and data dependencies.
Batch Prediction
- Predictions are generated periodically or on trigger
- They are stored and retrieved on demand.
- It is also called asynchronous or offline, where latency is not critical.
- It is common for internal analytics
- An example is recommendations being precomputed or segmentation compute nightly
Online Prediction
- Predictions generated immediately on request.
- Also called on-demand, real-time, or synchronous.
- User waits for prediction
- Latency sensitive
- Requests via REST API (HTTP requests)
- Common for user-facing apps
Batch vs Online
- Batch prediction is periodical, or asynchronous with a "high throughput"
- Batch prediction is useful for processing accumulated data, generating results when they're not need immediately (ex: recommender systems)
- Online prediction is synchronous with "low latency"
- Online prediction comes as soon as requests come
- Online prediction is useful when predictions are needed as soon as a data sample is generated (ex: fraud detection)
Batch vs. Streaming Features
- Batch features are computed from historical data
- Streaming features computed from real-time data
- Batch prediction: only batch features
- Online prediction:
- Can use batch features
- Can use streaming features
- Example:
- Delivery time estimation
- Batch: restaurant's past prep time
- Streaming: current orders, delivery people available
- Delivery time estimation
Streaming Prediction Architecture
- Can combine batch and streaming features
- Also called "streaming prediction"
- Batch features are retrieved from databases, data warehouses
- Streaming features are computed from real-time data
- Batch predictions for popular queries is an example of hybrid
Online vs. Batch Prediction: A Closer Look
- Food Delivery Platforms can be practical applications
- Batch Prediction: Curates restaurant suggestions (due to vast restaurant options)
- Online Prediction: Recommends dishes once a specific restaurant is selected
- There are debunking misconceptions
- Common Perception: Online prediction might lag in cost & performance efficiency.
- Reality: Efficiency varies; see insights from "Batch vs. Stream Processing".
- Optimizing Resources:
- Online predictions tailor to active users.
- Example: Out of 31 million Grubhub users in 2020, only 622,000 places daily orders. Predicting for all would squander 98% of computational resources.
Online vs. Batch Prediction: Trade-offs
- Online Prediction – Instantaneous Insight:
- Intuitive for academicians and researchers.
- Prototyping Ease: Feed the model an input and receive an instantaneous prediction.
- Deployment Platforms: Often employed with cloud solutions like Amazon SageMaker or Google App Engine, which readily expose endpoints for real-time predictions.
- Batch Prediction – Predict Now, Use Later:
- Predictions are calculated beforehand and stored for later use.
- Efficiency at Scale: Enables processing massive datasets swiftly by leveraging distributed computation.
- Latency Advantages: With predictions already computed, retrieval is often quicker than real-time generation, especially for complex models.
Constraints of Batch Prediction
- Adaptability Hurdles: Batch systems struggle with swift adaptability to evolving user behaviors.
- Illustration: On platforms like Netflix, if one's recent viewings shift genres, recommendations remain static until the next batch computation.
- Prediction Prescience: Batch systems necessitate forecasting which predictions will be sought.
- Challenging for Dynamic Queries: For instance, real-time translation services can't predict every conceivable sentence or phrase.
- Urgency in Application: Scenarios where instantaneous reactions are imperative.
- Sectors like high-frequency trading, autonomous transport, and instant fraud detection require real-time decision-making capabilities.
Towards Enhanced Online Prediction
- Hardware Innovations & Algorithmic Progress: These twin drivers are making online predictions faster and more cost-effective, nudging it towards becoming an industry norm.
- The Journey from Batch to Online requires these
- Strategic Corporate Investments to pivot towards online prediction to enhance user experience and decision accuracy.
- Counteracting Latency: The shift demands a real-time Processing Pipeline, and Rapid-response Models.
Unifying Batch Pipeline & Streaming Pipeline
- Backdrop:
- Historically dominant tools like MapReduce and Spark enabled efficient periodic processing of large datasets. Early ML implementations used robust batch systems
- The Streaming Imperative:
- With the growing need for real-time responsiveness, streaming pipelines became indispensable. Stream caters to instantaneous data influxes, demanding its separate pipeline.
- A Concrete Example:
- Navigation applications like Google Maps, there is a Batch Role which uses accumulated traffic data to predict general route timings
- Streaming Role which adjusts predictions in real-time, accounting for sudden changes like accidents or road closures.
Challenges of Dichotomy
- Running separate pipelines can lead to divergent data interpretations and potential feature inconsistencies.
- Dual systems can strain resources and complicate updates or refinements.
A Detailed Look at Real-time Arrival Prediction in Navigation
- Design a dynamic model for accurate arrival time forecasting in platforms akin to Google Maps.
- Continual Prediction Adjustments: As a user travels, the model constantly refines its prediction based on real-time data.
- Feature - Vehicle Speed Analysis is important
- Definition: Measures the average speed of all cars on the user's ongoing route over a short period (last 5 minutes).
- Data Insight: Leverages comprehensive data from the previous month to train the model on broader traffic patterns.
- Batch Processing: For efficient feature computation across vast datasets, data is grouped and analyzed in chunks using dataframes.
Contrast in Data Processing
- During Model Training: Speed features are derived through a batch-oriented approach, processing large sets of historical data.
- Real-time Inference: As users navigate, the speed feature updates instantaneously, utilizing a streaming methodology complemented by a sliding window.
- Implication: A dual processing ensures the model is both grounded in historical data and responsive to immediate traffic changes
Avoiding Dual Pipeline Bugs
- It is key to avoid different features being extracted for training vs inference
- Changes to one pipeline must be replicated in others
Deep Dive Into Integrating Stream & Batch Processing in ML
- Shift in the ML community towards the cohesive integration of stream and batch processing
- There is merging of real-time reactivity of streaming and in-depth analysis capabilities of batch processing.
- Uber & Weibo both applied infrastructuretransformations to bridge the gap between batch and stream processing by implementing using advanced stream processors, Apache Flink
Consistency Across Features is important to consider
- A Challenge: Avoiding discrepancies between features extracted during different processing modes.
- A Solution: Feature stores play a pivotal role which ensure uniformity between batch features (used during training) and streaming features (employed during real-time predictions).
- Advantages of Unified Processing:
- Combines the capacity to train models on extensive datasets while retaining adaptability to real-time data variations. Data redundancy is mitigated
A data pipeline for ML systems that do online prediction
- Streaming data is controlled, processed, and stored, then goes to a data warehouse
- There is research and development for ML models and labels
- The label and feature engineer results should be eqaul
- From the ML model, it goes to logs, predictions, and inputs before application
Model Compression
- With Real-time ML, size and complexity of ML models can lead to undesirable latency.
- The Trade-off: larger models offer can offer better accuracy, but come at the cost of slower inference speeds.
- Strategies to enhance speed involve inference optimization, hardware enhancements, and model compression.
- Model compression can have dual benefits; compressed models are smaller, but often provide quicker predictions because computational needs are reduced.
- With Growing Emphasis, model compression evident with 168 distinct open-source projects on the topic by April 2022.
Model Compression Techniques
- Low-Rank Optimization:
- Simplifies weight matrices in the model.
- Streamlines operations by eliminating redundancies.
- Knowledge Distillation:
- A 'student' model learns from a larger 'teacher' model.
- Retains capabilities without the unnecessary bulk.
- Pruning:
- Analogous to trimming a tree.
- Removes unneeded weights/neurons, leaving a lean and efficient model.
- Quantization:
- Reduces numerical precision (e.g., from 32-bit to 16-bit numbers).
- Cuts memory and computational requirements.
Low-Rank Factorization
- High-Dimensional Tensors carry redundant information or capture noise which isn't significant to the model's prediction power
- Low-Rank Factorization - A method to simplify tensors by converting high-dimensional spaces into lower-dimensional ones without significant loss of information.
- Enhances the speed and efficiency of model inferences
- Reduces memory consumption, crucial for edge devices.
Mechanics Behind Low-Rank Factorization
- Over-parameterization: Refers to models having too many parameters which can lead to inefficiencies and overfitting.
- Compact Convolutional Filters:
- Offers an approach to reduce over-parameterization.
- A Transition from larger convolutions (like 3x3) to smaller ones (like 1x1) can achieve more compact model structures.
- A result: A drastic reduction in the number of model parameters without a correspondingly significant drop in model accuracy.
Case Studies - SqueezeNets & MobileNets
- SqueezeNets Achieved similar performance to AlexNet on ImageNet but operates with 50x fewer parameters by utilizing strategies like reducing convolution sizes.
- MobileNets: Breaks down convolutions using depthwise and pointwise strategies by Achieves up to nine times reduction in parameters
Knowledge Distillation
- A lighter model called the 'student' is trained to replicate the behavior of a more complex model called the 'teacher'.
- Mechanism can be simultaneously or sequentially trained
- Real-world Application: DistilBERT, which is a compressed version of the BERT model that offers a 40% reduction in size while retaining 97% of BERT's capabilities
Advantages and Limitations of Knowledge Distillation
- Advantages when distillation isn't tied to specific model architectures
- Offers Efficiency with Pretrained Teachers, fewer training time and less data
- Limitations from Dependence on Teacher Models, requiring a model to trained first for more training time and data.
- There is a limited Adoption in Production since this a sensitive and dependence on teacher models, knowledge distillation isn't as widely
Pruning
-
Pruning stems from decision trees to remove non-essential sections. and has been adapted to address neural networks
-
Types of Pruning in Neural Networks:
- Node Pruning: Removes entire nodes from the neural network which changes architecture and reduces the total number of parameters. Parameter Pruning: targets and zeroes out the least important parameters which reduces overall number of parameters.
-
Pruning can induce sparsity which reduces storage requirements and enhances computational performance during inference and faster response times
Effectiveness of Pruning techniques
- Pruning can introduce biases into the neural model, impacting its decision-making. There are debates on Value where the pruned architecture is more value than inherited
- Some data suggest that the pruned structure should undergo retraining
- Zhu et al.'s Research: The pruned model outperformed its dense counterpart after retraining,
- The ML community acknowledges pruning's effectiveness. However, more work is needed
Quantization
- Compresses machine learning models by representing model parameters with fewer bits, reducing memory usage and potentially speeds up inference.
- Default Representation:
- Most systems represent floats using 32 bits (single precision floating point). A model with 100M parameters will take up approximately 400 MB. Types of Quantization:
- Half Precision by using 16 bits then the 100M takes 200MB Fixed Point for edge devices with memory constraints or Binary Weight Neural Networks
Benefits of Quantization
- Space Efficiency: Reduced storage requirements. Computation Efficiency: Operations with smaller bits can be faster. Larger Batch Sizes: More data can be processed at once due to reduced model size.
Quantization - Challenges and Modern Approaches
There are limited representation or and error for Smaller bit widths which require a solution Quantization in Practice:
- Quantization Aware Training vs Post-training Quantization
- Industry supports NVDIA and Goolge tensor cores and the framework simplify the process and quantization
ML on the Cloud and on the Edge
- Cloud Deployment:
- Infrastructure: Large data centers with vast computational resources.
- Use Case: Ideal for heavy computational tasks, initial model training, and managing large datasets.
- Edge Deployment: Infrastructure: Consumer devices with varying computational capabilities
Cloud Deployment
Scalability allows rapidly scaled up or down based on need Resource Pooling allows resource sharing among multiple tenants Automated Maintenance exists with regular backups that are typically managed Challenges includes costs and also Data Privacy Concerns where data could be exposed
Edge Computing is versatile with function in remote or challenging environmers can provide Immediate Data and is data secured. The challenges include Limited Resources for the data and requires more management and updates.
Hardware for Edge Computing
- Tailored Hardware Design: Manufacturers are designing hardware tailored for ML operations.
- Google: Tensor Processing Units designed for ML tasks.
The Challenge of Model Deployment Across Diverse Hardware
It is hard for Model Heterogeneity and Hardware Heterogeneity. This is bridged through Model Quantization and hardware accelerators.
Compiling & Optimizing Models for Edge Devices
- Support Dynamics: A synergy between model frameworks and hardware is paramount.
- Framework-Hardware Relationship: Dynamic is analogous to a software application needing specific OS support.
Hardware Landscape
- CPUs are great for tasks requiring sequential processing and are known for scalar computation.
- GPUs are best suited for parallelizable tasks and operates on one-dimensional vectors.
- TPUs are designed for tensor computations using two-dimensional vectors.
- Operations like convolutions differ based on the hardware's computation primitive
Challenge
- Framework-to-Hardware Mapping: Why is it cumbersome when hardware specific optimizations, different memory hierarchies, and divergent computational primitives.
Bridging with Intermediate Representations through IRs has Elegance
- Benefits: Simplifies the support process, making new hardware integrations more feasible.
Lowering Process
Stages of Lowering such as High-Level IRs and note there are computation graphs to understand a model and blueprint for optimization.
Computation Graphs and Significance in ML.
Also essential for backpropagation in neural networks – aids in understanding data flow during gradient computations and manages memory
Model Optimization
- Purpose is faster inference
- Achieves more cost effectiveness with a balance between performance and accuracy on platform such as CPUs and GPUs
Role of Computation Graphs & Optimization and adaptation to networks with adaptation to hardware for specific graph optimizations by bridging level implementations
Techniques for Model Efficiency
- Data locality and vectorization processes more data and operations at the same time with auto-vectorization. Vectorization requires parallelization.
Advanced Optimization Techniques such as loop tiling which is an operator with manipulated with enhanced loop fusion and reduced overhead for efficient learning.
Local vs. Global Optimization & Challenges
- Techniques such as specific sections or operators can perform operations like global or end to end considering all elements or the computation graph with pruning or fusion or restrictions
Future of Model Optimization
Emerging Trends will drive optimization to improve what's currently available for evolving ML through edge devices.
Using ML to optimize ML models
These can be used to vary graph data and to optimize manually the process of expert heuristics There are also limitations for models for AI Model Variety optimization is revolutionizing optimization which has innovator dilemmas for it because it might not be readily available
Exploration
Exploratory Apprtoachs show not all paths are likely for efficient processing and are only a limited few
cuDNN and autoTVM both aim to optimize
The broad goal is to get the GPU's going and running while adapting to the actual data.
autoTVM Process Overview
To help create, the aim should be to get the overall computation graph. It will look at the best path it can then do that for each thing. By doing that we can use each path that’s available to get all the data we need.
Trade-offs
Performance trade off allows optimal end results and performance When to use should be optimized with the best results It won't always be software and hardware
ML in Browsers
The benefits provide a seamless scope and cross device functionality. This allows you to decouple model deployment.
Common Misconception - JavaScript
Drawbacks for JavaScript includes tensorflow functions and performance restrictions for heavy computaiton.
Introducing WebAssembly
A stack based machine with integrated browsers which have versatility
The Promise of WASM
Performance with Javascript and allows complex browser functions It has Global support and compiled framework
Limitations of WASM
Display Limitations, Technical and Hardware to match app experiences
Challenges
Online Prediction, Batch Prediction, Cloud Inference, Edge Inference. The Hardware Revolution is to solve challenges with newer hardware This has been made possible to better continual monitoring
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.