Podcast
Questions and Answers
In the context of machine learning model deployment, what does 'production' primarily signify?
In the context of machine learning model deployment, what does 'production' primarily signify?
- The stage where the model is rigorously tested for edge cases and failure scenarios.
- The repository where the final model weights and architecture are stored for future use.
- The environment where the model is actively serving predictions to end-users. (correct)
- The phase where the model's code is refactored for optimal performance and readability.
Which of the following is NOT typically considered a key step in deploying a machine learning model?
Which of the following is NOT typically considered a key step in deploying a machine learning model?
- Containerizing the model and its dependencies.
- Deploying the container to a cloud platform.
- Exposing a prediction API endpoint.
- Conducting extensive hyperparameter tuning. (correct)
Which of the following best describes a significant challenge encountered specifically when deploying machine learning models at scale?
Which of the following best describes a significant challenge encountered specifically when deploying machine learning models at scale?
- Securing funding for the machine learning project from stakeholders.
- Developing the model architecture that achieves state-of-the-art accuracy.
- Ensuring the initial model training is computationally efficient.
- Maintaining milliseconds latency and high uptime with millions of users. (correct)
What is a primary implication of understanding deployment considerations early in the machine learning lifecycle?
What is a primary implication of understanding deployment considerations early in the machine learning lifecycle?
Which of the following statements accurately represents a common myth about machine learning deployment?
Which of the following statements accurately represents a common myth about machine learning deployment?
Why is the myth 'You Only Deploy One or Two ML Models at a Time' considered inaccurate in many industry applications?
Why is the myth 'You Only Deploy One or Two ML Models at a Time' considered inaccurate in many industry applications?
What is the primary reason for the phenomenon described as 'model performance degradation' or 'bit rot' in machine learning deployment?
What is the primary reason for the phenomenon described as 'model performance degradation' or 'bit rot' in machine learning deployment?
What is the implication of the reality 'Update Models Continuously' for machine learning operations?
What is the implication of the reality 'Update Models Continuously' for machine learning operations?
Why is 'scale' a significant consideration even for ML engineers not working at FAANG-level companies?
Why is 'scale' a significant consideration even for ML engineers not working at FAANG-level companies?
Which of the following is a crucial aspect of 'preparing for scale' in machine learning systems?
Which of the following is a crucial aspect of 'preparing for scale' in machine learning systems?
In 'Batch prediction', predictions are generated:
In 'Batch prediction', predictions are generated:
Which feature type is exclusively used in 'Batch Prediction'?
Which feature type is exclusively used in 'Batch Prediction'?
What is a key characteristic of 'Online prediction' that distinguishes it from 'Batch prediction'?
What is a key characteristic of 'Online prediction' that distinguishes it from 'Batch prediction'?
In a food delivery platform scenario, how would 'Batch prediction' be typically applied?
In a food delivery platform scenario, how would 'Batch prediction' be typically applied?
What is a primary constraint of 'Batch prediction' systems regarding adaptability to evolving user behaviors?
What is a primary constraint of 'Batch prediction' systems regarding adaptability to evolving user behaviors?
What is the 'streaming prediction architecture' primarily designed to facilitate?
What is the 'streaming prediction architecture' primarily designed to facilitate?
Which of the following is a significant advantage of 'Online Prediction' in terms of user experience?
Which of the following is a significant advantage of 'Online Prediction' in terms of user experience?
What is the primary purpose of 'Model Compression' techniques in machine learning?
What is the primary purpose of 'Model Compression' techniques in machine learning?
Which model compression technique is analogous to 'trimming a tree', removing unnecessary parts to simplify the structure?
Which model compression technique is analogous to 'trimming a tree', removing unnecessary parts to simplify the structure?
How does 'Low-Rank Factorization' achieve model compression?
How does 'Low-Rank Factorization' achieve model compression?
What is the core principle behind 'Knowledge Distillation' as a model compression technique?
What is the core principle behind 'Knowledge Distillation' as a model compression technique?
Which model compression technique directly reduces the numerical precision of model parameters, for example, from 32-bit to 16-bit?
Which model compression technique directly reduces the numerical precision of model parameters, for example, from 32-bit to 16-bit?
What is a primary challenge associated with 'Quantization' as a model compression method?
What is a primary challenge associated with 'Quantization' as a model compression method?
In the context of ML deployment locations, 'Edge Deployment' is best suited for scenarios that require:
In the context of ML deployment locations, 'Edge Deployment' is best suited for scenarios that require:
Which of the following is a key advantage of 'Cloud Deployment' for machine learning models?
Which of the following is a key advantage of 'Cloud Deployment' for machine learning models?
What is a primary challenge associated with 'Cloud Deployment' concerning data?
What is a primary challenge associated with 'Cloud Deployment' concerning data?
Why is 'Hardware Heterogeneity' a significant challenge in ML model deployment?
Why is 'Hardware Heterogeneity' a significant challenge in ML model deployment?
What is the role of 'Intermediate Representations (IR)' in bridging the gap between ML frameworks and diverse hardware?
What is the role of 'Intermediate Representations (IR)' in bridging the gap between ML frameworks and diverse hardware?
What is the primary purpose of 'Computation Graphs' in the context of model optimization?
What is the primary purpose of 'Computation Graphs' in the context of model optimization?
Which level of optimization considers the entire model or computation graph, rather than specific sections?
Which level of optimization considers the entire model or computation graph, rather than specific sections?
What is 'Vectorization' in model optimization techniques primarily aimed at achieving?
What is 'Vectorization' in model optimization techniques primarily aimed at achieving?
What is the main goal of 'Loop Tiling' (Blocking) as an advanced optimization technique?
What is the main goal of 'Loop Tiling' (Blocking) as an advanced optimization technique?
Which of the following best describes the 'autoTVM' approach to model optimization?
Which of the following best describes the 'autoTVM' approach to model optimization?
What is a key benefit of deploying ML models directly in web browsers?
What is a key benefit of deploying ML models directly in web browsers?
What is WebAssembly (WASM) primarily designed to address in the context of web-based ML applications?
What is WebAssembly (WASM) primarily designed to address in the context of web-based ML applications?
What is a primary limitation of WebAssembly (WASM) in the context of ML applications, compared to native applications?
What is a primary limitation of WebAssembly (WASM) in the context of ML applications, compared to native applications?
In the 'Summary' slide, what is identified as a key challenge regarding 'Online Prediction'?
In the 'Summary' slide, what is identified as a key challenge regarding 'Online Prediction'?
According to the 'Summary', what is a primary concern related to 'Cloud Inference'?
According to the 'Summary', what is a primary concern related to 'Cloud Inference'?
What is emphasized as crucial 'Beyond Deployment' in the 'Summary' slide?
What is emphasized as crucial 'Beyond Deployment' in the 'Summary' slide?
In the context of model deployment, what does the term 'Hardware Revolution' refer to, as mentioned in the summary?
In the context of model deployment, what does the term 'Hardware Revolution' refer to, as mentioned in the summary?
What is the MOST critical factor that determines the choice between batch and online prediction?
What is the MOST critical factor that determines the choice between batch and online prediction?
Which of the following is a KEY difference between batch and streaming features in machine learning?
Which of the following is a KEY difference between batch and streaming features in machine learning?
In the context of a food delivery platform, how would batch prediction be MOST effectively utilized alongside online prediction to enhance the user experience?
In the context of a food delivery platform, how would batch prediction be MOST effectively utilized alongside online prediction to enhance the user experience?
What is a PRIMARY challenge in batch prediction systems that makes them less adaptable to evolving user behaviors and preferences?
What is a PRIMARY challenge in batch prediction systems that makes them less adaptable to evolving user behaviors and preferences?
What is the PRIMARY aim of the 'streaming prediction architecture,' and how does it differ from traditional batch or online prediction systems?
What is the PRIMARY aim of the 'streaming prediction architecture,' and how does it differ from traditional batch or online prediction systems?
In what scenario would 'Cloud Deployment' be the MOST advantageous location for deploying ML models, considering both its strengths and weaknesses?
In what scenario would 'Cloud Deployment' be the MOST advantageous location for deploying ML models, considering both its strengths and weaknesses?
How does 'Hardware Heterogeneity' present a challenge in ML model deployment, and what strategies can be employed to mitigate this issue?
How does 'Hardware Heterogeneity' present a challenge in ML model deployment, and what strategies can be employed to mitigate this issue?
What is the role of 'Intermediate Representations (IR)' in the ML model deployment pipeline, and why are they considered essential for addressing hardware heterogeneity?
What is the role of 'Intermediate Representations (IR)' in the ML model deployment pipeline, and why are they considered essential for addressing hardware heterogeneity?
How do 'Computation Graphs' contribute to the process of model optimization, particularly concerning the identification and utilization of parallelizable tasks?
How do 'Computation Graphs' contribute to the process of model optimization, particularly concerning the identification and utilization of parallelizable tasks?
To achieve faster inference times and better utilization of hardware resources during model optimization, what key balance MUST be achieved, and why is it crucial?
To achieve faster inference times and better utilization of hardware resources during model optimization, what key balance MUST be achieved, and why is it crucial?
What is the primary goal of 'Vectorization' as a model optimization technique, and how does it contribute to improving computational efficiency?
What is the primary goal of 'Vectorization' as a model optimization technique, and how does it contribute to improving computational efficiency?
How does the advanced optimization technique of 'Loop Tiling' (Blocking) enhance the performance of machine learning models, especially in scenarios with limited cache memory?
How does the advanced optimization technique of 'Loop Tiling' (Blocking) enhance the performance of machine learning models, especially in scenarios with limited cache memory?
Considering the trend toward using ML to optimize ML models, what inherent limitations of traditional, human-driven optimization approaches are addressed by ML-based methods?
Considering the trend toward using ML to optimize ML models, what inherent limitations of traditional, human-driven optimization approaches are addressed by ML-based methods?
Within the context of model optimization and the 'Innovators' Dilemma,' why might a novel, cutting-edge model architecture struggle to achieve optimal performance during initial deployment?
Within the context of model optimization and the 'Innovators' Dilemma,' why might a novel, cutting-edge model architecture struggle to achieve optimal performance during initial deployment?
How does autoTVM determine the best execution paths for a given model, and what is the role of 'dynamic learning' in refining this process over time?
How does autoTVM determine the best execution paths for a given model, and what is the role of 'dynamic learning' in refining this process over time?
What are the PRIMARY benefits of deploying ML models directly in web browsers, and how does this deployment strategy impact the accessibility and compatibility of these models?
What are the PRIMARY benefits of deploying ML models directly in web browsers, and how does this deployment strategy impact the accessibility and compatibility of these models?
What is the PRIMARY purpose of employing WebAssembly (WASM) in web-based ML applications, and how does it attempt to overcome the inherent limitations of JavaScript in such contexts?
What is the PRIMARY purpose of employing WebAssembly (WASM) in web-based ML applications, and how does it attempt to overcome the inherent limitations of JavaScript in such contexts?
What are the key areas covered in research, when student thinks critically to write a praxis?
What are the key areas covered in research, when student thinks critically to write a praxis?
When creating a statement of problem, what should be avoided (and included)?
When creating a statement of problem, what should be avoided (and included)?
What format should thesis statements follow?
What format should thesis statements follow?
What are the limitations in WebAssembly (WASM)?
What are the limitations in WebAssembly (WASM)?
In the context of deploying ML models, what critical trade-off must be carefully balanced to ensure effective performance?
In the context of deploying ML models, what critical trade-off must be carefully balanced to ensure effective performance?
How do hardware-specific optimizations impact the deployment of ML models across various platforms, and what challenges do they introduce?
How do hardware-specific optimizations impact the deployment of ML models across various platforms, and what challenges do they introduce?
What is the primary role of 'Intermediate Representations (IR)' in the context of diverse hardware deployment, and how do they facilitate the optimization process?
What is the primary role of 'Intermediate Representations (IR)' in the context of diverse hardware deployment, and how do they facilitate the optimization process?
How does the strategic unification of batch and streaming pipelines address the challenges of real-time data processing and what benefits does it provide in the context of ML model deployment?
How does the strategic unification of batch and streaming pipelines address the challenges of real-time data processing and what benefits does it provide in the context of ML model deployment?
How does 'autoTVM' leverage machine learning to enhance model optimization, and what limitations of traditional optimization approaches does it address?
How does 'autoTVM' leverage machine learning to enhance model optimization, and what limitations of traditional optimization approaches does it address?
In what ways can integrating ML models in web browsers affect accessibility and compatibility, and what are some key mechanisms that enable this integration?
In what ways can integrating ML models in web browsers affect accessibility and compatibility, and what are some key mechanisms that enable this integration?
How does WebAssembly (WASM) enhance the execution of ML applications in web browsers, and what limitations persist despite its advantages?
How does WebAssembly (WASM) enhance the execution of ML applications in web browsers, and what limitations persist despite its advantages?
What factors should be considered when selecting a model compression technique for deployment in resource constrained edge devices?
What factors should be considered when selecting a model compression technique for deployment in resource constrained edge devices?
What are the major challenges with batch prediction?
What are the major challenges with batch prediction?
Which option should be focused on during the praxis?
Which option should be focused on during the praxis?
Flashcards
What is deploying your model?
What is deploying your model?
Moving a model from the development environment to a production environment.
What is 'production' in ML?
What is 'production' in ML?
Making a model accessible to end-users.
What does containerizing a model mean?
What does containerizing a model mean?
Packaging the model and its dependencies into a self-contained unit.
What is deploying a container to a cloud platform?
What is deploying a container to a cloud platform?
Signup and view all the flashcards
What does exposing a prediction API endpoint mean?
What does exposing a prediction API endpoint mean?
Signup and view all the flashcards
Key decision for Prediction Type?
Key decision for Prediction Type?
Signup and view all the flashcards
What is Batch Prediction?
What is Batch Prediction?
Signup and view all the flashcards
What is Online Prediction?
What is Online Prediction?
Signup and view all the flashcards
What is an important reality of ML models?
What is an important reality of ML models?
Signup and view all the flashcards
What are Batch Features?
What are Batch Features?
Signup and view all the flashcards
What are Streaming Features?
What are Streaming Features?
Signup and view all the flashcards
Common perception of Online prediction
Common perception of Online prediction
Signup and view all the flashcards
What is Adaptability Hurdles?
What is Adaptability Hurdles?
Signup and view all the flashcards
What is Low-Rank Factorization?
What is Low-Rank Factorization?
Signup and view all the flashcards
What is Knowledge Distillation?
What is Knowledge Distillation?
Signup and view all the flashcards
What is Pruning?
What is Pruning?
Signup and view all the flashcards
What is Quantization?
What is Quantization?
Signup and view all the flashcards
What is Localized Processing?
What is Localized Processing?
Signup and view all the flashcards
What is the elegance of IR?
What is the elegance of IR?
Signup and view all the flashcards
What are High-Level IRs?
What are High-Level IRs?
Signup and view all the flashcards
What are Computation Graphs?
What are Computation Graphs?
Signup and view all the flashcards
Purpose of Model Optimization?
Purpose of Model Optimization?
Signup and view all the flashcards
What is Roles of Optimization Engineers?
What is Roles of Optimization Engineers?
Signup and view all the flashcards
What is Spatial Locality?
What is Spatial Locality?
Signup and view all the flashcards
What is Temporal Locality?
What is Temporal Locality?
Signup and view all the flashcards
What is SIMD?
What is SIMD?
Signup and view all the flashcards
Fine-grained (Task Parallelism)
Fine-grained (Task Parallelism)
Signup and view all the flashcards
What is Coarse-grained (Data Parallelism)?
What is Coarse-grained (Data Parallelism)?
Signup and view all the flashcards
What is Loop Tiling?
What is Loop Tiling?
Signup and view all the flashcards
What is Global Support
What is Global Support
Signup and view all the flashcards
Study Notes
- The presentation welcomes students to SEAS Online at George Washington University.
Audio Settings
- Students are instructed to mute their audio to eliminate background noise.
- To speak, students should click the hand icon and unmute the microphone when called upon.
- Students should remember to mute themselves after speaking.
Chat and Recordings
- Students can type questions in the chat.
- Recordings of each class session are provided for registered students' private use only.
- Releasing the class recordings is strictly prohibited.
Fundamentals of AI-Enabled Systems
- SEAS 8500 is a course at George Washington University
- Week 6 focuses on model deployment and prediction services.
- John Fossaceca is the instructor.
- The slides are adapted from material in Designing Machine Learning Systems by Chip Huyen.
Agenda
- Topics include deploying your model, ML deployment myths, batch vs. online prediction, model compression methods (low-rank factorization, knowledge distillation, pruning, quantization), and ML on the cloud and on the edge.
Deploying Your Model
- Deployment involves moving a model from development to production.
- "Production" means making the model accessible to end-users.
- Key steps include containerizing the model and its dependencies, deploying the container to a cloud platform, and exposing a prediction API endpoint.
Challenges in Deployment
- Challenges include monitoring performance, updating models, scaling to handle traffic, ensuring reliability and uptime, managing costs, securing access, compliance and regulations, and understanding metrics.
Deploying at Scale
- Scaling involves exposing an API endpoint for model predictions and having downstream apps send requests.
- Getting basic deployment to work is straightforward, but challenges arise with millions of users, millisecond latency requirements, and 99% uptime expectations.
Operating at Scale
- Operation at scale includes monitoring, alerting for problems, debugging root causes, and updating models seamlessly.
- Responsibilities can fall on model developers or a separate deployment team, involving high communication overhead, slower model updating, and harder debugging.
ML Model Deployment
- Understanding deployment gives insight into model constraints.
- Models should be tailored based on intended use.
- Predictions can be served online in real-time or in batch.
- Deployment location, whether device (edge) or cloud, impacts design.
Deployment Considerations
- Key consideration include common deployment myths, online vs. batch prediction, edge vs. cloud deployment, infrastructure considerations, user experience factors, and background knowledge gaps.
Machine Learning Deployment Myths
- Common myths are that deployment is easy, models perform the same as in development, and models only need to be deployed once.
- The reality is that deployment is more involved, performance degrades, and models require continuous monitoring and updating.
- Debunking these myths helps set realistic expectations.
Myth 1: Deploying One or Two Models
- Academia focuses on single models, but real applications use many.
- Real-world applications require different features, separate models per country/region, and other segmentation.
- Ridesharing apps use many models for demand forecasting, ETA, pricing, fraud, churn, and country-specific models.
- This can add up to hundreds or thousands of models.
Reality: Many Models
- Uber has thousands of models in production.
- Google trains thousands of models concurrently with billions of parameters.
- Booking.com has over 150 models.
- 41% of large companies have over 100 models in production.
- Infrastructure must support many models in parallel.
- Deploying only one model in isolation is no longer viable.
Myth 2: Consistent Performance
- Software performance degrades over time ("bit rot").
- ML models suffer from data distribution shift.
- Differences exist between training and production data.
- Model accuracy declines after deployment.
- Ongoing monitoring and updating is essential.
- "Set and forget" models are not viable in production.
Myth 3: Infrequent Updates
- Models require frequent updates because performance degrades over time.
- Models should be updated as fast as possible.
- DevOps best practices should be followed for frequent updates.
- Etsy deployed 50x per day, Netflix 1000s per day, and AWS every 11 seconds in 2015.
Reality: Continuous Updates
- Many companies update monthly or quarterly, but leaders update faster.
- Weibo updates certain models every 10 minutes.
- Alibaba, ByteDance (TikTok) iterate rapidly.
- There is a trend towards continuous deployment for Machine Learning.
Myth 4: Limited Scale
- Scale varies, often meaning hundreds of queries per second or millions of users per month.
- The misconception is that scale is only a concern for large companies.
- Most ML jobs are at large companies of 100+ employees, so ML roles are likely similar.
Preparing for Scale
- Industry ML jobs are likely at 100+ person companies.
- ML systems need scalability.
- Scale is no longer an exception.
- ML engineers should care about scale and apply scalable solutions upfront, as retrofitting later is difficult.
Batch vs. Online
- Key decision is how the model serves predictions.
- Batch prediction generates predictions asynchronously and is not latency-critical, using batch features.
- Online, real-time prediction requires the user to wait, is latency-sensitive, and uses streaming features.
- Factors include user experience needs, infrastructure constraints, throughput required, and data dependencies.
Batch Prediction Details
- Batch Prediction generates predictions periodically or on a trigger
- It is stored and retrieved on demand
- Also known as asynchronous or offline prediction, it's not latency-critical
- Useful for internal analytics
- Recommendations and segmentation computed nightly are examples of batch processing.
Online Prediction Details
- Online prediction generates predictions immediately on request.
- It is also called on-demand, real-time, or synchronous prediction.
- The User must waits for prediction, and it is latency sensitive.
- Requests are sent via REST API (HTTP requests) and common for user-facing apps
Key Differences
- Batch prediction happens periodically, such as every four hours.
- Online prediction occurs as soon as requests come.
- Batch processing is good for accumulating data when you don't need immediate results, like recommender systems.
- Online Prediction works where predictions are needed as soon as a data sample is generated (such as fraud detection), and this comes at a lower latency
Batch vs. Streaming Data
- Batch features are computed from historical data.
- Streaming features are computed from real-time data.
- Batch prediction uses only batch features.
- Online prediction can use both batch and streaming features.
- Delivery time estimation is an example, in batch is the restaurant's past prep time is used, and the streaming feature are current orders, and the availability of delivery people.
Streaming Prediction Architecture
- Streaming Prediction Architecture can combine both batch and streaming features.
- Another term is Streaming Prediction
- Retrieves batch features from databases and data warehouses.
- Streaming characteristics are computed from real-time data.
- Batch Prediction suits popular queries while online predictions suit long tail of a query
Online vs. Batch Prediction - Use Cases
- Practical application, Food Delivery Platforms: Curates restaurant suggestions (batch) while Recommendations dishes once a specific restaurant is selected (online prediction).
- Common Perception: Online prediction lags in cost & is less performant however Reality: Efficiency varies.
- Out of 31 million Grubhub users, only 622,000 placed daily orders. Predicting for all would squander 98% of computational resources
From Batch to Online Prediction
- The appropriate type of Prediction depends on the decisions and how users experience those decisions
- Batch prediction and Online prediction are described, their pos and cons
- Their purpose is to fit into the larger picture of Machine Learning
Online vs. Batch Prediction - Trade Offs
- Online Prediction Pros - Instantaneous Insight: Intuitive for academicians and researchers. Fast Prototyping
- Online Prediction Trade offs: Cloud solutions like Amazon SageMaker and Google App Engine do readily expose endpoints for real-time predictions.
- Batch Prediction Pros - Predict Now, Use Later:. Efficient at Scale. and low Latency.
- Batch Prediction Trade offs: Predictions are calculated beforehand, Efficiency at Scale, retrieval is often quicker than real-time
Constraints of Batch Prediction
- Adaptability Hurdles: Batch Struggle with swift adaptability to evolving user behaviors, if one's recent viewings shift genres, recommendations may remain static.
- Challenges w/ Dynamic Queries: for instance, real-time translation services can't predict every conceivable sentence or phrase.
- There needs to be Instantaneous reactions in high-frequency trading and instant fraud detection
Towards Enhanced Online Prediction
- Hardware Innovations & Algorithmic: These two are making online predictions faster and more cost-effective while the Journey to become the standard.
- Strategic Corporate Investments - Many are pivoting towards online prediction to enhance user experience and decision accuracy. and so
- Must counteract Latency in real time by: Real-time Processing Pipeline & Rapid-response Models.
Unifying Batch Pipeline & Streaming Pipeline
- Batch is from the older days of computing, as historically dominant tools that enabled efficient periodic processing of large datasets through batch systems for Machine Learning.
- Now, their a need for real time responsiveness with streaming, its crucial for seperate pipelines now and the architecture is indispensible.
- Navigation is a Concrete Example from Google Maps to predict route timings for batch and real time by adjusting in real time for new changes
- Major Challenge - Running separate pipelines can lead to maintenance overhead and inconsistencies
Dual nature of processing in real time
- Dynamic model is good for a platforms arrival and so continual corrections can adjust along the way with real time data
- There is feature capture to consider with leveraging historical data and model training.
- The dual nature of speeds while driving are helpful too, as batch helps with the historical data but real time is critical for up to the minute traffic
- The implementation is to be aware of code bugs and its helpful to share code, collaborate more and to do more testing
Integrating Stream & Batch Processing
- ML is about the cohesive integration of stream and batch processing through transformations to bridge the gap.
- Avoid discrepancies, and do feature stores for uniform features.
- Benefit - allows businesses to scale their ML models and remain adaptive
Data piplines
- Data pipelines are designed to stream data to preform research and are also used for inference.
Model Compression
- The goal is to create Real-machine learning in applications for speed, but can be slow at speeds. Therefore the trade off is to create better accurancy and speeds.
- Stategies to enhance - Enhance Speed, hardware, Model Compression and reduceing the accuracy
Model compression (techniques)
- low rank optimization (simplifies weight matrices)
- knowledge distillation ('student' model learns from 'teacher' model)
- pruning (removes unneeded weights/neurons), quantization (reduces numerical precision & requirements)
low rank factorization
- with high-dimensional tensors, important to filter redundancy. and is done to simplify tensors with lower dimensions.
- its helps with speed and memory.
Factorization
- Its about avoiding parameterization by having filters. To transition from smaller areas to larger areas. the drastic reduction is in reducing the number of parameters.
Case studies for models to have fewer parameters
- squeezenets with alexnet with much smaller parameters
- Mobilenets which breaks down convolutions
Knowledge Distillation
- lighter a student model is trained to replicate the behavior of a more complex teacher model.
- These are trained quickly at the same time.
- For example, DistilBert retains 97% the the BERT models data
Benefits and negatives
- Distillation is flexible in that a teacher can pass what its teaches to a random place, but can preforn badly with a specific model design. which is why its not widely used.
ٍSparsity
- is helped through parameter and notewide pruning. so in summary - Pruning can create computational efficiency in model structures
Pruning Effectiveness
- Is effective although it can induce biases. also weights and Architecture is debatable with what to prune
- Can also retrain to capture losses through various testing and methods
Quantization
- A method to compress ML models by representing model parameters with fewer bits to represent data well. This all reduces the number of memory required
- Default systems store data in certain ways . And types of quantization. All this saves power and increase efficiency
Challenges in Quantization
- The problems is that bit representation can be lost with numbers, as well as with training . However NVIDIA and TPU have special sauce to alleviate those concerns.
- Therefore industry tools must provide with what needs to take advantage of minimal code with mobile devices
Edge verses cloud
- Its about what hardware and code has access to versus local machines, its the difference between performance as well are internet connectivivty issues
Cloud vs. edge advantages
- Scalability, Resource Pooling and Maintenance
- Data locality is essential
Hardware at the Edge
- Tailored Hardware Design is essential and manufacturers are looking forward to ML operations.
- Tensor Processing Units, The Neural Engine , and startups are all focused with energy in mind
Challenges
- The challenges across diverse hardware is due to the differnence and what to accelorate. The way forward is to have the unioforms be on optimized versions
Hardware and Software Mapping
- Software synergies are paramount. Its analogous of an application needing operating support and should work in harmony
Hardware and Software Mapping
- The challenge is for different code components all work together in an efficient environment
Process Bridging
- Wrong with Direct Mapping?: Multiple frameworks and many hardware backends make direct compatibility to impossible,
- The solution is to have intermediaries across all components and backends and simplifies new integrations to be more feasible
Process Bridging
- The way to have it low from high, so code works for harward natives after sequence operations. From computation to the actual blueprints for usage
Graphs and Visuals
- The components are visual with arrows to help under stand flow and gradient
Model Optimization
- Its about speed verses accurecy. and achieving great power through the right performance is critical.
How to make code components
- Code components must bridge together and have network adpatation.
ٍScalability
- To save power, its essential to vectorize and parallelize for speed
Scaling Efficiency
- There must have a loop and operator set together is great for optimization
Scaling Efficiency challenges
- Local verses scaling is needed to reduce restriction due to bad architeture
Optimization challenges
- Its about optimizing is a complex task with AI that may need to perform real time processing
Optimization
- Its about the human expertise
- The issue with machine code is that its not well known with frameworks such as cuDNN and autoTVM. This code only optimizes parts and not entire pieces.
autoTVM
- Is only in use if things are working with a broader scope as to adapating strategies
autoTVM
- It works with each subgraph with various processes to make the coding work
Code Efficiency
- There tradeoffs to account for in compilation is more important that it be optimized and coded more efficient.
Is it in the browser
- Its great to have the code in the browser for ease of access as to be device independent.
- Its aslo great to use JavaScript for code, TensorFlow or synapses
Code limitation - JavaScript
- JavaScript although popular, code does not run effectively
Its all about WASM now
- Its made so everyone can access as a quick stack method
Benefit - WASM
- performance of running code faster and supports many components like javascript
Cons WASM
- still has growing constraints and some features are being developed.
All together
- Be strategic and prepare to check that it may work
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.