MLOps CI/CD - Data Versioning (PDF)
Document Details
Uploaded by PoliteAlpenhorn
Jinen Daghrir
Tags
Summary
This document provides a comprehensive overview of MLOps practices, with a specific focus on Continuous Integration/Continuous Delivery (CI/CD) and data version control. It covers topics such as version control systems (e.g., Git), Data Version Control (DVC), data pipeline management, and the advantages of implementing CI/CD in machine learning workflows.
Full Transcript
MLOps: CI/CD Version Control Systems Jinen Daghrir Phd and data scientist @ Ubotica Technologies MLops Overview What’s CI/CD? Benefits of CI/CD Version Control in CI/CD Git: Version Control for code DVC: D...
MLOps: CI/CD Version Control Systems Jinen Daghrir Phd and data scientist @ Ubotica Technologies MLops Overview What’s CI/CD? Benefits of CI/CD Version Control in CI/CD Git: Version Control for code DVC: Data Version Control for Data/Models DVC mechanism DVC and Git What’s CI/CD? CI/CD is a method to make changes more frequently by automating the development stages. In machine learning(ML) this stages are different than a software development, a model depends not only on the code but also the data and hyperparameters, as well as deploying a model to production is more complex too. What’s CI/CD? Continuous Integration Continuous Integration A development practice where Ensures that code changes are developers integrate code changes automatically prepared for into a shared repository deployment. frequently. Focuses on automating the release Each integration is verified by process to deliver applications automated builds and tests to catch reliably and frequently. issues early. What’s CI/CD? Continuous Integration Continuous Integration (CI) in machine learning ensures that every update to code or data automatically triggers the ML pipeline to rerun in a versioned and reproducible manner. This allows seamless sharing of codebases across projects and teams. Each rerun may include training, testing, or generating reports, making it easy to compare different versions in production. j Examples of CI Workflows: Key Features of ML CI: Train and evaluate models for every Automatic validation of code format and data integrity commit to the repository. (e.g., checking for NaN values or incorrect data types). Compare experiment results for each Reproducible and shareable workflows for consistency Pull Request. across teams. Trigger periodic runs to update results or models. What’s CI/CD? Continuous Deployment Continuous deployment is a method to automate the deployment of the new release to production, or any environment such as staging. This practice makes it easier to receive users' feedback, as the changes are faster and constant, as well as new data for retraining or new models. Key Features of ML CD: Examples of CD Workflows: Automates deployment pipelines for efficiency and reliability. Verify infrastructure requirements Ensures smooth updates to production environments with before deployment. minimal downtime. Test model outputs with predefined inputs to ensure correctness. Perform load testing to evaluate model latency and scalability. Benefits of CI/CD Improved Collaboration: Enables teams to work together by integrating code, data, and models seamlessly. Faster Iterations: Automates tasks like testing and deployment, speeding up updates. Enhanced Quality and Reliability: Quickly detects and fixes errors through automated testing. Reproducibility and Traceability: Tracks changes to ensure consistent and auditable results. Efficient Resource Utilization: Automates pipelines to save time and optimize resources. Scalable Deployment: Easily integrates updates into large production systems. Improved User Feedback and Adaptability: Delivers updates faster, enabling better feedback and adjustments. Version Control in CI/CD Role in CI/CD Version control is an essential part of modern software development. It allows you to track changes, collaborate with team members, and maintain a history of your project’s evolution. Integrating version control into your CI/CD pipeline ensures that your code is always in a deployable state. It allows for continuous testing, building, and deployment, which means you can detect and fix issues early in the development cycle. This integration also promotes better collaboration among team members, as everyone has access to the latest code and can contribute without conflicts. Version Control in CI/CD What’s Version Control? manage changes to source code over time. Version control systems (VCS) are tools that help developers They keep track of every modification in a special kind of database. If a mistake is made, developers can turn back and compare earlier versions of the code to help fix the mistake while minimizing disruption to all team members. Git is one of the most popular version control systems, widely used due to its distributed nature and powerful features. Version Control in CI/CD Benefits of Version Control Improved Collaboration: Version control allows teams to work on the same codebase simultaneously, ensuring smooth collaboration without conflicts or overwriting changes. Enhanced Reliability: By tracking all changes, version control helps identify and fix issues early, providing a clear and organized change history for debugging. Faster Development: When integrated with CI/CD pipelines, version control automates testing, building, and deployment, speeding up the development process. Reproducibility: Version control enables developers to recreate previous code versions, ensuring consistent results and simplifying debugging and audits. Git: Version Control for code Git is a powerful tool that helps developers track, manage, and collaborate on code changes, ensuring a clear history of modifications and continuous and smooth teamwork. Key Features of Git: Distributed System: Every developer has a full copy of the repository. Branching and Merging: Enables simultaneous work on multiple features. Strong Community and Integrations: Compatible with CI/CD tools like Jenkins, GitHub Actions, and GitLab CI. Git: Version Control for code The first step in setting up version control for CI/CD is to initialize a repository. This repository will serve as the central place where your code is stored and managed. If you’re using Git, you can initialize a repository by navigating to your project directory and running: git init # Initialize a repository By committing your initial set of files, you create a baseline from which all future changes will be tracked git add # Add file to staging git commit -m "Message" # Save changes git push # Push to remote repository DVC: Data Version Control for Data/Models Challenge in MLOps: Data is as important as code in ML projects, but it’s often too large to be managed effectively with traditional version control tools like Git. Data Version Control: It is a tool for versioning large datasets and ML models. It integrates continuously with Git to track data files and pipelines. Also, it stores data separately from code repositories, enabling efficient management. While Git tracks code, DVC tracks datasets, models, and experiments. DVC mechanism Track Large Files: DVC tracks large data files by creating.dvc files, which store metadata, making it possible to version control large datasets efficiently without storing them directly in Git repositories. Separate Storage: DVC enables you to store large datasets and models in remote storage systems (e.g., Amazon S3, Google Cloud Storage, or Azure Blob Storage), reducing the burden on your Git repository and providing scalable storage. Version History: With DVC, you can version your datasets, allowing you to switch between different dataset versions easily, which is essential for reproducibility in machine learning experiments. Pipeline Management: DVC facilitates the creation of reproducible machine learning pipelines. You can track every stage of the pipeline (data preprocessing, training, etc.), ensuring consistency and traceability for experiments. DVC and Git Initialize DVC:Before using DVC, you need to initialize it in your project directory. This sets up necessary DVC configuration files and prepares the project for data versioning. dvc init Add a Data File to Track: Once DVC is initialized, you can track large data files by adding them to DVC. This creates a.dvc file containing metadata for the data file. dvc add data/raw_data.csv DVC and Git Commit Changes in Git: After adding the data, commit the.dvc file to Git. This ensures that Git tracks changes in the metadata of the data file, while the actual data stays in remote storage. git add data/raw_data.csv.dvc.gitignore git commit -m "Add raw dataset" Push Dataset to Remote Storage: To store the dataset in a remote location (e.g., S3), set up a remote storage location with DVC and push the data. This separates the data from the codebase but ensures it's still versioned and accessible. dvc remote add -d myremote s3://mybucket/path dvc push DVC and Git Benefits of DVC Efficient Storage Management: DVC helps manage large datasets by keeping them out of Git and storing them in more scalable remote systems, ensuring that your code repository remains lightweight. Easy Collaboration on Datasets: With DVC, teams can collaborate on datasets by tracking dataset versions, pushing and pulling data to/from remote storage, and ensuring everyone uses the correct version of the data. Seamless Integration with CI/CD Pipelines: DVC integrates smoothly into CI/CD pipelines, ensuring that dataset updates trigger automatic pipeline runs, keeping the entire ML workflow consistent and reproducible. By implementing DVC in your ML pipeline, you ensure that your code, models, and data are properly versioned, reproducible, and easy to manage, providing a robust foundation for collaboration and experimentation. MLOps pipeline ( end-to-end MLOps ) To be continued