Unit 2 UNDERSTANDING THE ML PROBLEMS AND PIPELINES.pdf

Full Transcript

Unit 2 UNDERSTANDING THE ML PROBLEMS AND PIPELINES PROBLEM STATEMENT: A data scientist team at a cargo shipping company in Finland has been assigned to save 20% of the costs for cargo operations at the port of Turku. This will be achieved by developing a machine learning (ML) sol...

Unit 2 UNDERSTANDING THE ML PROBLEMS AND PIPELINES PROBLEM STATEMENT: A data scientist team at a cargo shipping company in Finland has been assigned to save 20% of the costs for cargo operations at the port of Turku. This will be achieved by developing a machine learning (ML) solution to predict weather conditions at the port 4 hours in advance. Monitoring for possible rainy conditions is crucial as it can distort operations involving human resources and transportation, thereby affecting supply chain operations at the port. The ML solution will help port authorities predict possible rain 4 hours in advance, saving 20% of costs and ensuring smooth supply chain operations. The problem should be simplified and categorized using an appropriate approach. In the previous chapter, the categorization of a business problem for solving with ML was discussed. These principles will be applied to chart a clear roadmap for implementation. First, the type of model to be trained to yield maximum business value will be identified. Secondly, the right approach for MLOps implementation will be determined. To decide on the type of model to train, the dataset available on GitHub (https://github.com/PacktPublishing/EngineeringMLOps) will be examined. A snapshot of the weather_dataset_raw.csv file is shown in Figure 3.1. The file size is 10.7 MB, containing 96,453 rows in CSV format. Figure 3.1 – Dataset snapshot By assessing the data, the business problem can be categorized as follows: Model type: To save 20% of operational costs at the port of Turku, a supervised learning model is required to predict by classifying whether it will rain or not. The data is labeled, and the Weather condition column indicates whether an event has recorded rain, snow, or clear conditions. This can be relabeled as rain or no rain and used for binary classification, making it straightforward to solve the business problem with a supervised learning approach. MLOps approach: Observations from the problem statement and data are: o Data: The training data size is 10.7 MB, which is reasonably small (not considered big data). o Operations: The task is to train, test, deploy, and monitor an ML model to forecast weather at the port of Turku every hour (4 hours in advance) when new data is recorded. o Team size: The team consists of a small/medium group of data scientists, with no DevOps engineers. Based on these facts, the operations can be categorized as small team ops; there is no need for big data processing, and the team is small and agile. Suitable tools will now be examined to implement the operations needed to solve the business problem. To gain a holistic understanding of MLOps implementation, the business problems will be implemented using two different tools simultaneously: Azure Machine Learning (Microsoft Azure) MLflow (an open-source cloud and platform-agnostic tool) These two tools will be used to see how things work from a pure cloud-based approach and from an open-source/cloud-agnostic approach. All the code and CI/CD operations will be managed and orchestrated using Azure DevOps, as shown in Figure 3.2. Figure 3.2 – MLOps tools for the solution The tools and resources needed to implement the solution for the business problem will now be set up. Python will be the primary programming language used, so having Python 3 installed on your Mac, Linux, or Windows OS is a prerequisite. **Setting up the resources and If the tools are already installed and set up on your PC, you can skip this section. Otherwise, follow the detailed instructions to get them up and running. Installing MLflow MLflow, an open-source platform for managing the ML life cycle (including experimentation, reproducibility, deployment, and a central model registry), will be installed first. To install MLflow, execute the following command in your terminal: shell pip3 install mlflow After successful installation, test it by running the following command to start the MLflow tracking UI: shell mlflow ui The MLflow tracking UI will run a server listening at port 5000 on your machine, and it will output a message like: shell [2021-03-11 14:34:23 +0200] [INFO] Starting gunicorn 20.0.4 [2021-03-11 14:34:23 +0200] [INFO] Listening at: http://127.0.0.1:5000 (43819) [2021-03-11 14:34:23 +0200] [INFO] Using worker: sync [2021-03-11 14:34:23 +0200] [INFO] Booting worker with pid: 43821 Access and view the MLflow UI at http://localhost:5000. Once MLflow is successfully installed and the tracking UI is running, proceed to install the next tool. Azure Machine Learning Azure Machine Learning provides a cloud-based ML platform for training, deploying, and managing ML models. This service is available on Microsoft Azure, so creating a free subscription to Microsoft Azure with around $170 of credit is required to implement the solution: https://azure.microsoft.com/. After accessing/subscribing to Azure, the next step is setting up Azure Machine Learning. Creating a resource group A resource group, a collection of related resources for an Azure solution, needs to be created. It is a container that ties up all the resources related to a service or solution. Creating a resource group enables easy access and management of a solution. 1. Open the Azure portal. 2. Access the portal menu (go to the portal's home page if not there by default) and hover over the resource group icon in the navigation section. Click the Create button to create a new resource group: Figure 3.3 – Creating a resource group 3. Create a resource group with the name of your choice (Learn_MLOps is recommended), as shown in Figure 3.3. 4. Select a region close to you to get the optimal performance and pricing. For example, in Figure 3.3, a resource group named Learn MLOps with region (Europe) North Europe is ready to be created. After clicking the Review + Create button and Azure validates the request, the final Create button will appear. Press the final Create button to create the new resource group. After reviewing and creating the resource group, all services related to the ML solution can be set up and managed within this resource group. The newly created resource group will be listed in the resource group list. Creating an Azure Machine Learning workspace An ML workspace, a central hub for tracking and managing ML training, deployment, and monitoring experiments, needs to be created. Go to the Azure portal menu, click on Create a resource, then search for Machine Learning and select it. The following screen will appear: Figure 3.4 – Creating an Azure Machine Learning workspace Name the workspace with the name of your choice (e.g., MLOps_WS in Figure 3.4). Select the resource group created earlier (Learn_MLOps in Figure 3.4). Finally, hit the Review + create button, then the final Create button on the new screen to create the Azure Machine Learning workspace. After creating the Azure Machine Learning workspace (Learn_MLOps), the Azure platform will deploy all the resources needed for this service, such as Blob Storage, Key Vault, and Application Insights. These resources will be consumed or used via the workspace and the SDK. Installing Azure Machine Learning SDK Install the Azure Machine Learning SDK, extensively used in the code to orchestrate the experiment, by running the following command in the terminal or command line on your PC: shell pip3 install --upgrade azureml-sdk Azure DevOps All source code and CI/CD-related operations will be managed and orchestrated using Azure DevOps. The code managed in the Azure DevOps repository will be used to train, deploy, and monitor ML models enabled by CI/CD pipelines. Creating an Azure DevOps subscription: 1. Create a free account at dev.azure.com using a pre-existing Microsoft or GitHub account. 2. Create a project named Learn_MLOps (make it public or private depending on your preference). 3. Go to the repos section. In the Import a repository section, press the Import button. 4. Import a repository from a public GitHub project from this repository: https://github.com/PacktPublishing/EngineeringMLOps (as shown in Figure 3.5): Figure 3.5 – Import the GitHub repository into the Azure DevOps project After importing the GitHub repository, files from the imported repository will be displayed. JupyterHub Lastly, an interactive data analysis and visualization tool (JupyterHub) will be needed to process data using the code. This common data science tool is widely used by data scientists to process data, visualize data, and train ML models. Installing JupyterHub: 1. Install JupyterHub via the command line on your PC: shell python3 -m pip install jupyterhub 2. Install Anaconda, which is needed as it installs dependencies, sets up environments, and services to support JupyterHub. Download and install Anaconda as per the detailed instructions here: https://docs.anaconda.com/anaconda/install/. 10 PRINCIPLES OF SOURCE CODE MANAGEMENT FOR ML Here are 10 principles that should be applied to ensure the quality, robustness, and scalability of code: Modularity: Modular code is preferred over a large single block. Reusability is encouraged and upgrading is facilitated by replacing required components. To avoid unnecessary complexity and repetition, this golden rule should be followed: Two or more ML components should be paired only when one uses the other. If none of them use each other, pairing should be avoided. An ML component that is not tightly paired with its environment can be more easily modified or replaced than a tightly paired component. Single Task Dedicated Functions: Functions are important building blocks of pipelines and systems and are used to perform specific tasks. Repetition of commands should be avoided and reusable code enabled. Complex sets of commands should be avoided for tasks. For readability and reusability, it is more efficient to have a single function dedicated to a single task rather than multiple tasks. Multiple functions should be preferred over one long and complex function. Structuring: Functions, classes, and statements should be structured in a readable, modular, and concise form. Errors such as Error 300 should be avoided. Structuring blocks of code and limiting the maximum levels of indentation for functions and classes can improve readability. Clean Code: If code needs to be explained, it is not considered good. Clean code is self-explanatory and focuses on high readability, optimal modularity, reusability, non- repeatability, and optimal performance. The cost of maintaining and upgrading ML pipelines is reduced by clean code. Efficient team performance and extension to other developers are enabled by clean code. Testing: Robustness of a system should be ensured through testing. Unit testing and acceptance testing are generally extended. Components of source code are tested for robustness with coerced data and usage methods in unit testing to determine if the component is fit for the production system. The overall system is tested in acceptance tests to ensure user requirements are met, and end-to-end business flows are verified in real-time scenarios. Testing is essential for efficient code performance: "if it isn't tested, it is broken." For more on unit testing, refer to the documentation: Version Control (Code, Data, and Models): Git is used for version control of code in ML systems. Version control ensures that all team members have access to up-to-date code and that code is not lost during hardware failures. A rule of working with Git should be to not break the master (branch). New features or improvements should be added in a feature branch and merged to the master branch when the code is working and reviewed. Branches should have short descriptive names, such as feature/label- encoder. Branch naming and approval guidelines should be communicated and agreed upon with the team to avoid complexity and unnecessary conflicts. Code reviews should be done with pull requests to the code repository. Code is best reviewed in small sets, usually less than 400 lines, often meaning one module or submodule at a time. Versioning of data is crucial for tracking which data was used for a particular version of code to generate a model. Reproducing models and compliance with business needs and laws can be enabled by versioning data. Backtracking to see the reason for certain actions taken by the ML system is possible. Similarly, versioning of models (artifacts) is important for tracking which model version generated specific results or actions. Parameters used for training a model version can also be tracked or logged. End-to-end traceability for model artifacts, data, and code can be enabled. Transparency and efficiency for developers and maintainers are enhanced by version control for code, data, and models. Logging: In production, a logger is useful for monitoring and identifying important information. Print statements are good for testing and debugging but are not ideal for production. The logger contains useful information, including system information, warnings, and errors, which are valuable for monitoring production systems. Error Handling: Handling edge cases, especially those hard to anticipate, is vital. Exceptions should be caught and handled even if not immediately necessary, as prevention is better than cure. Combining logging with exception handling can be an effective way to manage edge cases. Readability: Code readability enables information transfer, code efficiency, and maintainability. This can be achieved by following industry-standard coding practices such as PEP-8 or the JavaScript standard style. Readability is also improved by using docstrings. A docstring is a text written at the beginning of, for example, a function, describing what it does and what it takes as input. For simple explanations, a one-liner can be used: python def swap(a, b): """Swaps the variables a and b. Returns the swapped variables""" return b, a A longer docstring is needed for more complex functions, explaining arguments and returns: python def function_with_types_in_docstring(param1, param2): """Example function with types documented in the docstring. PEP 484 type annotations are supported. If attribute, parameter, and return types are annotated according to PEP 484, they do not need to be included in the docstring: Args: param1 (int): The first parameter. param2 (str): The second parameter. Returns: bool: The return value. True for success, False otherwise. """ Commenting and Documenting: Commenting and documenting are vital for maintaining sustainable code. Explaining code clearly is not always possible, so comments can help prevent confusion and clarify the code. Comments can convey information such as copyright info, intent, code clarification, warnings, and elaboration. Detailed documentation of the system and modules can improve team efficiency and extend the code and assets to other developers. Open source tools for documenting APIs, such as Swagger and Read the Docs, can enable efficiency and standardize knowledge for developers. Good Data for ML Good ML models result from training on high-quality data. Before ML training, good- quality data is a prerequisite. Data should be processed to improve its quality. The quality of data should be determined based on the following five characteristics: Accuracy: Accuracy is crucial for data quality, as inaccurate data can lead to poor ML model performance and real-life consequences. To check data accuracy, confirm if the information represents a real-life situation. Completeness: Incomplete information is often unusable and can lead to incorrect outcomes when an ML model is trained on it. Checking the comprehensiveness of the data is essential. Reliability: Contradictions or duplications in data can lead to unreliability. Trusting data is essential, especially for making real-life decisions with ML. Data reliability can be assessed by examining bias and distribution. Extreme cases may indicate that the data is not reliable for ML training or may carry bias. Relevance: The relevance of data is important for contextualizing and determining if irrelevant information is being gathered. Relevant data enables appropriate decisions in real-life contexts using ML. Timeliness: Obsolete or outdated information costs time and money. Up-to-date information is crucial in some cases and can improve data quality. ML decisions based on outdated data can be costly and lead to incorrect decisions. Maximizing these five characteristics ensures the highest data quality. DATA PREPROCESSING Raw data cannot be directly passed to the ML model for training purposes. The data needs to be refined or preprocessed before it can be used for training the ML model. A series of steps will be performed to preprocess the imported data into a suitable shape for ML training. The quality of the data will be assessed to check for accuracy, completeness, reliability, relevance, and timeliness. Next, the required data will be calibrated, and text will be encoded into numerical data, which is ideal for ML training. Finally, correlations and time series will be analyzed, and irrelevant data will be filtered out for training ML models. Data Quality Assessment To assess the quality of the data, accuracy, completeness, reliability, relevance, and timeliness will be examined. First, completeness and reliability of the data will be checked by assessing formats, cumulative statistics, and anomalies such as missing data. The pandas functions will be used as follows: python df.describe() The describe function will provide descriptive statistics as output: Figure 3.7 – Descriptive statistics of the DataFrame Observations will be made to determine if the data is coherent and relevant, as it depicts real-life statistics like a mean temperature of ~11°C and a wind speed of ~10 km/h. Minimum temperatures in Finland are observed to be around ~-21°C, with an average visibility of 10 km. These facts will reflect relevance and data origin conditions. Next, column formats will be checked: python df.dtypes The formats of each column are: S_No: int64 Timestamp: object Location: object Temperature_C: float64 Apparent_Temperature_C: float64 Humidity: float64 Wind_speed_kmph: float64 Wind_bearing_degrees: int64 Visibility_km: float64 Pressure_millibars: float64 Weather_conditions: object Most columns are numerical (float and int), as expected. The Timestamp column is in object format and needs to be changed to DateTime format: python df['Timestamp'] = pd.to_datetime(df['Timestamp']) Using pandas' to_datetime function, Timestamp will be converted to DateTime format. Next, any null values will be checked using pandas' isnull function: python df.isnull().values.any() If null values are discovered, calibration of missing data will be essential. Calibrating Missing Data Missing values in the data indicate poor data quality. Various techniques will be used to replace missing data without compromising the correctness and reliability. Observing missing values in the data, the Forward fill method will be used to handle missing data: python df['Weather_conditions'].fillna(method='ffill', inplace=True, axis=0) NaN or null values are observed only in the Weather_conditions column. NaN values will be replaced using the fillna() method from pandas and the forward fill (ffill) method. Since weather is progressive, replicating the previous event is likely. Therefore, the forward fill method will be used to replicate the last observed non-null value until another non-null value is encountered. Label Encoding As machines do not understand human language or text, all text needs to be converted into numbers. The Weather_conditions column, which contains labels such as rain, snow, and clear, will be processed. These values are identified using pandas' value_counts() function: df['Weather_conditions'].value_counts() The Weather_conditions column will be simplified by categorizing it into two labels: rain and no_rain. This will enable solving the business problem for the cargo company: df['Weather_conditions'].replace({"snow": "no_rain", "clear": "no_rain"}, inplace=True) Snow and clear values will be replaced with no_rain, as both conditions imply no rain at the port. Labels will be converted into a machine-readable form or numbers using label encoding. Label encoding will convert categorical values into numbers by assigning each category a unique value. With only two categories (rain and no_rain), label encoding will be efficient, converting these values to 0 and 1. For more than two values, one-hot encoding is recommended to avoid numerical bias during training. One-hot encoding prevents bias by ensuring equal treatment of each categorical variable. Label encoding will be performed using scikit-learn as follows: python from sklearn.preprocessing import LabelEncoder le = LabelEncoder() y = df['Weather_conditions'] y = le.fit_transform(y) The LabelEncoder() function will be imported to encode the Weather_conditions column into 0s and 1s using the fit_transform() method. The previous textual column will be replaced with the label-encoded or machine-readable Weather_condition column as follows: python y = pd.DataFrame(data=y, columns=["Weather_condition"]) df = pd.concat([df, y], axis=1) df.drop(['Weather_conditions'], axis=1, inplace=True) The new label-encoded or machine-readable Weather_condition column will be concatenated to the DataFrame, and the previous textual Weather_conditions column will be dropped. The data will now be in machine-readable form and ready for further processing. The transformed data can be checked by executing df.head() in the notebook (optional). New Feature – Future_weather_condition To forecast weather conditions 4 hours into the future, a new feature named Future_weather_condition will be created by shifting Current_weather_condition by four rows, as each row is recorded with a time gap of an hour. Future_weather_condition will represent the label of future weather conditions 4 hours ahead and will be used as a dependent variable for ML forecasting: python df['Future_weather_condition'] = df.Current_weather_condition.shift(4, axis=0) df.dropna(inplace=True) The pandas' dropna() function will be used to discard or drop null values from the DataFrame, as some rows will have null values due to shifting to a new column. Data Correlations and Filtering With the data now fully machine-readable, correlations will be observed using the Pearson correlation coefficient to understand how each column relates to the others. Data and feature correlation is a crucial step before feature selection for ML model training, especially with continuous features. The Pearson correlation coefficient, a statistical linear correlation between each variable (X and y), produces values between +1 and -1. A value of +1 indicates a positive linear correlation, -1 indicates a negative linear correlation, and 0 indicates no linear correlation. Pearson correlation can help understand the relationship between continuous variables, though it does not imply causation. Pearson correlation coefficients will be observed using pandas as follows: python df.corr(method="pearson") To visualize using a heatmap: import seaborn as sn import matplotlib.pyplot as plt corrMatrix = df.corr() sn.heatmap(corrMatrix, annot=True) plt.show() Figure 3.8 – Heatmap of correlation scores From the heatmap in Figure 3.8, it can be seen that the coefficient between Temperature and Apparent_Temperature_C is 0.99. The S_No (Serial number) column, being a continuous value and serving as an incremental index for the DataFrame, may be discarded or filtered out as it does not provide significant value. Therefore, both Apparent_Temperature and S_No will be dropped or filtered. The correlation of Future_weather_condition with other independent variables will be examined: Figure 3.9 – Pearson correlation for Future_weather_condition Values between 0.5 and 1.0 indicate positive correlation, while values between -0.5 and -1.0 indicate negative correlation. The graph shows a positive correlation with Current_weather_condition, and Temperature_C is also positively correlated with Future_weather_condition. Time Series Analysis As temperature is a continuous variable, its progression over time will be observed. A time series plot will be visualized using matplotlib as follows: python time = df['Timestamp'] temp = df['Temperature_C'] plt.plot(time, temp) plt.show() Figure 3.10 – Time series progression of Temperature in °C The time series progression of temperature in Figure 3.10 depicts a stationary pattern, as mean, variance, and covariance are observed to be stationary over time. Stationary behaviors can include trends, cycles, random walks, or a combination of these. Temperature changes over seasons and follows seasonal patterns, which is consistent with this observation. This concludes the data analysis and processing. The processed data will now be registered in the workspace before proceeding with training the ML model. DATA REGISTRATION AND VERSIONING Before starting ML training, it is crucial that the data be registered and versioned in the workspace. This allows experiments or ML models to be traced back to the data source used for training. The purpose of versioning the data is to allow for replication of a model's training or to explain the workings of the model based on the inference or testing data. Therefore, the processed data will be registered and versioned for use in the ML pipeline. The Azure Machine Learning SDK will be used to register and version the processed data in the Azure Machine Learning workspace as follows: python subscription_id = '---insert your subscription ID here----' resource_group = 'Learn_MLOps' workspace_name = 'MLOps_WS' workspace = Workspace(subscription_id, resource_group, workspace_name) The subscription ID, resource group, and workspace name will be obtained from the Azure Machine Learning portal, as shown in Figure 3.11: Figure 3.11 – Workspace credentials (Resource group, Subscription ID, and Workspace name) By requesting the workspace credentials, a workspace object will be obtained. The Workspace() function will connect the notebook to the Azure platform. Authentication will be completed by following a link and providing a random code along with Azure account details. After authentication is confirmed, the default data store will be accessed, and the required data files will be uploaded to Azure Blob Storage connected to the workspace: python # get the default datastore linked to upload prepared data datastore = workspace.get_default_datastore() # upload the local file from src_dir to target_path in datastore datastore.upload(src_dir='Dataset', target_path='data') dataset Dataset.Tabular.from_delimited_files(datastore.path('data/weather_dataset_processed.csv')) Note: The Tabular.from_delimited_files() method may fail on Linux or macOS machines without.NET Core 2.1 installed. Instructions for correct installation are available here: Install.NET Core on Linux. After successfully executing the commands, the data file will be uploaded to the data store, as shown in Figure 3.12. The dataset can be previewed as follows: python # preview the first 3 rows of the dataset from the datastore dataset.take(3).to_pandas_dataframe() Once the data is uploaded, the dataset will be registered and versioned in the workspace as follows: python weather_ds = dataset.register(workspace=workspace, name='weather_ds_portofTurku', description='processed weather data') The register(...) function will register the dataset in the workspace, as illustrated in Figure 3.12. Detailed documentation is available at Register Datasets. Figure 3.12 – Processed dataset registered in the ML workspace ML Pipeline The data has been processed by addressing irregularities such as missing data, selecting features based on correlations, creating new features, and ingesting and versioning the processed data in the ML workspace. There are two methods to ingest data for ML model training in the ML pipeline. One method is from central storage (where all raw data is stored), and the other is through a feature store. To understand the feature store, its use will be explained before moving to the ML pipeline. Feature Store A feature store complements central storage by storing important features and making them available for training or inference. It is a repository where raw data is transformed into useful features that ML models can use for training and making predictions. Raw data typically comes from various sources such as structured, unstructured, streaming, batch, and real-time data. This data needs to be pulled, transformed (using a feature pipeline), and stored in the feature store. The feature store then makes this data available for consumption. By having a centralized feature store, data scientists can avoid duplicating work, especially data processing, and can efficiently share and reuse features with other teams, increasing productivity. Figure 3.13 – Feature store workflow In Figure 3.13, the Feature Store uses a Feature Pipeline connected to Central Storage (which stores data from multiple sources) to transform and store raw data into useful features for ML training. Features stored in the feature store can be retrieved for training, serving, or discovering insights. Benefits of using a feature store include: Efficient feature engineering for training data Avoidance of unnecessary data pre-processing before training Prevention of repetitive feature engineering Features available for quick inference (testing) System support for serving features Exploratory Data Analysis by the feature store Opportunity to reuse model features Quick queries on features Reproducibility of training data sets Monitoring feature drift in production (feature drift will be discussed in Chapter 12, Model Serving and Monitoring) Features available for data drift monitoring While knowing the advantages of a feature store is useful for data ingestion in the ML pipeline, it may not be suitable for all cases. For the current implementation, the feature store will not be used. Instead, data will be directly accessed from central storage, where preprocessed and registered datasets are available for training and testing. With the ingested and versioned data, the ML pipeline can now be built. This pipeline will facilitate further feature engineering, feature scaling, and the creation of training and testing datasets, which will be used for ML model training and hyperparameter tuning. The ML pipeline and functionalities will be executed on cloud computing resources rather than locally on a computer. MACHINE LEARNING PIPELINES BASICS OF ML PIPELINES Before beginning the implementation of the ML pipeline, the basics will be reviewed. The ML pipelines will be reflected upon and the necessary resources for ML pipeline implementation will be set up. The focus will then shift to data ingestion. The ML pipeline discussed in Figure 14 of Chapter 1, Fundamentals of MLOps Workflow, will be reviewed to clarify its components. Figure 4.1 – Machine Learning Pipeline As depicted in Figure 4.1, a comprehensive ML pipeline includes the following steps: 1. Data ingestion 2. Model training 3. Model testing 4. Model packaging 5. Model registering All these pipeline steps will be implemented using both the Azure ML service (cloud- based) and MLflow (open source) to provide a diverse perspective. Azure ML and MLflow are highlighted for their features in Table 4.1, and their unique capabilities are compared in Table 4.2. Table 4.2 – MLflow versus Azure ML service Implementation Steps To implement the ML pipeline, storage resources for the dataset and computational resources for the ML models are required, Characterizing Your Machine Learning Problem, computation for the ML pipeline and business problem will be performed, as shown in Figure 4.2. The data will initially be processed on a local computer to start and preprocess the data for ML training. For ML training and pipeline implementation, cloud-based compute resources (Microsoft Azure) will be used. Although ML training could be done on a local computer, cloud compute resources will be used to learn how to provision and utilize the necessary resources. Figure 4.3 – Computation Location for Data and ML Tasks Configuring Compute Resources To configure the needed compute resources for the ML pipeline, follow these steps: 1. Navigate to the ML workspace. Figure 4.4 – Azure Machine Learning Workspace 2. Select the Compute option and click the Create button to explore available compute options on the cloud. 3. Choose the suitable compute option for optimal and efficient ML model training. Select a compute option based on training needs and cost limitations, and assign it a name. For example, in Figure 4.4, a compute or virtual machine named Standard_D1_v2 is selected: it is a CPU with 1 Core, 3.5 GB of RAM, and 50 GB of disk space. To select the recommended machine configuration or size, choose from all options in the Virtual machine size section. After selecting the desired configuration, click Next to proceed, as shown in Figure 4.4. Figure 4.5 – Create a Compute Resource in an Azure Workspace 4. Provision the previously created compute resource. After naming and creating the compute resource, it will be provisioned, ready, and running for ML training on the cloud, as shown in Figure 4.5. Figure 4.6 – Provisioned Compute in an Azure ML Workspace Once provisioned, select the JupyterLab option. JupyterLab, an open-source web-based user interface, offers features like a text editor, code editor, terminal, and custom components. It will be used as a programming interface connected to the provisioned compute for ML model training. Hands-On Implementation To begin implementing the ML pipeline, follow these steps: 1. Clone the repository imported into the Azure DevOps project. Click the Clone button in the upper-right corner of the Repos menu and then click the Generate Git Credentials button to create a hash password. Figure 4.7 – Cloning an Azure DevOps Git Repository (Generate Git Credentials) 2. Copy the HTTPS link from the Command Line section to obtain the Azure DevOps repository link, for example: plaintext https://[email protected]/xxxxx/Learn_MLOps/_git/Learn_MLOps 3. Copy the password generated in step 1 and append it to the link from step 2 by placing the password immediately after the first username, separated by a colon, before the @ character. Use the following git clone command to avoid permission errors: bash git clone https://user:[email protected]/user/repo_created 4. Access the terminal in JupyterLab to clone the repository to the Azure compute. Select the Terminal option from the Launcher tab or use the Terminal link from the Application URI column in the compute instances list in the Azure ML workspace. Execute the following command: bash git clone https://[email protected]/xxxxx/Learn_MLOps/_git/Learn_MLOps Figure 4.8 – Clone the Azure DevOps Git Repository on Azure Compute 5. Navigate to the 04_MLpipelines folder and follow the implementation steps in ML- pipeline.ipynb from the cloned repository. It is recommended to follow the file instructions for better understanding and execute the code in a new file according to your setup. The compute resource has been provisioned and the GitHub repository cloned in the compute. 6. Start implementing the ML-pipeline.ipynb file by importing the necessary libraries, such as pandas, numpy, azureml, pickle, mlflow, and others: python import pandas as pd import numpy as np import warnings from math import sqrt warnings.filterwarnings('ignore') from azureml.core.run import Run from azureml.core.experiment import Experiment from azureml.core.workspace import Workspace from azureml.core.model import Model from azureml.core.authentication import ServicePrincipalAuthentication from azureml.train.automl import AutoMLConfig import pickle from matplotlib import pyplot as plt from matplotlib.pyplot import figure import mlflow So far, we have provisioned the compute resource and cloned the GitHub repository in the compute. 1. Next, we start implementing the ML-pipeline.ipynb file by importing the needed libraries, such as pandas, numpy, azureml, pickle, mlflow, and others, as shown in the following code block: import pandas as pd import numpy as np import warnings from math import sqrt warnings.filterwarnings('ignore') from azureml.core.run import Run from azureml.core.experiment import Experiment from azureml.core.workspace import Workspace from azureml.core.model import Model from azureml.core.authentication import ServicePrincipalAuthentication from azureml.train.automl import AutoMLConfig import pickle from matplotlib import pyplot as plt from matplotlib.pyplot import figure import mlflow 2. Next, we use setup MLflow (for tracking experiments). Use the get_mlflow_tracking_url() function to get a tracking ID for where MLflow experiments and artifacts should be logged (in this case, we get the tracking ID for the provisioned training compute). Then use the set_tracking_uri() function to connect to a tracking URI (the uniform resource identifier of a specific resource) for the provisioned training compute. The tracking URI can be either for a remote server, a database connection string, or a local path to log data in a local directory. In our case, we point the tracking URI to the local path by default (on the provisioned training compute): uri = workspace.get_mlflow_tracking_uri( ) mlflow.set_tracking_uri(uri) The URI defaults to the mlruns folder where MLflow artifacts and logs will be saved for experiments. By setting the tracking URI for your MLflow experiments, you have set the location for MLflow to save its artifacts and logs in the mlruns folder (on your provisioned compute). After executing these commands, check for the current path. You will find the mlruns folder. DATA INGESTION AND FEATURE ENGINEERING Data is essential to train ML models; without data, there is no ML. Data ingestion is a trigger step for the ML pipeline. It deals with the volume, velocity, veracity, and variety of data by extracting data from various data sources and ingesting the needed data for model training. The ML pipeline is initiated by ingesting the right data for training the ML models. We will start by accessing the preprocessed data we registered in the previous chapter. Follow these steps to access and import the preprocessed data and get it ready for ML training: 1. Using the Workspace() function from the Azure ML SDK, access the data from the datastore in the ML workspace as follows: from azureml.core import Workspace, Dataset subscription_id = 'xxxxxx-xxxxxx-xxxxxxx-xxxxxxx' resource_group = 'Learn_MLOps' workspace_name = 'MLOps_WS' workspace = Workspace(subscription_id, resource_group, workspace_name) NOTE Insert your own credentials, such as subscription_id, resource_group, and workspace_name and initiate a workspace object using these credentials. When these instructions are successfully executed in the JupyterLab, you can run the remaining blocks of code in the next cells. 2. Import the preprocessed dataset that was prepared in the previous chapter. The preprocessed dataset is imported using the.get_by_name() function from the Dataset function from the Azureml SDK and the function is used to retrieve the needed dataset: # Importing pre-processed dataset dataset = Dataset.get_by_name (workspace, name='processed_weather_data_portofTurku') print(dataset.name, dataset.version) 3. Upon successfully retrieving or mounting the dataset, you can confirm by printing dataset.name and dataset.version, which should print processed_weather_data_portofTurku 1 or as per the name you have given the dataset previously. 4. After retrieving the preprocessed data, it is vital to split it into training and validation sets in order to train the ML model and test or evaluate it in the training phase and later stages. Hence, we split it into the training and validation sets, by splitting it in the 80% (training set) and 20% (test set) split-ratio as follows: df_training = df.iloc[:77160] df_test = df.drop(df_training.index) df_training.to_csv('Data/training_data.csv',index=False) df_test.to_csv('Data/test_data.csv',index=False) 5. After successfully splitting the data, these two datasets are stored and registered to the datastore (connected to the Azure ML workspace) as follows: datastore = workspace.get_default_datastore() datastore.upload(src_dir='Data', target_path='data') training_dataset = / Dataset.Tabular.from_delimited_files(datastore.path('data/training_data.csv')) validation_dataset = / Dataset.Tabular.from_delimited_files(datastore.path('data/validation_data.csv')) training_ds = training_dataset.register(workspace=workspace, name='training_dataset', description='Dataset to use for ML training') test_ds = validation_dataset.register(workspace=workspace, name='test_dataset', description='Dataset for validation ML models') By using the register() function, we are able to register the training and test datasets, which can be imported later from the datastore. Next, we will import the training data and ingest it into the ML pipeline and use the test dataset later to test the model's performance on unseen data in production or for model analysis. Data is crucial for training ML models; without data, ML cannot be performed. The ML pipeline is initiated by extracting data from various sources and preparing it for model training. To access and import the preprocessed data from the previous chapter for ML training, follow these steps: 1. Access the Data The data should be accessed from the datastore in the ML workspace using the Workspace() function from the Azure ML SDK: python from azureml.core import Workspace, Dataset subscription_id = 'xxxxxx-xxxxxx-xxxxxxx-xxxxxxx' resource_group = 'Learn_MLOps' workspace_name = 'MLOps_WS' workspace = Workspace(subscription_id, resource_group, workspace_name) Note: Your own credentials, such as subscription_id, resource_group, and workspace_name, should be inserted to create a workspace object. After these instructions are successfully executed in JupyterLab, proceed with the remaining code blocks. 2. Import the Preprocessed Dataset The preprocessed dataset should be imported using the.get_by_name() function from the Dataset function of the Azure ML SDK: python # Importing pre-processed dataset dataset = Dataset.get_by_name(workspace, name='processed_weather_data_portofTurku') print(dataset.name, dataset.version) Successful retrieval or mounting of the dataset can be confirmed by printing dataset.name and dataset.version, which should display processed_weather_data_portofTurku 1 or the name given previously. 3. Split the Data The preprocessed data should be split into training and validation sets for ML model training and testing. The data will be divided into an 80% (training set) and 20% (test set) ratio: python df_training = df.iloc[:77160] df_test = df.drop(df_training.index) df_training.to_csv('Data/training_data.csv', index=False) df_test.to_csv('Data/test_data.csv', index=False) 4. Store and Register the Datasets The two datasets (training and validation) should be stored and registered to the datastore connected to the Azure ML workspace: python datastore = workspace.get_default_datastore() datastore.upload(src_dir='Data', target_path='data') training_dataset = Dataset.Tabular.from_delimited_files(datastore.path('data/training_data.csv')) validation_dataset = Dataset.Tabular.from_delimited_files(datastore.path('data/validation_data.csv')) training_ds = training_dataset.register(workspace=workspace, name='training_dataset', description='Dataset to use for ML training') test_ds = validation_dataset.register(workspace=workspace, name='test_dataset', description='Dataset for validation ML models') By using the register() function, the training and test datasets will be registered and available for future use from the datastore. The training data will be imported and ingested into the ML pipeline, while the test dataset will be used later to evaluate the model's performance on unseen data. To ingest training data into the ML pipeline, it should be imported using the get_by_name() function and converted to a pandas dataframe using the to_pandas_dataframe() function: python dataset = Dataset.get_by_name(workspace, name='training_dataset') print(dataset.name, dataset.version) df = dataset.to_pandas_dataframe() The training dataset is now retrieved and will be used to train the ML models. The goal is to train classification models to predict whether it will rain or not. Therefore, the features Temperature, Humidity, Wind_speed, Wind_bearing, Visibility, Pressure, and Current_weather_conditions will be selected to train the binary classification models for future weather prediction (4 hours ahead). Selecting and Scaling Features 1. Select Features Selecting the right features and scaling the data is essential before training the ML models. The features will be selected as follows. The values in the variable X represent the independent variables, while the variable Y is the dependent variable (forecasted weather): python X = df[['Temperature_C', 'Humidity', 'Wind_speed_kmph', 'Wind_bearing_degrees', 'Visibility_km', 'Pressure_millibars', 'Current_weather_condition']].values y = df['Future_weather_condition'].values 2. Split the Data The training data should be split into training and testing sets using the train_test_split() function from sklearn. Fixing the random seed (random_state) is necessary to reproduce a training session with the same configuration. Therefore, random_state=1 will be used: python from sklearn.model_selection import train_test_split X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=1) With an 80% (training data) and 20% (test data) split, the training and test datasets are now ready for feature scaling and ML model training. 3. Scale the Data For optimal and efficient ML model training, the data should be on the same scale. The data will be scaled using StandardScaler() from sklearn to calibrate all numeric values: python from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train = sc.fit_transform(X_train) X_val = sc.transform(X_val) With this step, the numeric values of the training data are scaled using StandardScaler, transforming all values to a range of -1 to 1, based on X_train values. The data is now ready for ML model training. MACHINE LEARNING TRAINING AND HYPERPARAMETER OPTIMIZATION The fun part, training ML models, is now ready to begin! Model training is enabled through modular scripts or code, which perform all traditional ML training steps, such as fitting and transforming data to train the model and tuning hyperparameters to find the best model. The output of this step is a trained ML model. To address the business problem, two well-known models, the Support Vector Machine (SVM) classifier and the Random Forest classifier, will be trained. These models are chosen for their popularity and consistency of results. Models of your choice can also be selected— there are no limitations. Training will start with the Support Vector Machine classifier and then proceed to the Random Forest classifier. Support Vector Machine The Support Vector Machine (SVM) is a widely used supervised learning algorithm for classification and regression. Data points are classified using hyperplanes in an N-dimensional space. Known for producing significant accuracy with less computation power, understanding SVM theoretically is recommended for practical model training. Training the SVM Classifier: 1. Initiate the Experiment: The experiment is initiated using the Experiment() function from the Azure SDK to start a training run or experiment in the Azure ML workspace: python myexperiment = Experiment(workspace, "support-vector-machine") Similarly, the MLflow experiment is initiated: python mlflow.set_experiment("mlflow-support-vector-machine") Both Azure ML and MLflow experiments are now initiated, and the training step will be monitored and logged. 2. Hyperparameter Tuning: Hyperparameter tuning is performed to find the best parameters for the model. Grid Search is used for tuning the SVM classifier: python from sklearn.svm import SVC from sklearn import svm from sklearn.model_selection import GridSearchCV parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]} svc = svm.SVC() run = myexperiment.start_logging() mlflow.start_run() run.log("dataset name", dataset.name) run.log("dataset Version", dataset.version) svc_grid = GridSearchCV(svc, parameters) svc_grid.fit(X_train, y_train) Grid Search tests different parameter combinations to optimize model performance. The best parameters, C=1 and kernel='rbf', are found, and the dataset used for training is logged. 3. Train the Model with Best Parameters: A new SVM model is trained using the best parameters: python svc = SVC(C=svc_grid.get_params(deep=True)['estimator__C'], kernel=svc_grid.get_params(deep=True)['estimator__kernel']) svc.fit(X_train, y_train) run.log("C", svc_grid.get_params(deep=True)['estimator__C']) run.log("Kernel", svc_grid.get_params(deep=True)['estimator__kernel']) The trained SVM model output is: python SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto_deprecated', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) 4. Train the Random Forest Classifier: Initialize the Experiment: The experiment is initialized in the Azure ML workspace and MLflow: python myexperiment = Experiment(workspace, "random-forest-classifier") mlflow.set_experiment("mlflow-random-forest-classifier") Train the Model: The Random Forest classifier is trained using default parameters: python from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier(max_depth=10, random_state=0, n_estimators=100) run = myexperiment.start_logging() mlflow.start_run() run.log("dataset name", dataset.name) run.log("dataset Version", dataset.version) rf.fit(X_train, y_train) run.log("max_depth", 10) run.log("random_state", 0) run.log("n_estimators", 100) The output of the Random Forest model is: python RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=10, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=0, verbose=0, warm_start=False) Both models, the SVM classifier and the Random Forest classifier, are now trained. MODEL TESTING AND DEFINING METRICS In this step, the performance of the trained models is evaluated on a separate test dataset (previously split and versioned). The model’s inference is evaluated according to selected metrics, and a report on model performance is generated. Metrics to be measured include accuracy, precision, recall, and F-score: Accuracy: Number of correct predictions divided by the total number of predictions. Precision: Proportion of correctly predicted positives. Precision = True Positives / (True Positives + False Positives) Recall: Proportion of actual positives identified correctly. Recall = True Positives / (True Positives + False Negatives) F-Score: Harmonic mean of precision and recall. F1 Score = 2 * (Recall * Precision) / (Recall + Precision) Testing the SVM Classifier: Metrics for the SVM model are calculated using sklearn.metrics and logged: python from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score predicted_svc = svc.predict(X_test) acc = accuracy_score(y_test, predicted_svc) fscore = f1_score(y_test, predicted_svc, average="macro") precision = precision_score(y_test, predicted_svc, average="macro") recall = recall_score(y_test, predicted_svc, average="macro") run.log("Test_accuracy", acc) run.log("Precision", precision) run.log("Recall", recall) run.log("F-Score", fscore) run.log("Git-sha", sha) The results are logged in Azure ML and MLflow experiments. Testing the Random Forest Classifier: Metrics for the Random Forest model are calculated and logged similarly: python acc = accuracy_score(y_test, predicted_rf) fscore = f1_score(y_test, predicted_rf, average="macro") precision = precision_score(y_test, predicted_rf, average="macro") recall = recall_score(y_test, predicted_rf, average="macro") run.log("Test_accuracy", acc) run.log("Precision", precision) run.log("Recall", recall) run.log("F-Score", fscore) run.log("Git-sha", sha) The output is logged to Azure ML and MLflow experiments. MODEL PACKAGING After testing, the trained model is serialized into a file for export to test or production environments. Serialization can present compatibility challenges, especially when models are trained with different frameworks. ONNX (Open Neural Network Exchange) offers a standard for model interoperability. The trained model is serialized using ONNX to avoid compatibility issues. Serialize the SVM Model: python from skl2onnx import convert_sklearn from skl2onnx.common.data_types import FloatTensorType initial_type = [('float_input', FloatTensorType([None, 6]))] onx = convert_sklearn(svc, initial_types=initial_type) with open("outputs/svc.onnx", "wb") as f: f.write(onx.SerializeToString()) Serialize the Random Forest Model: python from skl2onnx import convert_sklearn from skl2onnx.common.data_types import FloatTensorType initial_type = [('float_input', FloatTensorType([None, 6]))] onx = convert_sklearn(rf, initial_types=initial_type) with open("outputs/rf.onnx", "wb") as f: f.write(onx.SerializeToString()) REGISTERING MODELS AND PRODUCTION ARTIFACTS Serialized models are registered and stored in the model registry. A registered model serves as a logical container for one or more files that constitute the model. Downloading the registered model retrieves all files. Registered models can be deployed and used for inference on demand. Register the SVM Model: python model = Model.register(model_path='./outputs/svc.onnx', model_name="support-vector-classifier", tags={'dataset': dataset.name, 'version': dataset.version, 'hyparameter-C': '1', 'testdata-accuracy': '0.9519'}, model_framework='pandas==0.23.4', description="Support vector classifier to predict weather at port of Turku", workspace=workspace) print('Name:', model.name) print('Version:', model.version) The model is registered with a name and tags. The successful registration is confirmed by the model name and version. Figure 4.9 – Registered SVM model (with test metrics) Register the Random Forest Model: python model = Model.register(model_path='./outputs/rf.onnx', model_name="random-forest-classifier", tags={'dataset': dataset.name, 'version': dataset.version, 'hyparameter-C': '1', 'testdata-accuracy': '0.9548'}, model_framework='pandas==0.23.4', description="Random forest classifier to predict weather at port of Turku", workspace=workspace) print('Name:', model.name) print('Version:', model.version) The model registration output reflects the name and version used. Both models are now visible in the Models section of the Azure ML workspace. The training and testing logs for each model can be analyzed, providing a comprehensive view of the model's performance and traceability. Figure 4.10 – Registered models Congratulations! Both the SVM classifier and Random Forest classifier, along with the serialized scaler, are registered in the model registry. These models can be downloaded and deployed later. This brings us to the successful implementation of the ML pipeline!

Use Quizgecko on...
Browser
Browser