Understanding ML Problems and Pipelines Unit 2
47 Questions
0 Views

Understanding ML Problems and Pipelines Unit 2

Created by
@ClearerHeisenberg

Questions and Answers

What data quality aspects will be assessed?

Accuracy, completeness, reliability, relevance, and timeliness.

What pandas function provides descriptive statistics of a DataFrame?

df.describe()

Which column needs to be changed to DateTime format?

Timestamp

What method is used to handle missing data in the Weather_conditions column?

<p>Forward fill method</p> Signup and view all the answers

The Weather_conditions column is simplified by categorizing it into ___ labels.

<p>two</p> Signup and view all the answers

What is the purpose of label encoding in this context?

<p>To convert categorical values into numbers</p> Signup and view all the answers

The Pearson correlation coefficient ranges from -1 to 1.

<p>True</p> Signup and view all the answers

What function is used to visualize correlations via a heatmap?

<p>sn.heatmap()</p> Signup and view all the answers

What will be created to forecast weather conditions 4 hours into the future?

<p>Future_weather_condition feature</p> Signup and view all the answers

What does the dropna() function do in pandas?

<p>Discards or drops null values from the DataFrame</p> Signup and view all the answers

What is the primary purpose of registering and versioning data in Azure ML?

<p>To allow replication of a model's training and trace back to the data source used.</p> Signup and view all the answers

Which of the following steps is NOT part of the comprehensive ML pipeline?

<p>Feature extraction</p> Signup and view all the answers

A feature store is essential for all ML pipelines.

<p>False</p> Signup and view all the answers

What command is used to clone the repository to avoid permission errors?

<p>git clone <a href="https://user:[email protected]/user/repo_created">https://user:[email protected]/user/repo_created</a></p> Signup and view all the answers

Which libraries should be imported for implementing the ML-pipeline.ipynb file?

<p>pandas, numpy, azureml, pickle, mlflow</p> Signup and view all the answers

How should the preprocessed dataset be accessed in the Azure workspace?

<p>Using Workspace() function from the Azure ML SDK.</p> Signup and view all the answers

The training dataset is split into an 80% training set and a 20% test set.

<p>True</p> Signup and view all the answers

Which of the following metrics are used to evaluate model performance?

<p>Accuracy</p> Signup and view all the answers

What function is used to register the training dataset?

<p>register() function</p> Signup and view all the answers

Which function is used to import the preprocessed dataset from the Azure ML SDK?

<p>get_by_name()</p> Signup and view all the answers

What is the formula for F1 Score?

<p>F1 Score = 2 * (Recall * Precision) / (Recall + Precision)</p> Signup and view all the answers

What is the purpose of StandardScaler() in the data preparation process?

<p>To scale the data for ML model training</p> Signup and view all the answers

What Python library is used to calculate metrics for SVM and Random Forest models?

<p>sklearn.metrics</p> Signup and view all the answers

What is the output of the Random Forest model when trained with default parameters?

<p>RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=10, max_features='auto', n_estimators=100)</p> Signup and view all the answers

What command is used to predict outcomes using the SVM model?

<p>svc.predict(X_test)</p> Signup and view all the answers

What format is used to serialize the trained model for export?

<p>ONNX</p> Signup and view all the answers

What purpose does the model registry serve?

<p>It stores and manages registered models, allowing for retrieval and deployment.</p> Signup and view all the answers

What is logged using run.log in the SVM classifier testing process?

<p>Test_accuracy, Precision, Recall, F-Score, Git-sha</p> Signup and view all the answers

What Python function is used to register the SVM model?

<p>Model.register()</p> Signup and view all the answers

The SVM model's description includes predicting weather at the port of ___

<p>Turku</p> Signup and view all the answers

Match the model with its corresponding metrics:

<p>SVM Classifier = Accuracy: 0.9519 Random Forest Classifier = Accuracy: 0.9548</p> Signup and view all the answers

What is the goal of the data science team at the cargo shipping company in Finland?

<p>To save 20% of the costs for cargo operations at the port of Turku.</p> Signup and view all the answers

Which type of model is required to save 20% of operational costs at the port of Turku?

<p>Supervised Learning</p> Signup and view all the answers

The size of the training data set for the ML model is considered big data.

<p>False</p> Signup and view all the answers

What tools are used for implementing ML operations in the described process?

<p>Azure Machine Learning and MLflow.</p> Signup and view all the answers

What is the first command to install MLflow?

<p>pip3 install mlflow</p> Signup and view all the answers

The command to start the MLflow tracking UI is '____'.

<p>mlflow ui</p> Signup and view all the answers

What needs to be created to manage related resources for an Azure solution?

<p>A resource group.</p> Signup and view all the answers

Which programming language is primarily used in the ML solution?

<p>Python</p> Signup and view all the answers

You can skip the installation section if the tools are already installed.

<p>True</p> Signup and view all the answers

What is the purpose of Azure DevOps in this ML implementation?

<p>To manage source code and CI/CD-related operations.</p> Signup and view all the answers

Which of the following is NOT one of the ten principles of source code management for ML?

<p>Redundancy</p> Signup and view all the answers

How is data quality assessed before training an ML model?

<p>By checking for accuracy, completeness, reliability, relevance, and timeliness.</p> Signup and view all the answers

The command to install the Azure Machine Learning SDK is '____'.

<p>pip3 install --upgrade azureml-sdk</p> Signup and view all the answers

What is meant by version control in ML systems?

<p>It is used to manage code, data, and model updates systematically.</p> Signup and view all the answers

Good ML models result from training on high-quality data.

<p>True</p> Signup and view all the answers

Match the following components with their descriptions:

<p>Modularity = Encourages reusability and easy upgrades Testing = Ensures robustness of the system Clean Code = Self-explanatory and high readability Version Control = Tracks changes in code and models</p> Signup and view all the answers

Study Notes

Understanding Machine Learning (ML) Problems and Pipelines

  • A data science team aims to reduce cargo operations costs by 20% at the port of Turku, Finland through ML weather prediction.
  • The objective is to predict weather conditions (especially rain) 4 hours in advance to streamline operations.
  • The problem is to be categorized using principles from previous discussions, creating an implementation roadmap.
  • A supervised learning model is needed for binary classification (rain or no rain) based on labeled data.
  • The weather dataset contains 10.7 MB of data with 96,453 rows, available on GitHub.

Data Assessment

  • Training data is small but sufficient for ML purposes.
  • The operations require tasks for training, testing, deployment, and monitoring of hourly weather forecasts.
  • A small to medium-sized data science team without DevOps engineers will handle the project, indicating agile operations.
  • Two tools will be utilized for MLOps implementation: Azure Machine Learning and MLflow.

Tools and Resources Setup

  • MLflow: An open-source tool for managing the ML lifecycle.

    • Installation via terminal: pip3 install mlflow.
    • MLflow tracking UI can be started with: mlflow ui, accessible at http://localhost:5000.
  • Azure Machine Learning: A cloud platform for ML model management.

    • Requires creating a free Microsoft Azure subscription.
    • A resource group is created to manage related solution resources, recommended to name it Learn_MLOps.
  • Azure Machine Learning Workspace: Central hub for tracking ML experiments.

    • Workspace name should be defined and related to the previously created resource group.
  • Azure Machine Learning SDK: Essential for code orchestration.

    • Install via terminal: pip3 install --upgrade azureml-sdk.
  • Azure DevOps: Used for source code management and CI/CD operations.

    • Create a project under a free Azure DevOps account.
    • Import a relevant GitHub repository to facilitate collaboration.
  • JupyterHub: An interactive tool for data analysis and visualization.

    • Install using: python3 -m pip install jupyterhub.

Principles of Source Code Management for ML

  • Emphasize modularity, single-task functions, and structured code for readability and maintenance.
  • Clean code leads to reduced maintenance costs and improved performance.
  • Robustness through testing (unit and acceptance) ensures the code performs as expected.
  • Version control (using Git) is critical for managing code, data, and model versions.
  • Implement logging for monitoring in production without relying solely on print statements.
  • Handling exceptions and edge cases is key for system stability.

Data Quality Characteristics

  • Assess data based on accuracy, completeness, reliability, relevance, and timeliness.
  • Raw data must be preprocessed before training; steps include encoding text to numerical formats and analyzing correlations.
  • Quality assessment involves checking for anomalies, data types, and null values using pandas functions.

Data Quality Assessment Techniques

  • Use df.describe() to examine descriptive statistics and ascertain data coherence.
  • Verify column formats with df.dtypes; ensure correct data types for analysis.
  • Convert the Timestamp column to DateTime format with:
    df['Timestamp'] = pd.to_datetime(df['Timestamp'])
    
  • Check for null values using df.isnull().values.any(), and plan for data calibration if missing values are found.### Data Handling
  • The Forward Fill method is used to address missing values in the 'Weather_conditions' column through Pandas’ fillna() function.
  • Weather data is treated progressively, allowing the forward fill to replicate previous observed values until a new non-null value appears.
  • Label encoding converts categorical text data (such as weather conditions) into numerical format necessary for machine learning algorithms.

Label Encoding Process

  • Categorical labels in 'Weather_conditions' are simplified into two categories: "rain" and "no_rain" (where "snow" and "clear" are categorized as "no_rain").
  • LabelEncoding assigns unique numerical values (0 and 1) to these categories, allowing for efficient machine processing.
  • The transformation process is carried out using LabelEncoder from scikit-learn, making the data ready for further analysis.

Future Weather Condition Feature

  • A new feature, 'Future_weather_condition', is created by shifting the 'Current_weather_condition' column by four rows to predict weather conditions four hours ahead.
  • The dropna() function helps to remove any rows generated with null values due to the column shift.

Data Correlation Analysis

  • Pearson correlation coefficients assess relationships between variables, producing values ranging from +1 (positive correlation) to -1 (negative correlation).
  • Heatmaps can visualize these correlations, highlighting strong relationships, such as between 'Temperature' and 'Apparent_Temperature_C' with a coefficient of 0.99.
  • Non-informative columns, like 'S_No', can be discarded to streamline the dataset, focusing on relevant features, particularly with respect to 'Future_weather_condition'.

Time Series Analysis

  • Temperature, as a continuous variable, exhibits stationary patterns across time, revealing seasonal variability through time series plots.
  • This analysis confirms temperature's cyclical nature related to seasonal changes.

Data Registration and Versioning

  • Proper data registration is crucial for traceability in ML model training, allowing for future replication and understanding of model outcomes.
  • Azure Machine Learning SDK is utilized for registering and versioning the processed dataset within Azure workspace to maintain a systematic workflow.

Feature Store Concept

  • A feature store acts as a centralized repository for storing important features derived from raw data, enhancing efficiency in data usage.
  • Key benefits include reusability of features across projects, effective feature engineering, and easy access for model training and inference.

ML Pipeline Structure

  • A comprehensive ML pipeline entails data ingestion, model training, testing, packaging, and registering, leveraging cloud resources like Azure ML and MLflow for implementation.
  • Initial data processing occurs locally, while training and pipeline execution are conducted on Azure's cloud computing resources.

Compute Resource Configuration

  • Setting up compute resources in Azure allows for optimized training environments based on project requirements and cost limitations.
  • JupyterLab serves as an interface for model training on Azure, facilitating the implementation of the ML pipeline.

MLflow for Experiment Tracking

  • MLflow is integrated for tracking experiments, requiring setup of a tracking URI to store logs and artifacts within a designated path on the cloud.
  • This facilitates organization and retrieval of experimental data, enhancing overall efficiency in ML model development.

Data Ingestion and Feature Engineering

  • Data ingestion is the foundational step in ML pipelines, focusing on harvesting data from various sources necessary for effective model training.
  • The ability to manage data volume, velocity, veracity, and variety is crucial in preparing data for machine learning tasks.### Data Access and Preparation for ML Training
  • Access preprocessed data using the Azure ML SDK’s Workspace() function by providing subscription ID, resource group, and workspace name.
  • Import the preprocessed dataset with Dataset.get_by_name(), confirming successful retrieval by checking dataset.name and dataset.version.
  • Split the dataset into training (80%) and validation (20%) sets for model training and evaluation.
  • Store and register training and validation datasets in the Azure ML workspace's datastore to facilitate future access.

Feature Selection and Data Scaling

  • Select relevant features for training: Temperature, Humidity, Wind Speed, Wind Bearing, Visibility, Pressure, and Current Weather Conditions.
  • Define independent variables (X) and dependent variable (Y) for the model.
  • Use train_test_split() from sklearn to split data into training and testing sets, maintaining reproducibility with a fixed random state.
  • Scale numeric values for training data using StandardScaler() to improve model performance.

Machine Learning Model Training

  • Train two classifiers: Support Vector Machine (SVM) and Random Forest, recognized for their effectiveness in classification tasks.
  • Initiate experiments in Azure ML and MLflow to log model training metrics.
  • Use Grid Search for hyperparameter tuning of the SVM classifier to identify the best parameters (kernel and C value).
  • Train SVM model with identified optimal hyperparameters ensuring thorough logging of model parameters.

Model Testing and Metrics Evaluation

  • Evaluate model performance using metrics: Accuracy, Precision, Recall, and F-Score.
  • For the SVM classifier: Calculate metrics using sklearn.metrics, logging results in Azure ML and MLflow.
  • Repeat the process for the Random Forest classifier, ensuring consistency in metric evaluation and logging.

Model Serialization and Registration

  • Serialize trained models into ONNX format for compatibility and deployment convenience.
  • Register serialized SVM and Random Forest models in the Azure ML model registry, including relevant metadata such as accuracy and hyperparameters.

Conclusion

  • Both the SVM and Random Forest classifiers are now trained, evaluated, serialized, and registered, making them available for deployment and inference.
  • This process highlights the importance of structured data preparation, model evaluation, and registration in creating a robust ML pipeline.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Description

This quiz focuses on the core concepts of machine learning (ML) problems and pipelines, specifically applied to a real-world scenario in the cargo shipping industry. It examines how predictive modeling can help mitigate cost and operational disruptions due to weather conditions. Test your knowledge on the ML strategies that can be implemented for effective problem-solving.

More Quizzes Like This

Use Quizgecko on...
Browser
Browser