BMAN73701 Week 5: Advanced Machine Learning

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Machine Learning with scikit-learn does not require any libraries such as NumPy or Matplotlib.

False (B)

What is the primary output format of data processed using scikit-learn?

Numpy

Machine Learning requires _______ data preprocessing before analysis.

raw

Match the following data sources to their types:

<p>Databases = Raw Data Sources Excel = Raw Data Sources Numerical = Tabular Data Types Categorical = Tabular Data Types</p> Signup and view all the answers

What type of data does scikit-learn accept as input?

<p>Numpy or Pandas DataFrame (C)</p> Signup and view all the answers

Who is the professor for the Programming in Python course?

<p>Prof. Manuel López-Ibáñez</p> Signup and view all the answers

What is the formula for calculating Accuracy?

<p>$TP + TN / (TP + TN + FP + FN)$ (A)</p> Signup and view all the answers

The Recall is calculated as the ratio of true positives to the total actual positives.

<p>True (A)</p> Signup and view all the answers

What is the F1 score for the given model?

<p>0.62</p> Signup and view all the answers

The ratio of predicted positives that are actual positives is called __________.

<p>Precision</p> Signup and view all the answers

Match the terms with their descriptions:

<p>Precision = Ratio of true positives to predicted positives Recall = Ratio of true positives to actual positives Accuracy = Ratio of correct predictions to total predictions F1 Score = Harmonic mean of Precision and Recall</p> Signup and view all the answers

What does the value '0.57' represent in this context?

<p>Recall (A)</p> Signup and view all the answers

How many instances are classified correctly in this model?

<p>4</p> Signup and view all the answers

The F1 score in this case is higher than both precision and recall.

<p>False (B)</p> Signup and view all the answers

What happens if K is set too small in K-fold Cross-validation?

<p>It results in faster computations but poorer generalization. (C)</p> Signup and view all the answers

Increasing K in K-fold Cross-validation always improves model accuracy.

<p>False (B)</p> Signup and view all the answers

What method should be used in K-fold Cross-validation when classes are unbalanced?

<p>Stratified K-fold Cross-validation</p> Signup and view all the answers

In Stratified K-fold Cross-validation, the proportion of __________ labels is maintained within each fold.

<p>class</p> Signup and view all the answers

Match the following terms with their definitions:

<p>K-fold Cross-validation = Division of dataset into K subsets for training and validation. Stratified K-fold = A method ensuring each fold has the same proportion of class labels. Training fold = Subset of data used to train the model. Validation fold = Subset of data used to evaluate model performance.</p> Signup and view all the answers

Which library function can be used to implement stratified cross-validation in Python?

<p>cross_val_score() (B)</p> Signup and view all the answers

K-fold Cross-validation is only applicable to classification problems.

<p>False (B)</p> Signup and view all the answers

Identify one potential disadvantage of using a very large K value in K-fold Cross-validation.

<p>Increased computation time.</p> Signup and view all the answers

What is the purpose of the model.fit() function in supervised learning?

<p>To build the model (C)</p> Signup and view all the answers

Neural networks learn decision points and branches when modeling.

<p>False (B)</p> Signup and view all the answers

What are the two types of predictions made by classifiers and regression models?

<p>Classifiers predict labels, and regression predicts numerical outputs.</p> Signup and view all the answers

Random forests are a type of __________ learning model.

<p>ensemble</p> Signup and view all the answers

What scoring metric is commonly used for regression models?

<p>R-Squared (R2) (B)</p> Signup and view all the answers

Confusion matrices are used to assess the performance of regression models.

<p>False (B)</p> Signup and view all the answers

What is the main goal of supervised machine learning?

<p>Given examples, learn to classify or predict answers (D)</p> Signup and view all the answers

Classification tasks in machine learning predict a real-valued number.

<p>False (B)</p> Signup and view all the answers

What is the purpose of the train/test random split in machine learning?

<p>To separate data into training and testing sets for model evaluation.</p> Signup and view all the answers

In machine learning, __________ is used to validate a model's performance by dividing the training data into K subsets.

<p>K-fold cross-validation</p> Signup and view all the answers

Match the following machine learning terms with their descriptions:

<p>Supervised ML = Learning with labeled data Unsupervised ML = Learning without labeled data Classification = Assigning labels to data Regression = Predicting continuous values</p> Signup and view all the answers

What does K represent in K-fold cross-validation?

<p>The number of splits of the training data (B)</p> Signup and view all the answers

In supervised machine learning, the terms 'X' and 'y' typically represent the input and output data respectively.

<p>True (A)</p> Signup and view all the answers

Define classification in the context of machine learning.

<p>Classification is the process of assigning categories or labels to data.</p> Signup and view all the answers

The output layer of a neural network is where __________ are generated.

<p>predictions</p> Signup and view all the answers

Match the machine learning stages with their correct sequence:

<p>Train/test random split = 1 Train ML model = 2 Score on the test set = 3</p> Signup and view all the answers

What does the term 'hidden layer' refer to in a neural network?

<p>The layers that perform intermediate computations (B)</p> Signup and view all the answers

Regression tasks involve assigning discrete labels to data.

<p>False (B)</p> Signup and view all the answers

Explain the main difference between supervised and unsupervised machine learning.

<p>Supervised learning uses labeled data for training, while unsupervised learning works with unlabeled data.</p> Signup and view all the answers

The training dataset in machine learning is commonly denoted as __________.

<p>Xtrain</p> Signup and view all the answers

What does the 'Random' in Random Forests refer to?

<p>Random decisions made during tree construction (B)</p> Signup and view all the answers

Random Forests consists of a single decision tree for classification and regression.

<p>False (B)</p> Signup and view all the answers

What is the primary purpose of an ensemble of decision trees in Random Forests?

<p>To improve accuracy and reduce overfitting.</p> Signup and view all the answers

In Random Forests, classification is based on __________ and regression is based on _________.

<p>vote, average</p> Signup and view all the answers

Match the following terms related to Random Forests with their meanings:

<p>Ensemble = Combination of multiple models Decision Tree = A model that makes decisions based on features Feature Split = Dividing the data at a particular point based on feature values Information Gain = A measure used to determine the effectiveness of a feature in splitting data</p> Signup and view all the answers

Which of the following statements about Random Forests is true?

<p>They can handle both classification and regression tasks. (D)</p> Signup and view all the answers

Random Forests can only be used with continuous data.

<p>False (B)</p> Signup and view all the answers

Name one advantage of using Random Forests over a single decision tree.

<p>Reduced risk of overfitting.</p> Signup and view all the answers

Flashcards

Precision

The ratio of correctly predicted positive instances to the total number of instances predicted as positive.

Recall

The ratio of correctly predicted positive instances to the total number of actual positive instances.

Accuracy

The ratio of correctly classified instances to the total number of instances.

F1-score

A measure that combines precision and recall to give a balanced evaluation of a model.

Signup and view all the flashcards

Confusion Matrix

A table that summarizes the performance of a classification model by showing the counts of true positives, true negatives, false positives, and false negatives.

Signup and view all the flashcards

True Positive (TP)

An instance that is correctly predicted as positive.

Signup and view all the flashcards

True Negative (TN)

An instance that is correctly predicted as negative.

Signup and view all the flashcards

False Positive (FP)

An instance that is incorrectly predicted as positive.

Signup and view all the flashcards

Raw Data

Data in its original, unprocessed form. It can be in various formats like databases, spreadsheets, text files, or web data.

Signup and view all the flashcards

Data Tidying

The process of cleaning and organizing raw data to prepare it for analysis. This involves dealing with inconsistencies, missing values, and transforming data into a usable format.

Signup and view all the flashcards

Tabular Data

Data organized in a table format, with rows representing observations and columns representing variables.

Signup and view all the flashcards

Numerical Data

Data that represents quantities or measurements, like age, weight, or income.

Signup and view all the flashcards

Categorical Data

Data that represents categories or groups, like gender, color, or product type.

Signup and view all the flashcards

Ordinal Data

A type of categorical data where the categories have a natural order or ranking, like small, medium, large.

Signup and view all the flashcards

Preprocessing

Transforming and preparing data for machine learning algorithms. This includes scaling, encoding, and handling missing values.

Signup and view all the flashcards

Data Analysis

The process of examining and interpreting data to gain insights and make informed decisions.

Signup and view all the flashcards

Supervised ML

A type of machine learning where the algorithm learns from labeled data, meaning it's given both inputs and expected outputs. It aims to predict outputs based on the provided examples.

Signup and view all the flashcards

Unsupervised ML

A type of machine learning where the algorithm learns from unlabeled data, meaning it has to discover patterns and structures on its own. It aims to find insights and relationships within the data.

Signup and view all the flashcards

Classification

A type of supervised learning that categorizes data into distinct groups or classes.

Signup and view all the flashcards

Regression

A type of supervised learning that predicts a continuous value (like a number) based on given input data.

Signup and view all the flashcards

Train/Test Split

The process of dividing the dataset into two parts: one for training the machine learning model and another for evaluating its performance.

Signup and view all the flashcards

Test Data

The part of the dataset used to evaluate the performance of the trained machine learning model.

Signup and view all the flashcards

Score (Model)

A metric used to assess the performance of a machine learning model on the test data.

Signup and view all the flashcards

K-fold Cross-validation

A robust technique for evaluating a model by repeatedly splitting the training data into K folds, training the model on K-1 folds, and testing on the remaining fold.

Signup and view all the flashcards

Validation Set

A portion of the training data used to evaluate the performance of the model during training, helping to prevent overfitting.

Signup and view all the flashcards

Input Layer

The first layer of a neural network that receives the raw input data.

Signup and view all the flashcards

Hidden Layer

A processing layer in a neural network that transforms the input data and learns complex features.

Signup and view all the flashcards

Output Layer

The final layer of a neural network that produces the predicted output.

Signup and view all the flashcards

Neural Network

A complex machine learning model inspired by the structure of the human brain, composed of interconnected nodes called neurons.

Signup and view all the flashcards

Random Forests

A machine learning algorithm that combines multiple decision trees to improve accuracy and prevent overfitting.

Signup and view all the flashcards

Decision Tree

A flowchart-like structure that uses a series of rules to classify data points based on their features.

Signup and view all the flashcards

Ensemble

A group of multiple learning models working together to improve prediction accuracy.

Signup and view all the flashcards

Randomness in Random Forests

In Random Forests, each tree is built independently, making random decisions about which features to use and how to split the data.

Signup and view all the flashcards

Voting or Averaging

After each tree in the Random Forest makes a prediction, these predictions are combined through voting (for classification) or averaging (for regression) to arrive at the final output.

Signup and view all the flashcards

Information Gain

A measure used in decision tree construction to determine the best feature to split the data based on how much it improves the separation of classes.

Signup and view all the flashcards

Variance

In regression, the variance of data points around the prediction line is used to evaluate the accuracy of the prediction.

Signup and view all the flashcards

Overfitting

A scenario where a model performs well on the training data but poorly on unseen data due to learning the noise and specific patterns of the training data.

Signup and view all the flashcards

Supervised Learning

A type of machine learning where an algorithm learns from labeled data, meaning each data point has a known output or target value. This allows the model to learn the relationship between inputs and outputs, making predictions on new, unseen data.

Signup and view all the flashcards

Model Training

The process of feeding labeled data to a machine learning model to adjust its parameters and improve its ability to make accurate predictions.

Signup and view all the flashcards

Hyperparameter Optimization

The process of finding the best set of parameters (hyperparameters) for a machine learning model to achieve optimal performance.

Signup and view all the flashcards

Machine Learning Pipeline

A sequence of steps used to process data and train a machine learning model. It typically involves data cleaning, preprocessing, feature engineering, model selection, and evaluation.

Signup and view all the flashcards

K's impact: too small

If K is too small, there's less time but poorer generalization. Training might not have enough examples, leading to inaccurate performance predictions on unseen data.

Signup and view all the flashcards

K's impact: too large

When K is too large, it takes longer and increases variance. The validation set might be too small, leading to unstable performance estimates.

Signup and view all the flashcards

Stratified K-fold

A technique used for classification problems, where the proportion of classes is maintained across train and test sets, as well as in each fold. Ensures equal class representation for better model evaluation.

Signup and view all the flashcards

Why is stratification needed?

When dealing with unbalanced classes (more samples from one group), stratification ensures the model doesn't favor the majority class and leads to more reliable results.

Signup and view all the flashcards

stratify=y_labels

This parameter in train_test_split() tells the function to split the data into train and test sets, maintaining the proportion of class labels in each set.

Signup and view all the flashcards

cross_val_score()

This function automatically utilizes stratified k-fold cross-validation to assess models, ensuring balanced class representation across folds.

Signup and view all the flashcards

Performance evaluation with K-fold

K-fold cross-validation allows for robust performance evaluation of a model by assessing its generalization ability across different folds, ensuring it performs well on unseen data.

Signup and view all the flashcards

Study Notes

Course Information

  • Course Title: Programming in Python for Business Analytics
  • Course Code: BMAN73701
  • Week: 5, Lecture 2
  • Topic: Advanced Machine Learning

Data Analysis Process

  • Data acquisition from raw sources (databases, web, excel, APIs)
  • Raw data tidied and organized into tabular data (numerical, categorical, ordinal)
  • Data analysis through summary statistics, analysis, and visualizations.

Machine Learning with scikit-learn

  • Built on top of NumPy and Matplotlib
  • Input data can be NumPy or Pandas DataFrames
  • Output is typically NumPy arrays
  • Open-source, constantly improving, and object-oriented
  • Used to fit (train) or transform data

Supervised Machine Learning

  • Learning from examples of answers
  • Classification: assigning discrete categories or labels
  • Regression: predicting continuous real-valued numbers

Supervised ML Workflow

  • Randomly split data into training and testing sets
  • Train a machine learning model using the training data
  • Evaluate the model's performance on the test set

K-fold Cross-validation

  • Divides training data into k-folds
  • Iterates through k-folds, using each fold as validation data
  • Scores the model on validation data for each iteration
  • Improves the ability of the model to generalize to unseen data; k-folds can be more accurate than a train/test random split if the training_data is small
  • The best value for K is situational; too small, and the model may not generalize; too large, and it takes longer to train

Stratified K-fold Cross-validation

  • Maintains the proportion of class labels in train and test sets during K-fold Cross-Validation
  • Improves the handling of unbalanced data sets
  • Automatically used in cross_val_score()

Supervised ML Model Building

  • Decision trees: learning decision points/branches
  • Neural networks (MLP): learning weights of neurons

Supervised ML Model Evaluation

  • Classifiers: accuracy
  • Regression: R2

Random Forests

  • Ensemble of decision trees
  • Random decisions when building the trees
  • Many trees combined
  • Avoids overfitting by averaging predictions from multiple trees
  • Measures feature importance

Credit Card Default Example

  • Dataset used for demonstration purposes, with 30,000 rows (unbalanced)
  • Using value_counts() gives the breakdown of the default variable, which should be considered before modeling

Feature Importance in Random Forests

  • Important features have a higher impact on the model's predictions
  • Calculated by forest.feature importances_ (calculated after training the model)

Hyper-parameter Optimization

  • Parameters set by training data, whereas hyperparameters need additional tuning
  • Methods: Grid Search and Random Search
  • Optimization algorithms used to find the best combination of hyperparameters maximizing the cross-validation score
  • Methods such as SMAC, IRACE, Skopt

Preprocessing with Cross-validation

  • Data transformations should be performed within the model-building step for each k-fold in Cross-validation (not before)
  • Avoids data leakage, where model evaluation benefits by taking data from the validation set. This is important because the result would be overly optimistic.

Pipelines

  • Combining preprocessing steps and machine learning models into a single object
  • Helps with data transformations and avoiding data leakage during model evaluation.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Advanced Machine Learning PDF

More Like This

Use Quizgecko on...
Browser
Browser