BMAN73701 Week 5: Advanced Machine Learning
50 Questions
1 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Machine Learning with scikit-learn does not require any libraries such as NumPy or Matplotlib.

False

What is the primary output format of data processed using scikit-learn?

Numpy

Machine Learning requires _______ data preprocessing before analysis.

raw

Match the following data sources to their types:

<p>Databases = Raw Data Sources Excel = Raw Data Sources Numerical = Tabular Data Types Categorical = Tabular Data Types</p> Signup and view all the answers

What type of data does scikit-learn accept as input?

<p>Numpy or Pandas DataFrame</p> Signup and view all the answers

Who is the professor for the Programming in Python course?

<p>Prof. Manuel López-Ibáñez</p> Signup and view all the answers

What is the formula for calculating Accuracy?

<p>$TP + TN / (TP + TN + FP + FN)$</p> Signup and view all the answers

The Recall is calculated as the ratio of true positives to the total actual positives.

<p>True</p> Signup and view all the answers

What is the F1 score for the given model?

<p>0.62</p> Signup and view all the answers

The ratio of predicted positives that are actual positives is called __________.

<p>Precision</p> Signup and view all the answers

Match the terms with their descriptions:

<p>Precision = Ratio of true positives to predicted positives Recall = Ratio of true positives to actual positives Accuracy = Ratio of correct predictions to total predictions F1 Score = Harmonic mean of Precision and Recall</p> Signup and view all the answers

What does the value '0.57' represent in this context?

<p>Recall</p> Signup and view all the answers

How many instances are classified correctly in this model?

<p>4</p> Signup and view all the answers

The F1 score in this case is higher than both precision and recall.

<p>False</p> Signup and view all the answers

What happens if K is set too small in K-fold Cross-validation?

<p>It results in faster computations but poorer generalization.</p> Signup and view all the answers

Increasing K in K-fold Cross-validation always improves model accuracy.

<p>False</p> Signup and view all the answers

What method should be used in K-fold Cross-validation when classes are unbalanced?

<p>Stratified K-fold Cross-validation</p> Signup and view all the answers

In Stratified K-fold Cross-validation, the proportion of __________ labels is maintained within each fold.

<p>class</p> Signup and view all the answers

Match the following terms with their definitions:

<p>K-fold Cross-validation = Division of dataset into K subsets for training and validation. Stratified K-fold = A method ensuring each fold has the same proportion of class labels. Training fold = Subset of data used to train the model. Validation fold = Subset of data used to evaluate model performance.</p> Signup and view all the answers

Which library function can be used to implement stratified cross-validation in Python?

<p>cross_val_score()</p> Signup and view all the answers

K-fold Cross-validation is only applicable to classification problems.

<p>False</p> Signup and view all the answers

Identify one potential disadvantage of using a very large K value in K-fold Cross-validation.

<p>Increased computation time.</p> Signup and view all the answers

What is the purpose of the model.fit() function in supervised learning?

<p>To build the model</p> Signup and view all the answers

Neural networks learn decision points and branches when modeling.

<p>False</p> Signup and view all the answers

What are the two types of predictions made by classifiers and regression models?

<p>Classifiers predict labels, and regression predicts numerical outputs.</p> Signup and view all the answers

Random forests are a type of __________ learning model.

<p>ensemble</p> Signup and view all the answers

What scoring metric is commonly used for regression models?

<p>R-Squared (R2)</p> Signup and view all the answers

Confusion matrices are used to assess the performance of regression models.

<p>False</p> Signup and view all the answers

What is the main goal of supervised machine learning?

<p>Given examples, learn to classify or predict answers</p> Signup and view all the answers

Classification tasks in machine learning predict a real-valued number.

<p>False</p> Signup and view all the answers

What is the purpose of the train/test random split in machine learning?

<p>To separate data into training and testing sets for model evaluation.</p> Signup and view all the answers

In machine learning, __________ is used to validate a model's performance by dividing the training data into K subsets.

<p>K-fold cross-validation</p> Signup and view all the answers

Match the following machine learning terms with their descriptions:

<p>Supervised ML = Learning with labeled data Unsupervised ML = Learning without labeled data Classification = Assigning labels to data Regression = Predicting continuous values</p> Signup and view all the answers

What does K represent in K-fold cross-validation?

<p>The number of splits of the training data</p> Signup and view all the answers

In supervised machine learning, the terms 'X' and 'y' typically represent the input and output data respectively.

<p>True</p> Signup and view all the answers

Define classification in the context of machine learning.

<p>Classification is the process of assigning categories or labels to data.</p> Signup and view all the answers

The output layer of a neural network is where __________ are generated.

<p>predictions</p> Signup and view all the answers

Match the machine learning stages with their correct sequence:

<p>Train/test random split = 1 Train ML model = 2 Score on the test set = 3</p> Signup and view all the answers

What does the term 'hidden layer' refer to in a neural network?

<p>The layers that perform intermediate computations</p> Signup and view all the answers

Regression tasks involve assigning discrete labels to data.

<p>False</p> Signup and view all the answers

Explain the main difference between supervised and unsupervised machine learning.

<p>Supervised learning uses labeled data for training, while unsupervised learning works with unlabeled data.</p> Signup and view all the answers

The training dataset in machine learning is commonly denoted as __________.

<p>Xtrain</p> Signup and view all the answers

What does the 'Random' in Random Forests refer to?

<p>Random decisions made during tree construction</p> Signup and view all the answers

Random Forests consists of a single decision tree for classification and regression.

<p>False</p> Signup and view all the answers

What is the primary purpose of an ensemble of decision trees in Random Forests?

<p>To improve accuracy and reduce overfitting.</p> Signup and view all the answers

In Random Forests, classification is based on __________ and regression is based on _________.

<p>vote, average</p> Signup and view all the answers

Match the following terms related to Random Forests with their meanings:

<p>Ensemble = Combination of multiple models Decision Tree = A model that makes decisions based on features Feature Split = Dividing the data at a particular point based on feature values Information Gain = A measure used to determine the effectiveness of a feature in splitting data</p> Signup and view all the answers

Which of the following statements about Random Forests is true?

<p>They can handle both classification and regression tasks.</p> Signup and view all the answers

Random Forests can only be used with continuous data.

<p>False</p> Signup and view all the answers

Name one advantage of using Random Forests over a single decision tree.

<p>Reduced risk of overfitting.</p> Signup and view all the answers

Study Notes

Course Information

  • Course Title: Programming in Python for Business Analytics
  • Course Code: BMAN73701
  • Week: 5, Lecture 2
  • Topic: Advanced Machine Learning

Data Analysis Process

  • Data acquisition from raw sources (databases, web, excel, APIs)
  • Raw data tidied and organized into tabular data (numerical, categorical, ordinal)
  • Data analysis through summary statistics, analysis, and visualizations.

Machine Learning with scikit-learn

  • Built on top of NumPy and Matplotlib
  • Input data can be NumPy or Pandas DataFrames
  • Output is typically NumPy arrays
  • Open-source, constantly improving, and object-oriented
  • Used to fit (train) or transform data

Supervised Machine Learning

  • Learning from examples of answers
  • Classification: assigning discrete categories or labels
  • Regression: predicting continuous real-valued numbers

Supervised ML Workflow

  • Randomly split data into training and testing sets
  • Train a machine learning model using the training data
  • Evaluate the model's performance on the test set

K-fold Cross-validation

  • Divides training data into k-folds
  • Iterates through k-folds, using each fold as validation data
  • Scores the model on validation data for each iteration
  • Improves the ability of the model to generalize to unseen data; k-folds can be more accurate than a train/test random split if the training_data is small
  • The best value for K is situational; too small, and the model may not generalize; too large, and it takes longer to train

Stratified K-fold Cross-validation

  • Maintains the proportion of class labels in train and test sets during K-fold Cross-Validation
  • Improves the handling of unbalanced data sets
  • Automatically used in cross_val_score()

Supervised ML Model Building

  • Decision trees: learning decision points/branches
  • Neural networks (MLP): learning weights of neurons

Supervised ML Model Evaluation

  • Classifiers: accuracy
  • Regression: R2

Random Forests

  • Ensemble of decision trees
  • Random decisions when building the trees
  • Many trees combined
  • Avoids overfitting by averaging predictions from multiple trees
  • Measures feature importance

Credit Card Default Example

  • Dataset used for demonstration purposes, with 30,000 rows (unbalanced)
  • Using value_counts() gives the breakdown of the default variable, which should be considered before modeling

Feature Importance in Random Forests

  • Important features have a higher impact on the model's predictions
  • Calculated by forest.feature importances_ (calculated after training the model)

Hyper-parameter Optimization

  • Parameters set by training data, whereas hyperparameters need additional tuning
  • Methods: Grid Search and Random Search
  • Optimization algorithms used to find the best combination of hyperparameters maximizing the cross-validation score
  • Methods such as SMAC, IRACE, Skopt

Preprocessing with Cross-validation

  • Data transformations should be performed within the model-building step for each k-fold in Cross-validation (not before)
  • Avoids data leakage, where model evaluation benefits by taking data from the validation set. This is important because the result would be overly optimistic.

Pipelines

  • Combining preprocessing steps and machine learning models into a single object
  • Helps with data transformations and avoiding data leakage during model evaluation.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Advanced Machine Learning PDF

Description

Explore advanced concepts in machine learning tailored for business analytics. This quiz covers data acquisition, organization, analysis, and the application of scikit-learn for supervised learning techniques. Test your knowledge on classification, regression, and the workflow of machine learning models.

More Like This

Use Quizgecko on...
Browser
Browser