Podcast
Questions and Answers
Machine Learning with scikit-learn does not require any libraries such as NumPy or Matplotlib.
Machine Learning with scikit-learn does not require any libraries such as NumPy or Matplotlib.
False (B)
What is the primary output format of data processed using scikit-learn?
What is the primary output format of data processed using scikit-learn?
Numpy
Machine Learning requires _______ data preprocessing before analysis.
Machine Learning requires _______ data preprocessing before analysis.
raw
Match the following data sources to their types:
Match the following data sources to their types:
What type of data does scikit-learn accept as input?
What type of data does scikit-learn accept as input?
Who is the professor for the Programming in Python course?
Who is the professor for the Programming in Python course?
What is the formula for calculating Accuracy?
What is the formula for calculating Accuracy?
The Recall is calculated as the ratio of true positives to the total actual positives.
The Recall is calculated as the ratio of true positives to the total actual positives.
What is the F1 score for the given model?
What is the F1 score for the given model?
The ratio of predicted positives that are actual positives is called __________.
The ratio of predicted positives that are actual positives is called __________.
Match the terms with their descriptions:
Match the terms with their descriptions:
What does the value '0.57' represent in this context?
What does the value '0.57' represent in this context?
How many instances are classified correctly in this model?
How many instances are classified correctly in this model?
The F1 score in this case is higher than both precision and recall.
The F1 score in this case is higher than both precision and recall.
What happens if K is set too small in K-fold Cross-validation?
What happens if K is set too small in K-fold Cross-validation?
Increasing K in K-fold Cross-validation always improves model accuracy.
Increasing K in K-fold Cross-validation always improves model accuracy.
What method should be used in K-fold Cross-validation when classes are unbalanced?
What method should be used in K-fold Cross-validation when classes are unbalanced?
In Stratified K-fold Cross-validation, the proportion of __________ labels is maintained within each fold.
In Stratified K-fold Cross-validation, the proportion of __________ labels is maintained within each fold.
Match the following terms with their definitions:
Match the following terms with their definitions:
Which library function can be used to implement stratified cross-validation in Python?
Which library function can be used to implement stratified cross-validation in Python?
K-fold Cross-validation is only applicable to classification problems.
K-fold Cross-validation is only applicable to classification problems.
Identify one potential disadvantage of using a very large K value in K-fold Cross-validation.
Identify one potential disadvantage of using a very large K value in K-fold Cross-validation.
What is the purpose of the model.fit()
function in supervised learning?
What is the purpose of the model.fit()
function in supervised learning?
Neural networks learn decision points and branches when modeling.
Neural networks learn decision points and branches when modeling.
What are the two types of predictions made by classifiers and regression models?
What are the two types of predictions made by classifiers and regression models?
Random forests are a type of __________ learning model.
Random forests are a type of __________ learning model.
What scoring metric is commonly used for regression models?
What scoring metric is commonly used for regression models?
Confusion matrices are used to assess the performance of regression models.
Confusion matrices are used to assess the performance of regression models.
What is the main goal of supervised machine learning?
What is the main goal of supervised machine learning?
Classification tasks in machine learning predict a real-valued number.
Classification tasks in machine learning predict a real-valued number.
What is the purpose of the train/test random split in machine learning?
What is the purpose of the train/test random split in machine learning?
In machine learning, __________ is used to validate a model's performance by dividing the training data into K subsets.
In machine learning, __________ is used to validate a model's performance by dividing the training data into K subsets.
Match the following machine learning terms with their descriptions:
Match the following machine learning terms with their descriptions:
What does K represent in K-fold cross-validation?
What does K represent in K-fold cross-validation?
In supervised machine learning, the terms 'X' and 'y' typically represent the input and output data respectively.
In supervised machine learning, the terms 'X' and 'y' typically represent the input and output data respectively.
Define classification in the context of machine learning.
Define classification in the context of machine learning.
The output layer of a neural network is where __________ are generated.
The output layer of a neural network is where __________ are generated.
Match the machine learning stages with their correct sequence:
Match the machine learning stages with their correct sequence:
What does the term 'hidden layer' refer to in a neural network?
What does the term 'hidden layer' refer to in a neural network?
Regression tasks involve assigning discrete labels to data.
Regression tasks involve assigning discrete labels to data.
Explain the main difference between supervised and unsupervised machine learning.
Explain the main difference between supervised and unsupervised machine learning.
The training dataset in machine learning is commonly denoted as __________.
The training dataset in machine learning is commonly denoted as __________.
What does the 'Random' in Random Forests refer to?
What does the 'Random' in Random Forests refer to?
Random Forests consists of a single decision tree for classification and regression.
Random Forests consists of a single decision tree for classification and regression.
What is the primary purpose of an ensemble of decision trees in Random Forests?
What is the primary purpose of an ensemble of decision trees in Random Forests?
In Random Forests, classification is based on __________ and regression is based on _________.
In Random Forests, classification is based on __________ and regression is based on _________.
Match the following terms related to Random Forests with their meanings:
Match the following terms related to Random Forests with their meanings:
Which of the following statements about Random Forests is true?
Which of the following statements about Random Forests is true?
Random Forests can only be used with continuous data.
Random Forests can only be used with continuous data.
Name one advantage of using Random Forests over a single decision tree.
Name one advantage of using Random Forests over a single decision tree.
Flashcards
Precision
Precision
The ratio of correctly predicted positive instances to the total number of instances predicted as positive.
Recall
Recall
The ratio of correctly predicted positive instances to the total number of actual positive instances.
Accuracy
Accuracy
The ratio of correctly classified instances to the total number of instances.
F1-score
F1-score
Signup and view all the flashcards
Confusion Matrix
Confusion Matrix
Signup and view all the flashcards
True Positive (TP)
True Positive (TP)
Signup and view all the flashcards
True Negative (TN)
True Negative (TN)
Signup and view all the flashcards
False Positive (FP)
False Positive (FP)
Signup and view all the flashcards
Raw Data
Raw Data
Signup and view all the flashcards
Data Tidying
Data Tidying
Signup and view all the flashcards
Tabular Data
Tabular Data
Signup and view all the flashcards
Numerical Data
Numerical Data
Signup and view all the flashcards
Categorical Data
Categorical Data
Signup and view all the flashcards
Ordinal Data
Ordinal Data
Signup and view all the flashcards
Preprocessing
Preprocessing
Signup and view all the flashcards
Data Analysis
Data Analysis
Signup and view all the flashcards
Supervised ML
Supervised ML
Signup and view all the flashcards
Unsupervised ML
Unsupervised ML
Signup and view all the flashcards
Classification
Classification
Signup and view all the flashcards
Regression
Regression
Signup and view all the flashcards
Train/Test Split
Train/Test Split
Signup and view all the flashcards
Test Data
Test Data
Signup and view all the flashcards
Score (Model)
Score (Model)
Signup and view all the flashcards
K-fold Cross-validation
K-fold Cross-validation
Signup and view all the flashcards
Validation Set
Validation Set
Signup and view all the flashcards
Input Layer
Input Layer
Signup and view all the flashcards
Hidden Layer
Hidden Layer
Signup and view all the flashcards
Output Layer
Output Layer
Signup and view all the flashcards
Neural Network
Neural Network
Signup and view all the flashcards
Random Forests
Random Forests
Signup and view all the flashcards
Decision Tree
Decision Tree
Signup and view all the flashcards
Ensemble
Ensemble
Signup and view all the flashcards
Randomness in Random Forests
Randomness in Random Forests
Signup and view all the flashcards
Voting or Averaging
Voting or Averaging
Signup and view all the flashcards
Information Gain
Information Gain
Signup and view all the flashcards
Variance
Variance
Signup and view all the flashcards
Overfitting
Overfitting
Signup and view all the flashcards
Supervised Learning
Supervised Learning
Signup and view all the flashcards
Model Training
Model Training
Signup and view all the flashcards
Hyperparameter Optimization
Hyperparameter Optimization
Signup and view all the flashcards
Machine Learning Pipeline
Machine Learning Pipeline
Signup and view all the flashcards
K's impact: too small
K's impact: too small
Signup and view all the flashcards
K's impact: too large
K's impact: too large
Signup and view all the flashcards
Stratified K-fold
Stratified K-fold
Signup and view all the flashcards
Why is stratification needed?
Why is stratification needed?
Signup and view all the flashcards
stratify=y_labels
stratify=y_labels
Signup and view all the flashcards
cross_val_score()
cross_val_score()
Signup and view all the flashcards
Performance evaluation with K-fold
Performance evaluation with K-fold
Signup and view all the flashcards
Study Notes
Course Information
- Course Title: Programming in Python for Business Analytics
- Course Code: BMAN73701
- Week: 5, Lecture 2
- Topic: Advanced Machine Learning
Data Analysis Process
- Data acquisition from raw sources (databases, web, excel, APIs)
- Raw data tidied and organized into tabular data (numerical, categorical, ordinal)
- Data analysis through summary statistics, analysis, and visualizations.
Machine Learning with scikit-learn
- Built on top of NumPy and Matplotlib
- Input data can be NumPy or Pandas DataFrames
- Output is typically NumPy arrays
- Open-source, constantly improving, and object-oriented
- Used to fit (train) or transform data
Supervised Machine Learning
- Learning from examples of answers
- Classification: assigning discrete categories or labels
- Regression: predicting continuous real-valued numbers
Supervised ML Workflow
- Randomly split data into training and testing sets
- Train a machine learning model using the training data
- Evaluate the model's performance on the test set
K-fold Cross-validation
- Divides training data into k-folds
- Iterates through k-folds, using each fold as validation data
- Scores the model on validation data for each iteration
- Improves the ability of the model to generalize to unseen data; k-folds can be more accurate than a train/test random split if the training_data is small
- The best value for K is situational; too small, and the model may not generalize; too large, and it takes longer to train
Stratified K-fold Cross-validation
- Maintains the proportion of class labels in train and test sets during K-fold Cross-Validation
- Improves the handling of unbalanced data sets
- Automatically used in
cross_val_score()
Supervised ML Model Building
- Decision trees: learning decision points/branches
- Neural networks (MLP): learning weights of neurons
Supervised ML Model Evaluation
- Classifiers: accuracy
- Regression: R2
Random Forests
- Ensemble of decision trees
- Random decisions when building the trees
- Many trees combined
- Avoids overfitting by averaging predictions from multiple trees
- Measures feature importance
Credit Card Default Example
- Dataset used for demonstration purposes, with 30,000 rows (unbalanced)
- Using
value_counts()
gives the breakdown of the default variable, which should be considered before modeling
Feature Importance in Random Forests
- Important features have a higher impact on the model's predictions
- Calculated by
forest.feature importances_
(calculated after training the model)
Hyper-parameter Optimization
- Parameters set by training data, whereas hyperparameters need additional tuning
- Methods: Grid Search and Random Search
- Optimization algorithms used to find the best combination of hyperparameters maximizing the cross-validation score
- Methods such as SMAC, IRACE, Skopt
Preprocessing with Cross-validation
- Data transformations should be performed within the model-building step for each k-fold in Cross-validation (not before)
- Avoids data leakage, where model evaluation benefits by taking data from the validation set. This is important because the result would be overly optimistic.
Pipelines
- Combining preprocessing steps and machine learning models into a single object
- Helps with data transformations and avoiding data leakage during model evaluation.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.