Machine Learning Fundamentals Quiz
46 Questions
1 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is the primary purpose of feature selection in the data preprocessing phase?

  • To improve the accuracy of the model
  • To reduce the training time of the algorithm
  • To eliminate irrelevant features
  • All of the above (correct)

Untrained algorithms are used during the deployment phase.

False (B)

What are the outputs of a supervised machine learning algorithm?

Labels

During the prediction phase, new inputs are provided to a __________ machine learning algorithm.

<p>trained</p> Signup and view all the answers

Match the phases of machine learning with their corresponding activities:

<p>Data Preprocessing = Feature Selection Training Phase = Model Training Deployment Phase = Model Prediction Input Phase = Providing Features</p> Signup and view all the answers

What is the primary purpose of PCA in data analysis?

<p>To transform data into fewer uncorrelated components (D)</p> Signup and view all the answers

PCA can require the number of components to be specified in advance.

<p>True (A)</p> Signup and view all the answers

Name one challenge associated with interpreting PCA components.

<p>It is often hard to understand what components represent.</p> Signup and view all the answers

PCA primarily helps in visualizing __________ data.

<p>high-dimensional</p> Signup and view all the answers

Match the following terms related to PCA with their descriptions:

<p>Principal Component = A direction in the feature space along which the data varies the most Variance = The measure of how much values differ from the mean Dimensionality Reduction = The process of reducing the number of features while retaining essential information Uncorrelated Features = Features that do not influence each other</p> Signup and view all the answers

What does PCA aim to achieve by transforming data?

<p>To better capture the relationships between original features (A)</p> Signup and view all the answers

PCA is guaranteed to provide a clear interpretation of the resulting components.

<p>False (B)</p> Signup and view all the answers

What kind of features does PCA produce?

<p>Uncorrelated features</p> Signup and view all the answers

What does K-fold cross-validation do?

<p>It helps in reducing overfitting by validating multiple splits. (D)</p> Signup and view all the answers

The output layer of a neural network has no influence on the predictions made by the model.

<p>False (B)</p> Signup and view all the answers

What is the purpose of the 'MLPRegressor' in the provided content?

<p>To create a multi-layer perceptron regressor for training a machine learning model.</p> Signup and view all the answers

Match the following terms with their descriptions:

<p>K-fold cross-validation = A method to validate a model by splitting data into K subsets MLPRegressor = A neural network model used for regression tasks Training Data = Data used to fit the machine learning model Validation Data = Data used to assess the model's performance</p> Signup and view all the answers

Which parameter was set to 500 in MLPRegressor?

<p>Max iterations (D)</p> Signup and view all the answers

Using a single fold for validation can give a more accurate performance score than K-fold cross-validation.

<p>False (B)</p> Signup and view all the answers

What is the effect of increasing the number of hidden layers in an MLPRegressor?

<p>It can improve the model's ability to learn complex patterns, but may also lead to overfitting.</p> Signup and view all the answers

What is a dataset?

<p>A collection of numerical and/or categorical values (C)</p> Signup and view all the answers

An observation groups values from different variables for multiple items.

<p>False (B)</p> Signup and view all the answers

What programming libraries is scikit-learn built on top of?

<p>NumPy and Matplotlib</p> Signup and view all the answers

Scikit-learn is ___-source, free to use and contribute.

<p>open</p> Signup and view all the answers

Which of the following describes an observation?

<p>Values of several variables for the same object (D)</p> Signup and view all the answers

Scikit-learn requires data input to be in the form of a Pandas DataFrame or Numpy array.

<p>True (A)</p> Signup and view all the answers

What type of programming paradigm does scikit-learn follow?

<p>Object-oriented</p> Signup and view all the answers

The score of the decision tree model on the test set is lower than its cross-validation score.

<p>False (B)</p> Signup and view all the answers

The actual classes of the test set were: [2, 1, 0, 1, 0]. The predicted values for these classes are [____, ____, ____, ____, ____].

<p>'C', 'C', 'A', 'B', 'A'</p> Signup and view all the answers

Match the following classes with their corresponding predicted values:

<p>Class A = Predicted: 'A' Class B = Predicted: 'B' Class C = Predicted: 'C'</p> Signup and view all the answers

Which class had the highest predicted value?

<p>Class C (C)</p> Signup and view all the answers

How many samples were used for the analysis?

<p>40</p> Signup and view all the answers

The value corresponding to Class A is the highest among the values provided.

<p>False (B)</p> Signup and view all the answers

What is the primary goal of supervised machine learning?

<p>To predict outcomes based on labeled training data (D)</p> Signup and view all the answers

Unsupervised machine learning relies on labeled training data.

<p>False (B)</p> Signup and view all the answers

What is the purpose of cross-validation in machine learning?

<p>To assess how the results of a statistical analysis will generalize to an independent data set.</p> Signup and view all the answers

In supervised learning, we use _______ data for training the model.

<p>labeled</p> Signup and view all the answers

Match the machine learning techniques with their definitions:

<p>Supervised Learning = Learning from labeled data Unsupervised Learning = Learning from unlabeled data Cross-validation = Method to validate model performance Overfitting = Model is too complex and fits noise</p> Signup and view all the answers

Which of the following actions is NOT part of data preprocessing?

<p>Training the ML model (C)</p> Signup and view all the answers

Underfitting occurs when a model is too complex for the given data.

<p>False (B)</p> Signup and view all the answers

What is overfitting in machine learning?

<p>When a model learns noise from the training data rather than the underlying pattern.</p> Signup and view all the answers

The _______ data is used to evaluate the performance of the trained model.

<p>test</p> Signup and view all the answers

Match the components of a machine learning model with their roles:

<p>Training Data = Data used to train the model Test Data = Data used to evaluate model performance Model = The algorithm that makes predictions Features = Input variables for the model</p> Signup and view all the answers

Which of the following best describes the bias-variance trade-off?

<p>It is the balance between errors due to bias and variance in models (B)</p> Signup and view all the answers

Feature selection can help improve the performance of a machine learning model.

<p>True (A)</p> Signup and view all the answers

What does the process of standardization refer to in data preprocessing?

<p>Transforming data to have a mean of zero and a standard deviation of one.</p> Signup and view all the answers

Flashcards

Data preprocessing

The process of preparing data for machine learning models.

Feature selection

Choosing the most important features from the data.

Supervised machine learning

A type of machine learning where the model learns from labeled data.

ML algorithm

The specific method used to train the machine learning model.

Signup and view all the flashcards

Deployment

Using the trained model to make predictions on new data.

Signup and view all the flashcards

Principal Component Analysis (PCA)

A technique to reduce the number of features in data while keeping most of the variance.

Signup and view all the flashcards

Uncorrelated features

Features that don't depend on each other; independent variables.

Signup and view all the flashcards

Variance

The spread or variation in data values.

Signup and view all the flashcards

High-dimensional data

Data with many features.

Signup and view all the flashcards

Components

New features created from existing ones to capture most of the variance or information.

Signup and view all the flashcards

PCA limitations

Number of components must be predetermined and components' meaning may not be clear.

Signup and view all the flashcards

PCA application

Reduce data with many features and visualizing complex data.

Signup and view all the flashcards

Decision Tree

A machine learning model represented as a tree structure, where each node represents a feature, each branch represents a decision based on that feature, and each leaf node represents a prediction.

Signup and view all the flashcards

Cross-validation (CV)

A technique to assess the performance of a machine learning model by splitting the data into multiple folds, training the model on some folds and testing it on the remaining folds. This process is repeated, averaging the results to obtain a more reliable performance estimate.

Signup and view all the flashcards

5-fold CV

A specific type of cross-validation where the data is split into five folds, training the model on four folds and testing it on the remaining fold. This process is repeated five times, each time using a different fold as the test set, and the results are averaged.

Signup and view all the flashcards

Decision Tree Score

A measure of how accurately a decision tree model predicts the labels of the data. It's usually represented as a percentage or a decimal.

Signup and view all the flashcards

Predict

To use a trained machine learning model to make a prediction about the label of a new data point.

Signup and view all the flashcards

Actual

The true label of a data point, compared to the model's prediction.

Signup and view all the flashcards

Test Set

A portion of data reserved for evaluating the performance of a trained machine learning model. It hasn't been used during training.

Signup and view all the flashcards

Score on Test Set

The accuracy of the machine learning model on the test set, indicating its performance on unseen data.

Signup and view all the flashcards

Dataset

A collection of data values, typically organized in rows and columns, representing a collection of observations or instances.

Signup and view all the flashcards

Observation

A single row in a dataset that represents a single instance or unit of data. Each observation contains the values for all the variables measured in the dataset.

Signup and view all the flashcards

Scikit-learn

A popular Python library used for machine learning. It provides a wide range of algorithms for classification, regression, clustering, and more.

Signup and view all the flashcards

NumPy

A fundamental Python library for numerical computing, providing support for arrays, matrices, and mathematical operations.

Signup and view all the flashcards

Matplotlib

A powerful Python library for creating static, animated, and interactive visualizations. It is widely used for creating plots and graphs from data.

Signup and view all the flashcards

Pandas DataFrame

A two-dimensional data structure in Pandas, similar to a spreadsheet, that allows efficient storage, manipulation, and analysis of data.

Signup and view all the flashcards

Object-Oriented

A programming paradigm where data and the operations that act on that data are bundled together as objects. This allows for code reusability and modularity.

Signup and view all the flashcards

What is supervised ML?

A type of machine learning where the model learns from labeled data, meaning it has both inputs (features) and corresponding outputs (labels).

Signup and view all the flashcards

What is the purpose of data preprocessing?

Preparing data for machine learning by cleaning, transforming, and selecting relevant features, making the data suitable for the model.

Signup and view all the flashcards

What does 'feature selection' do?

Choosing the most relevant features from the dataset to improve model performance and reduce complexity.

Signup and view all the flashcards

What is the 'training' process?

Feeding the labeled training data to the ML algorithm, allowing it to learn the relationship between inputs (features) and outputs (labels).

Signup and view all the flashcards

What is the 'test' phase?

Evaluating the trained model on unseen data (test set) to measure its performance and generalization ability.

Signup and view all the flashcards

What does 'overfitting' mean?

When the model performs well on the training data but poorly on unseen data, failing to generalize to new situations.

Signup and view all the flashcards

What does 'underfitting' mean?

When the model fails to capture the underlying patterns in the data, resulting in poor performance on both training and unseen data.

Signup and view all the flashcards

What is the 'bias-variance trade-off'?

A balancing act in model selection, where reducing bias can increase variance and vice versa.

Signup and view all the flashcards

What is the 'test (validation) data' used for?

Evaluating the trained model's performance on data it has never seen during training.

Signup and view all the flashcards

What is the purpose of 'remove missing value' in data preprocessing?

Handling incomplete data by removing rows or columns with missing values to avoid errors in subsequent steps.

Signup and view all the flashcards

Why 'select only relevant features' in data preprocessing?

Focusing on features with a strong relationship to the output, enhancing model accuracy and reducing complexity.

Signup and view all the flashcards

How does the ML model 'learn the relationship' during training?

The model identifies patterns and relationships between inputs and outputs in the training data.

Signup and view all the flashcards

What is 'predicted outputs'?

The model's predictions for the outputs based on new inputs in the test data.

Signup and view all the flashcards

What is the 'output layer'?

The final layer of a neural network where the model makes predictions.

Signup and view all the flashcards

K-fold Cross-Validation

A technique for evaluating a machine learning model's performance by dividing the data into K folds, training the model on K-1 folds, and evaluating it on the remaining fold. This process is repeated K times, each time using a different fold as the test set, and the results are averaged to get a more reliable estimate.

Signup and view all the flashcards

What does the 'CV score' represent?

The CV score represents the average performance of the machine learning model across all the K folds in the cross-validation process. A higher CV score generally indicates a better model.

Signup and view all the flashcards

Why use K-fold cross-validation?

K-fold cross-validation is used to avoid overfitting and obtain a more robust estimate of the model's performance on unseen data. It provides a more balanced view of the model's performance across different parts of the dataset.

Signup and view all the flashcards

What does 'input layer' represent?

The input layer in a neural network receives the raw data that is fed into the model. This data is then processed through the network's hidden layers and finally outputs a result.

Signup and view all the flashcards

What is the role of the 'hidden layer'?

Hidden layers in a neural network perform complex calculations on the input data and transform it into a representation that is easier for the network to understand and learn from.

Signup and view all the flashcards

What does the 'output layer' produce?

The output layer of a neural network produces the final prediction or result based on the processing done by the hidden layers. It represents the model's output after all the data transformations.

Signup and view all the flashcards

What is the advantage of using a hidden layer with a large number of units in a neural network?

A hidden layer with a large number of units allows the neural network to learn more complex patterns from the data. It provides more flexibility to the network as it can model more intricate relationships between the input features and the output.

Signup and view all the flashcards

What is the drawback of using a hidden layer with a large number of units in a neural network?

A hidden layer with a large number of units can make the neural network more prone to overfitting the training data. This might lead to poor performance on unseen data.

Signup and view all the flashcards

Study Notes

Introduction to Machine Learning

  • Machine learning is about building models from data to identify patterns or predict future samples
  • Machine learning is similar to predictive analytics, statistical learning etc.
  • Machine learning is not the same as artificial intelligence

What is Data?

  • A dataset is a collection of numerical or categorical values
  • A variable is an attribute, criteria, feature, or dimension measured consistently
  • An observation is the values of several variables for a single item, person, unit, etc.

Machine Learning with Scikit-learn

  • Built on NumPy and Matplotlib
  • Input can be NumPy or Pandas DataFrame, output is NumPy
  • Open-source, free to use and contribute
  • Continuously updated
  • Object-oriented approach: create objects, call methods to fit (train) or transform data

Unsupervised ML

  • Goal is to learn something from data without knowing answers
  • Data preprocessing and feature selection are crucial steps
  • Algorithm examples: K-means, hierarchical clustering (unsupervised classification), Principal Components Analysis (dimensionality reduction), and some neural networks

K-Means Clustering

  • Divides data into k disjoint clusters, each with a center (centroid) that minimizes distance to its members
  • Very well-known algorithm
  • High-quality implementations
  • Handles large datasets well
  • Assumes clusters are convex and isotropic

Clustering example

  • Data is shown for stores grouped by type, size and mean sales

Principal Components Analysis (PCA)

  • Transforms data to have fewer uncorrelated features that explain most data variance
  • Useful for visualizing high-dimensional data and reducing features
  • Number of components must be specified
  • Components can be hard to interpret

Unsupervised Methodology

  • Fit the model to the training data
  • Transform the test data using fitted model
  • The model predicts a representation of the test data

Supervised ML

  • Goal is to learn relationship between input and output data, similar to supervised machine learning
  • Models: Multi-layer perceptron (neural network, regression); Decision trees (classification)

Deep (Artificial) Neural Networks

  • More layers mean higher capacity (prediction power)
  • Harder to train
  • Deep learning is a form of this

Supervised: fit, transform, predict

  • Train the model by learning the relationships between x and y where x is input and y is output
  • Build the model
  • Predict values for unseen data (test)

MLP Regression

  • Python codes show implementation for fitting and scoring models.

Underfitting / Overfitting

  • In machine learning, underfitting and overfitting can be a problem where the model does not accurately represent the data, whether insufficient training data (underfitting) or overtraining data (overfitting)

K-fold Cross-validation

  • Used to get a more accurate estimate of model performance
  • The algorithm is split into training and testing data
  • The data is then further split into folds
  • Trains on one fold, and validates / tests on another
  • The scores are averaged to create a more accurate assessment

Stratified K-fold Cross-validation

  • Stratified K-fold is a modification of K-fold used for classification problems
  • Ensures that the proportion of class labels is roughly the same within the training, validation, and test sets
  • Useful when training data includes imbalanced class

Decision Trees

  • Learn a hierarchy of if/else questions to classify outputs
  • Starts at root node and answers questions that eventually reach a leaf node with an output label

Next lecture

  • More advanced models, including score metrics and confusion matrices
  • Optimization of hyperparameters

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Description

Test your knowledge on the fundamental concepts of machine learning, including feature selection, outputs of supervised algorithms, and the phases of machine learning. This quiz covers essential topics crucial for understanding the data preprocessing phase and deployment activities.

More Like This

Use Quizgecko on...
Browser
Browser