Machine Learning Concepts Quiz

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

What does the elbow method rely on to determine the optimal number of clusters?

  • The number of iterations required for the algorithm to converge
  • The distance between each data point and its closest centroid (correct)
  • The number of data points in each cluster
  • The variance of the data within each cluster

When is k-means clustering most effective?

  • When data is very close together
  • When data points are well separated (correct)
  • When data points are randomly distributed
  • When data points are evenly distributed throughout the feature space

What is a major limitation of k-means clustering?

  • It is computationally expensive
  • It is not suitable for high-dimensional data
  • It is sensitive to the initial placement of centroids (correct)
  • It is not suitable for categorical data

Which of the following are not strengths of the k-means algorithm?

<p>It provides insights into the quality of clusters (B)</p> Signup and view all the answers

What does the term 'means' refer to in the context of k-means clustering?

<p>The average of the data points in a cluster (A)</p> Signup and view all the answers

What is the primary goal of the SVM algorithm regarding data points in the feature space?

<p>To find a hyperplane that classifies the data points into different classes. (A)</p> Signup and view all the answers

What is the name given to the closest data points from each class that influence the position of the hyperplane?

<p>Support vectors (B)</p> Signup and view all the answers

Which of these is NOT a characteristic of a hard margin in the SVM algorithm?

<p>It is a more robust approach to noisy data. (B)</p> Signup and view all the answers

What is the dimension of a hyperplane in an N-dimensional space?

<p>N - 1 (B)</p> Signup and view all the answers

Which of these would be considered a suitable application for the use of Support Vector Machines?

<p>Classifying emails as spam or not spam. (B)</p> Signup and view all the answers

Which of these techniques is NOT a supervised learning approach?

<p>Clustering (D)</p> Signup and view all the answers

Which of these scenarios is likely to use a classification algorithm?

<p>Identifying fraudulent transactions in a financial dataset. (B)</p> Signup and view all the answers

What is the main goal of supervised learning?

<p>To create a model that can predict a target variable based on input features. (B)</p> Signup and view all the answers

Which of these is a common metric used to evaluate the performance of a classification model?

<p>Accuracy (C)</p> Signup and view all the answers

Which of these concepts is NOT directly related to linear regression?

<p>Decision boundary (C)</p> Signup and view all the answers

Underfitting in a linear regression model refers to a model that:

<p>Is too simple and cannot capture the underlying relationship in the data. (A)</p> Signup and view all the answers

What is the difference between 'classification' and 'regression' in machine learning?

<p>Classification predicts a categorical output, while regression predicts a continuous output. (D)</p> Signup and view all the answers

Which of these is NOT a key skill needed for working with machine learning?

<p>Extensive knowledge of quantum mechanics (D)</p> Signup and view all the answers

What does the parameter 'M' represent in the linear basis hypothesis space ℋ𝑀?

<p>The number of basis functions 𝜑𝑗 used in the linear model (D)</p> Signup and view all the answers

Which of the following best describes the purpose of the regularisation term in the linear regression model?

<p>To prevent the model from overfitting the training data. (C)</p> Signup and view all the answers

What is the purpose of the pseudoinverse Φ† in the linear regression model?

<p>To handle cases where the matrix Φ𝑇Φ might be singular or nearly singular (B)</p> Signup and view all the answers

What mathematical concept is used to minimize the Euclidian norm in the context of linear regression?

<p>Least Squares (D)</p> Signup and view all the answers

What is the purpose of the basis functions 𝜑𝑗 in the linear basis hypothesis space ℋ𝑀?

<p>To transform the input data into a higher-dimensional space (C)</p> Signup and view all the answers

How does the regularization parameter λ affect the outcome of linear regression?

<p>A larger λ encourages a simpler model with a smaller number of parameters. (B)</p> Signup and view all the answers

What is the difference between regularized and non-regularized linear regression?

<p>Non-regularized models can overfit the training data, while regularized models try to prevent overfitting. (C)</p> Signup and view all the answers

What is the primary goal of linear regression?

<p>To predict a numerical output value based on input variables. (B)</p> Signup and view all the answers

Based on the course organization, what is the intended use of the weekly self-evaluations?

<p>To promote active learning and understanding of the material. (B)</p> Signup and view all the answers

Which of the following is NOT a topic covered in the course lectures?

<p>Genetic algorithms (B)</p> Signup and view all the answers

What is the primary focus of "Machine learning for smart industry" within the context of the course?

<p>How to use AI to improve manufacturing processes. (C)</p> Signup and view all the answers

What is the likely goal of having two lecturers for the course?

<p>To ensure a broader range of perspectives on AI in engineering. (A)</p> Signup and view all the answers

The course description lists four topics covered in the tutorials. Based on the information provided, which of the following is NOT likely to be part of a tutorial?

<p>Developing a deep learning model for image recognition. (B)</p> Signup and view all the answers

Based on the information provided about the lecturers, which would be a more likely area for the course to delve into?

<p>The application of AI in robotics. (B)</p> Signup and view all the answers

Which of the following is a potential benefit of submitting assignments a week before they're due?

<p>Students can get feedback on their assignments before the deadline and have time to revise. (D)</p> Signup and view all the answers

Which of these is NOT a topic that could be included in "Unsupervised learning" lecture?

<p>Reinforcement Learning (D)</p> Signup and view all the answers

Which of the following is the correct expression for the linear kernel function?

<p>𝑘 𝑥, 𝑥′ = 𝑥 𝑇 𝑥′ (D)</p> Signup and view all the answers

What is the feature map φ(𝑥) corresponding to the kernel function 𝑘 𝑥, 𝑥′ = 1 + 𝑥 𝑇 𝑥′ 2?

<p>φ(𝑥) = 1, 2𝑥1 , 2𝑥2 , 2𝑥1 𝑥2 , 𝑥12 , 𝑥22 (A)</p> Signup and view all the answers

Which of the following is a symmetric positive definite (SPD) kernel function?

<p>𝑘 𝑥, 𝑥′ = 𝑥 𝑇 𝑥′ (B), 𝑘 𝑥, 𝑥′ = 𝑥 𝑇 𝑥′ + 1 (C)</p> Signup and view all the answers

What is the main advantage of using kernel functions compared to explicit feature maps?

<p>Kernel functions can be defined without explicitly constructing the feature map. (D)</p> Signup and view all the answers

Which of the following kernel functions is commonly used in machine learning?

<p>All of the above (D)</p> Signup and view all the answers

What is the purpose of the regularisation parameter λ in kernel ridge regression?

<p>To control the complexity of the model and prevent overfitting. (C)</p> Signup and view all the answers

What is the Gram matrix 𝐺 in kernel ridge regression?

<p>A matrix where each element 𝐺𝑖𝑗 represents the kernel function evaluated at the i-th and j-th input vectors. (D)</p> Signup and view all the answers

What is the hypothesis space ℋ for a kernel ridge regression model?

<p>All functions that can be expressed as a weighted sum of the kernel evaluations between the input vectors and a set of support vectors. (C)</p> Signup and view all the answers

Signup and view all the answers

Signup and view all the answers

Flashcards

Elbow Method

A technique to determine the optimal number of clusters in k-means clustering using the inertia metric.

K-Means Clustering

A clustering algorithm that partitions data into k groups based on the mean values of the clusters.

Inertia Metric

The mean squared distance between each data point and its closest centroid in k-means clustering.

Centroid Initialization

The initial random placement of k cluster centroids in k-means clustering.

Signup and view all the flashcards

Clusters Quality

K-means does not provide explicit information about the quality of the clusters formed.

Signup and view all the flashcards

Unsupervised Learning

A type of machine learning where algorithms learn from unlabeled data without supervision.

Signup and view all the flashcards

Classification

Process of categorizing data into predefined classes or labels.

Signup and view all the flashcards

Spam Classification

Automatically identifying emails as spam or not spam using algorithms.

Signup and view all the flashcards

Regression

A statistical method for predicting continuous values based on input variables.

Signup and view all the flashcards

Decision Trees

A model that uses a tree-like graph of decisions to classify data.

Signup and view all the flashcards

Random Forest

An ensemble method using multiple decision trees to improve classification accuracy.

Signup and view all the flashcards

Supervised Learning

A machine learning approach where models are trained using labeled data.

Signup and view all the flashcards

Linear Regression

A method used to model the relationship between a scalar dependent variable and one or more independent variables.

Signup and view all the flashcards

History of AI

The development and evolution of artificial intelligence over time.

Signup and view all the flashcards

Machine Learning

A subset of AI that enables systems to learn from data and improve over time without being explicitly programmed.

Signup and view all the flashcards

Support Vector Machines

A supervised learning model used for classification and regression tasks that finds the optimal separating hyperplane.

Signup and view all the flashcards

Neural Networks

A computational model inspired by the human brain that is designed to recognize patterns and learn from data.

Signup and view all the flashcards

Feature Selection

The process of selecting a subset of relevant features for use in model construction, enhancing model performance and efficiency.

Signup and view all the flashcards

Linear Basis Hypothesis Space

A space of functions that models linear relationships with M parameters.

Signup and view all the flashcards

Input Vector Dimension

The dimension d of input vector x in linear regression.

Signup and view all the flashcards

Basis Function

A function that transforms input vectors to output values, ℝᴅ ⟶ ℝ.

Signup and view all the flashcards

Euclidean Norm

A measure to minimize the difference between predicted and actual values using a least squares approach.

Signup and view all the flashcards

Pseudoinverse

A generalized matrix inverse used to find least-squares solutions in linear regression.

Signup and view all the flashcards

Regularization Term

A term added to the loss function to prevent overfitting in linear regression.

Signup and view all the flashcards

Strength of Regularization (λ)

A parameter that controls how strongly regularization is applied in the model.

Signup and view all the flashcards

Smallest Norm Solution

A solution approach that seeks to minimize the magnitude of the weight vector in linear regression.

Signup and view all the flashcards

Support Vector Machine (SVM)

A machine learning algorithm for classification and regression tasks.

Signup and view all the flashcards

Hyperplane

A boundary that classifies data points in feature space.

Signup and view all the flashcards

Support Vectors

Data points closest to the hyperplane that influence its position.

Signup and view all the flashcards

Maximizing Margin

The process of increasing the distance between the hyperplane and support vectors.

Signup and view all the flashcards

Applications of SVM

SVMs are used in text classification, image classification, spam detection, and more.

Signup and view all the flashcards

Kernel Function

A function that maps data to a higher dimension without explicit feature mapping.

Signup and view all the flashcards

Linear Kernel

A kernel defined as k(x, x') = x^T x', capturing linear relationships.

Signup and view all the flashcards

Polynomial Kernel

A kernel defined as k(x, x') = (1 + x^T x')^m, where m > 0.

Signup and view all the flashcards

Gaussian Kernel

A kernel defined as k(x, x') = exp(-||x - x'||^2 / (2σ^2)), useful for RBF.

Signup and view all the flashcards

Feature Map

A transformation that represents data in higher dimensions for kernel methods.

Signup and view all the flashcards

Kernel Ridge Regression

Combines ridge regression with kernel methods, focusing on the hypothesis in transformed space.

Signup and view all the flashcards

Gram Matrix

A matrix G whose elements G{i,j} are inner products of the kernel function, helping compute predictions.

Signup and view all the flashcards

Empirical Risk Minimization

A principle that minimizes the average loss over a sample for model learning and prediction.

Signup and view all the flashcards

Study Notes

Machine Learning for Smart Industry

  • This course covers unsupervised learning, supervised learning, neural nets and deep learning, and reinforcement learning.

Unsupervised Learning

  • Unsupervised learning uses machine learning algorithms to analyze and cluster unlabeled data sets
  • These algorithms discover hidden patterns or data groupings without human intervention
  • Unsupervised learning differs from supervised learning in that it does not use labeled data
  • Common algorithms include:
    • Clustering (exclusive and overlapping clustering algorithms)
    • K-means clustering
    • DBSCAN
    • Gaussian Mixture Model
    • Association rule
    • Dimensionality reduction

Clustering

  • Clustering involves grouping unlabeled data into clusters based on similarities
  • The goal is to identify patterns and relationships in data without the need for prior knowledge about the data's meaning.
  • Key algorithms types include Exclusive and Overlapping clustering methods
    • Exclusive clustering includes k-means clustering
    • Overlapping clustering includes fuzzy k-means clustering

K-Means Clustering

  • A centroid-based clustering approach based on a partitioning method
  • The goal is to group data points based on their closeness
  • Similarity measures options are Euclidean distance, Manhattan distance, or Minkowski distance
  • Datasets are separated into a predetermined number of clusters
  • Recalculate centroids for observations assigned to each cluster.
  • Drawback of needing to select the number of clusters (k).
  • Common technique to select k is using the elbow method and the inertia metric
  • Inertia metric measures the mean squared distance between each data point and its closest centroid
  • General overview of k-means performs best if data is well separated
  • Not suitable for non-spherical clusters nor for overlapping data points

DBSCAN

  • Density-Based Spatial Clustering of Applications with Noise
  • Finds groups/categories based on the density of data points
  • It determines the number of clusters automatically
  • Less sensitive to initial position; used for irregular/overlapping clusters
  • Manages dense and sparse data regions, very flexible for diverse morphologies

DBSCAN Algorithm

  • Identifies core points within a radius
  • For each core point, creates a new cluster
  • Finds recursively all density-connected points and assigns them to the same cluster
  • Data points that do not belong to any cluster are noise
  • Parameters include Radius, Minimum number of points

Gaussian Mixture Model

  • A probabilistic model assuming instances were generated from a mixture of k Gaussian distributions
  • For each instance, a cluster is randomly selected from the k clusters with a probability defined by the cluster's weight (mixture weight)
  • The location x(i) is sampled from a Gaussian distribution with mean μ(j) and covariance matrix ∑(j)
  • The algorithm can estimate the weights Φ and the distribution parameters μ and ∑.

Association Rule

  • A rule-based machine learning technique to identify relationships/associations between parameters in a large dataset
  • Commonly used in market analysis to understand the relationships between different products
  • Common algorithms:
    • Apriori
    • FP-Growth
    • Eclat

Dimensionality Reduction

  • Looks to reduce the number of features in a dataset while preserving as much of the original information possible
  • Used in preprocessing stage
  • Common algorithms:
    • Principal component analysis (PCA)
    • Singular value decomposition (SVD)
    • Autoencoders

Supervised Learning

  • Supervised learning algorithms are given labeled data and learn a function that maps from input to output
  • A common use case is classification, where the output is a discrete label, and regression, where the output is a continuous value.

Classification vs Regression

  • Classification predicts a discrete label, while regression predicts a continuous value.

Linear Models

  • Well-suited to illustrate introductory concepts, and act as a baseline for more complex problems.
  • How measure closeness in loss function/risk minimization?
  • Measure distance with norms.

Distance and Norm

  • Euclidean norm (2-norm) √(x₁² + x₂² +...+ xn²)
  • Manhattan norm (1-norm) |x₁| + |x₂| + ... + |xn|
  • Maximum norm (∞-norm): max(|x₁|, |x₂|, ..., |xn|)

Linear Regression - Least Squares Fit (LSQ)

  • Hypothesis space H is the space of linear functions f(x) = w₀ + w₁x.
  • The Euclidian 2-norm is the loss function.
  • Finding the solution for the parameters w₀ and w₁.

Linear Regression - Basis Functions

  • Hypothesis space where parameters w appear in a linear relation, H = {f: f(x) = φw}
  • collect data into matrix Φ, input vector x, and output vector y
  • Minimize the Euclidian or 2-norm. Remp(w) = ||y - Φw||₂

Linear Regression – Regularisation

  • Solution for parameter estimate, ŵ = Φ⁺y (pseudoinverse)
  • Handles possible infinite number of solutions, e.g. smallest norm
    • Adding regularisation using the loss function with regularisation C(w):
      • min Remp(w) = min (||y – Φw||² + λC(w)) where parameter λ controls the strength

Linear Regression - 1² Regularisation

  • 1² regularisation / ridge regression: C(w) = ||w||₂
  • Solution: ŵ = (ΦΦ⁺ + λΙ_M)^⁻¹y_T
  • M = identity matrix

Nonlinear Regression

  • Not linear-in-the-parameters, non-Euclidian norms
  • Matrix formalism does not apply
  • No explicit solution: Iterative Methods e.g Gradient descent

Connection to Machine Learning

  • Nonlinear optimisation is common
  • Numerical implementation available in TensorFlow
  • Using hyperparameters to select hypothesis space
  • Using optimisation procedures e.g Adam optimizer

TensorFlow

  • Import and process data, using pandas
  • Checking data format using TensorFlow
  • Considering splitting into training, validation, and testing sets.
  • Scaling numeric data
  • Define model using tf.keras.Sequential
  • Compile the model with the optimizer, using e.g.,tf.keras.optimizers.Adam().
  • Training the model with model.fit
  • Evaluating the model with model.evaluate
  • Computing predictions model.predict

Neural Networks

  • A series of layers where nodes in each layer are fully connected to the next layer.
  • Use activation function to add nonlinearities, e.g., sigmoid, tanh, ReLU.
  • In simple terms: input → feature exctraction → learning → outputs

Classification

  • Outputs are discrete labels (e.g., {0,1}, or categories)
  • Hypothesis space is K-dimensional.
  • Use activation functions to produce output compairable to one-hot vectors, e.g., softmax
  • Examples of activation function are sigmoid/tanh

Kernel Functions

  • A kernel function k, maps input data into a higher dimensional feature space
  • Alternative solution computes similarity between data points
  • Used without explicit feature maps.
  • Examples include linear, polynomial, gaussian

Kernel Ridge Regression

  • Solution for regularised empirical risk minimization.
  • Define the kernel function (RBF/polynomial)
  • Consider hyperparameters e.g., gamma in RBF

SVM Classification

  • The main objective is to identify the optimal hyperplane that effectively separates data points into classes, maximizing the margin between the support vectors.

Hyperplane

  • A boundary/partition in a d-dimensional space.
  • Equation of the linear hyperplane w•x+b=0.
  • The normal vector w and the offset b are important properties.

SVM Formulation

  • The core concept of SVM is finding the hyperplane that maximises the margin between the closest data points (support vectors)
  • Minimise ||w||² and yi(w•xi + b) ≥ 1

Support Vector Machines (SVMs)

  • SVMs are powerful machine learning algorithms used in both linear and non-linear classification
  • It is also good for outlier detection tasks

Soft Margin SVM

  • Modification to the hard margin SVM to handle instances that are not linearly separable.
  • A penalty term (hinge loss) is introduced to account for misclassified instances.
  • The optimization problem now includes an additional term that penalizes violations of the margin or misclassifications.

Soft Margin Slack Variable

  • Allows some data points to violate the margin, improving the model's ability to generalize to unseen data
  • ξ : Slack variable
    • ξ = 0 → no misclassification
    • 0 < ξ <1 → some data points violate the margin
    • ξ ≥ 1 → some data points violate the hyperplane

Hinge Loss

  • A loss function for maximizing the margin in soft-margin SVM
  • It rewards when data points are correctly classified, and penalizes when they are misclassified, or violate the margin

SVM Using Kernel

  • Kernel function is a mathematical function used to transform data into a higher-dimensional feature space
  • The kernel implicitly handles the feature transformation, simplifying calculations and allowing non-linearly separable data to be transformed into linearly separable data in higher dimensions

Feature Scaling

  • Feature scaling transforms feature values into a common range, which is crucial to prevent differences in feature values from skewing the learning behavior of the model
  • Normalization: maps values into [0,1]
  • Standardization: maps values to mean 0 & std dev 1.

SVM Classification

  • Multiclass SVM: not always binary classification

SVM Regression

  • Used for predicting continuous values rather than discrete labels or categories
  • Predict a function that describes the data set that has the maximum absolute deviation from all the training data.

MLP for Classification

  • Output neurons use softmax for activation to ensure estimated probabilities are between 0 and 1 & add up to 1.

Cross Entropy

  • Measures the difference between predicted and actual probability distributions. Used as a loss function to penalize model predictions that deviate significantly from the true distribution
  • In multiclass classification, it penalizes the model when it estimates a low probability for the target class.

Learning Rate

  • Controls the step size that the model uses in updating parameters (weights).
  • Higher learning rate can lead to faster convergence but might miss optimal values
  • Lower learning rate leads to a better chance of reaching the minimum loss function but requires more epochs

Optimizer

  • Gradient descent, stochastic gradient descent, SGD with momentum, AdaGrad, RMSprop, and Adam are some available options for updating network weights/parameters

Adam Optimizer

  • Combines Adagrad and RMSprop in a straightforward implementation, low memory usage, faster running time, and requires less tuning.

Summary of Neural Networks Learning

  • Overview of ANN, MLP (many layers), backpropagation, and activation function.
  • Focus on different activation functions such as step, sigmoid, tanh and ReLU.
  • Training MLPs using backpropagation involving forward and backward pass.
  • Use of activation functions to add nonlinearities.

Feature Selection and PCA

  • PCA is a method for dimension reduction.
  • It seeks to maximise the variance and minimize errors using eigenvalues, eigenvectors, Singular Value Decomposition (SVD)

Dimension Reduction

  • It reduces the number of variables while retaining the data variance.
  • Used for both supervised and unsupervised learning

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

More Like This

K Means Clustering Quiz
3 questions

K Means Clustering Quiz

ProtectiveJudgment avatar
ProtectiveJudgment
K-Means Clustering Algorithm
58 questions
K-Means Clustering Quiz
10 questions
Use Quizgecko on...
Browser
Browser