Data Science Fundamentals Quiz

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Which of the following best describes the primary goal of data science?

  • Applying computational and statistical techniques to gain insights into real-world problems. (correct)
  • Efficiently storing and retrieving large datasets.
  • Creating visually appealing data presentations.
  • Developing new hardware and software for data storage.

Data analysis in data science solely involves choosing a model without any prior exploration of the data.

False (B)

Name the three macro-steps involved in a typical data science task.

Data collection, data analysis, and data presentation

A crucial aspect of statistical analysis is choosing a good __________, as selection __________ can lead to non-representative conclusions.

<p>sample</p> Signup and view all the answers

Match the following concepts with their descriptions:

<p>Data Management = Support for data storage, retrieval, and infrastructure. Correlation Coefficient = Measures the linear relationship between two variables. Simpson's Paradox = Trend observable in groups is unobservable when combined. Bonferroni's Principle = Addresses the risk of discovering meaningless patterns.</p> Signup and view all the answers

In the context of correlation coefficients, what does a value of 0 indicate?

<p>No linear correlation. (D)</p> Signup and view all the answers

What does Simpson's paradox primarily point to?

<p>The presence of a confounding variable influencing the other variables. (B)</p> Signup and view all the answers

Descriptive statistics involves making conclusions that extend beyond the given dataset.

<p>False (B)</p> Signup and view all the answers

What is the primary drawback of a greedy policy in reinforcement learning?

<p>It may lead to suboptimal solutions by not exploring alternatives. (B)</p> Signup and view all the answers

Temporal difference learning requires prior knowledge of the transition function and reward function.

<p>False (B)</p> Signup and view all the answers

Describe the impact of a very small learning rate ($ \alpha \rightarrow 0 $) on the learning process.

<p>slow but stable learning</p> Signup and view all the answers

A common assumption for proving the convergence of many RL algorithms is ______ exploration, ensuring every state-action pair is experienced infinitely many times.

<p>infinite</p> Signup and view all the answers

Match the following Multi-Agent Reinforcement Learning (MARL) challenges with their descriptions:

<p>Non-stationarity = The environment changes continuously due to simultaneous actions of multiple agents. Credit assignment = Difficulty in correctly attributing rewards to the appropriate agent. Equilibrium selection = Finding a globally optimal policy that balances the interests of all agents.</p> Signup and view all the answers

What characterizes a zero-sum game in the context of Multi-Agent Reinforcement Learning (MARL)?

<p>The sum of all agents’ rewards is zero. (A)</p> Signup and view all the answers

A Nash equilibrium guarantees that all agents achieve the highest possible reward.

<p>False (B)</p> Signup and view all the answers

Which learning mode in MARL balances independence and coordination by sharing information during training but allowing independent action during deployment?

<p>CTDE (Centralized Training, Decentralized Execution) (A)</p> Signup and view all the answers

What are the key advantages of using deep learning in reinforcement learning?

<p>Generalization and scalability of value and policy functions, efficient representation of complex environments, better sample efficiency.</p> Signup and view all the answers

In reinforcement learning, the absence of a fixed ground truth and the presence of non-i.i.d. data, which violates assumptions of supervised learning, leads to the ______ target problem and potential overfitting.

<p>moving</p> Signup and view all the answers

Which of the following ensemble methods learns sequentially, with each model correcting the previous one?

<p>Boosting (A)</p> Signup and view all the answers

Linear classifiers are generally unsuitable for large datasets due to their computational complexity.

<p>False (B)</p> Signup and view all the answers

What is the primary goal of Support Vector Machines (SVMs) in classification tasks?

<p>to find the optimal separating hyperplane that best divides data points into different classes</p> Signup and view all the answers

In gradient descent, a learning rate that is too ______ can cause training to become unstable and may prevent convergence.

<p>high</p> Signup and view all the answers

What is the role of the sigmoid function in logistic regression?

<p>To map predictions to the range [0, 1], representing probabilities (C)</p> Signup and view all the answers

The perceptron can only be used for classification problems and cannot be adapted for regression tasks.

<p>False (B)</p> Signup and view all the answers

Explain how the C hyper-parameter in SVM affects the bias-variance trade-off.

<p>A large C leads to a smaller margin and potential overfitting (low bias, high variance), while a small C allows misclassifications, leading to a larger margin and better generalization (high bias, low variance).</p> Signup and view all the answers

The data points closest to the hyperplane in SVM, which directly influence its position, are called ______.

<p>support vectors</p> Signup and view all the answers

Which of the following statements best describes the purpose of a loss function in machine learning?

<p>To quantify the difference between model predictions and actual target values (A)</p> Signup and view all the answers

In logistic regression, a feature weight close to zero indicates that the feature is highly predictive of the positive class.

<p>False (B)</p> Signup and view all the answers

Describe the process of gradient descent and its goal in optimizing model parameters.

<p>Gradient descent is an iterative optimization algorithm that minimizes the error function by adjusting model parameters in the direction opposite to the gradient until the error is minimized.</p> Signup and view all the answers

The C hyper-parameter in SVM controls the trade-off between achieving a larger ______ and minimizing classification errors.

<p>margin</p> Signup and view all the answers

Which of the following is a characteristic of ensemble methods that use boosting?

<p>Using homogeneous weak learners learned sequentially (B)</p> Signup and view all the answers

Regularization techniques in machine learning aim to increase the complexity of models to better fit the training data.

<p>False (B)</p> Signup and view all the answers

In frequentist hypothesis testing, which condition leads to the rejection of the null hypothesis (H0)?

<p>p-value &lt; T (D)</p> Signup and view all the answers

According to the Central Limit Theorem, the sample mean of any random variable always follows a normal distribution, regardless of sample size.

<p>False (B)</p> Signup and view all the answers

Bayes' theorem provides a method to update the probability of an event given new information. Write the formula for Bayes' Theorem.

<p>p(A|B) = p(A) * p(B|A) / p(B)</p> Signup and view all the answers

__________ statistics aims to generalize results from a sample to an entire population, perform hypothesis testing, and build data models to draw conclusions.

<p>Inferential</p> Signup and view all the answers

Match the following Gestalt principles with their descriptions:

<p>Closure = Experiencing separate elements as a complete figure Proximity = Perceiving objects near each other as a group Similarity = Grouping similar elements into collective entities Common Fate = Perceiving objects moving in the same direction as a collective unit</p> Signup and view all the answers

Which type of chart is most suitable for visualizing the distribution of a single variable and identifying whether the distribution is symmetrical?

<p>Box plot (D)</p> Signup and view all the answers

Anscombe's quartet demonstrates that descriptive statistics are always sufficient for understanding the underlying patterns in a dataset.

<p>False (B)</p> Signup and view all the answers

Explain the purpose of a heatmap and in what kind of analysis it would be most useful.

<p>A heatmap represents the values in a matrix using colors and is useful for visualizing correlations or distances between entities.</p> Signup and view all the answers

In machine learning, finding a mapping between input and output variables directly from labeled data observations is known as __________ learning.

<p>supervised</p> Signup and view all the answers

What is the primary difference between classification and regression in supervised learning?

<p>Classification predicts discrete outputs, regression predicts continuous outputs. (C)</p> Signup and view all the answers

Unsupervised learning requires labeled data to train a model.

<p>False (B)</p> Signup and view all the answers

Define what a confusion matrix is and what its purpose is in evaluating the performance of a machine learning model.

<p>A confusion matrix is a matrix to measure the performance of the prediction model; it can be binary or multi classes. It reports tn, tp, fn, fp.</p> Signup and view all the answers

__________ is calculated as the ratio of true positives to the sum of true positives and false positives, indicating the accuracy of the positive predictions.

<p>Precision</p> Signup and view all the answers

When should a recall-precision curve be preferred over a receiver operating characteristic (ROC) curve for evaluating a classification model?

<p>When the dataset is imbalanced. (B)</p> Signup and view all the answers

Lindley's paradox suggests that the Bayesian and frequentist inference approaches will always arrive at the same conclusions, regardless of the prior distribution.

<p>False (B)</p> Signup and view all the answers

What is the primary purpose of using kernel functions in Kernel Machines?

<p>To efficiently compute dot products in a higher-dimensional space without explicitly transforming the data. (B)</p> Signup and view all the answers

The One-vs-One (OvO) approach for multi-class SVM classification requires training fewer models than the One-vs-All (OvA) approach.

<p>False (B)</p> Signup and view all the answers

In the context of neural networks, what is function approximation?

<p>learning a parametric function that maps inputs to outputs</p> Signup and view all the answers

In a feed-forward neural network, each neuron applies a function: $f(x;\theta) = g(w \cdot x + b)$, where $g$ is a non-linear ________ function.

<p>activation</p> Signup and view all the answers

Match the following activation functions with their primary use case:

<p>Sigmoid = Binary classification Softmax = Multi-class classification ReLU = Deep networks</p> Signup and view all the answers

During the training of a neural network, what is the role of the loss function?

<p>To measure the error between the network's predictions and the ground truth. (D)</p> Signup and view all the answers

Backpropagation is used to calculate the loss function in a neural network.

<p>False (B)</p> Signup and view all the answers

What problem does early stopping address during neural network training?

<p>overfitting</p> Signup and view all the answers

Convolutional Neural Networks (CNNs) are particularly well-suited for tasks involving ________ classification due to their ability to automatically extract hierarchical features.

<p>image</p> Signup and view all the answers

Which of the following is a disadvantage of the One-vs-All (OvA) approach in multi-class SVM classification?

<p>It struggles when the classes are highly imbalanced. (B)</p> Signup and view all the answers

Without activation functions, a neural network can effectively model non-linear dependencies in data.

<p>False (B)</p> Signup and view all the answers

What are the trainable parameters in each layer of a feedforward neural network?

<p>weights and biases</p> Signup and view all the answers

The ________ kernel captures complex decision boundaries by mapping data into an infinite-dimensional space.

<p>Gaussian (Radial Basis Function, RBF)</p> Signup and view all the answers

What is the purpose of the 'patience' parameter in the early stopping technique?

<p>To wait for a few epochs to see if the validation error decreases again before stopping the training. (A)</p> Signup and view all the answers

Match the names to their correct process

<p>Convolutional Neural Networks (CNNs) = automatic hierarchical feature extraction from input data Backpropagation = compute gradients for every parameter in a neural network Kernel Machines = extend SVMs to handle non-linear classification problems</p> Signup and view all the answers

Which characteristic distinguishes Recurrent Neural Networks (RNNs) from feed-forward networks?

<p>RNNs have loops in their architecture, allowing them to maintain a memory of previous inputs. (A)</p> Signup and view all the answers

Regression is used to predict discrete categories or labels.

<p>False (B)</p> Signup and view all the answers

What key assumption differentiates parametric regression from non-parametric regression?

<p>Parametric regression requires pre-defining the shape of the function.</p> Signup and view all the answers

In K-Nearest Neighbors Regression, the target value is predicted as a weighted combination of the K ________’ values.

<p>nearest neighbors</p> Signup and view all the answers

Which loss function is commonly used in neural networks designed for regression tasks?

<p>Mean Squared Error (MSE) (B)</p> Signup and view all the answers

Match each clustering performance measure with its description:

<p>Dunn Index = A higher value indicates better clustering performance. Silhouette Score = Ranges from -1 to 1; higher values suggest better clustering. Homogeneity = Measures if each cluster contains only points of a single label; best value is 1.</p> Signup and view all the answers

Which of the following is a primary goal of clustering?

<p>To have high intra-cluster similarity and high inter-cluster dissimilarity. (A)</p> Signup and view all the answers

What is a major limitation of the K-means algorithm regarding cluster shape?

<p>It assumes linear separation and works best when clusters are linearly separable. (C)</p> Signup and view all the answers

Since clustering is unsupervised, there is a direct way to evaluate the performance.

<p>False (B)</p> Signup and view all the answers

What is the role of the Moore-Penrose pseudo-inverse in linear regression?

<p>To solve for the coefficients when there are more equations than variables. (B)</p> Signup and view all the answers

In regression trees, what criterion is used to evaluate and select attributes during training?

<p>Mean Squared Error (MSE)</p> Signup and view all the answers

Which of the following adjustments is needed to adapt a Multi-Layer Perceptron (MLP) for regression tasks compared to classification?

<p>Using one neuron with a linear activation function in the output layer. (C)</p> Signup and view all the answers

Support Vector Regression (SVR) separates classes by finding a margin.

<p>False (B)</p> Signup and view all the answers

In the K-means algorithm, the number of clusters (K) must be _________.

<p>set in advance</p> Signup and view all the answers

What is the significance of high intra-cluster similarity and high inter-cluster dissimilarity in clustering?

<p>Indicates well-defined clusters</p> Signup and view all the answers

Which of the following is the MOST accurate description of the difference between data mining and machine learning?

<p>Data mining aims to find frequent patterns and rules, while machine learning learns a model from the data. (C)</p> Signup and view all the answers

A high lift value (e.g., lift > 1) for an association rule indicates a negative dependence between the antecedent and the consequent.

<p>False (B)</p> Signup and view all the answers

In the context of association rule mining, define the 'interest' of a rule and explain what it means when the interest is zero.

<p>Interest is the difference between the confidence of a rule and the support of the consequent. An interest of zero means that the antecedent has no influence on the consequent.</p> Signup and view all the answers

The Apriori algorithm generates candidate itemsets of length k, given all frequent itemsets of length ______.

<p>k-1</p> Signup and view all the answers

Match the linkage methods with their descriptions in hierarchical clustering:

<p>Single Linkage = Minimum distance between any two points from each cluster. Complete Linkage = Maximum distance between any two points from each cluster. Average Linkage = Average distance between all pairs of points between two clusters. Ward's Method = Minimizes the increase in variance when merging clusters.</p> Signup and view all the answers

What is the primary purpose of Principal Component Analysis (PCA)?

<p>To project high-dimensional data into a lower-dimensional space while preserving variance. (D)</p> Signup and view all the answers

In Reinforcement Learning (RL), a sparse reward system provides the agent with a reward after each action.

<p>False (B)</p> Signup and view all the answers

Define a Markov Decision Process (MDP) and briefly explain the significance of the Markov property within this framework.

<p>An MDP is a mathematical model describing sequential decision-making with states, actions, rewards, and transition probabilities. The Markov property states that the future depends only on the current state and action, not on past history.</p> Signup and view all the answers

In Reinforcement Learning, the learning goal is to maximize the expected ______, which is the sum of future rewards an agent can collect from a given state.

<p>return</p> Signup and view all the answers

What impact does a discount factor ($$\gamma$$) close to 0 have on the agent's decision-making in Reinforcement Learning?

<p>The agent prioritizes immediate rewards over future rewards. (C)</p> Signup and view all the answers

The completeness measure in clustering evaluates whether clusters are internally homogenous.

<p>False (B)</p> Signup and view all the answers

Given an association rule A -> B, what does the support of the rule represent?

<p>The fraction of transactions that contain both itemsets A and B. (C)</p> Signup and view all the answers

A ______ is a tree-like diagram that represents the hierarchical structure of clusters in hierarchical clustering.

<p>dendrogram</p> Signup and view all the answers

Explain the purpose of state value function, $V(s)$, and state-action value function, $Q(s, a)$, in Reinforcement Learning.

<p>$V(s)$ measures the expected value of being in state $s$ while following a policy $\pi$, and $Q(s, a)$ measures the expected value of taking action $a$ in state $s$, and then following policy $\pi$. They both estimate how 'good' a state or state-action pair is for decision-making.</p> Signup and view all the answers

What is a greedy policy in Reinforcement Learning?

<p>A policy that selects the action with the highest estimated state-action value. (B)</p> Signup and view all the answers

When training a machine learning model, what is the primary purpose of splitting data into training, validation, and test sets?

<p>To choose appropriate hyper parameters, select the best model, and measure performance on unseen data, respectively. (A)</p> Signup and view all the answers

Maintaining class proportions when splitting data for model training is generally recommended.

<p>True (A)</p> Signup and view all the answers

In K-fold cross-validation, the data is split into k folds, and in each turn, one fold acts as the __________ set while the others form the training and validation sets.

<p>test</p> Signup and view all the answers

What is the key difference between micro and macro averaging in the context of evaluating model performance?

<p>Micro averaging sums all true positives, false positives, and false negatives before computing the metric, while macro averaging first computes the metric per fold and then averages. (A)</p> Signup and view all the answers

Describe overfitting in the context of machine learning.

<p>Overfitting occurs when a model learns the training data too well, including its noise and specific details, leading to poor generalization on unseen data.</p> Signup and view all the answers

Match the following data preprocessing tasks with their descriptions:

<p>Aggregation = Grouping data together. Cleaning = Handling missing or inconsistent data. Discretization = Converting numerical features into discrete intervals. Normalization = Scaling data to a specific range.</p> Signup and view all the answers

In the context of the KNN classifier, how is the target variable for a new instance typically determined?

<p>By finding the most frequent target value among the k nearest neighbors. (B)</p> Signup and view all the answers

Hamming distance can be used as a metric in KNN.

<p>True (A)</p> Signup and view all the answers

How can the KNN algorithm be adapted for regression tasks?

<p>By evaluating the target variable as a weighted combination of the target values of the k nearest neighbors. (A)</p> Signup and view all the answers

Name one requirement and one drawback of using the KNN algorithm.

<p>Requirement: Numeric features. Drawback: High computational cost in high-dimensional spaces.</p> Signup and view all the answers

In a decision tree, what does each internal node represent?

<p>An attribute used for splitting the data. (C)</p> Signup and view all the answers

The C4.5 algorithm uses __________ __________ __________ to select the best attribute for splitting the data.

<p>information gain ratio</p> Signup and view all the answers

Which of the following statements best describes the Gini Index in the context of decision trees:

<p>It measures the impurity of a node; lower values indicate purer nodes. (C)</p> Signup and view all the answers

Pruning is used to increase the complexity of decision trees.

<p>False (B)</p> Signup and view all the answers

Give two pros and one con of decision trees.

<p>Pros: Simple to understand, able to handle both numerical and categorical data. Con: Prone to overfitting</p> Signup and view all the answers

Flashcards

Data Science

Applying computational and statistical techniques to gain insight into real-world problems.

Data Science Application Ingredients

Raw data, a problem statement, a model, and an evaluation metric.

Data Science Macro-Steps

Data collection, data analysis, and data presentation.

Data Management

Efficient storing, retrieving, and managing data with appropriate resources.

Signup and view all the flashcards

Correlation Coefficient

Measures the linear relationship between two variables, ranging from -1 to 1.

Signup and view all the flashcards

Bonferroni's principle

A risk in data analysis is discovering meaningless patterns i.e. false positives

Signup and view all the flashcards

Sample

Subset of observations from a larger group.

Signup and view all the flashcards

Selection Bias

Choosing a non-representative sample, leading to skewed results.

Signup and view all the flashcards

Data Splitting

Splitting data into training, validation, and test sets to train, select, and evaluate.

Signup and view all the flashcards

K-Fold Cross-Validation

Splits data into k folds, using each fold once as a test set.

Signup and view all the flashcards

Micro Average

Calculate metric summing all TPs, FPs, FNs before averaging.

Signup and view all the flashcards

Macro Average

Calculate metrics individually and average.

Signup and view all the flashcards

Overfitting

Model learns training data perfectly but fails to generalize.

Signup and view all the flashcards

Data Pre-processing

Preparing data through aggregation, cleaning, discretization, normalization, etc.

Signup and view all the flashcards

KNN Classifier

Classifies based on the most frequent target value of the k nearest neighbors.

Signup and view all the flashcards

KNN Distance Metric

Distance measures like Euclidean or Hamming to find nearest neightbors.

Signup and view all the flashcards

KNN for Regression

Evaluates target as a weighted combination of neighbors' targets.

Signup and view all the flashcards

Decision Tree

Classifies data based on attribute-value representations.

Signup and view all the flashcards

C4.5 Algorithm

Builds decision trees, using Information Gain Ratio for feature selection.

Signup and view all the flashcards

Feature Selection Metrics

Evaluates effectiveness of attribute splits: Gini Index, Information Gain Ratio

Signup and view all the flashcards

Gini Index

Measure the impurity of a node.

Signup and view all the flashcards

Information Gain

Expected reduction in entropy after splitting on an attribute.

Signup and view all the flashcards

Pruning

Reduces overfitting by removing unnecessary branches.

Signup and view all the flashcards

Inferential Statistics

Generalizing results from data to the whole population, hypothesis testing, and building data models for drawing conclusions.

Signup and view all the flashcards

Goal of Hypothesis testing

To evaluate if a hypothesis is likely to be true, given the available data.

Signup and view all the flashcards

Null Hypothesis Rejection

If the p-value < T, reject the null hypothesis. If p-value > T, there's not enough evidence to reject it.

Signup and view all the flashcards

Central Limit Theorem

Given a random variable, the sample mean follows a normal distribution if the sample size is large enough.

Signup and view all the flashcards

Bayes' Theorem

Provides a way to update the probability of an event given new information.

Signup and view all the flashcards

Lindley's Paradox

Bayesian and frequentist approaches can yield different results for the same hypothesis.

Signup and view all the flashcards

Anscombe's Quartet

Datasets with identical descriptive statistics but very different graphical representations.

Signup and view all the flashcards

Gestalt Principles Goal

The whole is other than the sum of its parts.

Signup and view all the flashcards

Histogram

Chart that plots counts or frequencies of a variable.

Signup and view all the flashcards

Bar Plot

Chart comparing the same variable across different categories (bars can be reordered).

Signup and view all the flashcards

Pie Chart

Chart representing parts/proportions of a whole.

Signup and view all the flashcards

Scatter Plot

Chart to analyze the relationship between two continuous variables, revealing trends or correlations.

Signup and view all the flashcards

Supervised Learning

Finding a mapping between input/output variables directly from labelled data.

Signup and view all the flashcards

Classification vs. Regression

Discrete output (classification) versus continuous output (regression).

Signup and view all the flashcards

Unsupervised Learning

Given data without labels, to cluster data, find patterns, and identify important features.

Signup and view all the flashcards

Kernel Machines

Extend SVMs to handle non-linear data by mapping it to higher dimensions.

Signup and view all the flashcards

Kernel Trick

Functions that compute dot products in transformed space without explicit transformation.

Signup and view all the flashcards

Polynomial Kernel

Captures polynomial relationships between features.

Signup and view all the flashcards

Gaussian (RBF) Kernel

Maps data into an infinite-dimensional space to capture complex boundaries.

Signup and view all the flashcards

Multi-class SVM

Train one SVM per class (One-vs-All) or per pair of classes (One-vs-One).

Signup and view all the flashcards

Neural Networks task

Learn a function that maps inputs to outputs by minimizing a loss function.

Signup and view all the flashcards

Feed-forward NN

Input, hidden, and output layers with weights, biases, and activation functions.

Signup and view all the flashcards

Activation Function Role

Introduce non-linearity, enabling NNs to learn complex patterns.

Signup and view all the flashcards

Common Activation Functions

Sigmoid, Softmax, ReLU - introduce non-linearity into neural networks.

Signup and view all the flashcards

NN Training

Supervised learning using labeled data, loss function, and gradient descent.

Signup and view all the flashcards

Backpropagation

Algorithm to compute gradients for every parameter, enabling efficient training.

Signup and view all the flashcards

Early Stopping

Stops training when validation error increases to prevent overfitting.

Signup and view all the flashcards

Convolutional NN

Designed for hierarchical feature extraction from input data.

Signup and view all the flashcards

One-vs-All (OvA)

One SVM classifier is trained for each class, distinguishing it from all others.

Signup and view all the flashcards

One-vs-One (OvO)

Train a binary classifier for each pair of classes.

Signup and view all the flashcards

Linear Classifiers

Models that separate data using a linear decision boundary. Simple, efficient, and good for large datasets.

Signup and view all the flashcards

Perceptron

A simple neural network model with a single neuron, classifying linearly separable examples.

Signup and view all the flashcards

Loss Function

Measures how well a model's predictions match actual values. Used to adjust the model.

Signup and view all the flashcards

Gradient Descent

Algorithm to minimize the loss function by iteratively adjusting model parameters until convergence.

Signup and view all the flashcards

Learning Rate

Parameter in gradient descent that controls the step size during parameter updates.

Signup and view all the flashcards

Logistic Regression

Statistical model estimating the probability of an input belonging to a particular class (0 or 1).

Signup and view all the flashcards

Logistic Regression Output

Probability close to 1: strongly positive class. Close to 0: strongly negative class.

Signup and view all the flashcards

Support Vector Machines (SVM)

Finds the optimal hyperplane to best divide data points from different classes.

Signup and view all the flashcards

Hyperplane (in SVM)

Decision boundary separating classes in SVM.

Signup and view all the flashcards

Margin (in SVM)

Distance between the hyperplane and the closest data points (support vectors).

Signup and view all the flashcards

Support Vectors

Data points closest to the hyperplane, influencing its position.

Signup and view all the flashcards

SVM Hyper-parameter C

SVM hyper-parameter controlling the trade-off between margin size and misclassification errors.

Signup and view all the flashcards

Large C (in SVM)

High C leads to a smaller margin and can cause overfitting. Tries classifying all training examples correctly

Signup and view all the flashcards

Regularization

A penalty on complexity to improve generalization. The C parameter affects the strength of this penalty

Signup and view all the flashcards

Small C (in SVM)

Allows some misclassifications to occur. Leads to a larger margin

Signup and view all the flashcards

Greedy Policy

Always selecting the action with the highest expected value without exploring alternatives.

Signup and view all the flashcards

Dynamic Programming

Algorithms for optimal policies, assuming full knowledge of the MDP.

Signup and view all the flashcards

Temporal Difference Learning

Learning optimal policies without knowing the transition or reward functions.

Signup and view all the flashcards

Learning Rate (α)

Determines how much the model updates its estimates.

Signup and view all the flashcards

Infinite Exploration

Every state-action pair must be experienced infinitely many times.

Signup and view all the flashcards

Non-Stationarity in MARL

Environment changes continuously due to multiple agents acting simultaneously.

Signup and view all the flashcards

Zero-Sum Game

A game where one agent’s gain is exactly another agent’s loss.

Signup and view all the flashcards

Joint Policy

Specifies the actions of all agents in every state.

Signup and view all the flashcards

Best Response Policy

Maximizes expected return given that other agents' policies are fixed.

Signup and view all the flashcards

Nash Equilibrium

No agent can improve its return by unilaterally changing its policy.

Signup and view all the flashcards

Completeness (clustering)

Assigning all points of a label to the same cluster.

Signup and view all the flashcards

V-Measure

Harmonic mean of homogeneity and completeness in clustering (ideal value: 1).

Signup and view all the flashcards

Data Mining

Transforming data into intelligible knowledge by finding patterns.

Signup and view all the flashcards

Confidence (rule)

Ratio of rule support to antecedent support.

Signup and view all the flashcards

Association Rule

Implication: X -> Ik (X is a subset of items, Ik is not in X).

Signup and view all the flashcards

Support (itemset)

Fraction of transactions containing the itemset.

Signup and view all the flashcards

Support (rule)

Fraction of transactions with both antecedent and consequent.

Signup and view all the flashcards

Lift (rule)

Ratio of rule support to product of antecedent and consequent supports.

Signup and view all the flashcards

Interest (rule)

Difference between rule confidence and consequent support.

Signup and view all the flashcards

Apriori Algorithm

Identifies frequent items, generates candidates, prunes infrequent sets, creates rules.

Signup and view all the flashcards

Dendrogram

Diagram showing hierarchical cluster structure.

Signup and view all the flashcards

Single Linkage

Minimum distance between any two points from each cluster.

Signup and view all the flashcards

Principal Component Analysis (PCA)

Projects high-dimensional data into a lower-dimensional space.

Signup and view all the flashcards

Reinforcement Learning (RL)

Agent interacts with environment to maximize rewards.

Signup and view all the flashcards

Learning Goal (RL)

Maximizing the sum of future rewards from a state.

Signup and view all the flashcards

Recurrent Neural Networks (RNNs)

Neural networks that process sequential data where order matters; they have loops to maintain memory of previous inputs.

Signup and view all the flashcards

Regression

Used for real-valued continuous target variables; predicts quantities.

Signup and view all the flashcards

Classification

Used for discrete category or label target variables; predicts categories.

Signup and view all the flashcards

Parametric Regression

Requires choosing the function's shape in advance; learns a fixed number of parameters.

Signup and view all the flashcards

Non-Parametric Regression

Does not assume a specific function shape; more flexible but needs more data.

Signup and view all the flashcards

K-Nearest Neighbors Regression

Predicts a target value based on a weighted combination of the K nearest neighbors’ values.

Signup and view all the flashcards

Linear Regression

Minimizes the Residual Sum of Squares (RSS) to find the best-fitting linear function.

Signup and view all the flashcards

Regression Trees

Extends decision trees to regression, using Mean Squared Error (MSE) to evaluate attributes.

Signup and view all the flashcards

Neural Network Architecture

Uses one neuron in the output layer with a linear activation function and MSE loss.

Signup and view all the flashcards

Support Vector Regression (SVR)

Finds a hyperplane that keeps all points within an epsilon-tube, allowing small errors.

Signup and view all the flashcards

Clustering

Partitions data points into groups based on similarity, aiming for high intra-cluster similarity and high inter-cluster dissimilarity.

Signup and view all the flashcards

K-means Algorithm process

  1. Choose K. 2. Initialize centroids. 3. Assign points. 4. Compute new centroids. 5. Repeat until convergence.
Signup and view all the flashcards

K-means: Random Initialization Dependence

The position of initial centroids greatly affect the results.

Signup and view all the flashcards

K-means Algorithm: Assusmptions of Linear Separation

It only works effectively if the clusters of the data are linearly separable.

Signup and view all the flashcards

Dunn Index for clustering

Higher values indicate better clustering performance.

Signup and view all the flashcards

Study Notes

Data Science

  • Data science applies computational and statistical methods to gain insights into real-world problems.
  • The core components include raw data and a well-defined problem statement.
  • Model choice is determined by maximizing performance on a specific evaluation metric.
  • The macro-steps are data collection, data analysis, and data presentation.
  • Data analysis includes exploration, model selection, and testing.
  • Data presentation aims to maximize information while minimizing unnecessary detail.
  • Data management provides essential support, including efficient storage, retrieval, and suitable infrastructure.

Statistics

  • A correlation coefficient measures the linear relationship between two variables, ranging from -1 to 1.
  • A value of 1 indicates a positive correlation, 0 indicates no correlation, and -1 indicates a negative correlation.
  • Pearson’s correlation coefficient, corr(X, Y), is the ratio of covariance to the product of standard deviations.
  • The Bonferroni principle warns against discovering meaningless patterns in data analysis.
  • A sample is a subset of a population, and statistical analysis aims to draw conclusions about the entire population from the sample.
  • Selection bias occurs when a non-representative sample is chosen for analysis.
  • Simpson's paradox shows trends observable in data groups that disappear when the groups are combined, indicating confounding variables.
  • Descriptive statistics summarizes observed phenomena, useful for quick data analysis (e.g., mean).
  • Inferential statistics generalizes results from data to the entire population through hypothesis testing and model building.
  • A frequentist approach formulates a hypothesis, estimates a confidence level, repeats experiments, and computes test statistics.
  • A Bayesian approach applies probability to statistical problems, finding the probability of a hypothesis using Bayes’ theorem.
  • Hypothesis testing evaluates the likelihood of a hypothesis being true, given the data.
  • The common approach to hypothesis testing involves computing the probability of observing the data given the null hypothesis (p-value).
  • The null hypothesis is rejected if the p-value is less than a threshold (T).
  • The central limit theorem states that the sample mean of any random variable approaches a normal distribution as sample size increases.
  • Bayes' theorem updates the probability of an event given new information, expressed as p(A|B) = p(A) p(B|A)/p(B).
  • Lindley’s paradox indicates that Bayesian and frequentist approaches can yield different results based on prior distribution choices.

Data Visualization

  • Anscombe’s quartet highlights the limitations of descriptive statistics by showing datasets with identical stats but different graphs.
  • The Gestalt principles demonstrate that the whole is different from the sum of its parts.
  • Closure: Perceiving separate elements as a complete figure.
  • Common fate: Grouping objects moving in the same direction as a unit.
  • Continuity: Perceiving disconnected figures as a continuous form.
  • Proximity: Perceiving objects near each other as a group.
  • Similarity: Grouping similar elements into collective entities.
  • Symmetry: Perceiving symmetry in objects regardless of distance.
  • A histogram plots counts or frequencies of a variable.
  • A bar plot compares the same metric across different categories, where bars can be reordered.
  • A pie chart represents parts/proportions of a whole.
  • A scatter plot analyzes the relationship between two continuous variables, revealing trends or correlations.
  • A box plot visualizes the distribution of values, showing the mean and symmetry of the dataset.
  • A heatmap represents matrix entries with colors and is useful for correlations/distances between entities.
  • A bubble chart is a scatter plot variation that can represent more than two dimensions using color and point diameter.

ML Introduction

  • The three main machine learning paradigms are supervised, unsupervised, and reinforcement learning.
  • Supervised learning finds a mapping between input and output variables using labeled data.
  • Classification involves a discrete output, while regression involves a continuous output.
  • Unsupervised learning uses unlabeled data to cluster, find patterns, and identify important features.
  • The confusion matrix measures prediction model performance, reporting true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).
  • Accuracy is the ratio of correct predictions (TP+TN) to all predictions (TP+TN+FP+FN).
  • Precision is the ratio of true positives (TP) to all positive predictions (TP+FP).
  • Recall is the ratio of true positives (TP) to all actual positives (TP+FN).
  • F1 score is calculated as 2PR/(P+R).
  • A recall-precision curve evaluates classification model performance, particularly for imbalanced datasets.
  • A receiver operating characteristic (ROC) curve assesses binary classification model performance, particularly for balanced datasets.
  • Regression metrics include Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Deviation (MAD).
  • Model selection chooses the best hyperparameters, splitting data into training, validation, and test sets.
  • K-fold cross-validation splits data into k folds, using one as a test set and the rest for training and validation over k turns.
  • Micro-averaging sums TPs, FPs, and FNs before computing metrics.
  • Macro-averaging computes per-fold metrics and then averages them.
  • Overfitting occurs when a model learns the training set too well but fails to generalize to new examples.
  • Data pre-processing tasks include aggregation, cleaning, discretization, normalization, conversion, and integration.

ML KNN

  • KNN classifies a variable's target based on the most frequent target value of its k nearest neighbors, using a distance function for similarity.
  • Distance metrics include Euclidean and Hamming distance.
  • KNN can be used for regression when the variable is continuous by averaging the target values of the k nearest neighbors.
  • KNN requires numeric features.
  • KNN has high computational complexity in high-dimensional spaces.

ML Decision Trees

  • Decision Trees are widely used ML models that classify data based on attribute-value representations.
  • Each internal node represents an attribute.
  • Each branch corresponds to a value of that attribute.
  • Each leaf node represents a class.
  • Can be viewed as a set of classification rules, with each path from the root to a leaf representing one such rule.
  • The C4.5 algorithm builds decision trees by choosing attributes, partitioning the dataset, applying C4.5 recursively, and stopping.
  • C4.5 handles continuous attributes and uses information gain ratio.
  • Metrics that aid feature selection include the Gini index and information gain ratio.
  • The Gini index measures the impurity of a node, with a best value of 0.
  • Information gain measures the expected reduction in entropy, with higher values indicating better splits.
  • Information gain ratio normalizes information gain, solving the bias issue of information gain and the Gini Index.
  • Pruning reduces overfitting by removing unnecessary branches using pre-pruning (stopping tree growth early) and post-pruning (removing subtrees).
  • Decision trees are interpretable and handle mixed data types but are prone to overfitting.
  • Random forests are ensembles of decision trees trained on random feature subsets.
  • Ensemble models combine multiple weak classifiers, including bagging, boosting, and stacking.

Linear classifiers

  • Linear classifiers separate data using a linear decision boundary.
  • They are useful for large datasets where simplicity and low computational cost are crucial.
  • The perceptron is a simple neural network model with a single neuron, capable of classifying linearly separable examples.
  • A loss function measures the difference between model predictions and actual target values.
  • Gradient descent minimizes the loss function by iteratively adjusting model parameters.
  • The learning rate determines the step size in parameter updates, requiring careful tuning to avoid slow convergence or instability.
  • The perceptron can be used for regression by removing the threshold activation function and using a different loss function.
  • Logistic regression estimates the probability that an input belongs to a class using the sigmoid function.
  • The output of logistic regression is the probability, and feature weights indicate the importance and direction of the feature's influence.

Support Vector Machines

  • SVMs find an optimal separating hyperplane to divide data points into classes.
  • SVMs aim to maximize the margin, which is the distance between the hyperplane and the closest data points.
  • A soft-margin SVM allows some misclassification using slack variables.
  • The C hyper-parameter balances margin size and classification errors.
  • Large C values try to classify all points correctly, which can lead to overfitting; small C values allow misclassifications, improving generalization.
  • Regularization prevents overfitting by adding a penalty term to the loss function.
  • Kernel machines extend SVMs to non-linear problems by transforming the input space.
  • Kernel machines use the "kernel trick" to compute dot products in a higher-dimensional space efficiently.
  • Polynomial and Gaussian (RBF) kernels are commonly used.
  • Multi-class classification can be handled with one-vs-all (OvA) or one-vs-one (OvO) approaches.

Neural Networks

  • Neural Networks (NNs) are used for function approximation.
  • NNs minimize a loss function, such as binary or multinomial cross-entropy.
  • A feed-forward neural network (FNN) consists of an input layer, hidden layers, and an output layer.
  • Each neuron applies a function: f(x;θ)=g(wâ‹…x+b), where g is a non-linear activation function.
  • Activation functions introduce non-linearity, enabling the network to learn complex relationships.
  • Common activation functions include Sigmoid, Softmax, and ReLU.
  • Neural networks are trained using supervised learning with labeled data.
  • Backpropagation computes gradients for all parameters, enabling efficient training through gradient descent.
  • Early stopping prevents overfitting by monitoring validation error.
  • Convolutional Neural Networks (CNNs) are used for image classification, employing convolutional layers to learn spatial hierarchies.
  • Recurrent Neural Networks (RNNs) are used for sequential data processing, such as time series forecasting and NLP.

Regression

  • Regression predicts real-valued continuous variables, unlike classification, which predicts discrete categories.
  • Parametric regression requires assuming a function shape in advance.
  • Linear regression is an example of Parametric regression.
  • Non-parametric regression doesn't assume a function shape but requires more data.
  • K-Nearest Neighbors (KNN) regression is an example of Non-parametric regression.
  • K-Nearest Neighbors, linear regression, decision trees, neural networks and support vector machines can be adjusted for regression.

Unsupervised learning

  • Clustering groups data points based on similarity, maximizing intra-cluster and minimizing inter-cluster similarity.
  • The K-means algorithm assigns data points to the closest centroid.
  • The K-means algorithm computes new centroids based on the mean of assigned points.
  • K-means is dependent on random initialization.
  • K-means requires choosing the number of clusters (K) in advance.
  • K-means assumes linear separation.
  • Clustering performance measures include the Dunn Index, Silhouette Score, Homogeneity, Completeness, and V-Measure.

Association Rule Mining

  • Data mining is the process of transforming data into intelligible knowledge.
  • The confidence of a rule is the probability of consequent happening, given the antecedent.
  • An association rule implies X -> Ik, where X is a subset of items and Ik is an item not in X.
  • The support of an itemset is the fraction of transactions containing the itemset.
  • The support of a rule is the fraction of transactions containing both the antecedent and consequent.
  • The lift of a rule measures the support of the rule relative to the expected support if the items were independent.
  • Rule with Lift > 1 indicates positive dependence.
  • Interest measures the difference between the confidence of the rule and the support of the consequent.
  • Apriori identifies frequent items, generates candidate itemsets, prunes infrequent itemsets, and generates rules.
  • A dendrogram is a tree-like diagram that visualizes the hierarchical structure of clusters.
  • Single Linkage calculates the minimum distance between any two points from each cluster, able to capture elongated structures.
  • Principal Component Analysis (PCA) reduces dimensionality by projecting data into a lower space, preserving main characteristics.

Reinforcement Learning (RL)

  • RL involves an agent learning to achieve goals by interacting with an environment and receiving rewards for its actions.
  • Markov Decision Process (MDP)
  • MDP models sequential decision-making with states, actions, rewards, and transition functions.
  • The learning goal in single-agent RL maximizes the expected return, often discounted to prioritize immediate rewards
  • Discount factor determines the significance of future rewards—a value near 0 prioritizes immediate rewards, and one near 1 values future rewards almost equally
  • State value function measures the expected value of being in a state.
  • State-action value function measures the expected value of taking an action in a state.
  • A greedy policy selects the action with the highest expected value.
  • Dynamic programming computes optimal policies, assuming full knowledge of the MDP.
  • Temporal difference learning learns optimal policies without knowing the transition or reward functions beforehand.
  • Learning rate determines how much the model updates its estimates.
  • The common assumption made to prove convergence of RL algorithms is infinite exploration.

Multi-Agent Reinforcement Learning (MARL)

  • MARL introduces challenges over RL through Non-stationarity, credit assignment, and equilibrium selection.
  • A zero-sum game is where the sum of agent's rewards is always zero.
  • A joint policy specifies the actions of all agents in every state.
  • A best response policy maximizes its expected return given fixed policies of other agents.
  • Nash equilibrium means no agent can improve its expected return by unilaterally changing its policy.
  • Independent learning is simple, but suffers from non-stationarity issues.
  • Central learning avoids non-stationarity, with increased computational complexity.
  • CTDE shares information during training but acts independently.
  • Deep learning in RL enables generalization and efficient representation of complex environments.
  • In RL, the training target changes continuously (a moving target problem) unlike supervised learning.
  • RL data isn't i.i.d., which can cause overfitting to recent experiences.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

More Like This

Use Quizgecko on...
Browser
Browser