Podcast
Questions and Answers
Which of the following best describes the primary goal of data science?
Which of the following best describes the primary goal of data science?
- Applying computational and statistical techniques to gain insights into real-world problems. (correct)
- Efficiently storing and retrieving large datasets.
- Creating visually appealing data presentations.
- Developing new hardware and software for data storage.
Data analysis in data science solely involves choosing a model without any prior exploration of the data.
Data analysis in data science solely involves choosing a model without any prior exploration of the data.
False (B)
Name the three macro-steps involved in a typical data science task.
Name the three macro-steps involved in a typical data science task.
Data collection, data analysis, and data presentation
A crucial aspect of statistical analysis is choosing a good __________, as selection __________ can lead to non-representative conclusions.
A crucial aspect of statistical analysis is choosing a good __________, as selection __________ can lead to non-representative conclusions.
Match the following concepts with their descriptions:
Match the following concepts with their descriptions:
In the context of correlation coefficients, what does a value of 0 indicate?
In the context of correlation coefficients, what does a value of 0 indicate?
What does Simpson's paradox primarily point to?
What does Simpson's paradox primarily point to?
Descriptive statistics involves making conclusions that extend beyond the given dataset.
Descriptive statistics involves making conclusions that extend beyond the given dataset.
What is the primary drawback of a greedy policy in reinforcement learning?
What is the primary drawback of a greedy policy in reinforcement learning?
Temporal difference learning requires prior knowledge of the transition function and reward function.
Temporal difference learning requires prior knowledge of the transition function and reward function.
Describe the impact of a very small learning rate ($ \alpha \rightarrow 0 $) on the learning process.
Describe the impact of a very small learning rate ($ \alpha \rightarrow 0 $) on the learning process.
A common assumption for proving the convergence of many RL algorithms is ______ exploration, ensuring every state-action pair is experienced infinitely many times.
A common assumption for proving the convergence of many RL algorithms is ______ exploration, ensuring every state-action pair is experienced infinitely many times.
Match the following Multi-Agent Reinforcement Learning (MARL) challenges with their descriptions:
Match the following Multi-Agent Reinforcement Learning (MARL) challenges with their descriptions:
What characterizes a zero-sum game in the context of Multi-Agent Reinforcement Learning (MARL)?
What characterizes a zero-sum game in the context of Multi-Agent Reinforcement Learning (MARL)?
A Nash equilibrium guarantees that all agents achieve the highest possible reward.
A Nash equilibrium guarantees that all agents achieve the highest possible reward.
Which learning mode in MARL balances independence and coordination by sharing information during training but allowing independent action during deployment?
Which learning mode in MARL balances independence and coordination by sharing information during training but allowing independent action during deployment?
What are the key advantages of using deep learning in reinforcement learning?
What are the key advantages of using deep learning in reinforcement learning?
In reinforcement learning, the absence of a fixed ground truth and the presence of non-i.i.d. data, which violates assumptions of supervised learning, leads to the ______ target problem and potential overfitting.
In reinforcement learning, the absence of a fixed ground truth and the presence of non-i.i.d. data, which violates assumptions of supervised learning, leads to the ______ target problem and potential overfitting.
Which of the following ensemble methods learns sequentially, with each model correcting the previous one?
Which of the following ensemble methods learns sequentially, with each model correcting the previous one?
Linear classifiers are generally unsuitable for large datasets due to their computational complexity.
Linear classifiers are generally unsuitable for large datasets due to their computational complexity.
What is the primary goal of Support Vector Machines (SVMs) in classification tasks?
What is the primary goal of Support Vector Machines (SVMs) in classification tasks?
In gradient descent, a learning rate that is too ______ can cause training to become unstable and may prevent convergence.
In gradient descent, a learning rate that is too ______ can cause training to become unstable and may prevent convergence.
What is the role of the sigmoid function in logistic regression?
What is the role of the sigmoid function in logistic regression?
The perceptron can only be used for classification problems and cannot be adapted for regression tasks.
The perceptron can only be used for classification problems and cannot be adapted for regression tasks.
Explain how the C hyper-parameter in SVM affects the bias-variance trade-off.
Explain how the C hyper-parameter in SVM affects the bias-variance trade-off.
The data points closest to the hyperplane in SVM, which directly influence its position, are called ______.
The data points closest to the hyperplane in SVM, which directly influence its position, are called ______.
Which of the following statements best describes the purpose of a loss function in machine learning?
Which of the following statements best describes the purpose of a loss function in machine learning?
In logistic regression, a feature weight close to zero indicates that the feature is highly predictive of the positive class.
In logistic regression, a feature weight close to zero indicates that the feature is highly predictive of the positive class.
Describe the process of gradient descent and its goal in optimizing model parameters.
Describe the process of gradient descent and its goal in optimizing model parameters.
The C hyper-parameter in SVM controls the trade-off between achieving a larger ______ and minimizing classification errors.
The C hyper-parameter in SVM controls the trade-off between achieving a larger ______ and minimizing classification errors.
Which of the following is a characteristic of ensemble methods that use boosting?
Which of the following is a characteristic of ensemble methods that use boosting?
Regularization techniques in machine learning aim to increase the complexity of models to better fit the training data.
Regularization techniques in machine learning aim to increase the complexity of models to better fit the training data.
In frequentist hypothesis testing, which condition leads to the rejection of the null hypothesis (H0)?
In frequentist hypothesis testing, which condition leads to the rejection of the null hypothesis (H0)?
According to the Central Limit Theorem, the sample mean of any random variable always follows a normal distribution, regardless of sample size.
According to the Central Limit Theorem, the sample mean of any random variable always follows a normal distribution, regardless of sample size.
Bayes' theorem provides a method to update the probability of an event given new information. Write the formula for Bayes' Theorem.
Bayes' theorem provides a method to update the probability of an event given new information. Write the formula for Bayes' Theorem.
__________ statistics aims to generalize results from a sample to an entire population, perform hypothesis testing, and build data models to draw conclusions.
__________ statistics aims to generalize results from a sample to an entire population, perform hypothesis testing, and build data models to draw conclusions.
Match the following Gestalt principles with their descriptions:
Match the following Gestalt principles with their descriptions:
Which type of chart is most suitable for visualizing the distribution of a single variable and identifying whether the distribution is symmetrical?
Which type of chart is most suitable for visualizing the distribution of a single variable and identifying whether the distribution is symmetrical?
Anscombe's quartet demonstrates that descriptive statistics are always sufficient for understanding the underlying patterns in a dataset.
Anscombe's quartet demonstrates that descriptive statistics are always sufficient for understanding the underlying patterns in a dataset.
Explain the purpose of a heatmap and in what kind of analysis it would be most useful.
Explain the purpose of a heatmap and in what kind of analysis it would be most useful.
In machine learning, finding a mapping between input and output variables directly from labeled data observations is known as __________ learning.
In machine learning, finding a mapping between input and output variables directly from labeled data observations is known as __________ learning.
What is the primary difference between classification and regression in supervised learning?
What is the primary difference between classification and regression in supervised learning?
Unsupervised learning requires labeled data to train a model.
Unsupervised learning requires labeled data to train a model.
Define what a confusion matrix is and what its purpose is in evaluating the performance of a machine learning model.
Define what a confusion matrix is and what its purpose is in evaluating the performance of a machine learning model.
__________ is calculated as the ratio of true positives to the sum of true positives and false positives, indicating the accuracy of the positive predictions.
__________ is calculated as the ratio of true positives to the sum of true positives and false positives, indicating the accuracy of the positive predictions.
When should a recall-precision curve be preferred over a receiver operating characteristic (ROC) curve for evaluating a classification model?
When should a recall-precision curve be preferred over a receiver operating characteristic (ROC) curve for evaluating a classification model?
Lindley's paradox suggests that the Bayesian and frequentist inference approaches will always arrive at the same conclusions, regardless of the prior distribution.
Lindley's paradox suggests that the Bayesian and frequentist inference approaches will always arrive at the same conclusions, regardless of the prior distribution.
What is the primary purpose of using kernel functions in Kernel Machines?
What is the primary purpose of using kernel functions in Kernel Machines?
The One-vs-One (OvO) approach for multi-class SVM classification requires training fewer models than the One-vs-All (OvA) approach.
The One-vs-One (OvO) approach for multi-class SVM classification requires training fewer models than the One-vs-All (OvA) approach.
In the context of neural networks, what is function approximation?
In the context of neural networks, what is function approximation?
In a feed-forward neural network, each neuron applies a function: $f(x;\theta) = g(w \cdot x + b)$, where $g$ is a non-linear ________ function.
In a feed-forward neural network, each neuron applies a function: $f(x;\theta) = g(w \cdot x + b)$, where $g$ is a non-linear ________ function.
Match the following activation functions with their primary use case:
Match the following activation functions with their primary use case:
During the training of a neural network, what is the role of the loss function?
During the training of a neural network, what is the role of the loss function?
Backpropagation is used to calculate the loss function in a neural network.
Backpropagation is used to calculate the loss function in a neural network.
What problem does early stopping address during neural network training?
What problem does early stopping address during neural network training?
Convolutional Neural Networks (CNNs) are particularly well-suited for tasks involving ________ classification due to their ability to automatically extract hierarchical features.
Convolutional Neural Networks (CNNs) are particularly well-suited for tasks involving ________ classification due to their ability to automatically extract hierarchical features.
Which of the following is a disadvantage of the One-vs-All (OvA) approach in multi-class SVM classification?
Which of the following is a disadvantage of the One-vs-All (OvA) approach in multi-class SVM classification?
Without activation functions, a neural network can effectively model non-linear dependencies in data.
Without activation functions, a neural network can effectively model non-linear dependencies in data.
What are the trainable parameters in each layer of a feedforward neural network?
What are the trainable parameters in each layer of a feedforward neural network?
The ________ kernel captures complex decision boundaries by mapping data into an infinite-dimensional space.
The ________ kernel captures complex decision boundaries by mapping data into an infinite-dimensional space.
What is the purpose of the 'patience' parameter in the early stopping technique?
What is the purpose of the 'patience' parameter in the early stopping technique?
Match the names to their correct process
Match the names to their correct process
Which characteristic distinguishes Recurrent Neural Networks (RNNs) from feed-forward networks?
Which characteristic distinguishes Recurrent Neural Networks (RNNs) from feed-forward networks?
Regression is used to predict discrete categories or labels.
Regression is used to predict discrete categories or labels.
What key assumption differentiates parametric regression from non-parametric regression?
What key assumption differentiates parametric regression from non-parametric regression?
In K-Nearest Neighbors Regression, the target value is predicted as a weighted combination of the K ________’ values.
In K-Nearest Neighbors Regression, the target value is predicted as a weighted combination of the K ________’ values.
Which loss function is commonly used in neural networks designed for regression tasks?
Which loss function is commonly used in neural networks designed for regression tasks?
Match each clustering performance measure with its description:
Match each clustering performance measure with its description:
Which of the following is a primary goal of clustering?
Which of the following is a primary goal of clustering?
What is a major limitation of the K-means algorithm regarding cluster shape?
What is a major limitation of the K-means algorithm regarding cluster shape?
Since clustering is unsupervised, there is a direct way to evaluate the performance.
Since clustering is unsupervised, there is a direct way to evaluate the performance.
What is the role of the Moore-Penrose pseudo-inverse in linear regression?
What is the role of the Moore-Penrose pseudo-inverse in linear regression?
In regression trees, what criterion is used to evaluate and select attributes during training?
In regression trees, what criterion is used to evaluate and select attributes during training?
Which of the following adjustments is needed to adapt a Multi-Layer Perceptron (MLP) for regression tasks compared to classification?
Which of the following adjustments is needed to adapt a Multi-Layer Perceptron (MLP) for regression tasks compared to classification?
Support Vector Regression (SVR) separates classes by finding a margin.
Support Vector Regression (SVR) separates classes by finding a margin.
In the K-means algorithm, the number of clusters (K) must be _________.
In the K-means algorithm, the number of clusters (K) must be _________.
What is the significance of high intra-cluster similarity and high inter-cluster dissimilarity in clustering?
What is the significance of high intra-cluster similarity and high inter-cluster dissimilarity in clustering?
Which of the following is the MOST accurate description of the difference between data mining and machine learning?
Which of the following is the MOST accurate description of the difference between data mining and machine learning?
A high lift value (e.g., lift > 1) for an association rule indicates a negative dependence between the antecedent and the consequent.
A high lift value (e.g., lift > 1) for an association rule indicates a negative dependence between the antecedent and the consequent.
In the context of association rule mining, define the 'interest' of a rule and explain what it means when the interest is zero.
In the context of association rule mining, define the 'interest' of a rule and explain what it means when the interest is zero.
The Apriori algorithm generates candidate itemsets of length k, given all frequent itemsets of length ______.
The Apriori algorithm generates candidate itemsets of length k, given all frequent itemsets of length ______.
Match the linkage methods with their descriptions in hierarchical clustering:
Match the linkage methods with their descriptions in hierarchical clustering:
What is the primary purpose of Principal Component Analysis (PCA)?
What is the primary purpose of Principal Component Analysis (PCA)?
In Reinforcement Learning (RL), a sparse reward system provides the agent with a reward after each action.
In Reinforcement Learning (RL), a sparse reward system provides the agent with a reward after each action.
Define a Markov Decision Process (MDP) and briefly explain the significance of the Markov property within this framework.
Define a Markov Decision Process (MDP) and briefly explain the significance of the Markov property within this framework.
In Reinforcement Learning, the learning goal is to maximize the expected ______, which is the sum of future rewards an agent can collect from a given state.
In Reinforcement Learning, the learning goal is to maximize the expected ______, which is the sum of future rewards an agent can collect from a given state.
What impact does a discount factor ($$\gamma$$) close to 0 have on the agent's decision-making in Reinforcement Learning?
What impact does a discount factor ($$\gamma$$) close to 0 have on the agent's decision-making in Reinforcement Learning?
The completeness measure in clustering evaluates whether clusters are internally homogenous.
The completeness measure in clustering evaluates whether clusters are internally homogenous.
Given an association rule A -> B, what does the support of the rule represent?
Given an association rule A -> B, what does the support of the rule represent?
A ______ is a tree-like diagram that represents the hierarchical structure of clusters in hierarchical clustering.
A ______ is a tree-like diagram that represents the hierarchical structure of clusters in hierarchical clustering.
Explain the purpose of state value function, $V(s)$, and state-action value function, $Q(s, a)$, in Reinforcement Learning.
Explain the purpose of state value function, $V(s)$, and state-action value function, $Q(s, a)$, in Reinforcement Learning.
What is a greedy policy in Reinforcement Learning?
What is a greedy policy in Reinforcement Learning?
When training a machine learning model, what is the primary purpose of splitting data into training, validation, and test sets?
When training a machine learning model, what is the primary purpose of splitting data into training, validation, and test sets?
Maintaining class proportions when splitting data for model training is generally recommended.
Maintaining class proportions when splitting data for model training is generally recommended.
In K-fold cross-validation, the data is split into k folds, and in each turn, one fold acts as the __________ set while the others form the training and validation sets.
In K-fold cross-validation, the data is split into k folds, and in each turn, one fold acts as the __________ set while the others form the training and validation sets.
What is the key difference between micro and macro averaging in the context of evaluating model performance?
What is the key difference between micro and macro averaging in the context of evaluating model performance?
Describe overfitting in the context of machine learning.
Describe overfitting in the context of machine learning.
Match the following data preprocessing tasks with their descriptions:
Match the following data preprocessing tasks with their descriptions:
In the context of the KNN classifier, how is the target variable for a new instance typically determined?
In the context of the KNN classifier, how is the target variable for a new instance typically determined?
Hamming distance can be used as a metric in KNN.
Hamming distance can be used as a metric in KNN.
How can the KNN algorithm be adapted for regression tasks?
How can the KNN algorithm be adapted for regression tasks?
Name one requirement and one drawback of using the KNN algorithm.
Name one requirement and one drawback of using the KNN algorithm.
In a decision tree, what does each internal node represent?
In a decision tree, what does each internal node represent?
The C4.5 algorithm uses __________ __________ __________ to select the best attribute for splitting the data.
The C4.5 algorithm uses __________ __________ __________ to select the best attribute for splitting the data.
Which of the following statements best describes the Gini Index in the context of decision trees:
Which of the following statements best describes the Gini Index in the context of decision trees:
Pruning is used to increase the complexity of decision trees.
Pruning is used to increase the complexity of decision trees.
Give two pros and one con of decision trees.
Give two pros and one con of decision trees.
Flashcards
Data Science
Data Science
Applying computational and statistical techniques to gain insight into real-world problems.
Data Science Application Ingredients
Data Science Application Ingredients
Raw data, a problem statement, a model, and an evaluation metric.
Data Science Macro-Steps
Data Science Macro-Steps
Data collection, data analysis, and data presentation.
Data Management
Data Management
Signup and view all the flashcards
Correlation Coefficient
Correlation Coefficient
Signup and view all the flashcards
Bonferroni's principle
Bonferroni's principle
Signup and view all the flashcards
Sample
Sample
Signup and view all the flashcards
Selection Bias
Selection Bias
Signup and view all the flashcards
Data Splitting
Data Splitting
Signup and view all the flashcards
K-Fold Cross-Validation
K-Fold Cross-Validation
Signup and view all the flashcards
Micro Average
Micro Average
Signup and view all the flashcards
Macro Average
Macro Average
Signup and view all the flashcards
Overfitting
Overfitting
Signup and view all the flashcards
Data Pre-processing
Data Pre-processing
Signup and view all the flashcards
KNN Classifier
KNN Classifier
Signup and view all the flashcards
KNN Distance Metric
KNN Distance Metric
Signup and view all the flashcards
KNN for Regression
KNN for Regression
Signup and view all the flashcards
Decision Tree
Decision Tree
Signup and view all the flashcards
C4.5 Algorithm
C4.5 Algorithm
Signup and view all the flashcards
Feature Selection Metrics
Feature Selection Metrics
Signup and view all the flashcards
Gini Index
Gini Index
Signup and view all the flashcards
Information Gain
Information Gain
Signup and view all the flashcards
Pruning
Pruning
Signup and view all the flashcards
Inferential Statistics
Inferential Statistics
Signup and view all the flashcards
Goal of Hypothesis testing
Goal of Hypothesis testing
Signup and view all the flashcards
Null Hypothesis Rejection
Null Hypothesis Rejection
Signup and view all the flashcards
Central Limit Theorem
Central Limit Theorem
Signup and view all the flashcards
Bayes' Theorem
Bayes' Theorem
Signup and view all the flashcards
Lindley's Paradox
Lindley's Paradox
Signup and view all the flashcards
Anscombe's Quartet
Anscombe's Quartet
Signup and view all the flashcards
Gestalt Principles Goal
Gestalt Principles Goal
Signup and view all the flashcards
Histogram
Histogram
Signup and view all the flashcards
Bar Plot
Bar Plot
Signup and view all the flashcards
Pie Chart
Pie Chart
Signup and view all the flashcards
Scatter Plot
Scatter Plot
Signup and view all the flashcards
Supervised Learning
Supervised Learning
Signup and view all the flashcards
Classification vs. Regression
Classification vs. Regression
Signup and view all the flashcards
Unsupervised Learning
Unsupervised Learning
Signup and view all the flashcards
Kernel Machines
Kernel Machines
Signup and view all the flashcards
Kernel Trick
Kernel Trick
Signup and view all the flashcards
Polynomial Kernel
Polynomial Kernel
Signup and view all the flashcards
Gaussian (RBF) Kernel
Gaussian (RBF) Kernel
Signup and view all the flashcards
Multi-class SVM
Multi-class SVM
Signup and view all the flashcards
Neural Networks task
Neural Networks task
Signup and view all the flashcards
Feed-forward NN
Feed-forward NN
Signup and view all the flashcards
Activation Function Role
Activation Function Role
Signup and view all the flashcards
Common Activation Functions
Common Activation Functions
Signup and view all the flashcards
NN Training
NN Training
Signup and view all the flashcards
Backpropagation
Backpropagation
Signup and view all the flashcards
Early Stopping
Early Stopping
Signup and view all the flashcards
Convolutional NN
Convolutional NN
Signup and view all the flashcards
One-vs-All (OvA)
One-vs-All (OvA)
Signup and view all the flashcards
One-vs-One (OvO)
One-vs-One (OvO)
Signup and view all the flashcards
Linear Classifiers
Linear Classifiers
Signup and view all the flashcards
Perceptron
Perceptron
Signup and view all the flashcards
Loss Function
Loss Function
Signup and view all the flashcards
Gradient Descent
Gradient Descent
Signup and view all the flashcards
Learning Rate
Learning Rate
Signup and view all the flashcards
Logistic Regression
Logistic Regression
Signup and view all the flashcards
Logistic Regression Output
Logistic Regression Output
Signup and view all the flashcards
Support Vector Machines (SVM)
Support Vector Machines (SVM)
Signup and view all the flashcards
Hyperplane (in SVM)
Hyperplane (in SVM)
Signup and view all the flashcards
Margin (in SVM)
Margin (in SVM)
Signup and view all the flashcards
Support Vectors
Support Vectors
Signup and view all the flashcards
SVM Hyper-parameter C
SVM Hyper-parameter C
Signup and view all the flashcards
Large C (in SVM)
Large C (in SVM)
Signup and view all the flashcards
Regularization
Regularization
Signup and view all the flashcards
Small C (in SVM)
Small C (in SVM)
Signup and view all the flashcards
Greedy Policy
Greedy Policy
Signup and view all the flashcards
Dynamic Programming
Dynamic Programming
Signup and view all the flashcards
Temporal Difference Learning
Temporal Difference Learning
Signup and view all the flashcards
Learning Rate (α)
Learning Rate (α)
Signup and view all the flashcards
Infinite Exploration
Infinite Exploration
Signup and view all the flashcards
Non-Stationarity in MARL
Non-Stationarity in MARL
Signup and view all the flashcards
Zero-Sum Game
Zero-Sum Game
Signup and view all the flashcards
Joint Policy
Joint Policy
Signup and view all the flashcards
Best Response Policy
Best Response Policy
Signup and view all the flashcards
Nash Equilibrium
Nash Equilibrium
Signup and view all the flashcards
Completeness (clustering)
Completeness (clustering)
Signup and view all the flashcards
V-Measure
V-Measure
Signup and view all the flashcards
Data Mining
Data Mining
Signup and view all the flashcards
Confidence (rule)
Confidence (rule)
Signup and view all the flashcards
Association Rule
Association Rule
Signup and view all the flashcards
Support (itemset)
Support (itemset)
Signup and view all the flashcards
Support (rule)
Support (rule)
Signup and view all the flashcards
Lift (rule)
Lift (rule)
Signup and view all the flashcards
Interest (rule)
Interest (rule)
Signup and view all the flashcards
Apriori Algorithm
Apriori Algorithm
Signup and view all the flashcards
Dendrogram
Dendrogram
Signup and view all the flashcards
Single Linkage
Single Linkage
Signup and view all the flashcards
Principal Component Analysis (PCA)
Principal Component Analysis (PCA)
Signup and view all the flashcards
Reinforcement Learning (RL)
Reinforcement Learning (RL)
Signup and view all the flashcards
Learning Goal (RL)
Learning Goal (RL)
Signup and view all the flashcards
Recurrent Neural Networks (RNNs)
Recurrent Neural Networks (RNNs)
Signup and view all the flashcards
Regression
Regression
Signup and view all the flashcards
Classification
Classification
Signup and view all the flashcards
Parametric Regression
Parametric Regression
Signup and view all the flashcards
Non-Parametric Regression
Non-Parametric Regression
Signup and view all the flashcards
K-Nearest Neighbors Regression
K-Nearest Neighbors Regression
Signup and view all the flashcards
Linear Regression
Linear Regression
Signup and view all the flashcards
Regression Trees
Regression Trees
Signup and view all the flashcards
Neural Network Architecture
Neural Network Architecture
Signup and view all the flashcards
Support Vector Regression (SVR)
Support Vector Regression (SVR)
Signup and view all the flashcards
Clustering
Clustering
Signup and view all the flashcards
K-means Algorithm process
K-means Algorithm process
Signup and view all the flashcards
K-means: Random Initialization Dependence
K-means: Random Initialization Dependence
Signup and view all the flashcards
K-means Algorithm: Assusmptions of Linear Separation
K-means Algorithm: Assusmptions of Linear Separation
Signup and view all the flashcards
Dunn Index for clustering
Dunn Index for clustering
Signup and view all the flashcards
Study Notes
Data Science
- Data science applies computational and statistical methods to gain insights into real-world problems.
- The core components include raw data and a well-defined problem statement.
- Model choice is determined by maximizing performance on a specific evaluation metric.
- The macro-steps are data collection, data analysis, and data presentation.
- Data analysis includes exploration, model selection, and testing.
- Data presentation aims to maximize information while minimizing unnecessary detail.
- Data management provides essential support, including efficient storage, retrieval, and suitable infrastructure.
Statistics
- A correlation coefficient measures the linear relationship between two variables, ranging from -1 to 1.
- A value of 1 indicates a positive correlation, 0 indicates no correlation, and -1 indicates a negative correlation.
- Pearson’s correlation coefficient, corr(X, Y), is the ratio of covariance to the product of standard deviations.
- The Bonferroni principle warns against discovering meaningless patterns in data analysis.
- A sample is a subset of a population, and statistical analysis aims to draw conclusions about the entire population from the sample.
- Selection bias occurs when a non-representative sample is chosen for analysis.
- Simpson's paradox shows trends observable in data groups that disappear when the groups are combined, indicating confounding variables.
- Descriptive statistics summarizes observed phenomena, useful for quick data analysis (e.g., mean).
- Inferential statistics generalizes results from data to the entire population through hypothesis testing and model building.
- A frequentist approach formulates a hypothesis, estimates a confidence level, repeats experiments, and computes test statistics.
- A Bayesian approach applies probability to statistical problems, finding the probability of a hypothesis using Bayes’ theorem.
- Hypothesis testing evaluates the likelihood of a hypothesis being true, given the data.
- The common approach to hypothesis testing involves computing the probability of observing the data given the null hypothesis (p-value).
- The null hypothesis is rejected if the p-value is less than a threshold (T).
- The central limit theorem states that the sample mean of any random variable approaches a normal distribution as sample size increases.
- Bayes' theorem updates the probability of an event given new information, expressed as p(A|B) = p(A) p(B|A)/p(B).
- Lindley’s paradox indicates that Bayesian and frequentist approaches can yield different results based on prior distribution choices.
Data Visualization
- Anscombe’s quartet highlights the limitations of descriptive statistics by showing datasets with identical stats but different graphs.
- The Gestalt principles demonstrate that the whole is different from the sum of its parts.
- Closure: Perceiving separate elements as a complete figure.
- Common fate: Grouping objects moving in the same direction as a unit.
- Continuity: Perceiving disconnected figures as a continuous form.
- Proximity: Perceiving objects near each other as a group.
- Similarity: Grouping similar elements into collective entities.
- Symmetry: Perceiving symmetry in objects regardless of distance.
- A histogram plots counts or frequencies of a variable.
- A bar plot compares the same metric across different categories, where bars can be reordered.
- A pie chart represents parts/proportions of a whole.
- A scatter plot analyzes the relationship between two continuous variables, revealing trends or correlations.
- A box plot visualizes the distribution of values, showing the mean and symmetry of the dataset.
- A heatmap represents matrix entries with colors and is useful for correlations/distances between entities.
- A bubble chart is a scatter plot variation that can represent more than two dimensions using color and point diameter.
ML Introduction
- The three main machine learning paradigms are supervised, unsupervised, and reinforcement learning.
- Supervised learning finds a mapping between input and output variables using labeled data.
- Classification involves a discrete output, while regression involves a continuous output.
- Unsupervised learning uses unlabeled data to cluster, find patterns, and identify important features.
- The confusion matrix measures prediction model performance, reporting true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).
- Accuracy is the ratio of correct predictions (TP+TN) to all predictions (TP+TN+FP+FN).
- Precision is the ratio of true positives (TP) to all positive predictions (TP+FP).
- Recall is the ratio of true positives (TP) to all actual positives (TP+FN).
- F1 score is calculated as 2PR/(P+R).
- A recall-precision curve evaluates classification model performance, particularly for imbalanced datasets.
- A receiver operating characteristic (ROC) curve assesses binary classification model performance, particularly for balanced datasets.
- Regression metrics include Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Deviation (MAD).
- Model selection chooses the best hyperparameters, splitting data into training, validation, and test sets.
- K-fold cross-validation splits data into k folds, using one as a test set and the rest for training and validation over k turns.
- Micro-averaging sums TPs, FPs, and FNs before computing metrics.
- Macro-averaging computes per-fold metrics and then averages them.
- Overfitting occurs when a model learns the training set too well but fails to generalize to new examples.
- Data pre-processing tasks include aggregation, cleaning, discretization, normalization, conversion, and integration.
ML KNN
- KNN classifies a variable's target based on the most frequent target value of its k nearest neighbors, using a distance function for similarity.
- Distance metrics include Euclidean and Hamming distance.
- KNN can be used for regression when the variable is continuous by averaging the target values of the k nearest neighbors.
- KNN requires numeric features.
- KNN has high computational complexity in high-dimensional spaces.
ML Decision Trees
- Decision Trees are widely used ML models that classify data based on attribute-value representations.
- Each internal node represents an attribute.
- Each branch corresponds to a value of that attribute.
- Each leaf node represents a class.
- Can be viewed as a set of classification rules, with each path from the root to a leaf representing one such rule.
- The C4.5 algorithm builds decision trees by choosing attributes, partitioning the dataset, applying C4.5 recursively, and stopping.
- C4.5 handles continuous attributes and uses information gain ratio.
- Metrics that aid feature selection include the Gini index and information gain ratio.
- The Gini index measures the impurity of a node, with a best value of 0.
- Information gain measures the expected reduction in entropy, with higher values indicating better splits.
- Information gain ratio normalizes information gain, solving the bias issue of information gain and the Gini Index.
- Pruning reduces overfitting by removing unnecessary branches using pre-pruning (stopping tree growth early) and post-pruning (removing subtrees).
- Decision trees are interpretable and handle mixed data types but are prone to overfitting.
- Random forests are ensembles of decision trees trained on random feature subsets.
- Ensemble models combine multiple weak classifiers, including bagging, boosting, and stacking.
Linear classifiers
- Linear classifiers separate data using a linear decision boundary.
- They are useful for large datasets where simplicity and low computational cost are crucial.
- The perceptron is a simple neural network model with a single neuron, capable of classifying linearly separable examples.
- A loss function measures the difference between model predictions and actual target values.
- Gradient descent minimizes the loss function by iteratively adjusting model parameters.
- The learning rate determines the step size in parameter updates, requiring careful tuning to avoid slow convergence or instability.
- The perceptron can be used for regression by removing the threshold activation function and using a different loss function.
- Logistic regression estimates the probability that an input belongs to a class using the sigmoid function.
- The output of logistic regression is the probability, and feature weights indicate the importance and direction of the feature's influence.
Support Vector Machines
- SVMs find an optimal separating hyperplane to divide data points into classes.
- SVMs aim to maximize the margin, which is the distance between the hyperplane and the closest data points.
- A soft-margin SVM allows some misclassification using slack variables.
- The C hyper-parameter balances margin size and classification errors.
- Large C values try to classify all points correctly, which can lead to overfitting; small C values allow misclassifications, improving generalization.
- Regularization prevents overfitting by adding a penalty term to the loss function.
- Kernel machines extend SVMs to non-linear problems by transforming the input space.
- Kernel machines use the "kernel trick" to compute dot products in a higher-dimensional space efficiently.
- Polynomial and Gaussian (RBF) kernels are commonly used.
- Multi-class classification can be handled with one-vs-all (OvA) or one-vs-one (OvO) approaches.
Neural Networks
- Neural Networks (NNs) are used for function approximation.
- NNs minimize a loss function, such as binary or multinomial cross-entropy.
- A feed-forward neural network (FNN) consists of an input layer, hidden layers, and an output layer.
- Each neuron applies a function: f(x;θ)=g(w⋅x+b), where g is a non-linear activation function.
- Activation functions introduce non-linearity, enabling the network to learn complex relationships.
- Common activation functions include Sigmoid, Softmax, and ReLU.
- Neural networks are trained using supervised learning with labeled data.
- Backpropagation computes gradients for all parameters, enabling efficient training through gradient descent.
- Early stopping prevents overfitting by monitoring validation error.
- Convolutional Neural Networks (CNNs) are used for image classification, employing convolutional layers to learn spatial hierarchies.
- Recurrent Neural Networks (RNNs) are used for sequential data processing, such as time series forecasting and NLP.
Regression
- Regression predicts real-valued continuous variables, unlike classification, which predicts discrete categories.
- Parametric regression requires assuming a function shape in advance.
- Linear regression is an example of Parametric regression.
- Non-parametric regression doesn't assume a function shape but requires more data.
- K-Nearest Neighbors (KNN) regression is an example of Non-parametric regression.
- K-Nearest Neighbors, linear regression, decision trees, neural networks and support vector machines can be adjusted for regression.
Unsupervised learning
- Clustering groups data points based on similarity, maximizing intra-cluster and minimizing inter-cluster similarity.
- The K-means algorithm assigns data points to the closest centroid.
- The K-means algorithm computes new centroids based on the mean of assigned points.
- K-means is dependent on random initialization.
- K-means requires choosing the number of clusters (K) in advance.
- K-means assumes linear separation.
- Clustering performance measures include the Dunn Index, Silhouette Score, Homogeneity, Completeness, and V-Measure.
Association Rule Mining
- Data mining is the process of transforming data into intelligible knowledge.
- The confidence of a rule is the probability of consequent happening, given the antecedent.
- An association rule implies X -> Ik, where X is a subset of items and Ik is an item not in X.
- The support of an itemset is the fraction of transactions containing the itemset.
- The support of a rule is the fraction of transactions containing both the antecedent and consequent.
- The lift of a rule measures the support of the rule relative to the expected support if the items were independent.
- Rule with Lift > 1 indicates positive dependence.
- Interest measures the difference between the confidence of the rule and the support of the consequent.
- Apriori identifies frequent items, generates candidate itemsets, prunes infrequent itemsets, and generates rules.
- A dendrogram is a tree-like diagram that visualizes the hierarchical structure of clusters.
- Single Linkage calculates the minimum distance between any two points from each cluster, able to capture elongated structures.
- Principal Component Analysis (PCA) reduces dimensionality by projecting data into a lower space, preserving main characteristics.
Reinforcement Learning (RL)
- RL involves an agent learning to achieve goals by interacting with an environment and receiving rewards for its actions.
- Markov Decision Process (MDP)
- MDP models sequential decision-making with states, actions, rewards, and transition functions.
- The learning goal in single-agent RL maximizes the expected return, often discounted to prioritize immediate rewards
- Discount factor determines the significance of future rewards—a value near 0 prioritizes immediate rewards, and one near 1 values future rewards almost equally
- State value function measures the expected value of being in a state.
- State-action value function measures the expected value of taking an action in a state.
- A greedy policy selects the action with the highest expected value.
- Dynamic programming computes optimal policies, assuming full knowledge of the MDP.
- Temporal difference learning learns optimal policies without knowing the transition or reward functions beforehand.
- Learning rate determines how much the model updates its estimates.
- The common assumption made to prove convergence of RL algorithms is infinite exploration.
Multi-Agent Reinforcement Learning (MARL)
- MARL introduces challenges over RL through Non-stationarity, credit assignment, and equilibrium selection.
- A zero-sum game is where the sum of agent's rewards is always zero.
- A joint policy specifies the actions of all agents in every state.
- A best response policy maximizes its expected return given fixed policies of other agents.
- Nash equilibrium means no agent can improve its expected return by unilaterally changing its policy.
- Independent learning is simple, but suffers from non-stationarity issues.
- Central learning avoids non-stationarity, with increased computational complexity.
- CTDE shares information during training but acts independently.
- Deep learning in RL enables generalization and efficient representation of complex environments.
- In RL, the training target changes continuously (a moving target problem) unlike supervised learning.
- RL data isn't i.i.d., which can cause overfitting to recent experiences.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.