Podcast
Questions and Answers
What is the significance of having two hidden layers in an artificial neural network (ANN)?
What is the significance of having two hidden layers in an artificial neural network (ANN)?
- They can represent any decision boundary with high accuracy. (correct)
- They increase the overall computational speed.
- They improve the interpretability of the model.
- They simplify the training process.
How is the optimal size of the hidden layer(s) in a multi-layer ANN typically determined?
How is the optimal size of the hidden layer(s) in a multi-layer ANN typically determined?
- By following pre-set standard sizes for specific tasks.
- Based on theoretical analysis of network performance.
- Through extensive simulations on training data.
- By using a trial-and-error heuristic approach. (correct)
What happens during the training of a multi-layer ANN when an error is detected in the output?
What happens during the training of a multi-layer ANN when an error is detected in the output?
- New input patterns are generated.
- The hidden layers are removed.
- The entire network resets to its initial state.
- Weights are adjusted to reduce the error. (correct)
Why might more layers be added to an ANN structure?
Why might more layers be added to an ANN structure?
Which of the following is true about the learning process in a multi-layer ANN?
Which of the following is true about the learning process in a multi-layer ANN?
What is the purpose of the backward pass in an artificial neural network (ANN)?
What is the purpose of the backward pass in an artificial neural network (ANN)?
Which factor can contribute to the overfitting of an ANN?
Which factor can contribute to the overfitting of an ANN?
What does the back propagation algorithm primarily aim to achieve?
What does the back propagation algorithm primarily aim to achieve?
What is indicated by a higher $R^2$ value in relation to an ANN?
What is indicated by a higher $R^2$ value in relation to an ANN?
What does momentum help achieve when training an ANN?
What does momentum help achieve when training an ANN?
Which of the following is NOT a parameter that can affect the performance of an ANN?
Which of the following is NOT a parameter that can affect the performance of an ANN?
What does the local gradient of a neuron during back propagation represent?
What does the local gradient of a neuron during back propagation represent?
What is one potential consequence of a high learning rate in an ANN?
What is one potential consequence of a high learning rate in an ANN?
What is the primary purpose of the least squares method in linear regression?
What is the primary purpose of the least squares method in linear regression?
Which of the following indicates that a linear regression model may not adequately fit the data?
Which of the following indicates that a linear regression model may not adequately fit the data?
When analyzing residuals to check for homoscedasticity, what does constant variance imply?
When analyzing residuals to check for homoscedasticity, what does constant variance imply?
In a confusion matrix, what does a True Positive (TP) represent?
In a confusion matrix, what does a True Positive (TP) represent?
What does a non-independent residual analysis indicate?
What does a non-independent residual analysis indicate?
The linear function is represented mathematically as which of the following?
The linear function is represented mathematically as which of the following?
Which statement is true regarding the slope in a linear regression model?
Which statement is true regarding the slope in a linear regression model?
How does increasing the number of data points affect the least squares regression line?
How does increasing the number of data points affect the least squares regression line?
What does it mean if the residuals have a fan-shaped pattern when plotted?
What does it mean if the residuals have a fan-shaped pattern when plotted?
What is represented by the term 'error' in a linear regression model?
What is represented by the term 'error' in a linear regression model?
Which logical operation results in 1 only when the inputs are different?
Which logical operation results in 1 only when the inputs are different?
In a single-layer perceptron, which type of problems can it not solve?
In a single-layer perceptron, which type of problems can it not solve?
What is true about multi-layer artificial neural networks (ANNs)?
What is true about multi-layer artificial neural networks (ANNs)?
Which of the following accurately describes the layers in a multi-layer ANN?
Which of the following accurately describes the layers in a multi-layer ANN?
What does the AND operator output when both inputs are 0?
What does the AND operator output when both inputs are 0?
What classification task is suited for a multi-layer ANN but not for a single-layer perceptron?
What classification task is suited for a multi-layer ANN but not for a single-layer perceptron?
How do multi-layer ANNs propagate input signals?
How do multi-layer ANNs propagate input signals?
Which result does the OR logical operation yield for inputs 0 and 0?
Which result does the OR logical operation yield for inputs 0 and 0?
How does changing the mean (μ) of a normal distribution affect its graph?
How does changing the mean (μ) of a normal distribution affect its graph?
What does the standard deviation (σ) determine in a normal distribution?
What does the standard deviation (σ) determine in a normal distribution?
In a normal distribution defined by its mean and standard deviation, what does E(X) represent?
In a normal distribution defined by its mean and standard deviation, what does E(X) represent?
What mathematical function describes a normal distribution?
What mathematical function describes a normal distribution?
How is the variance (Var(X)) of a normal distribution calculated?
How is the variance (Var(X)) of a normal distribution calculated?
According to the central limit theorem, how does the mean of a sample (𝑥̅) vary around the population mean (μ)?
According to the central limit theorem, how does the mean of a sample (𝑥̅) vary around the population mean (μ)?
In a normal distribution, if the mean (μ) is increased while the standard deviation (σ) remains unchanged, what happens to the distribution?
In a normal distribution, if the mean (μ) is increased while the standard deviation (σ) remains unchanged, what happens to the distribution?
Which of the following statements is true about the total area under a normal distribution curve?
Which of the following statements is true about the total area under a normal distribution curve?
What happens to the sampling distribution of 𝑥̅ as sample size n increases?
What happens to the sampling distribution of 𝑥̅ as sample size n increases?
What does a p-value greater than 0.1 indicate?
What does a p-value greater than 0.1 indicate?
What does a positive covariance between two variables indicate?
What does a positive covariance between two variables indicate?
Which of the following p-value ranges indicates a low presumption against the null hypothesis?
Which of the following p-value ranges indicates a low presumption against the null hypothesis?
In the context of A/B testing, what does Fisher's exact test evaluate?
In the context of A/B testing, what does Fisher's exact test evaluate?
In terms of linear correlation, what does it mean when the covariance is equal to zero?
In terms of linear correlation, what does it mean when the covariance is equal to zero?
Which statement describes the 68-95-99.7 Rule?
Which statement describes the 68-95-99.7 Rule?
Which scenario best demonstrates anomaly detection in machine learning?
Which scenario best demonstrates anomaly detection in machine learning?
Which of the following represents a weak linear relationship in terms of correlation?
Which of the following represents a weak linear relationship in terms of correlation?
What is the interpretation of a p-value that falls within the range of 0.01 to 0.05?
What is the interpretation of a p-value that falls within the range of 0.01 to 0.05?
Flashcards
Normal Distribution
Normal Distribution
A bell-shaped probability distribution defined by its mean (µ) and standard deviation (σ).
Mean (µ)
Mean (µ)
The average value of the distribution.
Standard Deviation (σ)
Standard Deviation (σ)
Measures the spread or dispersion of the data around the mean.
Probability Density Function (PDF)
Probability Density Function (PDF)
Signup and view all the flashcards
Central Limit Theorem
Central Limit Theorem
Signup and view all the flashcards
Sample Mean (𝑥̅)
Sample Mean (𝑥̅)
Signup and view all the flashcards
Standard Error
Standard Error
Signup and view all the flashcards
Variance
Variance
Signup and view all the flashcards
Sampling Distribution of x̄
Sampling Distribution of x̄
Signup and view all the flashcards
68-95-99.7 Rule
68-95-99.7 Rule
Signup and view all the flashcards
Confidence Interval
Confidence Interval
Signup and view all the flashcards
t-Student value
t-Student value
Signup and view all the flashcards
p-value
p-value
Signup and view all the flashcards
Hypothesis Testing
Hypothesis Testing
Signup and view all the flashcards
Correlation
Correlation
Signup and view all the flashcards
Linear Correlation
Linear Correlation
Signup and view all the flashcards
A/B Testing
A/B Testing
Signup and view all the flashcards
Anomaly Detection
Anomaly Detection
Signup and view all the flashcards
Linear Regression Model
Linear Regression Model
Signup and view all the flashcards
Least Squares Method
Least Squares Method
Signup and view all the flashcards
Residual Analysis
Residual Analysis
Signup and view all the flashcards
Linearity
Linearity
Signup and view all the flashcards
Homoscedasticity
Homoscedasticity
Signup and view all the flashcards
Independence of Errors
Independence of Errors
Signup and view all the flashcards
Confusion Matrix
Confusion Matrix
Signup and view all the flashcards
True Positives (TP)
True Positives (TP)
Signup and view all the flashcards
True Negatives (TN)
True Negatives (TN)
Signup and view all the flashcards
False Positives (FP)
False Positives (FP)
Signup and view all the flashcards
Multi-layer ANNs
Multi-layer ANNs
Signup and view all the flashcards
Hidden Layers
Hidden Layers
Signup and view all the flashcards
Learning in Multi-layer ANNs
Learning in Multi-layer ANNs
Signup and view all the flashcards
Optimal Hidden Layer Size
Optimal Hidden Layer Size
Signup and view all the flashcards
Complexity in Data
Complexity in Data
Signup and view all the flashcards
Logical operators
Logical operators
Signup and view all the flashcards
OR operation
OR operation
Signup and view all the flashcards
AND operation
AND operation
Signup and view all the flashcards
XOR operation
XOR operation
Signup and view all the flashcards
Single-layer perceptron
Single-layer perceptron
Signup and view all the flashcards
Multi-layer ANN
Multi-layer ANN
Signup and view all the flashcards
Hidden layers
Hidden layers
Signup and view all the flashcards
Backpropagation
Backpropagation
Signup and view all the flashcards
Backpropagation
Backpropagation
Signup and view all the flashcards
Gradient Descent
Gradient Descent
Signup and view all the flashcards
Local Maximum (ANN)
Local Maximum (ANN)
Signup and view all the flashcards
Overfitting (ANN)
Overfitting (ANN)
Signup and view all the flashcards
Training Data
Training Data
Signup and view all the flashcards
Test Data
Test Data
Signup and view all the flashcards
Goodness-of-Fit (ANN)
Goodness-of-Fit (ANN)
Signup and view all the flashcards
R² (ANN)
R² (ANN)
Signup and view all the flashcards
Study Notes
Machine Learning for Business Analytics
- The presentation is about machine learning for business analytics, focusing on different aspects of data analysis.
- The instructors are Dr. Marc Hilbert and Dr. Andrii Kleshchonok.
- The presentation language is English.
- The date of the presentation is October 20, 2024.
Part 1: Introduction to Business Analytics
- This section introduces the core concepts of business analytics.
- Define the task (e.g., prediction, clustering, classification, anomaly detection).
- Define objectives, error metrics, and performance standards.
- Data collection: set up data streams, storage, input, parallelisation, and Hadoop.
- Preprocessing: noise and outlier filtering, completing missing data using histograms and interpolation, normalization to scale data.
Dimensionality Reduction/Feature Selection
- Choose features to use and extract data from.
- Explore methods such as PCA, LDA, LLE, GDA.
- Consider goals, questions related to tractability.
- Design experiments, including train/validate/test data sets and cross-validation.
- Perform deployment.
Classification vs. Clustering
- Classification: uses labeled data, requires training phases, and is domain sensitive. Easy to measure performance. Includes methods like Naive Bayes, KNN, SVM, Decision Trees, Random Forests.
- Clustering: uses unlabeled data, organizes patterns based on similarity, difficult to evaluate, and includes methods like K-means, Fuzzy C-means, Hierarchical Clustering, DBScan.
Examples of ML Problems
- Predict how much customers spend in online retail.
- Explore different types of online retail customers.
- Find categories for items in an online store.
- Suggest items users might want to buy online.
Part 2: Elements of Statistics
- Discusses fundamental statistical concepts.
- Random variable descriptions, discrete and continuous.
- Probability function mapping.
- Probability function area always equals 1.
Description of Random Variables
- A random variable takes on a range of values with specific probabilities.
- The probability is how often we expect different outcomes in repeated experiments.
Discrete vs. Continuous Random Variables
- Discrete: countable number of outcomes. Examples: dead/alive, treatment/placebo, dice rolls.
- Continuous: infinite continuum of values. Examples: blood pressure, weight, speed of a car, real numbers from 1 to 6.
Probability Function
- A probability function maps possible values of a variable against the probability of their occurrence. This value is between 0 and 1.
- The area under the probability function is equal to 1.0.
Continuous Case
- For continuous variables, the probability function is a continuous mathematical function that integrates to 1.
- Example: the negative exponential function (exponential distribution) integrates to 1.
Continuous Case (cont.)
- The probability function for continuous random variables is called the probability density function (PDF).
- Probabilities of continuous variables are associated to ranges, not single values.
All Probability Distributions
- All probability distributions are characterized by an expected value (mean) and a variance (standard deviation squared).
Mean or Expectation Value
- Discrete case mean (expected value): E(X) = Σxᵢp(xᵢ) for all x.
- Continuous case mean (expected value): E(X) = ∫xᵢp(xᵢ)dx for all x.
Variance
- σ² = Var(X) = E(x-μ)²
- Variance is the expected squared distance from the mean.
Variance (cont.)
- Discrete case: Var(X) = Σ(xᵢ - μ)² p(xᵢ) for all x.
- Continuous case: Var(x) = ∫(xᵢ - μ)² p(xᵢ)dx for all x.
Normal Distribution
- A bell-shaped curve.
- Defined by the mean (μ) and standard deviation (σ).
- Changing μ shifts the distribution left or right. Changing σ increases or decreases the distribution spread.
The Normal Distribution: Mathematical Function
- f(x) = 1/(σ√2π) * e^(-(x-μ)²/(2σ²)).
The Normal PDF
- The area under the PDF curve always integrates to 1.
Normal Distribution Definition
- Mean = E(X) = μ
- Standard deviation (Std Dev) = √Var(X) = σ
Central Limit Theorem
- The mean of many random samples will be normally distributed around the true mean of the population, as the sample size increases.
- Standard deviation of the sampling distribution decreases as the sample size increases.
68-95-99.7 Rule
- 68% of the data falls within one standard deviation of the mean.
- 95% within two standard deviations.
- 99.7% within three standard deviations.
Confidence Interval
- μ = x ± t * (s/√n).
- Uses t-Students value, dependent on sample size and confidence level.
Testing Hypothesis
- Comparison between population distribution vs. sampling distribution.
- Test on the sample mean to either reject or accept the null hypothesis.
Deterministic vs. Statistical Testing
- Deterministic: observe the event and decide (reject/don't reject null).
- Statistical: observe the event and decide, with a chance of error (reject/don't reject with chance p%).
Types of Errors in Hypothesis Testing
- Type I error (α): Rejecting the null hypothesis when it's true (false positive).
- Type II error (β): Failing to reject the null hypothesis when it's false (false negative).
p-value
- Probability of an observed event to occur by pure chance.
- Informal significance levels help us interpret the results.
Anomaly Detection
- Techniques used to isolate and identify data points or values that are considered unusual or don't align with the rest of the data.
Examples of ML problems (cont.)
- Identify potential scams in online retail outlets.
A/B Testing
- A method for testing two different design options by comparing their success rate.
- Used to quantitatively asses if there's a statistical difference between the two.
Underlining Links
- Assess the effect of underlining links on click-through rate.
Correlation
- Describes a linear relationship between two variables.
- Positive correlation (increasing X → increasing Y).
- Negative correlation (increasing X → decreasing Y).
- No correlation.
Correlation (cont.)
- cov(X,Y) = Σ((xᵢ - X̄)(yᵢ - Ȳ))/(n - 1).
Linear Correlation
- Linear relationships are visualized and evaluated on scatterplots.
- Assess the strength of the relationship between variables.
- Visual assessment of the relationship: Strong, weak, no relationship.
Linear Regression Model
- Assumes a linear relationship between variables.
- Defines the relationship using an equation (Y = β₀ + β₁X₁ + εᵢ).
- The dependent variable is Y, and the independent variable is X.
- Random error (ε) accounts for the fact the linear relation is an approximation.
Estimating Parameters: Least Squares Method
- The best fit is when the differences between prediction values and the actual values are minimal.
- Least squares minimizes the sum of the squared differences.
- The method is used for parameter estimation in linear regression, and can also be applied to other models.
Least Squares Graphically
- Visual representation of the error minimisation using a line graph and the points.
Residual Analysis for Linearity, Homoscedasticity, and Independence
- Residual analysis assess the validity of the assumptions for the linear model.
- Linearity: The errors should be randomly distributed around the line.
- Homoscedasticity: The variance of errors should be constant across the independents variable.
- Independence: The residual values at one point should not be correlated with the residual values at a different point.
Estimating Parameters: Classification
- This section covers estimation techniques specific to classification problems.
Comparing LP and Logit Models
- Comparison of linear predictive models vs. logistic predictive models focusing on the shape differences.
Confusion Matrix/Crosstabs
- Calculates the performance of a classification model.
- True positives (TP), True negatives (TN), False Positives (FP), False Negatives (FN).
Confusion Matrix
- A table that records the counts of the classifications.
Underfitting and Overfitting
- Underfitting: The model is too simple to capture the true relationship (e.g., a flat line through data with a curve).
- Overfitting: The model is too complex, fitting the training set too closely and losing generalisability (e.g., following the noise in data, which does not reflect the underlying pattern)
Overfitting
- The model fits the training data very well.
- The model does not generalise well for the test data.
- There is a gap between training and test error.
Overfitting (cont.)
- Overfitting is a problem in machine learning.
- It occurs when a model is too complex.
- Good fit on training data but poor on unseen (test) data.
Overfitting of ANNs
- Parameters (e.g., number of neurons, initial weights).
- Activation functions (sigmoid, etc.).
- Learning rate, momentum (increase in flexibility due to increase in neurons).
Training and test data set
- Training data is used to train or learn a model.
- Test data is used to evaluate the performance.
Goodness-of-fit of ANN
- Measure of how well the ANN models the data.
- Similar to R-squared for linear regression, but applied to ANNs.
- Close to 1.0 = better fit.
MSE related with training over time
- Plotting MSE vs. Epochs helps choose the optimal model, as the training and test MSE.
Advantages of ANNs
- Efficient for massively parallel processing.
- Robust, tolerant to missing or noisy data.
- User-friendly programming.
Disadvantages of ANNs
- Difficult to design models for arbitrary applications.
- Difficult to assess internal operation of the ANN.
- Not easy to know which variable is influential (black box).
Part 5: Python Implementation
- This section outlines components for designing and training dense artificial neural networks using the Keras Python library.
- This includes Data (housing dataset), Problem statement (regression or classification), Preprocessing (standard scaling, one-hot encoding), and Architecture (number of layers/neurons, activation functions, dropout layers), Training parameters (optimizer ADAM, batch size, epochs), Evaluation metrics and learning curves, Analysis of errors and residuals.
Dropout Layers
- Randomly remove nodes within the NN during a forward path to train an ensemble of subnetworks.
- Effectively improves generalisation ability.
- Leads to improved uncertainty estimation of predictions.
Classification Implementation in Keras
- Steps to use Keras for classification models include defining a sequential model, defining the layers (input, hidden, output), specifying activation functions, compiling the model, and training it.
High-level Language Model Overview
- Large language models are described and their parameters vs. the year are displayed.
Intuition behind LLM trainings
- Autoregressive models predict future tokens given past history. Autoencoders predict tokens, given the rest of the context.
LLM Capabilities
- LLMs cover various tasks such as text classification, entity recognition, summarization, paraphrase, translation, and data generation.
GPT (Generative Pre-trained Transformer)
- Generative Large Language Model (LLM). Zero-shot and few-shot learning on diverse tasks. Includes a Chat Functionality and Human Feedback Loop.
Part 6: Exam Information
- This section contains exam-related details (e.g., dates, topics).
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.