Podcast
Questions and Answers
What is the significance of having two hidden layers in an artificial neural network (ANN)?
What is the significance of having two hidden layers in an artificial neural network (ANN)?
How is the optimal size of the hidden layer(s) in a multi-layer ANN typically determined?
How is the optimal size of the hidden layer(s) in a multi-layer ANN typically determined?
What happens during the training of a multi-layer ANN when an error is detected in the output?
What happens during the training of a multi-layer ANN when an error is detected in the output?
Why might more layers be added to an ANN structure?
Why might more layers be added to an ANN structure?
Signup and view all the answers
Which of the following is true about the learning process in a multi-layer ANN?
Which of the following is true about the learning process in a multi-layer ANN?
Signup and view all the answers
What is the purpose of the backward pass in an artificial neural network (ANN)?
What is the purpose of the backward pass in an artificial neural network (ANN)?
Signup and view all the answers
Which factor can contribute to the overfitting of an ANN?
Which factor can contribute to the overfitting of an ANN?
Signup and view all the answers
What does the back propagation algorithm primarily aim to achieve?
What does the back propagation algorithm primarily aim to achieve?
Signup and view all the answers
What is indicated by a higher $R^2$ value in relation to an ANN?
What is indicated by a higher $R^2$ value in relation to an ANN?
Signup and view all the answers
What does momentum help achieve when training an ANN?
What does momentum help achieve when training an ANN?
Signup and view all the answers
Which of the following is NOT a parameter that can affect the performance of an ANN?
Which of the following is NOT a parameter that can affect the performance of an ANN?
Signup and view all the answers
What does the local gradient of a neuron during back propagation represent?
What does the local gradient of a neuron during back propagation represent?
Signup and view all the answers
What is one potential consequence of a high learning rate in an ANN?
What is one potential consequence of a high learning rate in an ANN?
Signup and view all the answers
What is the primary purpose of the least squares method in linear regression?
What is the primary purpose of the least squares method in linear regression?
Signup and view all the answers
Which of the following indicates that a linear regression model may not adequately fit the data?
Which of the following indicates that a linear regression model may not adequately fit the data?
Signup and view all the answers
When analyzing residuals to check for homoscedasticity, what does constant variance imply?
When analyzing residuals to check for homoscedasticity, what does constant variance imply?
Signup and view all the answers
In a confusion matrix, what does a True Positive (TP) represent?
In a confusion matrix, what does a True Positive (TP) represent?
Signup and view all the answers
What does a non-independent residual analysis indicate?
What does a non-independent residual analysis indicate?
Signup and view all the answers
The linear function is represented mathematically as which of the following?
The linear function is represented mathematically as which of the following?
Signup and view all the answers
Which statement is true regarding the slope in a linear regression model?
Which statement is true regarding the slope in a linear regression model?
Signup and view all the answers
How does increasing the number of data points affect the least squares regression line?
How does increasing the number of data points affect the least squares regression line?
Signup and view all the answers
What does it mean if the residuals have a fan-shaped pattern when plotted?
What does it mean if the residuals have a fan-shaped pattern when plotted?
Signup and view all the answers
What is represented by the term 'error' in a linear regression model?
What is represented by the term 'error' in a linear regression model?
Signup and view all the answers
Which logical operation results in 1 only when the inputs are different?
Which logical operation results in 1 only when the inputs are different?
Signup and view all the answers
In a single-layer perceptron, which type of problems can it not solve?
In a single-layer perceptron, which type of problems can it not solve?
Signup and view all the answers
What is true about multi-layer artificial neural networks (ANNs)?
What is true about multi-layer artificial neural networks (ANNs)?
Signup and view all the answers
Which of the following accurately describes the layers in a multi-layer ANN?
Which of the following accurately describes the layers in a multi-layer ANN?
Signup and view all the answers
What does the AND operator output when both inputs are 0?
What does the AND operator output when both inputs are 0?
Signup and view all the answers
What classification task is suited for a multi-layer ANN but not for a single-layer perceptron?
What classification task is suited for a multi-layer ANN but not for a single-layer perceptron?
Signup and view all the answers
How do multi-layer ANNs propagate input signals?
How do multi-layer ANNs propagate input signals?
Signup and view all the answers
Which result does the OR logical operation yield for inputs 0 and 0?
Which result does the OR logical operation yield for inputs 0 and 0?
Signup and view all the answers
How does changing the mean (μ) of a normal distribution affect its graph?
How does changing the mean (μ) of a normal distribution affect its graph?
Signup and view all the answers
What does the standard deviation (σ) determine in a normal distribution?
What does the standard deviation (σ) determine in a normal distribution?
Signup and view all the answers
In a normal distribution defined by its mean and standard deviation, what does E(X) represent?
In a normal distribution defined by its mean and standard deviation, what does E(X) represent?
Signup and view all the answers
What mathematical function describes a normal distribution?
What mathematical function describes a normal distribution?
Signup and view all the answers
How is the variance (Var(X)) of a normal distribution calculated?
How is the variance (Var(X)) of a normal distribution calculated?
Signup and view all the answers
According to the central limit theorem, how does the mean of a sample (𝑥̅) vary around the population mean (μ)?
According to the central limit theorem, how does the mean of a sample (𝑥̅) vary around the population mean (μ)?
Signup and view all the answers
In a normal distribution, if the mean (μ) is increased while the standard deviation (σ) remains unchanged, what happens to the distribution?
In a normal distribution, if the mean (μ) is increased while the standard deviation (σ) remains unchanged, what happens to the distribution?
Signup and view all the answers
Which of the following statements is true about the total area under a normal distribution curve?
Which of the following statements is true about the total area under a normal distribution curve?
Signup and view all the answers
What happens to the sampling distribution of 𝑥̅ as sample size n increases?
What happens to the sampling distribution of 𝑥̅ as sample size n increases?
Signup and view all the answers
What does a p-value greater than 0.1 indicate?
What does a p-value greater than 0.1 indicate?
Signup and view all the answers
What does a positive covariance between two variables indicate?
What does a positive covariance between two variables indicate?
Signup and view all the answers
Which of the following p-value ranges indicates a low presumption against the null hypothesis?
Which of the following p-value ranges indicates a low presumption against the null hypothesis?
Signup and view all the answers
In the context of A/B testing, what does Fisher's exact test evaluate?
In the context of A/B testing, what does Fisher's exact test evaluate?
Signup and view all the answers
In terms of linear correlation, what does it mean when the covariance is equal to zero?
In terms of linear correlation, what does it mean when the covariance is equal to zero?
Signup and view all the answers
Which statement describes the 68-95-99.7 Rule?
Which statement describes the 68-95-99.7 Rule?
Signup and view all the answers
Which scenario best demonstrates anomaly detection in machine learning?
Which scenario best demonstrates anomaly detection in machine learning?
Signup and view all the answers
Which of the following represents a weak linear relationship in terms of correlation?
Which of the following represents a weak linear relationship in terms of correlation?
Signup and view all the answers
What is the interpretation of a p-value that falls within the range of 0.01 to 0.05?
What is the interpretation of a p-value that falls within the range of 0.01 to 0.05?
Signup and view all the answers
Study Notes
Machine Learning for Business Analytics
- The presentation is about machine learning for business analytics, focusing on different aspects of data analysis.
- The instructors are Dr. Marc Hilbert and Dr. Andrii Kleshchonok.
- The presentation language is English.
- The date of the presentation is October 20, 2024.
Part 1: Introduction to Business Analytics
- This section introduces the core concepts of business analytics.
- Define the task (e.g., prediction, clustering, classification, anomaly detection).
- Define objectives, error metrics, and performance standards.
- Data collection: set up data streams, storage, input, parallelisation, and Hadoop.
- Preprocessing: noise and outlier filtering, completing missing data using histograms and interpolation, normalization to scale data.
Dimensionality Reduction/Feature Selection
- Choose features to use and extract data from.
- Explore methods such as PCA, LDA, LLE, GDA.
- Consider goals, questions related to tractability.
- Design experiments, including train/validate/test data sets and cross-validation.
- Perform deployment.
Classification vs. Clustering
- Classification: uses labeled data, requires training phases, and is domain sensitive. Easy to measure performance. Includes methods like Naive Bayes, KNN, SVM, Decision Trees, Random Forests.
- Clustering: uses unlabeled data, organizes patterns based on similarity, difficult to evaluate, and includes methods like K-means, Fuzzy C-means, Hierarchical Clustering, DBScan.
Examples of ML Problems
- Predict how much customers spend in online retail.
- Explore different types of online retail customers.
- Find categories for items in an online store.
- Suggest items users might want to buy online.
Part 2: Elements of Statistics
- Discusses fundamental statistical concepts.
- Random variable descriptions, discrete and continuous.
- Probability function mapping.
- Probability function area always equals 1.
Description of Random Variables
- A random variable takes on a range of values with specific probabilities.
- The probability is how often we expect different outcomes in repeated experiments.
Discrete vs. Continuous Random Variables
- Discrete: countable number of outcomes. Examples: dead/alive, treatment/placebo, dice rolls.
- Continuous: infinite continuum of values. Examples: blood pressure, weight, speed of a car, real numbers from 1 to 6.
Probability Function
- A probability function maps possible values of a variable against the probability of their occurrence. This value is between 0 and 1.
- The area under the probability function is equal to 1.0.
Continuous Case
- For continuous variables, the probability function is a continuous mathematical function that integrates to 1.
- Example: the negative exponential function (exponential distribution) integrates to 1.
Continuous Case (cont.)
- The probability function for continuous random variables is called the probability density function (PDF).
- Probabilities of continuous variables are associated to ranges, not single values.
All Probability Distributions
- All probability distributions are characterized by an expected value (mean) and a variance (standard deviation squared).
Mean or Expectation Value
- Discrete case mean (expected value): E(X) = Σxᵢp(xᵢ) for all x.
- Continuous case mean (expected value): E(X) = ∫xᵢp(xᵢ)dx for all x.
Variance
- σ² = Var(X) = E(x-μ)²
- Variance is the expected squared distance from the mean.
Variance (cont.)
- Discrete case: Var(X) = Σ(xᵢ - μ)² p(xᵢ) for all x.
- Continuous case: Var(x) = ∫(xᵢ - μ)² p(xᵢ)dx for all x.
Normal Distribution
- A bell-shaped curve.
- Defined by the mean (μ) and standard deviation (σ).
- Changing μ shifts the distribution left or right. Changing σ increases or decreases the distribution spread.
The Normal Distribution: Mathematical Function
- f(x) = 1/(σ√2π) * e^(-(x-μ)²/(2σ²)).
The Normal PDF
- The area under the PDF curve always integrates to 1.
Normal Distribution Definition
- Mean = E(X) = μ
- Standard deviation (Std Dev) = √Var(X) = σ
Central Limit Theorem
- The mean of many random samples will be normally distributed around the true mean of the population, as the sample size increases.
- Standard deviation of the sampling distribution decreases as the sample size increases.
68-95-99.7 Rule
- 68% of the data falls within one standard deviation of the mean.
- 95% within two standard deviations.
- 99.7% within three standard deviations.
Confidence Interval
- μ = x ± t * (s/√n).
- Uses t-Students value, dependent on sample size and confidence level.
Testing Hypothesis
- Comparison between population distribution vs. sampling distribution.
- Test on the sample mean to either reject or accept the null hypothesis.
Deterministic vs. Statistical Testing
- Deterministic: observe the event and decide (reject/don't reject null).
- Statistical: observe the event and decide, with a chance of error (reject/don't reject with chance p%).
Types of Errors in Hypothesis Testing
- Type I error (α): Rejecting the null hypothesis when it's true (false positive).
- Type II error (β): Failing to reject the null hypothesis when it's false (false negative).
p-value
- Probability of an observed event to occur by pure chance.
- Informal significance levels help us interpret the results.
Anomaly Detection
- Techniques used to isolate and identify data points or values that are considered unusual or don't align with the rest of the data.
Examples of ML problems (cont.)
- Identify potential scams in online retail outlets.
A/B Testing
- A method for testing two different design options by comparing their success rate.
- Used to quantitatively asses if there's a statistical difference between the two.
Underlining Links
- Assess the effect of underlining links on click-through rate.
Correlation
- Describes a linear relationship between two variables.
- Positive correlation (increasing X → increasing Y).
- Negative correlation (increasing X → decreasing Y).
- No correlation.
Correlation (cont.)
- cov(X,Y) = Σ((xᵢ - X̄)(yᵢ - Ȳ))/(n - 1).
Linear Correlation
- Linear relationships are visualized and evaluated on scatterplots.
- Assess the strength of the relationship between variables.
- Visual assessment of the relationship: Strong, weak, no relationship.
Linear Regression Model
- Assumes a linear relationship between variables.
- Defines the relationship using an equation (Y = β₀ + β₁X₁ + εᵢ).
- The dependent variable is Y, and the independent variable is X.
- Random error (ε) accounts for the fact the linear relation is an approximation.
Estimating Parameters: Least Squares Method
- The best fit is when the differences between prediction values and the actual values are minimal.
- Least squares minimizes the sum of the squared differences.
- The method is used for parameter estimation in linear regression, and can also be applied to other models.
Least Squares Graphically
- Visual representation of the error minimisation using a line graph and the points.
Residual Analysis for Linearity, Homoscedasticity, and Independence
- Residual analysis assess the validity of the assumptions for the linear model.
- Linearity: The errors should be randomly distributed around the line.
- Homoscedasticity: The variance of errors should be constant across the independents variable.
- Independence: The residual values at one point should not be correlated with the residual values at a different point.
Estimating Parameters: Classification
- This section covers estimation techniques specific to classification problems.
Comparing LP and Logit Models
- Comparison of linear predictive models vs. logistic predictive models focusing on the shape differences.
Confusion Matrix/Crosstabs
- Calculates the performance of a classification model.
- True positives (TP), True negatives (TN), False Positives (FP), False Negatives (FN).
Confusion Matrix
- A table that records the counts of the classifications.
Underfitting and Overfitting
- Underfitting: The model is too simple to capture the true relationship (e.g., a flat line through data with a curve).
- Overfitting: The model is too complex, fitting the training set too closely and losing generalisability (e.g., following the noise in data, which does not reflect the underlying pattern)
Overfitting
- The model fits the training data very well.
- The model does not generalise well for the test data.
- There is a gap between training and test error.
Overfitting (cont.)
- Overfitting is a problem in machine learning.
- It occurs when a model is too complex.
- Good fit on training data but poor on unseen (test) data.
Overfitting of ANNs
- Parameters (e.g., number of neurons, initial weights).
- Activation functions (sigmoid, etc.).
- Learning rate, momentum (increase in flexibility due to increase in neurons).
Training and test data set
- Training data is used to train or learn a model.
- Test data is used to evaluate the performance.
Goodness-of-fit of ANN
- Measure of how well the ANN models the data.
- Similar to R-squared for linear regression, but applied to ANNs.
- Close to 1.0 = better fit.
MSE related with training over time
- Plotting MSE vs. Epochs helps choose the optimal model, as the training and test MSE.
Advantages of ANNs
- Efficient for massively parallel processing.
- Robust, tolerant to missing or noisy data.
- User-friendly programming.
Disadvantages of ANNs
- Difficult to design models for arbitrary applications.
- Difficult to assess internal operation of the ANN.
- Not easy to know which variable is influential (black box).
Part 5: Python Implementation
- This section outlines components for designing and training dense artificial neural networks using the Keras Python library.
- This includes Data (housing dataset), Problem statement (regression or classification), Preprocessing (standard scaling, one-hot encoding), and Architecture (number of layers/neurons, activation functions, dropout layers), Training parameters (optimizer ADAM, batch size, epochs), Evaluation metrics and learning curves, Analysis of errors and residuals.
Dropout Layers
- Randomly remove nodes within the NN during a forward path to train an ensemble of subnetworks.
- Effectively improves generalisation ability.
- Leads to improved uncertainty estimation of predictions.
Classification Implementation in Keras
- Steps to use Keras for classification models include defining a sequential model, defining the layers (input, hidden, output), specifying activation functions, compiling the model, and training it.
High-level Language Model Overview
- Large language models are described and their parameters vs. the year are displayed.
Intuition behind LLM trainings
- Autoregressive models predict future tokens given past history. Autoencoders predict tokens, given the rest of the context.
LLM Capabilities
- LLMs cover various tasks such as text classification, entity recognition, summarization, paraphrase, translation, and data generation.
GPT (Generative Pre-trained Transformer)
- Generative Large Language Model (LLM). Zero-shot and few-shot learning on diverse tasks. Includes a Chat Functionality and Human Feedback Loop.
Part 6: Exam Information
- This section contains exam-related details (e.g., dates, topics).
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.