Untitled Quiz
49 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is the significance of having two hidden layers in an artificial neural network (ANN)?

  • They can represent any decision boundary with high accuracy. (correct)
  • They increase the overall computational speed.
  • They improve the interpretability of the model.
  • They simplify the training process.

How is the optimal size of the hidden layer(s) in a multi-layer ANN typically determined?

  • By following pre-set standard sizes for specific tasks.
  • Based on theoretical analysis of network performance.
  • Through extensive simulations on training data.
  • By using a trial-and-error heuristic approach. (correct)

What happens during the training of a multi-layer ANN when an error is detected in the output?

  • New input patterns are generated.
  • The hidden layers are removed.
  • The entire network resets to its initial state.
  • Weights are adjusted to reduce the error. (correct)

Why might more layers be added to an ANN structure?

<p>To handle more complex data and numerous predictors. (A)</p> Signup and view all the answers

Which of the following is true about the learning process in a multi-layer ANN?

<p>It involves presenting input patterns and adjusting weights based on errors. (C)</p> Signup and view all the answers

What is the purpose of the backward pass in an artificial neural network (ANN)?

<p>To calculate and propagate the error backwards (C)</p> Signup and view all the answers

Which factor can contribute to the overfitting of an ANN?

<p>An increasing number of hidden neurons (C)</p> Signup and view all the answers

What does the back propagation algorithm primarily aim to achieve?

<p>Minimize the total error of the ANN (B)</p> Signup and view all the answers

What is indicated by a higher $R^2$ value in relation to an ANN?

<p>A better fit to the data (D)</p> Signup and view all the answers

What does momentum help achieve when training an ANN?

<p>Prevention of local maxima (C)</p> Signup and view all the answers

Which of the following is NOT a parameter that can affect the performance of an ANN?

<p>Output data type (A)</p> Signup and view all the answers

What does the local gradient of a neuron during back propagation represent?

<p>The change in the neuron's output relative to changes in input (A)</p> Signup and view all the answers

What is one potential consequence of a high learning rate in an ANN?

<p>Divergence from optimal weights (D)</p> Signup and view all the answers

What is the primary purpose of the least squares method in linear regression?

<p>To minimize the sum of the squared differences between actual and predicted values (B)</p> Signup and view all the answers

Which of the following indicates that a linear regression model may not adequately fit the data?

<p>A non-linear pattern in residuals (A)</p> Signup and view all the answers

When analyzing residuals to check for homoscedasticity, what does constant variance imply?

<p>The model is correctly specified (B)</p> Signup and view all the answers

In a confusion matrix, what does a True Positive (TP) represent?

<p>The model predicted YES, and the actual answer was YES (C)</p> Signup and view all the answers

What does a non-independent residual analysis indicate?

<p>Residuals are correlated and may indicate model misspecification (A)</p> Signup and view all the answers

The linear function is represented mathematically as which of the following?

<p>Y = β0 + β1X + ε (A)</p> Signup and view all the answers

Which statement is true regarding the slope in a linear regression model?

<p>It represents the rate of change of Y with respect to X (A)</p> Signup and view all the answers

How does increasing the number of data points affect the least squares regression line?

<p>It could potentially stabilize the coefficients and reduce variability (A)</p> Signup and view all the answers

What does it mean if the residuals have a fan-shaped pattern when plotted?

<p>There is a problem with linearity in the data (C)</p> Signup and view all the answers

What is represented by the term 'error' in a linear regression model?

<p>The difference between predicted and actual values (B)</p> Signup and view all the answers

Which logical operation results in 1 only when the inputs are different?

<p>XOR (C)</p> Signup and view all the answers

In a single-layer perceptron, which type of problems can it not solve?

<p>Non-linearly separable problems (B)</p> Signup and view all the answers

What is true about multi-layer artificial neural networks (ANNs)?

<p>They can learn both linearly and non-linearly separable problems. (B)</p> Signup and view all the answers

Which of the following accurately describes the layers in a multi-layer ANN?

<p>At least one layer must be hidden. (A)</p> Signup and view all the answers

What does the AND operator output when both inputs are 0?

<p>0 (D)</p> Signup and view all the answers

What classification task is suited for a multi-layer ANN but not for a single-layer perceptron?

<p>Multi-class classification (C)</p> Signup and view all the answers

How do multi-layer ANNs propagate input signals?

<p>Layer-by-layer in a forward direction (C)</p> Signup and view all the answers

Which result does the OR logical operation yield for inputs 0 and 0?

<p>0 (D)</p> Signup and view all the answers

How does changing the mean (μ) of a normal distribution affect its graph?

<p>It shifts the distribution left or right. (A)</p> Signup and view all the answers

What does the standard deviation (σ) determine in a normal distribution?

<p>The width or spread of the distribution. (A)</p> Signup and view all the answers

In a normal distribution defined by its mean and standard deviation, what does E(X) represent?

<p>The expected value of the random variable. (A)</p> Signup and view all the answers

What mathematical function describes a normal distribution?

<p>A probability density function (pdf). (C)</p> Signup and view all the answers

How is the variance (Var(X)) of a normal distribution calculated?

<p>By squaring the standard deviation. (D)</p> Signup and view all the answers

According to the central limit theorem, how does the mean of a sample (𝑥̅) vary around the population mean (μ)?

<p>It varies around μ with a standard deviation of σ/n. (D)</p> Signup and view all the answers

In a normal distribution, if the mean (μ) is increased while the standard deviation (σ) remains unchanged, what happens to the distribution?

<p>The distribution shifts to the right. (A)</p> Signup and view all the answers

Which of the following statements is true about the total area under a normal distribution curve?

<p>It always equals 1. (C)</p> Signup and view all the answers

What happens to the sampling distribution of 𝑥̅ as sample size n increases?

<p>It becomes a Gaussian distribution. (D)</p> Signup and view all the answers

What does a p-value greater than 0.1 indicate?

<p>No presumption against the null hypothesis. (B)</p> Signup and view all the answers

What does a positive covariance between two variables indicate?

<p>The variables are positively correlated. (B)</p> Signup and view all the answers

Which of the following p-value ranges indicates a low presumption against the null hypothesis?

<p>0.05 &lt; 𝑝 ≤ 0.1 (B)</p> Signup and view all the answers

In the context of A/B testing, what does Fisher's exact test evaluate?

<p>Non-random associations between two categorical variables. (C)</p> Signup and view all the answers

In terms of linear correlation, what does it mean when the covariance is equal to zero?

<p>The two variables are independent. (C)</p> Signup and view all the answers

Which statement describes the 68-95-99.7 Rule?

<p>68% of data falls within one standard deviation from the mean. (A)</p> Signup and view all the answers

Which scenario best demonstrates anomaly detection in machine learning?

<p>Finding potential scams in an online retail shop. (B)</p> Signup and view all the answers

Which of the following represents a weak linear relationship in terms of correlation?

<p>cov(X,Y) = 0.01 (A)</p> Signup and view all the answers

What is the interpretation of a p-value that falls within the range of 0.01 to 0.05?

<p>Strong presumption against the null hypothesis. (B)</p> Signup and view all the answers

Flashcards

Normal Distribution

A bell-shaped probability distribution defined by its mean (µ) and standard deviation (σ).

Mean (µ)

The average value of the distribution.

Standard Deviation (σ)

Measures the spread or dispersion of the data around the mean.

Probability Density Function (PDF)

A mathematical function that describes the probability of a random variable taking on a specific value.

Signup and view all the flashcards

Central Limit Theorem

The average of a large number of samples from any distribution will be approximately normally distributed, no matter the original distribution.

Signup and view all the flashcards

Sample Mean (𝑥̅)

The average of a specific sample.

Signup and view all the flashcards

Standard Error

Standard deviation of the sample mean (σ/√n).

Signup and view all the flashcards

Variance

The square of the standard deviation.

Signup and view all the flashcards

Sampling Distribution of x̄

The distribution of sample means (x̄) calculated from multiple samples of a population. As the sample size increases, this distribution becomes more concentrated around the population mean (μ).

Signup and view all the flashcards

68-95-99.7 Rule

A guideline for normal distributions, stating the percentage of data within specific standard deviations of the mean. Within one standard deviation, 68% falls, two standard deviations 95%, three standard deviations 99.7%.

Signup and view all the flashcards

Confidence Interval

A range of values likely to contain a population parameter. Calculation depends on the sample size and desired confidence level.

Signup and view all the flashcards

t-Student value

A critical value used in hypothesis testing when sample size is small. Value depends on sample size and confidence level.

Signup and view all the flashcards

p-value

The probability of obtaining results at least as extreme as the observed results, assuming the null hypothesis is true.

Signup and view all the flashcards

Hypothesis Testing

A statistical method for determining if there is enough evidence to support or reject a claim about a population parameter.

Signup and view all the flashcards

Correlation

A measure of the linear relationship between two variables. Positive correlation indicates a direct relationship, negative correlation an inverse one; zero correlation indicates no linear relationship.

Signup and view all the flashcards

Linear Correlation

Shows the straight line relationship; indicates how strongly two variables change together in a straight-line pattern

Signup and view all the flashcards

A/B Testing

A method to compare two versions of something (e.g., a website design) to see which performs better. Often used to determine if a change improves a result (like click-through rate)

Signup and view all the flashcards

Anomaly Detection

Identifying unusual or unexpected patterns in data.

Signup and view all the flashcards

Linear Regression Model

A model that describes the relationship between a dependent variable and one or more independent variables using a linear equation.

Signup and view all the flashcards

Least Squares Method

A method for finding the best-fitting line (or hyperplane) to a set of data points by minimizing the sum of squared errors.

Signup and view all the flashcards

Residual Analysis

A diagnostic tool to evaluate the assumptions of a linear regression model, such as linearity, constant variance, and independence of errors.

Signup and view all the flashcards

Linearity

A crucial assumption of linear regression; the relationship between variables is truly linear.

Signup and view all the flashcards

Homoscedasticity

The assumption that the variability of errors is constant across all levels of the independent variable.

Signup and view all the flashcards

Independence of Errors

The assumption that the errors in the model are not correlated with each other.

Signup and view all the flashcards

Confusion Matrix

A table that summarizes the performance of a classification model by showing the counts of true positive, true negative, false positive, and false negative predictions.

Signup and view all the flashcards

True Positives (TP)

Correctly predicted positive cases.

Signup and view all the flashcards

True Negatives (TN)

Correctly predicted negative cases.

Signup and view all the flashcards

False Positives (FP)

Predictions of positive cases when the actual answer was negative.

Signup and view all the flashcards

Multi-layer ANNs

Artificial neural networks with multiple hidden layers that can approximate any smooth mapping and represent complex decision boundaries.

Signup and view all the flashcards

Hidden Layers

Intermediate layers in a neural network that process information from prior layers and determine their output.

Signup and view all the flashcards

Learning in Multi-layer ANNs

Adjusting the weights between layers to minimize errors compared to target outputs, improving the network's performance using training data.

Signup and view all the flashcards

Optimal Hidden Layer Size

The ideal number of neurons in a hidden layer isn't known beforehand and needs to be determined empirically.

Signup and view all the flashcards

Complexity in Data

Data from various sources or with numerous predictors or memory, resulting in the need for multiple layers.

Signup and view all the flashcards

Logical operators

Used to combine or modify logical statements (e.g., OR, AND, XOR).

Signup and view all the flashcards

OR operation

Returns true if at least one input is true.

Signup and view all the flashcards

AND operation

Returns true only if all inputs are true.

Signup and view all the flashcards

XOR operation

Returns true if inputs are different.

Signup and view all the flashcards

Single-layer perceptron

A neural network with one layer capable of learning linearly separable problems.

Signup and view all the flashcards

Multi-layer ANN

Neural network with multiple hidden layers allowing for more complex tasks.

Signup and view all the flashcards

Hidden layers

Layers between input and output in a multi-layer ANN. Perform computations.

Signup and view all the flashcards

Backpropagation

Training algorithm for multi-layer ANNs adjusting weights to improve accuracy.

Signup and view all the flashcards

Backpropagation

An algorithm to adjust neural network weights by propagating errors backward through the network.

Signup and view all the flashcards

Gradient Descent

Optimizing a function (e.g., a neural net's error) by following the steepest descent of its gradient (slope).

Signup and view all the flashcards

Local Maximum (ANN)

A point in the error surface where a neural network's error is locally minimized (but not globally).

Signup and view all the flashcards

Overfitting (ANN)

A neural network trained too well on the specific training data, making it perform poorly on other data (generalization issue).

Signup and view all the flashcards

Training Data

Data used during to train a neural network, evaluating in-sample error.

Signup and view all the flashcards

Test Data

Data used only to evaluate how well a neural network performs on new, unseen data (out-of-sample error).

Signup and view all the flashcards

Goodness-of-Fit (ANN)

A measure (e.g., R²) to determine how well a neural network fits the data.

Signup and view all the flashcards

R² (ANN)

A measure of how well a neural network fits the data. Higher R² translates to a better fit. 𝑅²= 1 means perfect fit.

Signup and view all the flashcards

Study Notes

Machine Learning for Business Analytics

  • The presentation is about machine learning for business analytics, focusing on different aspects of data analysis.
  • The instructors are Dr. Marc Hilbert and Dr. Andrii Kleshchonok.
  • The presentation language is English.
  • The date of the presentation is October 20, 2024.

Part 1: Introduction to Business Analytics

  • This section introduces the core concepts of business analytics.
  • Define the task (e.g., prediction, clustering, classification, anomaly detection).
  • Define objectives, error metrics, and performance standards.
  • Data collection: set up data streams, storage, input, parallelisation, and Hadoop.
  • Preprocessing: noise and outlier filtering, completing missing data using histograms and interpolation, normalization to scale data.

Dimensionality Reduction/Feature Selection

  • Choose features to use and extract data from.
  • Explore methods such as PCA, LDA, LLE, GDA.
  • Consider goals, questions related to tractability.
  • Design experiments, including train/validate/test data sets and cross-validation.
  • Perform deployment.

Classification vs. Clustering

  • Classification: uses labeled data, requires training phases, and is domain sensitive. Easy to measure performance. Includes methods like Naive Bayes, KNN, SVM, Decision Trees, Random Forests.
  • Clustering: uses unlabeled data, organizes patterns based on similarity, difficult to evaluate, and includes methods like K-means, Fuzzy C-means, Hierarchical Clustering, DBScan.

Examples of ML Problems

  • Predict how much customers spend in online retail.
  • Explore different types of online retail customers.
  • Find categories for items in an online store.
  • Suggest items users might want to buy online.

Part 2: Elements of Statistics

  • Discusses fundamental statistical concepts.
  • Random variable descriptions, discrete and continuous.
  • Probability function mapping.
  • Probability function area always equals 1.

Description of Random Variables

  • A random variable takes on a range of values with specific probabilities.
  • The probability is how often we expect different outcomes in repeated experiments.

Discrete vs. Continuous Random Variables

  • Discrete: countable number of outcomes. Examples: dead/alive, treatment/placebo, dice rolls.
  • Continuous: infinite continuum of values. Examples: blood pressure, weight, speed of a car, real numbers from 1 to 6.

Probability Function

  • A probability function maps possible values of a variable against the probability of their occurrence. This value is between 0 and 1.
  • The area under the probability function is equal to 1.0.

Continuous Case

  • For continuous variables, the probability function is a continuous mathematical function that integrates to 1.
  • Example: the negative exponential function (exponential distribution) integrates to 1.

Continuous Case (cont.)

  • The probability function for continuous random variables is called the probability density function (PDF).
  • Probabilities of continuous variables are associated to ranges, not single values.

All Probability Distributions

  • All probability distributions are characterized by an expected value (mean) and a variance (standard deviation squared).

Mean or Expectation Value

  • Discrete case mean (expected value): E(X) = Σxᵢp(xᵢ) for all x.
  • Continuous case mean (expected value): E(X) = ∫xᵢp(xᵢ)dx for all x.

Variance

  • σ² = Var(X) = E(x-μ)²
  • Variance is the expected squared distance from the mean.

Variance (cont.)

  • Discrete case: Var(X) = Σ(xᵢ - μ)² p(xᵢ) for all x.
  • Continuous case: Var(x) = ∫(xᵢ - μ)² p(xᵢ)dx for all x.

Normal Distribution

  • A bell-shaped curve.
  • Defined by the mean (μ) and standard deviation (σ).
  • Changing μ shifts the distribution left or right. Changing σ increases or decreases the distribution spread.

The Normal Distribution: Mathematical Function

  • f(x) = 1/(σ√2π) * e^(-(x-μ)²/(2σ²)).

The Normal PDF

  • The area under the PDF curve always integrates to 1.

Normal Distribution Definition

  • Mean = E(X) = μ
  • Standard deviation (Std Dev) = √Var(X) = σ

Central Limit Theorem

  • The mean of many random samples will be normally distributed around the true mean of the population, as the sample size increases.
  • Standard deviation of the sampling distribution decreases as the sample size increases.

68-95-99.7 Rule

  • 68% of the data falls within one standard deviation of the mean.
  • 95% within two standard deviations.
  • 99.7% within three standard deviations.

Confidence Interval

  • μ = x ± t * (s/√n).
  • Uses t-Students value, dependent on sample size and confidence level.

Testing Hypothesis

  • Comparison between population distribution vs. sampling distribution.
  • Test on the sample mean to either reject or accept the null hypothesis.

Deterministic vs. Statistical Testing

  • Deterministic: observe the event and decide (reject/don't reject null).
  • Statistical: observe the event and decide, with a chance of error (reject/don't reject with chance p%).

Types of Errors in Hypothesis Testing

  • Type I error (α): Rejecting the null hypothesis when it's true (false positive).
  • Type II error (β): Failing to reject the null hypothesis when it's false (false negative).

p-value

  • Probability of an observed event to occur by pure chance.
  • Informal significance levels help us interpret the results.

Anomaly Detection

  • Techniques used to isolate and identify data points or values that are considered unusual or don't align with the rest of the data.

Examples of ML problems (cont.)

  • Identify potential scams in online retail outlets.

A/B Testing

  • A method for testing two different design options by comparing their success rate.
  • Used to quantitatively asses if there's a statistical difference between the two.
  • Assess the effect of underlining links on click-through rate.

Correlation

  • Describes a linear relationship between two variables.
  • Positive correlation (increasing X → increasing Y).
  • Negative correlation (increasing X → decreasing Y).
  • No correlation.

Correlation (cont.)

  • cov(X,Y) = Σ((xᵢ - X̄)(yᵢ - Ȳ))/(n - 1).

Linear Correlation

  • Linear relationships are visualized and evaluated on scatterplots.
  • Assess the strength of the relationship between variables.
  • Visual assessment of the relationship: Strong, weak, no relationship.

Linear Regression Model

  • Assumes a linear relationship between variables.
  • Defines the relationship using an equation (Y = β₀ + β₁X₁ + εᵢ).
  • The dependent variable is Y, and the independent variable is X.
  • Random error (ε) accounts for the fact the linear relation is an approximation.

Estimating Parameters: Least Squares Method

  • The best fit is when the differences between prediction values and the actual values are minimal.
  • Least squares minimizes the sum of the squared differences.
  • The method is used for parameter estimation in linear regression, and can also be applied to other models.

Least Squares Graphically

  • Visual representation of the error minimisation using a line graph and the points.

Residual Analysis for Linearity, Homoscedasticity, and Independence

  • Residual analysis assess the validity of the assumptions for the linear model.
  • Linearity: The errors should be randomly distributed around the line.
  • Homoscedasticity: The variance of errors should be constant across the independents variable.
  • Independence: The residual values at one point should not be correlated with the residual values at a different point.

Estimating Parameters: Classification

  • This section covers estimation techniques specific to classification problems.

Comparing LP and Logit Models

  • Comparison of linear predictive models vs. logistic predictive models focusing on the shape differences.

Confusion Matrix/Crosstabs

  • Calculates the performance of a classification model.
  • True positives (TP), True negatives (TN), False Positives (FP), False Negatives (FN).

Confusion Matrix

  • A table that records the counts of the classifications.

Underfitting and Overfitting

  • Underfitting: The model is too simple to capture the true relationship (e.g., a flat line through data with a curve).
  • Overfitting: The model is too complex, fitting the training set too closely and losing generalisability (e.g., following the noise in data, which does not reflect the underlying pattern)

Overfitting

  • The model fits the training data very well.
  • The model does not generalise well for the test data.
  • There is a gap between training and test error.

Overfitting (cont.)

  • Overfitting is a problem in machine learning.
  • It occurs when a model is too complex.
  • Good fit on training data but poor on unseen (test) data.

Overfitting of ANNs

  • Parameters (e.g., number of neurons, initial weights).
  • Activation functions (sigmoid, etc.).
  • Learning rate, momentum (increase in flexibility due to increase in neurons).

Training and test data set

  • Training data is used to train or learn a model.
  • Test data is used to evaluate the performance.

Goodness-of-fit of ANN

  • Measure of how well the ANN models the data.
  • Similar to R-squared for linear regression, but applied to ANNs.
  • Close to 1.0 = better fit.
  • Plotting MSE vs. Epochs helps choose the optimal model, as the training and test MSE.

Advantages of ANNs

  • Efficient for massively parallel processing.
  • Robust, tolerant to missing or noisy data.
  • User-friendly programming.

Disadvantages of ANNs

  • Difficult to design models for arbitrary applications.
  • Difficult to assess internal operation of the ANN.
  • Not easy to know which variable is influential (black box).

Part 5: Python Implementation

  • This section outlines components for designing and training dense artificial neural networks using the Keras Python library.
  • This includes Data (housing dataset), Problem statement (regression or classification), Preprocessing (standard scaling, one-hot encoding), and Architecture (number of layers/neurons, activation functions, dropout layers), Training parameters (optimizer ADAM, batch size, epochs), Evaluation metrics and learning curves, Analysis of errors and residuals.

Dropout Layers

  • Randomly remove nodes within the NN during a forward path to train an ensemble of subnetworks.
  • Effectively improves generalisation ability.
  • Leads to improved uncertainty estimation of predictions.

Classification Implementation in Keras

  • Steps to use Keras for classification models include defining a sequential model, defining the layers (input, hidden, output), specifying activation functions, compiling the model, and training it.

High-level Language Model Overview

  • Large language models are described and their parameters vs. the year are displayed.

Intuition behind LLM trainings

  • Autoregressive models predict future tokens given past history. Autoencoders predict tokens, given the rest of the context.

LLM Capabilities

  • LLMs cover various tasks such as text classification, entity recognition, summarization, paraphrase, translation, and data generation.

GPT (Generative Pre-trained Transformer)

  • Generative Large Language Model (LLM). Zero-shot and few-shot learning on diverse tasks. Includes a Chat Functionality and Human Feedback Loop.

Part 6: Exam Information

  • This section contains exam-related details (e.g., dates, topics).

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Untitled Quiz
6 questions

Untitled Quiz

AdoredHealing avatar
AdoredHealing
Untitled Quiz
18 questions

Untitled Quiz

RighteousIguana avatar
RighteousIguana
Untitled Quiz
50 questions

Untitled Quiz

JoyousSulfur avatar
JoyousSulfur
Untitled Quiz
48 questions

Untitled Quiz

StraightforwardStatueOfLiberty avatar
StraightforwardStatueOfLiberty
Use Quizgecko on...
Browser
Browser