Data Analytics for Business Optimization Final Exam Notes PDF
Document Details
Uploaded by EntrancedRed
IE University
Matilde Biondi
Tags
Summary
These notes cover data analytics for business optimization, focusing on trendlines and regression analysis, and introducing inferential statistics. It details the goal of regression analysis and different types of variables.
Full Transcript
Icon Description automatically generated **DATA ANALYTICS FOR BUSINESS OPTIMIZATION** **Final Exam** **Matilde Biondi - 12392** **-CHAPTER 6. TRENDLINE AND REGRESSION ANALYSIS-** **What is the main goal of regression analysis?** Many applications of business analytics involve modeling relation...
Icon Description automatically generated **DATA ANALYTICS FOR BUSINESS OPTIMIZATION** **Final Exam** **Matilde Biondi - 12392** **-CHAPTER 6. TRENDLINE AND REGRESSION ANALYSIS-** **What is the main goal of regression analysis?** Many applications of business analytics involve modeling relationships between one or more independent variables and some dependent variable. For example, we can predict the level of sales based on the price we set or extrapolate a trend into the future. Trendlines and regression analysis are tools for building such models and predicting future results. **-Inferential Statistics-** - Relates to decision making. - We try to reach conclusions that extend beyond the immediate data alone → generalizing from the sample to the population: probability, testing hypotheses, etc. - Which decisions? 1. Determining whether any apparent characteristics of situation are unusual. 2. Estimating the value of unknown numerical quantities and determining the reliability of those estimates. 3. Using past occurrences to attempt to predict the future. **-Types of variable-** ![](media/image2.png) ![](media/image4.png) **-How does interferential statistic work?-** If there is a relationship between any two variables, it may be possible to predict the value of one of the variables from the value of the other. **What do we mean if two variables have a strong statistical relationship with one another?** - They move together - They appear to be related - There might be a correlation between them **[Remember]**: sometimes, however, statistical relationships exist even though a change in one variable is not caused by a change in the other. **[Rule: correlation does not imply causation]** ![](media/image6.png)Just because two trends seem to fluctuate in tandem, that doesn't prove that they are meaningfully related to one another! **How is it possible?** 1. It may be the result of random chance, where the variables appear to be related, but there is no true underlying relationship. 2. There may be a third, lurking variable that that makes the relationship appear stronger (or weaker) than it actually is. **What should we do?** With well-designed empirical research, we can establish causation! → randomization, controlled experiments and predictive models with multiple variables. **-What is a trend?-** A trend is a pattern found in time series datasets; it is used to describe if the data is showing an upward or downward movement Slow variation over a longer period of time, usually several years - **[Goal]**: forecasts, predictions, planning, testing theories, providing a likelihood of changes happening **Rule 1:** a trend cannot be inferred from two points! E.g.: If the crime rate drops from one year to the next, that\'s not evidence of a downward trend in the crime rate. It\'s unlikely the crime rate is going to be exactly the same from one year to the next. **Rule 2:** you cannot pick convenient spots for your trend to begin and end. People sometimes will observe that the crime rate has increased, for three or four years in a row, and decide that that\'s a trend → they picked only years when the rate was increasing the main conclusion you can draw is that they\'re sandbagging... **Rule 3:** we cannot simply draw a line between the first data point in our series and the last and call that your trend. Any measure includes error, so it is not a completely accurate measure of the variable whose trend you\'re interested in. We must fit a trend line with a statistical technique to get a legitimate estimate of the trend. **Rule 4:** no change is a trend until a statistical test says it is. Statistical tests evaluate the likelihood of changes happening and they can tell you pretty quickly how likely the changes in your sample are if there is no trend in the population. They also give you a pretty good idea of how strong the trend is. **-Types of trends-** +-----------------------+-----------------------+-----------------------+ | **Linear Trend** | A linear pattern is a | | | | continuous decrease | | | | or increase in | | | | numbers over time. | | | | | | | | On a graph, this data | | | | appears as a straight | | | | line angled | | | | diagonally up or down | | | | (the angle may be | | | | steep or shallow)→ | | | | the trend either can | | | | be upward or | | | | downward. | | +=======================+=======================+=======================+ | **Exponential Trend** | This technique | ![Chart, line chart | | | produces non-linear | Description | | | curved lines where | automatically | | | the data rises or | generated](media/imag | | | falls, not at a | e8.png) | | | steady rate, but at a | | | | higher rate. | | | | | | | | Instead of a straight | | | | line pointing | | | | diagonally up, the | | | | graph will show a | | | | curved line where the | | | | last point in later | | | | years is higher than | | | | the first year if the | | | | trend is upward. | | +-----------------------+-----------------------+-----------------------+ | **Seasonality** | We can identify a | | | | seasonality pattern | | | | when fluctuations | | | | repeat over fixed | | | | periods of time and | | | | are therefore | | | | predictable and where | | | | those patterns do not | | | | extend beyond a | | | | one-year period. | | | | | | | | Seasonality may be | | | | caused by factors | | | | like weather, | | | | vacation, and | | | | holidays. It usually | | | | consists of periodic, | | | | repetitive, and | | | | generally regular and | | | | predictable patterns. | | | | It can repeat on a | | | | weekly, monthly, or | | | | quarterly basis. | | +-----------------------+-----------------------+-----------------------+ | **Irregular/Random** | Patterns This type of | ![](media/image10.png | | | analysis reveals | ) | | | fluctuations in a | | | | time series. These | | | | fluctuations are | | | | short in duration, | | | | erratic in nature and | | | | follow no regularity | | | | in the occurrence | | | | pattern. In | | | | prediction, the | | | | objective is to | | | | "model" all the | | | | components to some | | | | trend patterns to the | | | | point that the only | | | | component that | | | | remains unexplained | | | | is the random | | | | component. | | | | | | | | E.g.: long-term oil | | | | prices | | +-----------------------+-----------------------+-----------------------+ | **Stationary** | A stationary time | | | | series is one with | | | | statistical | | | | properties such as | | | | mean, where variances | | | | are all constant over | | | | time. | | | | | | | | A stationary series | | | | varies around a | | | | constant mean level, | | | | neither decreasing | | | | nor increasing | | | | systematically over | | | | time, with constant | | | | variance. | | +-----------------------+-----------------------+-----------------------+ | **Cyclical Patterns** | Cyclical patterns | ![](media/image12.png | | | occur when | ) | | | fluctuations do not | | | | repeat over fixed | | | | periods of time and | | | | are therefore | | | | unpredictable and | | | | extend beyond a year. | | +-----------------------+-----------------------+-----------------------+ **-What is a trendline? -** **[Purpose]**: to model relationships between variables and understand how the dependent variable behaves as the independent variable changes. **Why do we use it?** - To highlighting an underlying pattern of individual values → minimizing the distance between each point in a scatter plot - It represents the data's "best fit Our trendline is most reliable when its R-squared value is at or near 1 - **0%** represents a **model that does not explain any of the variance** in the response variable around its mean. - **100%** represents **a model that explains all the variation** in the response variable around its mean. [The larger R-Squared, the better the regression model fits our observations. As a rule of thumb, aim for 80% or higher. ] **How to "read" the R square value result?** *E.g.: the R-squared for this model is 0.45 which means that 45% of variation in average obesity is explained by variation in % of people with BA or higher education.* **-How do we know if the trend line is a good fit?-** ![](media/image14.png) **-Types of trendline-** +-----------------------+-----------------------+-----------------------+ | **Linear trendline** | \- Best-fit straight | | | | line that is used | | | | with simple linear | | | | data sets | | | | | | | | \- The pattern in its | | | | data points resembles | | | | a line | | | | | | | | \- Something is | | | | increasing or | | | | decreasing at a | | | | steady rate | | +=======================+=======================+=======================+ | **Logarithmic | \- Best-fit curved | ![](media/image16.png | | trendline** | line | ) | | | | | | | \- The rate of change | | | | in the data increases | | | | or decreases quickly | | | | and then levels out | | | | | | | | \- A logarithmic | | | | trendline can use | | | | negative and/or | | | | positive values. | | +-----------------------+-----------------------+-----------------------+ | **Polynomial | \- Curvilinear | Chart, bar chart, | | trendline** | trendline | histogram Description | | | | automatically | | | \- Works well for | generated | | | large data sets with | | | | oscillating values | | | | that have more than | | | | one rise and fall → | | | | can be determined by | | | | the amount of bends | | | | on a graph (a | | | | quadratic polynomial | | | | trendline has one | | | | bend) | | | | | | | | \- *E.g.: rise in the | | | | beginning, peak in | | | | the middle and fall | | | | near the end | | | | (profit)* | | +-----------------------+-----------------------+-----------------------+ | **Power trendline** | \- Curved line that | ![](media/image18.png | | | is best used with | ) | | | data sets that | | | | compare measurements | | | | that increase at a | | | | specific rate --- for | | | | example, the | | | | acceleration of a | | | | race car at | | | | one-second intervals | | | | | | | | \- We cannot create a | | | | power trendline if | | | | the data contains | | | | zero or negative | | | | values | | +-----------------------+-----------------------+-----------------------+ | **Moving average | \- When the data | | | trendline** | points in your chart | | | | have a lot of ups and | | | | downs, a moving | | | | average trendline can | | | | smooth the extreme | | | | fluctuations in data | | | | values to show a | | | | pattern more clearly | | | | | | | | \- Excel calculates | | | | the moving average of | | | | the number of periods | | | | that you specify (2 | | | | by default) and puts | | | | those average values | | | | as points in the line | | | | | | | | *- E.g.: to reveal | | | | fluctuations in a | | | | stock price* | | +-----------------------+-----------------------+-----------------------+ **-Correlation VS Regression-** ![](media/image20.png)**Key difference:** correlation measures the degree of a relationship between two variables (x and y), whereas regression is how one variable affects another **-Measure of association-** ![](media/image22.png) **-Correlation-** When a change to one variable is then followed by a change in another variable, whether it be direct or indirect. - Variables are considered "uncorrelated" when a change in one does not affect the other. - **Goal 1:** knowing how two variables are correlated allows for predicting trends in the future - **Goal 2:** to find the numerical value that shows the relationship between the two variables and how they move together ![](media/image24.png) Correlation coefficients are used to measure the strength of the linear relationship between two variables - Range: between -1 and 1 - Usually measured for linear relationships - A correlation **coefficient greater than zero** indicates a **positive relationship** while a value **less than zero** signifies a **negative relationship**. - A value of **zero** indicates **no relationship** between the two variables being compared. **-Regression-** How one variable affects another → the outcome is dependent on one or more variables - It can be utilized to assess the strength of the relationship between variables and for modeling the future relationship between them. - Equation: Y = a + b(x) → prediction formula - Y = dependent variable X = independent variable - A → refers to the y-intercept, the value of y when x = 0 - B → refers to the slope, or rise over run **-Correlation calculation-** - Step 1: decide which variable would be the independent and the dependent one! - Step 2: formulate a research hypothesis about the strength and direction of the relationship between the 2 variables - Step 3: calculate the correlation coefficient with a function + accept or reject the hypothesis - Step 4: develop a visual tool → scatter plot (insert the trendline, R2 value and equation) or combined chart - Step 5: create a simple correlation matrix with the Data Analysis ToolPak **-CHAPTER 7. SIMPLE AND MULTIPLE REGRESSION ANALYSIS-** **-When do we apply regression? -** ![Diagram, text, email Description automatically generated](media/image26.png) **Main function:** to estimate the relationships between two or more variables. **Why should we use it? →** helps to understand how the dependent variable changes when one of the independent variables varies and allows to mathematically determine which of those variables really has an impact. **Base of regression**: sum of squares → to get the smallest possible sum of squares and draw a line that comes closest to the data. **-Types of regression-** Diagram Description automatically generated **-How to do it-** 1. Step 1: Plot this information on a scatter plot → the regression line will demonstrate the relationship between the independent variable (rainfall) and dependent variable (umbrella sales) 2. In the Regression dialog box, configure the following settings: a. Select the Input Y Range, which is your dependent variable. In our case, it\'s umbrella sales b. Select the Input X Range, i.e., your independent variable. In this example, it\'s the average monthly rainfall c. Check the Labels box d. Choose your preferred Output option e. Optionally, select the Residuals checkbox to get the difference between the predicted and actual values ![Graphical user interface, text, application, email Description automatically generated](media/image28.png) ![Graphical user interface, application, table Description automatically generated](media/image30.png) Graphical user interface, table Description automatically generated **-Types of hypothesis-** ![](media/image32.png) **-Linear regression-** A specific statistical technique that can help to understand the relationship between one dependent variable and two or more independent variables. Main goal: to model the linear relationship between the explanatory (independent) variables and response (dependent) variables. **-Multiple regression-** Only relevant variables must be included in the model and the model should be reliable. - **[Linearity]**: linear relationship between the dependent variables and the independent variable - **[Normality]** must be assumed in multiple regression. This means that in multiple regression, variables must have normal distribution. - **[No Multicollinearity:]** none of the predictor variables are highly correlated with each other **-Multicollinearity-** The statistical phenomenon where two or more independent variables are strongly correlated → almost perfect or exact relationship between the predictors! [Why is it a problem? ] The researchers obtain unreliable results → we would not be able to distinguish between the individual effects of the independent variables on the dependent variable - E.g.: person's height and weight, age and sales price of a car [What to do?] → obtain more data, remove a variable, etc. [Rule of thumb:] if the correlation coefficient is more than 0.8 between 2 or more independent variables → multicollinearity! **-Adjusted R2 Rule-** Problem: R-squared increases every time you add an independent variable to the model! → A regression model that contains more independent variables than another model can look like it provides a better fit merely because it contains more variables. A modified version of R-squared that accounts for predictors that are not significant in a regression model. Two outcomes might occur: - Lower adjusted R-squared indicates that the additional input variables are not adding value to the model. - Higher adjusted R-squared indicates that the additional input variables are adding value to the model. **-Reliability of data-** **[Coefficient of variation ]** - Shows the extent of variability in relation to the mean of the population - The higher we have the coefficient of variation, the higher the level of dispersion that we have in the mean of data points - The lower the value of the coefficient of variation, the higher the precision of the estimate - High CV indicates that the group is more variable - Recommendation: value between 20--30 is acceptable Diagram Description automatically generated **[How to find the most reliable variables?]** 1. Compare their Kurtosis and Skewness values → shape of the histogram, estimating the asymmetry 2. Find the smallest mean/median proportion (preferably lower than 5%) 3. Evaluate the result of variability: coefficient of variation → values between 20-30 4. Select those variables which have the most similar distribution in terms of their shape and dispersion → run a Regression model for these most reliable variables! **-CHAPTER 8. FORECASTING-** ![Diagram Description automatically generated](media/image34.png) **[Forecasting is a tool used for predicting future observation values based on past information ]** Forecasting can be used for: - Strategic planning (long range planning). - Finance and accounting (budgets and cost controls). - Marketing (future sales, new products). - Production and operations Business analysts may choose from a **wide range of forecasting techniques** to support decision making. Selecting the appropriate method depends on the characteristics of the forecasting problem, such as the time horizon of the variable being forecast, as well as available information on which the forecast will be based. **[General rules]**: - Forecasts are more accurate for shorter time periods. - Every forecast should include an error estimate. - A forecast is only as good as the information included in the forecast (past data). ![](media/image36.png) **-Working with past data-** A screenshot of a computer Description automatically generated with low confidence **-Time series-** A time series consists of a set of observations that are measured at specified (usually equal) time intervals. - Time series can be taken over very short periods or long periods (years, centuries) - Basic tool in forecasting *What should we consider when looking at time series?* 1. Trends 2. Cyclical elements 3. Seasonality ![Graphical user interface, text, application Description automatically generated](media/image38.png) **-Quantitative forecast method-** Chart, scatter chart Description automatically generated **-How do we decide the forecasting method? -** We choose the most accurate forecast, that is the prediction with the lowest deviation \~ "fitting error". "Error" metrics used with /=ABS()/: MAD (mean abs. deviat.) = AVERAGE(ABS(all „real values" -- all forecasted values)) MSE (mean sq. error) RMSE (root mean sq.error) MAPE (mean % deviation) = the avg of (all absolute deviations / all actual values) - Step 1: Calculate the abs.deviations - Step 2: abs dev / actual value - Step 3: Calculate the average of all the values from step 2 **-Forecasting methods in Excel-** None will give you definitive answers without the ability to see the future. These results are best used to make educated guesses. ![Diagram Description automatically generated](media/image40.png) **[A. Moving Average (simple moving average) ]** - Forecasting method based on the premise that, if the values in a time series are averaged over a sufficient period, the effect of short-term variations will be reduced. - In this method, short term cyclical, seasonal and irregular variations will be smoothed out. - By choosing the number of cases to be included in any average (n), the degree of smoothing can be controlled (12 for monthly data, 4 per quarterly and so) - The larger the number of cases (n) included in an average, the greater the smoothing effect. Chart, histogram Description automatically generated In the simple moving average models the forecast value is: ![A picture containing text, antenna Description automatically generated](media/image42.png) **Where:** t → Current period. Ft+1 → Forecast for next period (we choose) n → Forecasting horizon (how far back we look) A→Actual figures from each period. **[B. Exponential Smoothing]** The prediction of the future depends mostly on the most recent observation, and on the error of the latest forecast. ![A picture containing text Description automatically generated](media/image44.png) Involves only one parameter (α) which smooths the time series data by computing exponentially averages and provides short-term forecasts. The smoothing constant α expresses how much our forecast will react to observed differences... The older the data, the less priority ("weight") the data is given; newer data is seen as more relevant and is assigned more weight. Graphical user interface, text, application, email Description automatically generated![Text Description automatically generated with medium confidence](media/image46.png) **[C. Weighted Average ]** - We give more importance to what happened recently, without losing the impact of the past. **Where:** t: Current period. Ft+1: Forecast for next period (we choose) n: Forecasting horizon (how far back we look) A: Actual figures from each period w: Weight we give to each period Chart, line chart Description automatically generated ![Graphical user interface, application, table Description automatically generated](media/image49.png) Graphical user interface, table Description automatically generated![Table Description automatically generated](media/image51.png)