Statistics for Business Analytics & Data Science PDF

Document Details

WinningSodalite2051

Uploaded by WinningSodalite2051

Zarqa University

Dr. Mais Haj Qasem

Tags

statistics distributions data analysis business analytics

Summary

This document is a lecture or presentation on various statistical concepts, including discrete and continuous distributions, probability distributions, the normal distribution, standard deviation, mean, median and mode. It also introduces examples and exercises.

Full Transcript

Statistics for Business Analytics & Data Science Part -1 Prepared by: Dr. Mais Haj Qasem Outline What we will learn : 1. Continuities and Discrete 2. Mean, Median, Mode 3. Standard Deviation 4. What is a Distribution 5. Normal Di...

Statistics for Business Analytics & Data Science Part -1 Prepared by: Dr. Mais Haj Qasem Outline What we will learn : 1. Continuities and Discrete 2. Mean, Median, Mode 3. Standard Deviation 4. What is a Distribution 5. Normal Distribution 6. Skewness Prepared by: Dr. Mais Haj Qasem Continuous and Discrete Statistics and data scientists have to understand the difference between discrete and continuous datasets and variables. The similarity is that they are both types of numerical data. In practice, however, whether the base data is discrete or continuous affects many data scientists and statistical decisions. See the table on the following slide and highlight Discrete and Continuous Data. Prepared by: Dr. Mais Haj Qasem Prepared by: Dr. Mais Haj Qasem Bank Dataset Prepared by: Dr. Mais Haj Qasem A variable is a quantity whose value changes. A discrete variable is a variable whose value is obtained by counting. Examples: number of students present number of red marbles in a jar number of heads when flipping three coins students’ grade level A continuous variable is a variable whose value is obtained by measuring. Examples: height of students in class weight of students in class time it takes to get to school distance travelled between classes Prepared by: Dr. Mais Haj Qasem Distribution A Probability distribution is a mathematical function that stated in a simple terms, can be thought of as providing the probability of occurrence of different possible outcomes in an experiment. A distribution in statistics is a function that shows the possible values for a variable and how often they occur. Distribution is not required to be in a chart; it may be a table or maybe a group of people and the probability that one of them is under 10 years old, but Distribution is represented in a chart to be easy to understand. Prepared by: Dr. Mais Haj Qasem Discrete & Continuous Distribution A probability distribution is a formula or table that assigns probability values to each possibility of a random variable X. There are two types of probability distributions: 1. Discrete distribution means X may have on one of a countable (typically finite) number of possible values. 2. Continuous distribution implies that X can take on an infinite (uncountable) number of different values. Prepared by: Dr. Mais Haj Qasem Discrete Distribution A discrete distribution indicates the probability of each value of a discrete random variable occurring. A discrete random variable is one with countable values, such as a set of non- negative integers. Each possible value of the discrete random variable can be associated with a non-zero probability. Thus, a discrete probability distribution can always be presented in tabular form. A discrete distribution can used to compute the probability that X is exactly equal to some value. Prepared by: Dr. Mais Haj Qasem Here, age is not a discrete value, while age group is a discrete value because the age of someone may be 21 years, 10 months, or 11 hours, while age group is the group of age in general, where everyone in the age column must be in some age group. The probability of finding the age of someone in the age range of 30–40 is 30%; this means that the real probability of this someone's age is a real value and can be calculated specifically. Prepared by: Dr. Mais Haj Qasem if you're flipping a coin twice, the possible combinations are: ❖ Tails/tails (TT) ❖ Heads/tails (HT) ❖ Tails/heads (TH) ❖ Heads/heads (HH) There are four options because you are flipping the coin twice and there are two possible outcomes. Each of the outcomes reflects one-quarter of the possible outcomes. The HT and TH combinations represent each one-quarter of the outcomes (and are effectively the same thing). As a result, one- quarter of the time, you'll get a TT or HH, and half of the time, you'll get an HT or TH. Let the random variable X represent the number of Heads produced by this experiment. Because the random variable X can only have the values 0 or 1 or 2, it is a discrete random variable. Number of heads Probability 0 0.25 1 0.50 2 0.25 Prepared by: Dr. Mais Haj Qasem Continuous Distribution A continuous distribution explains the probability of a continuous random variable's possible values. A continuous random variable is a random variable with an endless and uncountable set of possible values (known as the range). The probability of a continuous random variable (X) is defined as the area under the PDF's curve. As a result, only value ranges can have a nonzero probability. A continuous random variable's probability of being equal to some value is always zero. Prepared by: Dr. Mais Haj Qasem Here the peak (P) is 50%, but it does not mean that 50% of people have a balance of $10,000; it means that someone's balance is 0 or 10,000 equal to 0 or 5%. I can say that the probability of finding someone whose balance is between $9,500 and $10,500 is the shaded area in the Gray colour. Shaded areas represent the actual percentage of people in the category between $9,500 and $10,500, so the shaded areas gave the highest percentage of balancing exits (most probability data). P($9,500 < x > $ 10,500 = Shaded area Prepared by: Dr. Mais Haj Qasem The continuous normal distribution can be used to represent the weight distribution of adult males. For example, you may compute the probability that a man weighs between 160 and 170 pounds. In this example, the shaded area under the curve indicates the weight range between 160 and 170 pounds. Because the area of this range is 0.136, the probability that a selected guy weighs between 160 and 170 pounds is 13.6%. The total area under the curve is 1.0. The probability that X is exactly equal to any number, on the other hand, is always 0 since the area under the curve at a single point with no width is zero. For example, the probability of a man weighing exactly 190 pounds to infinite accuracy is none. The probability that a guy weighs more than 190 pounds, less than 190 pounds, or between 189.9 and 190.1 pounds is nonzero, but the probability that he weighs exactly 190 pounds is zero. Prepared by: Dr. Mais Haj Qasem Discrete Vs Continuous Distribution In several ways, a continuous probability distribution is different from a discrete probability distribution: The probability that a continuous random variable will take on an exact value is zero. As a result, it is impossible to represent a continuous probability distribution in tabular form. A continuous probability distribution is instead described using an expression or formula. Prepared by: Dr. Mais Haj Qasem Standard Deviation Mean: The mean reflects the average of the given set of data. It works with both continuous and discrete data. It is the sum of all the values in the data set divided by the total number of values. 𝑥1 +𝑥2 + …………+ 𝑥𝑛 𝑚= , where 𝑛 total number of values 𝑛 Variance: The average distance of each value from the mean. If we subtract a value that is less than the mean, the variance will be negative, while the value that is more than the mean will be positive, but their total will be zero, and for that, we take the square of each value to give us a logical number. (𝒙𝒊 −𝒎) 𝟐 Variance = σ𝑵 𝒊=𝟏 𝑵 Prepared by: Dr. Mais Haj Qasem Standard Deviation : is a measurement that shows how much variation exists from the mean. It is the square root of variance. It's important for data scientists to know the skewness that we will talk about later. 𝑵 (𝒙𝒊 − 𝒎)𝟐 Standard Deviation= ෍ 𝒊=𝟏 𝑵 Since you know the standard deviation and the mean, you simply add or subtract the standard deviation to/from the mean. ❖ 1st standard deviation above = mean + standard deviation ❖ 2nd standard deviation above = mean + 1st standard deviation ❖ 3rd standard deviation above = mean + 2nd standard deviation ❖ 1st standard deviation below = mean - standard deviation ❖ 2nd standard deviation below = mean - 1st standard deviation ❖ 3rd standard deviation below = mean - 2nd standard deviation Prepared by: Dr. Mais Haj Qasem How much each person deviates from the mean Mean = 155 cm But we are not interested in the deviation of each individual person from the mean value, we want to know how much the persons deviate from the mean value on average, this is what Standard deviation tell us Standard Deviation = 11.5 cm Prepared by: Dr. Mais Haj Qasem Standard Deviation Vs Variance : Standard Deviation is the quadratic mean of the distance. 𝑵 (𝒙𝒊 − 𝒎)𝟐 ෍ 𝒊=𝟏 𝑵 The variance now is the squared standard deviation (𝒙𝒊 −𝒎) 𝟐 σ𝑵 𝒊=𝟏 𝑵 The only difference is that in order to calculate the standard deviation we take a root , and in order to calculate variance we don’t , and because the root is taken the standard deviation is always in the same unit as the original data (𝑐𝑚) For this reason , it is advisable to always use the standard deviation to described data, as this makes interpretation easer. The variance is more difficult to interpret because the units is squared of the original data 𝑐𝑚2 Prepared by: Dr. Mais Haj Qasem We will work on a sample of the height population. { (smallest number ) 62, 61.3, 65.1, 70.4, 70.9 (largest number )} 62 + 61.3 + 65.1 + 70.4 + 70.9 𝑴𝒆𝒂𝒏 𝐴𝑣𝑔 = = 65.92 5 𝟐 (𝒙𝒊 −𝒎) Variance = σ𝑵 𝒊=𝟏 = 16.65 𝑵 𝑵 (𝒙𝒊 − 𝒎)𝟐 Standard Deviation= ෍ = 4.48 𝒊=𝟏 𝑵 Prepared by: Dr. Mais Haj Qasem Normal Distribution The normal distribution is an important type of statistical distribution with several applications. This distribution is used in a great deal of Machine Learning Algorithms, therefore understanding the Normal Distribution is essential for any Statistician, Machine Learning Engineer, or Data Scientist. The Normal distribution is also known as the Gaussian, Gauss distribution, or probability density function. This pattern is followed by several groups. That is why it is widely used in business and statistics. All normal distributions can be described by just two parameters: the mean and the standard deviation. The mean value is the distribution's peak, or highest point. The distribution then falls symmetrically around the mean, the width of which is determined by the standard deviation. Prepared by: Dr. Mais Haj Qasem Here, the probability or observation that the area is exactly between the negative and positive 1st standard deviation is 68.2%, and between the negative and positive 2sd standard deviation is 27.2%. This means that if we are away from the mean, the probability or observation will be less. We can see that the probability or observation for each negative and positive 3rd standard deviation is 1%, and that even called an outlier or up-normal data that is out of the ordinary, like if we talked about the highest for someone being 2.2 or 130, it will be in that area, and in many study cases that outlier, we remove it. Here means equal zero, but it is not always a condition. When we are far from the mean, the standard deviation will be low. Data scientists always use and solve their problems using a normal distribution because it appears to be an optimal solution, and every possibility ends with a normal distribution. If we need to take the evaluation for a student's mark, we will use a normal distribution to determine the student's competence. Prepared by: Dr. Mais Haj Qasem For all normal distributions, 68.2% of the observations will be within one standard deviation of the mean, 95.4% will fall within two standard deviations, and 99.7% will fall within three standard deviations. This fact is known as the "empirical rule," a heuristic that indicates where the majority of the data in a normal distribution will appear. This means that data falling outside of three standard deviations ("3-sigma") would signify rare occurrences. Prepared by: Dr. Mais Haj Qasem Skewness Skewness is asymmetry in a statistical distribution, in which the curve appears distorted or skewed either to the left or to the right. Skewness can be quantified to define the extent to which a distribution differs from a normal distribution. Skewness may be expressed as a measure of how much a specific distribution differs from a normal distribution. A normal distribution has a skew of zero. Skewness is the most important topic for a data scientist because they must know where it gives them important insight. They gave him first indications about the data they work with and helped him in the analysis and understanding of the data. Types of Skewness: ❖ Positive Skewness ❖ Negative Skewness ❖ Zero Skewness Prepared by: Dr. Mais Haj Qasem Positive Skewed or Right-Skewed (Positive Skewness): A positively skewed (or right-skewed) distribution is a kind of distribution in statistics in which most values are grouped around the left tail of the distribution while the right tail is longer. ❖ Example: Most of the players score poorly while just a few of them perform well. Prepared by: Dr. Mais Haj Qasem Negative Skewed or Left-Skewed (Negative Skewness): A negatively skewed (also known as left- skewed) distribution is a kind of distribution in statistics in which more values are grouped together on the right side (tail) of the distribution graph while the left tail is longer. ❖ Example: Most of the players scored 40+ runs in a match and only a few of them scored less than 10 runs. (negative skew) Zero Skewed : is the normal distribution is a kind of distribution in statistics which means the mean is the most frequent data in the given data set. Prepared by: Dr. Mais Haj Qasem Mean, Median, Mode Mean: The mean reflects the average of the given set of data. It works with both continuous and discrete data. It is the sum of all the values in the data set divided by the total number of values. 𝑥1 +𝑥2 + …………+ 𝑥𝑛 𝑚= , where 𝑛 total number of values 𝑛 {58.8,61.2,61.3,62,62.5,62.6,63.6,65.1,65.4,65.5,66.1,66.4,66.7,67.4,67.9,68,68.1,68.3,68.6,69.9,70.4,70.7,70.9,71.8,71.8,72.2} Prepared by: Dr. Mais Haj Qasem Median : In general, the median indicates the middle value of a given collection of data when sorted in a specific order. When we need to calculate the median, the first step is to order the value from smallest to largest. If the number of values or observations in the provided data is odd, the median is middle value observation. If there is an even number oof observation the median is the average oof the two-middle value. ❖ Example : 2, 3, 11, 13, 17, 27, 34, 47 – Median = (13+17)/2 = 15 {58.8,61.2,61.3,62,62.5,62.6,63.6,65.1,65.4,65.5,66.1,66.4,66.7,67.4,67.9,68,68.1,68.3,68.6,69.9,70.4,70.7,70.9,71.8,71.8,72.2} Prepared by: Dr. Mais Haj Qasem Mode : The most frequent number occurring in the data set is known as the mode. It is a hop, or peak, in distribution. {58.8,61.2,61.3,62,62.5,62.6,63.6,65.1,65.4,65.5,66.1,66.4,66.7,67.4,67.9,68,68.1,68.3,68.6,69.9,70.4,70.7,70.9,71.8,71.8,72.2} Prepared by: Dr. Mais Haj Qasem The mean in the right skew always came closer to the right, and in this case, the mean is always larger than the median and vice versa. In left skewness, the mean is always smaller than the median, because when that data scientist works on a dataset that has many missing values, one way of solving this problem is using the skewness, and if the skewness is left, the solution is to take the mean and median on the estimated number, but that solution is one of the old solutions. They always use the median because it is not affected by the standard deviation, but the mean may be affected by their outlier in the data, and that mean will be affected, so if the standard deviation is high in a big way, the median is the best solution, but if the standard deviation is low and logical, we can use the mean. So, using the median in the left skewness is the best because it is closer to mode, which means it is closer to highly population and most frequency data. Prepared by: Dr. Mais Haj Qasem Homework Challenge You are an Analyst working for a high-end clothes design boutique. The company is developing a new line of clothes for very tall people. Your team is analyzing the viability of the project from a sales perspective and your manager has asked you to assist with some input variables to help test the financial forecast. You need to create two distributions: ❖ A normal distribution of 1000 observations for heights of men in Jordan ❖ A normal distribution of 1000 observations for heights of women in Jordan Prepared by: Dr. Mais Haj Qasem Also, for each of the two populations you have been asked to identify the minimum height of 2.2% ( that’s mean the std between 2nd and 3rd std ) of the tallest people in the population. In Jordan, men's heights have a mean of 69.1 inches (175.5 cm) and standard deviation 2.9 inches (7.4 cm), while female's heights have a mean of 63.7 inches (161.8 cm) and standard deviation 2.7 inches (6.9 cm). ❖ Hint: 1 inch = 2.54 cm Hint – 1 : ❖ Use the NORM.INV() function combined with RAND() in Excel ❖ Example: NORM.INV(RAND(), 69.1, 2.9) Prepared by: Dr. Mais Haj Qasem Hint – 2 ❖ If you want to visualize your distributions, you will need to allocate your data to bins (age group) first. Check out the article on Microsoft for more information on how to do this in Excel ❖ How to use the Histogram tool in Excel ❖ https://support.microsoft.com/en-us/help/214269/how-to-use-the-histogram-tool-in- ❖ excel ❖ Note: this histogram won't be dynamic. We will learn how to make the dynamic on in the homework solution Prepared by: Dr. Mais Haj Qasem

Use Quizgecko on...
Browser
Browser