FM217 Intro Stats Part 1 PDF

Summary

This document provides an introduction to data science focusing on introductory statistics. It covers measures of centrality, such as mean, median, and mode, and measures of dispersion, including range, variance, and standard deviation. The document also touches on the concept of percentiles.

Full Transcript

Introduction to Data Science: Refresher for September 4, 2024 class on Introductory Statistics Tuesday, September 03, 2024 9:09 PM Data is useful only if it is presented meaningfully. What is a Population? What is a sample? Sampling – Random Sample – Subset...

Introduction to Data Science: Refresher for September 4, 2024 class on Introductory Statistics Tuesday, September 03, 2024 9:09 PM Data is useful only if it is presented meaningfully. What is a Population? What is a sample? Sampling – Random Sample – Subset of the population, (hopefully) represents the population If possible, study the entire population to avoid sampling bias eg., X_1 = {14, 15, 15, 16,16,17,17,18} How do you aggregate X? Measures of Centrality: 1. Average/Mean 2. Median 3. Mode The typical/central value 4. …... Mean = (Sum of values ) / n Where n = |Sample| or the size of sample set X = {x_1, x_2,…..x_n} Problems with Mean: X_2 = {0,17,18,19} X_3 = {2,2,2,2,2,2,2,2,2, 10000} Possible solution: Median = Middle value If |n| is even, in this course, we will take the average of the two middle values. Problems with Median: X_4 = {1,1,1,2,2,3,3,10,10,10,10,10,10} Possible Solution: Mode = Most frequently occurring value X_5 = {1,1,1,1,100,100,100,100,100,45} Unimodal Hometowns = {Mohali, Mohali, Delhi, Chandigarh, Bangalore} Bimodal Hometowns = {Mohali, Mohali, Delhi, Delhi, Chandigarh, Bangalore} Multimodal Mode can be non-numeric. Q) How does one differentiate between the 3 sets of values given below? Y_1 = {-1,0,1} Y_2 = {-10,0,10} Y_3 = {-100,0,100} Measure of Dispersion/Scatter/Spread: (Real value >= 0 ) 1. Range 2. Variance 3. Standard Deviation 4. CoV 5. IQR (Inter quartile Range) 6. …... Problem with Range: Y_4 = {1,1,1,1,1,10,10,10,10,10} and Y_5 = {1,2,3,4,5,6,7,8,9,10} have the same range. Variance = Var(X) = E[(x - mean)^2] Standard Dev. = SQRT (Variance) Problem: Results can be made to look better/worse by changing units. X = {1,2,2,3} Y = {100,200,200,300} X = Marks out of 10 Y= Marks out of 1000 Standard Deviation of X = 1/SQRT(2) ~ 0.7071 Standard Deviation of Y = 100/SQRT(2) ~70.71 Possible Solution: Coefficient of Variation = Standard Deviation/Mean CoV of X = CoV of Y, which represents the fact that the two sets have a similar spread. A more natural and robust technique of measuring dispersion is: Average absolute Deviation 1. Mean absolute deviation around the mean 2. Mean absolute deviation around the median 3. Mean absolute deviation around the mode 4. Median absolute deviation around the mean Mean absolute deviation 5. Median absolute deviation around the median Mean absolute deviation around the mode 6. Median absolute deviation around the mode around the mean 7. Maximum absolute deviation around the mean 8. Maximum absolute deviation around the median 9. Maximum absolute deviation around the mode There are several formulae that give bounds on these values under various conditions. Eg., Percentile: Relative grade / position X = {1,2,3,4,5,6,6,8,9,10} 100th percentile value = 10 70th percentile value = 7 Nth percentile = |X| * n /100 70th percentile of X = 10 * 70/100 = 7th position = 6 65th percentile of X = 10 * 65/100 = 6.5 - 6th or 7th position = 6.

Use Quizgecko on...
Browser
Browser