CB2200 Business Statistics - Topic 1 Introduction to Statistics PDF

Summary

This document provides an introduction to business statistics, covering topics such as what is/are statistics, why study statistics, types of variables, organizing and visualizing data, and descriptive statistics. It also includes information about television audience measurement (TAM).

Full Transcript

CB2200 Business Statistics Topic 1 Introduction to Statistics Reference Levine, D.M., Kathryn, A.S. and David, F.S. Business Statistics: A First Course, Pearson Education Ltd, Chapter 1 & 2 & 3 Liu, K. I., To K. M., Speaking of Statistics, Pearson Education Ltd, Chapter 1...

CB2200 Business Statistics Topic 1 Introduction to Statistics Reference Levine, D.M., Kathryn, A.S. and David, F.S. Business Statistics: A First Course, Pearson Education Ltd, Chapter 1 & 2 & 3 Liu, K. I., To K. M., Speaking of Statistics, Pearson Education Ltd, Chapter 1 1 Outline Introduction  What is/are Statistics?  Why Study Statistics?  Types of Variables  Organizing and Visualizing Data  Use of Excel in Organizing Data Descriptive Statistics  Measures of Central Tendency  Measures of Variation  Distribution Shape  Use of Excel in Descriptive Statistics 2 What is/are Statistics? Of what? Statistics What types? The branch of mathematics that transforms data into useful information for decision makers. Difference? Descriptive Statistics Inferential Statistics How to summarize data with tables & charts? Collecting, summarizing, and Drawing conclusions and/or describing data making decisions concerning a population based only on What measures of sample data central tendency & What is the difference between variation to use? sample & population? What is the What is the shape of difference between sample statistics the distribution? and population parameters?3 Example: Television Audience Measurement (TAM) 4 Example: Television Audience Measurement (TAM) Cont’d What is TAM?  It is used to calculate the number of people watching TV With the help of the collected data  It allows brands and media companies to plan their approach Schedule the programs and determine the price of advertisements showing in the program Who is doing TAM in Hong Kong?  HK HOY TV, TVB, ViuTV and HK4As awarded six years contract (2024‐2030) to GfK for performing TAM 5 Example: Television Audience Measurement (TAM) Cont’d How is TAM Done?  Through a population sampling, researchers can record what a few people are watching at a given point in time, and they apply that information on the entire TV viewers in Hong Kong  A representative panel of 2,700 individuals from 1,000 households have been selected, sample will report ratings data for all TV viewing  Each television is connected to a device known as set meters. They gather all of the required data on their own 6 Example: Television Audience Measurement (TAM) Cont’d How is TAM done?  TV rating is expressed as a simple percentage calculation, but it involves using comprehensive scientific methods in a series of process that includes the following: Establishment survey: To understand TV viewership universe and other detailed information of TV markets Sampling and panel creation: To ensure the panel is consistent with population values for a number of characteristics 7 Example: Television Audience Measurement (TAM) Cont’d How is TAM done? Data collection  Diary method: Each household member is required to record the TV channels they watched and the watching time throughout a normal day. Interviewers will collect the completed diary cards on a regular weekly basis  Meter device method: Each viewer in a household is required to press a remote‐control handset to indicate presence when watching a TV program. View data stored in the device will be transferred to researchers every night Data processing: Three major procedures: Data entry and cleaning, data integration, and weighting and calculation 8 Example: Television Audience Measurement (TAM) Cont’d People Meter Data Panel_ID Member Eff_Date Channel Code Start Time End Time Number 1124339 2 2019‐7‐9 99 776 806 a STATISTICAL STUDY! 1124339 2 2019‐7‐9 1 1108 1262 1124339 It’s 3 2019‐7‐9 1 1185 1262 …. …. …. …. …. …. Start time = 776 means 12:56 pm and End time = 806 means 1:26 pm etc. Channel Code = 1 means watching TVB Jade; Channel Code = 99 means watching Viu TV Hence, on “9 July 2019”, the individual “2” of household “1124339” watched channel “99” (Viu TV) from 12:56 pm (“776”) to 1:26 pm (“806”) 9 Basic Steps in a Statistical Study Step 1: State the goal of your study precisely; that is, determine the population you want to study and exactly what you’d like to learn about it (population parameters). Step 2: Choose a sample from the population. (Be sure to use an appropriate sampling technique) Step 3. Collect raw data from the sample and summarize these data by finding sample statistics of interest. Step 4. Use the sample statistics to make inference about the population. Step 5. Draw conclusions; determine what you learned and whether you achieved your goal. Population Sample Measures used to describe the Measures computed from 10 population are called parameters sample data are called statistics Process of a Statistical Study Cont’d 11 Statistical Study – Exercise Cont’d Describe how you would apply the five basic steps in a statistical study to estimate the average time that local CityU students use to travel from home to campus.  Step 1:  Step 2:  Step 3:  Step 4:  Step 5: 12 How to Make Money Nowadays? Any relationship? Walmart put all its checkout‐counter data into a giant digital warehouse and set the disk drives spinning. Out popped a most unexpected correlation: diapers and beer at the same cart usually on Fridays. 13 How to Make Money Nowadays? Cont’d With statistical analysis, Walmart found that young mothers always ask fathers to purchase diapers for babies after work Evidently, young fathers would make a late‐night run to the store to pick up Huggies and get some Blue Light while they were there Capitalizing on the discovery, the store  placed the disparate items together  placed high‐price diapers besides beer (as males don’t concern the price) Sales zoomed!!! 14 Why Study Statistics? Knowing how to do statistics can lead to become a well recognized and respected profession: Statistician “I don’t like numbers. Statistics are not for me!!! Unfortunately, statistics are there for everybody, like it or not!  The 2023 survey results showed a decrease in students’ expected annual salary upon graduation, where it declined to HKD292.9k (HKD24.4k per month) from HKD303.0k (HKD25.2k per month) in 2022, dropping 3%.” APAC at Universum  “According to the Composite CPI, overall consumer prices rose by 2.0% in March 2024 over the same month a year earlier, slightly larger than the average rate of increase in January and February 2024 (1.9%).” HKSAR C&S  “The total world population is expected to reach 9.3 billion in 2050.” U.S. Census Bureau 15 Why Study Statistics? Cont’d Accounting Marketing Economics Finance Information Management 16 Why Study Statistics? Accounting Audit Sampling  The application of audit procedures to less than 100% of items within a population of audit relevance such that all sampling units have a chance of selection in order to provide the auditor with a reasonable basis on which to draw conclusions about the entire population  The auditor shall determine a sample size sufficient to reduce sampling risk to an acceptably low level How to obtain a Source: http://www.ifac.org/sites/default/files/publications/files/A028%202012%20IAASB%20Handbook%20ISA%20530.pdf sufficient 17 SAMPLE SIZE? Why Study Statistics? Economics Economics Indicators  Allow analysis of economic performance and predictions of future performance How to construct ECONOMICS INDEX? Source: http://hk.centadata.com/cci/cci_e.htm 18 Why Study Statistics? Finance Risk and Portfolio Management How to measure the  Use statistical models to analyze the market portfolio EXPECTED  Efficient frontier of portfolio (a basket of stocks) RETURN and RISK? Source: http://en.wikipedia.org/wiki/Modern_portfolio_theory# 19 The_efficient_frontier_with_no_risk‐free_asset Why Study Statistics? Cont’d Big DATA In June 2024, a research firm Statista estimated that the global data analytics market is expected to see significant growth over the coming years, with a forecasted market value of over $650 billion by 2029. In June 2024, Omar Choucair, CFO, Trintech said “Big tech companies such as Alphabet, Microsoft and Meta are investing billions of dollars into AI because these tools have the potential to streamline processes and unlock valuable insights from vast amounts of financial data, which can ultimately enhance decision making.” In May 2024, Nexford University published that data scientist, who utilizes their skills in statistics, mathematics, programming, and domain expertise to analyse and interpret complex data sets and maintain data infrastructure of a company, is the best paying job in the world 2024. 20 More Than Just Numbers 8.32, 7.91, 9.64, 9.18, 10.33, 7.46 As just numbers, this list is uninteresting, but what can you say if this list represents:  Weight of a newborn puppy?  Minutes to run a mile?  Minutes to swim 400 meters? Copyright © 2013 Pearson Education, Inc.. All rights reserved. 21 Variables A variable is any characteristic, number, or quantity that can be measured or counted E.g.  A person’s gender  The weight of a newborn puppy  The concentration of CO2 in the atmosphere  People’s income  Examination grade  Vehicle type Copyright © 2013 Pearson Education, Inc.. All rights reserved. 22 Data Data are the values measured or observed for each variable of each object Data can be numeric or categorical  Numeric data takes numeric values and work well for statistics Number of students attending the lecture People’s income  Categorical data takes non‐numeric values People’s gender Examination grade Vehicle type Copyright Ordinal data mixes numeric and categorical data  © 2013 Pearson Education, Inc.. All rights reserved. 23 Satisfaction rating ranging from 1 to 10 Types of Variables Variables Categorical Variables Numerical Variables describes qualities describes quantities of the objects of interest of the objects of interest Examples: Marital Status Political Party Eye Color Discrete Continuous (Defined categories) Examples: Examples: Number of Children Weight Defects per hour Voltage (Counted items) (Measured characteristics) 24 Types of Variables – Exercise Cont’d Age Gender Major Credits District GPA 18 Male Management Sciences 16 Hong Kong Island 3.6 21 Male Accountancy 18 New Territories 3.1 20 Female Marketing Information Mgt 16 Kowloon 2.8 Numerical Categorical Copyright © 2013 Pearson Education, Inc.. All rights reserved. 25 Numerical or Categorical? Why are you in college? Answer: 1. Personal Growth 2. Career Opportunities 3. Parental Pressure 4. Personal Networking Data: 1, 4, 3, 2, 2, 1, 2, 3, 3, 1, 4, 2 Coding categorical data with numbers: Although the above data values are numbers, the variable is still categorical Reason for coding: Easier to input into a computer Copyright © 2013 Pearson Education, Inc.. All rights reserved. 26 Coding Yes/No Questions Use 0 for “No” and 1 for “Yes” Useful for data with only two possible values  True or False  Black or White  Success or Failure  Dead or Alive Copyright © 2013 Pearson Education, Inc.. All rights reserved. 27 Organizing and Visualizing Data Variables Categorical Variables Numerical Variables describes qualities describes quantities of the objects of interest of the objects of interest Summary Table Frequency Distribution Bar Chart Histogram Pie Chart 28 Organizing and Visualizing Data Cont’d Organizing Categorical Data  Suppose you asked 60 customers to pick which of the three colours, say green, red, or blue they like best for a product The data might look like this: green, red, green, green, red, red, blue, blue, green, red, green, blue, red, blue, green, green, blue, green, green, blue, green, blue, green, red, blue, green, green, green, green, red, red, red, blue, green, green, green, green, blue, red, red, green, green, red, blue, green, red, green, green, blue, red, green, red, green, blue, blue, blue, green, green, green, green 29 Organizing and Visualizing Data Cont’d A natural way to describe the data is counting how many of each colour you have got A summary table: Colour Number of Customers blue 15 green 30 red 15 Total 60  It is accustomed to list the values of the variable in alphabetical order of the category, or in descending (or ascending) order of the count  In statistical context, the proper name for count is 30 called frequency Organizing and Visualizing Data Cont’d You cannot tell from the table which customer picked what colour. This information is often unimportant in reporting The table tells us  15 customers picked blue, 30 customers picked green, and 15 customers picked red  More customers picked green than the other two colours  About the same number of customers picked the two other colours 31 Organizing and Visualizing Data Cont’d (Frequency) Bar Chart Customer's Favourite Colour 35 30 Number of customers 25 20 15 10 5 0 blue green red Colour 32 Organizing and Visualizing Data Cont’d Features of a Bar Chart  It is accustomed to arrange the bars in the alphabetical order of the categories of the variable, or in descending (or in ascending) order of the count  It is up to you to decide the gap between two bars, as long as the gaps are the same  It is up to you to decide the width of each bar, as long as they all have the same width Keeping the widths of the bars equal ensuring the area of each bar proportional to the number of individuals in that category  The height of each bar is proportional to the number 33 of individuals in that category Organizing and Visualizing Data Cont’d The proportion of each category can also be included in the summary table and bar chart Number of Percent of Colour Customers Customers blue 15 25% (= 15/60) green 30 50% red 15 25% Total 60 100%  In statistical context, percent is called relative frequency 34 Organizing and Visualizing Data Cont’d (Relative Frequency) Bar Chart Customer's Favourite Colour 60% 50% Percent of customers 40% 30% 20% 10% 0% blue green red Colour The two bar charts, frequency and relative frequency, looks the same if the vertical scales are removed 35 Organizing and Visualizing Data Cont’d Pie Chart Customer's Favourite Colour 25% green 50% blue 25% red 36 Organizing and Visualizing Data Cont’d Features of a Pie Chart  It shows the size relationship between the categories of the variable and the variable itself  The slices are mutually exclusive. The sum of all slices equal to 100 percent  It is accustomed to arrange the slices in the alphabetical order of the categories of the variable, or in descending (or in ascending) order of the count  Slices of very low percent may need to be combined with others  Percentages (or counts) should be shown as it is 37 difficult to compare slices of similar size Organizing and Visualizing Data Cont’d Organizing Numerical Data  Suppose you asked 100 people about the amount they spent in their last visit to supermarket The data might look like this: 44.8, 230.5, 303.6, 70.8, 534.4, 166.2, 466, 85.1, 63, 47.8, 36.5, 35.7, 12.7, 11.9, 297.5, 74.1, 77.1, 251.2, 127.1, 118.6, 211.2, 221.9, 49.1, 349.1, 556.6, 768, 231.7, 247.2, 87.4, 304.3, 311.3, 825.8, 15.9, 526, 5.2, 156.7, 65.2, 143.3, 138.5, 478.4, 124.2, 205.1, 90.8, 3.1, 334.8, 7.4, 113.8, 79.2, 128.8, 26.6, 15.2, 554.4, 2.9, 70.2, 540.7, 36.4, 588.9, 151.5, 14.2, 235.7, 13.7, 187.4, 817.8, 140.3, 114.9, 219.5, 31.4, 99.4, 47.3, 111.8, 230.2, 478.2, 4.6, 783.5, 483.5, 99.3, 92.8, 464.2, 172.9, 380.1, 234.5, 120.2, 100.3, 109.8, 276.1, 157.7, 192.9, 13.1, 62.2, 44.2, 35.9, 239.9, 193.8, 591.9, 249.1, 17.9, 89.3, 369.1, 38.2, 154.3 38 Organizing and Visualizing Data Cont’d Similar to categorical data, Amount Spent ($) Frequency numerical data can be 0 ‐ < 100 40 100 ‐ < 200 22 presented in the form of 200 ‐ < 300 15 table. It is called frequency 300 ‐ < 400 7 400 ‐ < 500 5 distribution 500 ‐ < 600 7  The frequency distribution is 600 ‐ < 700 0 a summary table in which 700 ‐ < 800 2 800 ‐ < 900 2 the data are arranged into 900 ‐ < 1000 0 numerically ordered classes Total 100 39 Organizing and Visualizing Data Cont’d Steps to construct a frequency distribution 1. Sort data in ascending order: 2.9, 3.1, 4.6, 5.2, 7.4, … 2. Find the range: 825.8 – 2.9 = 822.9 3. Select the number of classes: 10 4. Compute the class interval (width): 822.9 / 10 = 82.29 Round up to a convenient number, say 100 5. Determine class boundaries (limits): Class 1: 0 but less than 100 Class 2: 100 but less than 200 … 6. Assign the observation to each class and count the 40 number of observations Organizing and Visualizing Data Cont’d Features of frequency distribution  Exact value of each observation is lost  The width of each interval is identical Width can be unequal. However, it should be done so only under very special circumstances, such as the data is sparsely distributed, or have a very long tail at one or both ends  The lower value of the first class interval is often the smallest value in the data, or a smaller value which is selected for the reason of convenience, such as 0  Class boundaries include the left endpoint, but not the right Other endpoint policy can be adopted, but need to be 41 consistent Organizing and Visualizing Data Cont’d Features of frequency distribution  The number of classes depends on the range (maximum value – minimum value) of the data Large data range and high number of observations allow a larger number of class. In general, 5 to 15 classes will be sufficient  The width of a class depends on the number of classes adopted and the data range To determine the width of a class, you divide the range of the data by the number of classes desired 42 Organizing and Visualizing Data Cont’d Histogram  A histogram is a bar chart for grouped numerical data in which the frequencies of each group of numerical data are represented as individual vertical bars  For example: Histogram for Customer's Spent 50 40 Frequency 30 20 10 0 100 200 300 400 500 600 700 800 900 1000 Amount ($) 43 Organizing and Visualizing Data Cont’d Features of a histogram  The chart is made from the constructed frequency distribution  The height of the bars is in proportion to the frequency of intervals  There is no gap between bars If an interval has 0 frequency, the height of the bar in the histogram is 0  The width of the bars must be identical because the width of the intervals are identical  The bar must be drawn in the same sequence as of the intervals in the frequency distribution 44 Organizing and Visualizing Data Cont’d The proportion of each class can also be displayed in frequency distribution (or histogram) Amount Spent ($) Frequency Relative Frequency 0 ‐ < 100 40 0.40 100 ‐ < 200 22 0.22 200 ‐ < 300 15 0.15 300 ‐ < 400 7 0.07 400 ‐ < 500 5 0.05 500 ‐ < 600 7 0.07 600 ‐ < 700 0 0.00 700 ‐ < 800 2 0.02 800 ‐ < 900 2 0.02 900 ‐ < 1000 0 0.00 Total 100 1.00 45 Use of Excel in Organizing Data PivotTable  PivotTable can be used to create summary table for categorical variables  Steps to create a pivot table manually 1. Click Pivot Table in Insert ribbon. In Create PivotTable dialog box, enter or confirm the address in the Table Range as the data that you want to analyze. Choose Existing worksheet and specify a location (A1, say) for the PivotTable report to be placed 46 Use of Excel in Organizing Data Cont’d 2. Drag Age to the Row Labels area, and Age to Values area to create the frequency table  The default reported value for categorical value is Count. This can be changed from the dropdown list under Values area  Grand Total is included by default 47 Use of Excel in Organizing Data Cont’d 3. If the relative frequency is also wanted, drag Age into the Values field again. From the dropdown list of Count of Age 2, select Value Field Setting, then set Show Value As % of Grand Total. Enter "% of Total" into the Custom Name box If you do not see the Field List, click FieldList in Show group under the Analyze ribbon of PivotTable Tools 48 Use of Excel in Organizing Data Cont’d Histogram  Use “max” and “min” functions to compute the range  Define the upper limit of each class  Use of Excel “Data Analysis” Add‐Ins tool to create a histogram File  Options  Add‐Ins  Click “Go” at the bottom  Check “Analysis ToolPak” and click “OK” You can find “Data Analysis” in the “Data” menu bar For Mac: Tools  Add‐Ins  Check “Analysis ToolPak” and click “OK”  Select “Data” then select “Data Analysis” 49 Use of Excel in Organizing Data Cont’d  Choose “Histogram” at “Data Analysis” browser  State the “Input range” and “Bin range”, check the required boxes, and click “OK” button “Bin range” refers to the upper limits  Class 1: larger than 0 but less than or equal to 250  Class 2: larger than 250 but less than or equal to 500  …  Double‐click the bars on the chart, and adjust the gap width to 0% 50 Faulty Graphs: Chart Junk Bad Presentation  Good Presentation Nominal Wage Index in Hong Kong Nominal Wage Index in Hong Kong (Sept 1992=100) (Sept 1992=100) 200 2009: 157.9 180 160 2010: 163.1 140 Nominal Wage Index 120 100 2011: 178.3 80 60 40 20 2012: 187.5 0 2009 2010 2011 2012 Year 51 Faulty Graphs: No Relative Basis Cont’d Bad Presentation  Good Presentation A's Obtained by Students in CB2345 A's Obtained by Students in CB2345 45 50.00% 40 45.00% 35 40.00% 35.00% 30 Percentage Frequency 30.00% 25 25.00% 20 20.00% 15 15.00% 10 10.00% 5 5.00% 0 0.00% Accountancy Business Business Information Marketing Accountancy Business Business Information Marketing Analysis Operations Management Analysis Operations Management Management Management Programme Programme 52 Faulty Graphs: Compressing the Vertical Axis Cont’d Bad Presentation  Good Presentation Unemployment Rate in Hong Kong Unemployment Rate in Hong Kong 20.0% 6.0% 18.0% 5.0% 16.0% 14.0% Unemployment Rate Unemployment Rate 4.0% 12.0% 10.0% 3.0% 8.0% 2.0% 6.0% 4.0% 1.0% 2.0% 0.0% 0.0% 2008 2009 2010 2011 2012 2008 2009 2010 2011 2012 Year Year 53 Faulty Graphs: No Zero Point on the Vertical Axis Cont’d Bad Presentation  Good Presentation Hang Seng Index Hang Seng Index 0 54 Principles of Excellent Graphs The graph should not distort the data The graph should not contain unnecessary adornments (sometimes referred to as chart junk) The scale on the vertical axis should begin at zero All axes should be properly labeled The graph should contain a title The simplest possible graph should be used for a given set of data 55 Summary Definitions The central tendency is the extent to which all the data values group around a typical or central value The variation is the amount of dispersion or scattering of values The shape is the pattern of the distribution of values from the lowest value to the highest value 56 Measures of Central Tendency Cont’d What is Median? Why it is being reported here? Source: Headline Daily 2023 April 4 https://paper.stheadline.com/index.php?product=Headline&issue=202 30404&vol=2023040401&token=0cc6a3c260dc1fe2&page=4 57 Measures of Central Tendency Cont’d Central Tendency Mean Median Mode ∑ 𝑋 𝑋 𝑛 ∑ 𝑋 𝜇 middle value in most frequently 𝑁 the ordered array observed value 58 Measures of Central Tendency: The Mean Cont’d Sample mean n X pronounced x-bar i X1  X 2    X n X i 1  n n Sample Size Population mean N pronounced mu X i X1  X 2    X N  i 1  N N Population Size 59 Measures of Central Tendency: The Mean Cont’d The most common measure of central tendency Affected by extreme values (outliers) Best to use when the distribution of the data values are symmetrical, and there is no clear outliers 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Mean = 3 Mean = 4 1 2 3 4 5 15 1 2 3 4 10 20 3 4 5 5 5 5 60 Measures of Central Tendency: The Median Cont’d Robust measure of central tendency In an ordered array, the median is the “middle” number (50% above, 50% below)  If n or N is odd, the median is the middle number  If n or N is even, the median is the average of the 2 middle numbers Best to use when the distribution of data values is skewed or when there are clear outliers 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Median = 3 Median = 3 61 Measures of Central Tendency: The Mode Cont’d Value that occurs most often Not affected by extreme values (outliers) Used for both numerical and categorical data There may be no mode There may be several modes 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 No Mode Mode = 9 62 Effects of Outliers Imagine that the five graduating seniors on a college basketball team receive the following first‐year contract offers to play in the National Basketball Association (zero indicates that the player did not receive a contract offer): 0 0 0 0 $3,500,000 The mean contract offer is: 0 0 0 0 $3,500,000 𝑚𝑒𝑎𝑛 $700,000 5 Is it therefore fair to say that the average senior on this basketball team received a $700,000 contract offer? Why Census & Statistics Department reported the median household income? When the Hong Kong Housing Authority revises the rent of public housing, they need a reference for the average income of tenants. Should they consider the mean or the median? Is median of a variable always smaller than its mean? 63 Comparison of Mean, Median & Mode 64 Measures of Variation What is Lower Quartile? Why we need it? 65 Measures of Variation Cont’d Variation Range Interquartile Variance Standard Range Deviation Measures of variation give information on the spread or variability or dispersion of the data values same center, different variation 66 Measures of Variation: The Range Cont’d Simplest measure of variation Difference between the largest and the smallest values Range 𝑋Largest 𝑋Smallest Ignores the way in which data are distributed 7 8 9 10 11 12 7 8 9 10 11 12 Range = 12 ‐ 7 = 5 Range = 12 ‐ 7 = 5 Sensitive to outliers 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5 Range = 5 ‐ 1 = 4 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120 Range = 120 ‐ 1 = 119 67 Quartiles Cont’d Quartiles split the ranked data into 4 segments with an equal number of values per segment 25% 25% 25% 25% Q1 Q2 Q3  The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% are larger  Q2 is the same as the median (50% of the observations are smaller and 50% are larger)  Only 25% of the observations are greater than the third quartile, Q3 68 Quartiles Cont’d Calculating the quartile position for n values: Q1 position: Q2 position: Q3 position: r= r= r= If r is a whole number, it is the ranked position to use When r is not a whole number, the following linear interpolation steps can be used to determine the quartile value 1. d = r – [r], where [r] is the integer part of r 2. Quartile value = X[r] + d*(X[r]+1 – X[r]), where X[r] is the value at the rank rth position 69 Quartiles Cont’d Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22 Quartile Quartile Position Quartile Value Q1 (9+1)/4 = 2.5 12 + 0.5*(13 – 12) = 12.5 Q2 (9+1)/2 = 5 16 Q3 3(9+1)/4 = 7.5 18 + 0.5*(21 – 18) = 19.5 70 Quartiles – Exercise Cont’d Data in ordered array: 3 6 7 7 9 12 Quartile Quartile Position Quartile Value 71 Measures of Variation: Interquartile Range Cont’d Interquartile range IQR = Q3 – Q1 and measures the spread in the middle 50% of the data 25% 25% 25% 25% Q1 Q2 Q3 Interquartile Range  Interquartile range is also called the midspread because it covers the middle 50% of the data  Not influenced by outliers or extreme values  Usually, values fall outside the range [Q1 ‐ 1.5*IQR, Q3 + 1.5*IQR] are considered as outliers 72 Measures of Variation: Interquartile Range Cont’d Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22 Quartile Quartile Value Q1 12.5 Q2 16 Q3 19.5 Interquartile range = 19.5 – 12.5 = 7 Since Q1 - 1.5*IQR = 12.5 - 1.5*7 = 2 and Q3+1.5*IQR = 19.5 + 1.5*7 = 30, no data value falls outside [2, 30]. There is no outlier in this sample 73 Measures of Variation: Variance Cont’d Most preferred measure of variation due to its mathematical property It shows variation of each value from the mean  Sample Variance  Population Variance pronounced sigma squared 74 Measures of Variation: Standard Deviation Cont’d It is the square‐root of variance. It has the same units as the original data  Sample Standard Deviation ∑ 𝑋 𝑋 𝑆 𝑛 1  Population Standard Deviation pronounced ∑ 𝑋 𝜇 𝜎 sigma 𝑁 75 Measures of Variation: Standard Deviation Cont’d Smaller standard deviation Larger standard deviation 𝜇 or 𝑋  Smaller standard deviation means most values of X are closer to its mean value. Larger standard deviation means the values of X are more spread out 76 Measures of Variation: Standard Deviation Cont’d Note: All the data set are random samples from the population Data A 𝑋 = 15.5 s = 3.338 11 12 13 14 15 16 17 18 19 20 21 Data B 𝑋= 15.5 s = 0.926 11 12 13 14 15 16 17 18 19 20 21 Data C 𝑋 = 15.5 s = 4.570 11 12 13 14 15 16 17 18 19 20 21 77 Measures of Variation Cont’d The more the data are spread out, the greater the range, variance, and standard deviation The more the data are concentrated, the smaller the range, variance, and standard deviation If the values are all the same (no variation), all these measures will be zero None of these measures are ever negative  Smaller standard deviation means most values of X are closer to its mean value. Larger standard deviation means the values of X are more spread out 78 Measures of Variation Cont’d Which measure of variation to be used?  Variance (or standard deviation) is more often used due to: Its nice mathematical features. It is easier and more reliable (statistically) to use sample variance to infer the population variance For most distributions (not too skewed), a majority (around 65%) of all values are within one standard deviation on either side of the mean For most distributions (not too skewed), a small minority (around 5%) of all values deviate more than two standard deviations on either side of the mean If a distribution is known to be very skewed, then other measure of variation should be used 79 Measures of Variation Cont’d Example:  Stock A with an average price of $50 and a standard deviation of $10 is expected to trade between $30 and $70 (mean ± 2 standard deviation) at approximately 95% of the time  Stock B with an average price of $50 and a standard deviation of $1 is expected to trade between $48 and $52 at approximately 95% of the time  Stock B is considered safer and more reliable (lower volatility) 80 Distribution Shape Data sets may have similar central tendency measures, similar standard deviations, but different in shape 81 Distribution Shape Cont’d The Skewness  The skewness measures the extent to which data values are not symmetrical  It equals to 0 if the distribution of the variable is symmetrical  It is lesser than 0 if the distribution is left‐skewed (or negatively skewed), larger than 0 if the distribution is right‐skewed (or positively skewed)  The skewness can help us to decide which type of central tendency (mean or median) is appropriate to use 82 Distribution Shape Cont’d Position of mean and median for unimodal continuous distribution Left-Skewed Symmetric Right-Skewed Mean < Median Mean = Median Median < Mean Skewness 0 Statistic  If data are skewed, the median may be a more appropriate measure of central tendency 83 The Five Number Summary and Boxplot The five numbers that help describe the center, spread and shape of data are Xsmallest ‐‐ Q1 ‐‐ Median ‐‐ Q3 ‐‐ Xlargest Boxplot 25% of data 25% 25% 25% of data of data of data Xsmallest Q1 Median Q3 Xlargest or Q1 - 1.5 (IQR) or Q3 + 1.5 (IQR) 84 Distribution Shape and Boxplot Cont’d 𝑄 X smallest 𝑄 𝑄 X largest If A = B, then Symmetric If A > B, then Left-Skewed A B If A < B, then Right-Skewed If C = D, then Symmetric If C > D, then Left-Skewed C D If C < D, then Right-Skewed If E = F, then Symmetric If E > F, then Left-Skewed E F If E < F, then Right-Skewed Look at all the three pairs of comparisons, go for the majority 85 Distribution Shape and Boxplot Cont’d Left-Skewed Symmetric Right-Skewed Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3 86 Boxplot Example Cont’d Xsmallest Q1 Q2 Q3 Xlargest 0 2 2 2 3 3 4 5 5 9 27 00 22 33 55 27 27 The data are right skewed, as the plot depicts 87 Calculating Descriptive Statistics in Excel The preparation time for the examination of 12 randomly selected students (in days): 5 21 18 9 4 17 11 28 19 2 18 22 Remarks: For population data set, use STDEV.P and VAR.P for computing the population standard deviation and population variance, respectively Use only Quartile.exc function for computing the quartiles. Quartile.inc function (or Quartile function) will 88 not be used in this course Calculating Descriptive Statistics in Excel Cont’d Use of Excel “Data Analysis” Add‐Ins tool to find descriptive measures  File  Options  Add‐Ins  Click “Go” at the bottom  Check “Analysis ToolPak” and click “OK”  You can find “Data Analysis” in the “Data” menu bar  Choose “Descriptive Statistics” at “Data Analysis” browser 89 Calculating Descriptive Statistics in Excel Cont’d Use of Excel “Data Analysis” Add‐Ins tool to find descriptive measures Data Cells Output Cell Generate Descriptive Statistics 90 Calculating Descriptive Statistics in Excel Cont’d Drawing Boxplot  Select the data range Data set does not need to be sorted first in Excel  Insert  Click “Recommended Charts” in Charts group  Select “Box & Whisker” from All Charts and click “OK”  Click “Add Chart Element” and “Data Labels” to show the mean and quartile values on the plot 91 Calculating Descriptive Statistics in Excel Cont’d  Example 1: Refer to the students’ preparation time for the examination data set 92 Calculating Descriptive Statistics in Excel Cont’d  Example 2: Compare the number of hours spent on study per week by female and male students from a sample of 710 students Two columns of data, one for each gender Include the column title when selecting the data range for plotting so that the column title can be used as legends for the plot Outliers (values outside [Q1‐1.5*IQR, Q3+1.5*IQR]) are shown by default. The shown min and max points exclude these outliers It is common not to show the 5 statistics on the plot 93 Calculating Descriptive Statistics in Excel Cont’d The maximum number of hours spent on the study is similar between the male and female students. The median and the quartiles are much lower for the male students. The plot shows that female students tend to spend more hours studying than male students 94 Calculating Descriptive Statistics in Calculator (For Casio fx‐50F) Data Set: 95

Use Quizgecko on...
Browser
Browser