SSF1093 Week 1 Introduction to Statistics 2023 PDF
Document Details
Uploaded by HighQualityPeachTree
Universiti Malaysia Sarawak
2023
Tags
Summary
This document provides an introduction to statistics for SSF1093 students. It covers topics such as the aims of statistics, data types, variables, and different kinds of analysis. The document also includes examples and exercises, offering a comprehensive overview of the SSF 1093 Introduction to statistics course.
Full Transcript
Week 1 What is statistics? Statistics is the science of conducting studies to collect, organize, summarize, analyze, and draw conclusions from data. Statistics involves information, numbers, and visual graphics to summarise collected data and its interpretat...
Week 1 What is statistics? Statistics is the science of conducting studies to collect, organize, summarize, analyze, and draw conclusions from data. Statistics involves information, numbers, and visual graphics to summarise collected data and its interpretation. Table 2: Labour Force Outcome by Gender and STEM Fields of Study STEM Fields of Study Labour Force Outcome Total Employed Unemployed Outside Male: 3054 (92.0) 75 (2.3) 192 (5.0) 3321 Science, mathematics and computer 942 (91.7) 21 (2.0) 64 (6.2) 1027 Engineering, manufacturing and construction 1734 (62.1) 43 (2.3) 106 (5.6) 1883 Agriculture and veterinary 104 (88.1) 3 (2.5) 11 (9.3) 118 Health and welfare 274 (93.5) 8 (2.7) 11 (3.8) 293 Female: 2254 (85.8) 122 (4.5) 343 (12.6) 2719 Science, mathematics and computer 1089 (71.3) 67 (5) 183 (13.7) 1339 Engineering, manufacturing and construction 634 (83.1) 33 (4.3) 96 (12.6) 763 Agriculture and veterinary 75 (75.0) 5 (5.0) 20 (20.0) 100 Health and welfare 456 (88.2) 17 (3.3) 44 (8.5) 517 Note: Figures in parentheses refer to percentages. 2 The study of statistics involves math and relies upon calculations of numbers. But it also relies heavily on how the numbers are chosen and how the statistics are interpreted. Do you agree with the interpretations of data below? 1) A new advertisement for Anchor’s butter introduced in last October resulted in a 30% increase in butter sales for the following three months. Thus, the advertisement was effective. 2) 75% more interracial marriages are occurring this year than 20 years ago. Thus, our society accepts interracial marriages. “During my administration, expenditures increased a mere 3%.” His opponent, who is trying to unseat him, might say, “During my opponent’s administration, expenditures have increased a whopping $6,000,000.” – Here both figures are correct; however, expressing a 3% increase as $6,000,000 makes it sound like a very large increase. Here again, ask yourself, Which measure better represents the data? Household income between Chinese and Bumiputra is widening Median income Growth (1989-2019)? Household 1989 2019 12MP points out that Chinese 1,180 7,400 530% absolute difference Bumiputra 680 5,400 700% has increased 4 times. So ethnicity Difference 500 2000 income has widened in the past 3 Share 58% 73% bumiputra decades. to Chinese Technically, it is true, but ….. But it has different meaning to people of different backgrounds and interests. Consider: i. Will job market be strong when I graduate since 70% of fresh graduate were unemployed now? ii. What are the chances of finding a job with a degree in social sciences when I graduate later? iii. With 80% of vaccination rate in Malaysia among people above 18 years old, should we give more mobility freedom within Malaysia? iv. Will total fertility rate improve if more friendly working environment, such as flexible working hours and childcare facility is implemented in Malaysia? People are making decision in the face of uncertainty. Different people are using the information in different 5 manners. How to make sound decision under uncertainty? Based on the trend of women labour force participation rates, do you think a woman should invest in higher education? The Labour Force Outcome by Marital Status and Gender marital status married women married men unmarried women unmarried men % 0 20 40 60 80 100 Employed Unemployed Outside Aims of statistics Statistics is a TOOL to help process, summarise, analyse, display and interpret data so that we can draw conclusion and to make decisions from collected data. – 1. To extract information from data using statistical techniques (make sense of data) – 2. To report findings from data – summarise and analyse the available data in a useful and informative manner Quantify the collected data into numerical form so that statistical techniques can be used in the analysis. Knowing the statistical techniques and have some basic understanding of statistics is useful in daily life and make decision in your best interest. – If a pharmacist told you that ‘taking calcium will lower blood pressure in some people’. Will you buy the calcium introduced by him/her? Gathering Data Where do data come from? A study. What are the critical skills to remain relevant and employable with the advent of AI and automation? What is the most appealing name for a new brand of smart phone? What is a study? A research process in which information is recorded for a group observation unit (people, plant, animal, non-living object). Data refers to the information that has been collected from an experiment, a survey, an observation or historical record (secondary data). Data are used to answer questions, for instance, time use for different household activities; the effectiveness of a new drug in controlling the spread of covid-19. Do we have to collect required data ourselves to answer objectives of study? Primary data versus Secondary data Asking the customers at the shopping mall about their voting intentions in the upcoming Sarawak election is an example of secondary data. True? Analysing unemployment data from the Labour Force Survey collected by the Department of Statistics is an example of secondary data. True? A hotel employee asked the customer who is checking out to rate his satisfaction on a scale of 1-10. This is an example of collecting primary data. True? Limitation of secondary data: you have no control over how the data are collected. Advantage: safe time and costs Cross-section and time series data Cross-section Data: data collected on different elements at the same point in time or for the same period of time Time-Series Data: data gathered for different periods of time i. Water bills of a family for each month of 2021. ii. Number of covid-19 infected cases in each state in 2020. iii. Gross sales of face mask in first quarter of 2021. 12 Why do statistics? The annual earnings of undergraduates exceed, on average, those of secondary graduates by RM1500. On the basis of existing research, there is no conclusive evidence of a negative relationship between time spent in gaming and test scores. Heavy users of tobacco suffer significantly more respiratory disease than non-users. 60% of the voters interviewed expressed their support for GPS in the coming state election. The support for GPS to lead the state is estimated within 60%±15%. Like professional people, you must be able to make sense of data; read and understand the various statistical studies performed in your fields. To have this understanding, you must be knowledgeable about the vocabulary, symbols, concepts, and statistical procedures 13 used in these studies. In year 3, you are require to conduct your own study in your field. To accomplish this, you must be able to design your study: collect, organize, analyze, and summarize data; and possibly make reliable predictions or forecasts for future use. You must also be able to communicate the results of the study in your own words. You can also use the knowledge gained from studying statistics to become better consumers and citizens. For example, you can make intelligent decisions about what products to purchase based on consumer studies, about government spending based on fiscal budget, and so on. After analysing the results (factors), what can policy makers, firms and individuals do in order to retain highly educated women in the workforce? improve existing situation/process use statistics (evidence) generated from a smaller group of data to make inference/estimate/extrapolate about a population. 14 Recap Statistical results can be useful if they are correctly analyse and interpreted. But the results highly depend on how trustworthy and reliable in the collected data. Statistics is just a tool and technique to understand the world around us. Use it wisely. Use it appropriately. It is a great tool for us to make decision based on empirical evidence instead of ones belief or bias. The statistical concepts that we are going to discuss in this lecture will help us to understand further what measurements or types of statistics that we wanted to answer our research question. Each successive level builds upon the previous one. Ways to summarise data 16 Types of Statistics/Statistical analysis Study of statistics can be divided into two main groups: descriptive statistics inferential statistics 17 DESCRIPTIVE STATISTICS Consists of methods for organising, displaying and describing collected data in meaningful ways, which allows simpler interpretation of the data. Methods: a. Tables and graphs How many cases/observations in the table? What does 37 indicate? b. Summarize quantitative data in numerical method Measure of central tendency (location, position) (MCT) Statistics/indicator: mean, median, mode Measure of variability (spread/dispersion) – describing how spread out are the data Statistics: range, variance, standard deviation, inter-quartile range A pet shop conducted a study on the number of fish sold each day for one month and found that an average of 10 fish were sold each day. Under the 12MP, average household income is set of at least RM10,000 a month 30% of the voters would support Party AAA in the coming GE15. Inferential statistics makes inferences about populations using data drawn from the population. a. Suppose we want to know the average income of all households in Kuching division. b. If we are interested in the exam marks of all SPM students in Malaysia. It isn't very practical to try and get the income of each household from a population of so many million residents. Not feasible to measure all exam marks of all students in the whole Malaysia. Instead of using the entire population to gather the data, the statistician will collect a sample or samples from the millions of residents; Take a smaller sample of students (e.g., 1000 SPM students), which are used to represent the larger population of all Malaysian students. Use the resulting data to estimate the value of the population parameter. Inferential Statistics Inference is the process of generalising, drawing conclusions or making decisions about a population from a sample. A diet high in fruit and vegetable will lower blood pressure. Use sample statistics to infer/generalise about population parameters. a. Estimation – from point estimate such as sample mean, we estimate for population mean. – For instance, only 7 out of 64 students taking SSF1093 are asked to give their GPA in 6th form. – From the mean GPA of 7 students, we can estimate the mean GPA of 64 students. The process of estimated mean value is known as inferential statistics. b. Hypothesis testing – to test the claim of somebody. For instance, students, on average, sleep 1 hour less during exam than non-exam time. The alcohol content is less than 15ml in a bottle of 50ml beer. Our brand of crackers has one-third fewer calories than brand ABC. c. Correlation and regression - analyse relationships between two or more variables d. Prediction/Forecast Generalising to a Larger population Story about a sample 22 Descriptive statistics or inferential statistics? The mean price of a single storey house in Kuching was RM200,000 in 2010. 95% of Malaysian schooling children aged 7-12 would have a smart phone by 2022. Out of 100 adults surveyed, most of them acknowledged they do not wear face mask at workplace. Based on the trend of enrolment, the Dean at FSSH says that about 85% of the newly enrolled students at the faculty are girls. In a sample of 100 individuals, 36% think that watching television is the best way to spend an evening. A researcher concludes that drinking decaffeinated coffee can raise cholesterol levels by 7%. 23 Key concepts Population versus sample Primary versus secondary data Cross sectional v time series data Quantitative v categorical data nominal, ordinal, interval and ratio 24 Key Definitions Population Every member Sample //all of a defined group a b cd b c ef gh i jk l m n gi n o p q rs t u v w o r u x y z y the collection of all items of an observed subset of the interest or under investigation population Values calculated using population Values computed from sample data are called parameters data are called statistics Population A population can be small or large, as long as it includes the data of all elements/subjects that being studied. – A population is not necessarily referring to people in statistics. A population could be anything, example: measurements of rainfall in a particular area or a batch of batteries. For example, if you were interested in the exam marks of all the 5 students, then 5 students would represent your population. – Given the marks of 5 students: 65, 80, 70, 55, 100, it can be summarised based on descriptive statistics by using mean or median. – The mean and median values calculated based on the entire population data are called PARAMETERS. Sample Suppose you want to know the average monthly expenses of all households in Sarawak. Is it very practical to try and get the information from each household in Sarawak? Instead of asking all the 10 million households, researcher usually take a smaller sample, let say 500 households to find out the mean expenses. Sample – a selection of a group of subjects selected from a population. Since the mean is determined based on a sample data, mean sample is known as STATISTIC. From the mean statistic, using the technique of inferential statistics, researcher can make inference about the mean monthly expenses of entire 10 million households in Sarawak (mean population). Examples: Select 100 students currently enrolled at FSS and collect data for the programme they enrolled into. – Identify the population of interest in the study. – What is the sample? You have been hired by the Election Commission to examine how the Malaysian feel about the fairness of the voting procedures in Malaysia. Who will you ask? 28 Dr Sandra wants to know how students in her class did on their last test. She asks the 10 students sitting in the front row to state their latest test score. She concludes from their report that the class did extremely well. What is the sample? What is the population? Can you identify any problems with choosing the sample in the way that Dr Sandra did? A sample is typically a small subset of the population. In choosing a sample, it is therefore crucial that it not over-represent one kind of citizen at the expense of others. 29 Identify the following statements as either taken from a population or sample. A list of all registered voters’ names in the Kuching Division. Incomes of 100 families living in Kuching. Grade point averages of all students in FSS. No of cars park at the one of the car parks in unimas. No of cars park at 10am to 1pm at the car park at FSSH. Math marks for 20 students in Dr Ali’s class. In each situation, indicate whether the value given in bold print is a statistic or a parameter. 1. Of 10 students sampled from a class of 200, 8 (80%) said they would like the school library to have longer hours. 2. The mean CGPA of all second year students at FSSH was reported as 2.95 in academic year 2016. 3. Based on a census conducted by DOSM, 39.5% of the Kuching children who are 5 years old speak languages other than English at home. 4. A study involving 500 individuals with hypertension is conducted and it is found that 80% of the individuals are able to control their hypertension with the drug. Variable: characteristics of observation unit A variable contains a value or quantity that can be measured or counted or just a description of what is being studied in the sample or population. Educational attainment; height; time taken to finish a question; CGPA for STPM/Matriculation; grade for SPM English; method of payment of shoppers; weight of a bottle of 100 Plus; number of children in a household. Why is it called a variable? The observation or value may vary between observations (units) in population or sample, and may change in value over time. Types of variables /Data Data Measurable quantity Categorical Numerical Discrete Continuous Examples: Examples: how many Examples: how much Marital Status Number of Children Weight Eye Colour Defects per hour Voltage Condition (Good, acceptable, poor) Total students in FSS (Measured characteristics; any (Defined categories or value so long the instrument groups or attribute; non- of measurement allows) numeric value) 1. Which of the following statements are true? I. Categorical variables are the same as qualitative variables. II. Categorical variables are the same as quantitative variables. III. Quantitative variables can be continuous variables. (A) I only (B) II only (C) III only (D) I and II (E) I and III 2. A researcher sends out a survey to learn about people’s water usage habits. Some of the questions included in the survey are given below. Q1. How many times a week do you take a shower? Q2. Do you leave the water running when you brush your teeth? Q3. When you water your lawn, how long do you let the water run? For each question, determine if it leads to categorical responses or quantitative responses. Categorize the following variables as being: (i) qualitative or quantitative, (ii) for quantitative variable, identify further as discrete or continuous Response time Rating of job satisfaction Occupation aspired to after completing a bachelor degree Number of words remembered Number displayed on a footballer’s jersey Choice on a test item: true or false Clothing size: 6, 8, 10, 12, 14, 16, 18 average daily temperature household income primary student’s daily expenses Data and Level of measurement Refers to the relationship among the values that are assigned to the attributes for a variable. Helps you decide what statistical analysis is appropriate on the values that were assigned. NOMINAL variable - takes qualitative values, but do not have any ordering (ranking) relationships among them. Has at least two categories – Gender: male and female – Brand of mobile phone: Nokia, Apple iphone, Samsung galaxy, Motorola – Property owned: house, apartment, condominium We use the values as a shorter name for the attribute in nominal measurement. Ordinal ORDINAL variable - takes qualitative values and have an ordering relation among them. – Class position: first, second, third… last. – Size of T-shirt: XS, S, M, L XL Do you like the minimum wage policy implemented by government? Yes a lot [ ] It is OK [ ] Not very much [ ] The values assigned in each attribute show the ranking, but it does not imply that the distance between 1 and 2 is equal to 3 and 4. The interval between values is not interpretable in an ordinal measure 37 INTERVAL variable - take quantitative values but does not have a ‘true’ zero value. The values can be ranked and the difference between two values is meaningful – IQ level: 80, 160, 240 – Temperature: 50F, 100 F; 30C Example: 40F - 30F = 20F - 10F. But you can’t say 40F is twice as hot as 20F RATIO – take quantitative values and pass through a ‘true’ zero value. – height: 20cm, 150cm, 180cm – Hours in revision – 0, 1.5 hrs, 5 hrs. – Monthly expenses on food Ahmad spent RM100 a month while Brian spent RM50. The spending of Ahmad is 2 times larger than Brian. In each of the following cases, decide if the variable is of nominal, ordinal or numerical type; a) The country of a tourist coming to Malaysia. b) The grade obtained by a student in an examination, classified as A+, A, B, C or D. c) The age of an individual. d) Price of share of a public limited company at a stock exchange. e) Impression of a foreign tourist about Malaysia classified as Fantastic, Good, Average, Bad, Terrible. f) The educational level of a worker classified as illiterate, primary level, secondary level and post secondary level. Basic Vocabulary of Statistics POPULATION A population consists of all the items or individuals about which you want to draw a conclusion. SAMPLE A sample is the portion of a population selected for analysis. PARAMETER A parameter is a numerical measure that describes a characteristic of a population. STATISTIC A statistic is a numerical measure that describes a characteristic of a sample. Qualitative data are labels or names used to identify an attribute of each element. Qualitative data use either the nominal or ordinal scale of measurement. The statistical analysis for qualitative data are rather limited. Use frequency, percentage Quantitative data indicate either how many or how much. – Quantitative data that measure how many are discrete. – Quantitative data that measure how much are continuous because there is no separation between the possible values for the data. Quantitative data are always numeric. Arithmetic operations, such as mean, median, range etc) are meaningful only with quantitative data.