STA111 Week 1 and 2 Descriptive Statistics PDF
Document Details
Tags
Summary
This document provides an overview of descriptive statistics, covering topics such as data definitions, variables, and frequency distributions. It details the different types of data and statistical applications in various fields.
Full Transcript
STA 111DESCRIPTIVE STATISTICS WEEK 1 AND 2 DEFINITION: DATA, VARIABLE AND FREQUENCY DISTRIBUTION 1 COURSE OUTLINE 1.Definition of Statistical data; Types, Sources...
STA 111DESCRIPTIVE STATISTICS WEEK 1 AND 2 DEFINITION: DATA, VARIABLE AND FREQUENCY DISTRIBUTION 1 COURSE OUTLINE 1.Definition of Statistical data; Types, Sources and Methods of collection. 2. Methods of data Presentation: It can be presented by Tables, chart and graph. 3. Measures of central tendency. 4. Frequency and cumulative distributions. 5. Grouped and ungrouped data 6. Measures of location, partition, dispersion, Skewness and Kurtosis. 7. Rates, ratios and index numbers. 8. Theory of probability. 2 OBJECTIVES By the end of this Module, you should be able to: 1 explain the basic concepts of descriptive statistics. 2 present data in graphs and charts. 3 differentiate between measures of location, dispersion and partition. 4 describe the basic concepts of Skewness and Kurtosis as well as their utility function in a given data set. 5 differentiate rates from ratio and how they are use. 6 compute the different types of index number from a given data set and interpret the output. 7 compute the basic concept of probability. 3 RELATIONSHIP BETWEEN DATA AND STATISTICS Datum is a singular form of data and data are factual information used for the purpose of analysis. It is the raw information from which statistics are created. Statistics are the results of data analysis - its presentation and interpretation. In other words some computation has taken place that provides some understanding of what the data means. Statistics are often, beyond mere presentation of data in the form of a table, chart, or graph. It sometime required rigorous computation. Both statistics and data are frequently used in scholarly research. Statistics are often reported by government agencies - for example, unemployment statistics or educational literacy statistics. Often these types of statistics are referred to as statistical data. Statistics is the science (and art) of collecting and analyzing observations for the purpose of learning about ourselves, our surroundings, and our universe at large. Data are raw facts and the 4 building blocks of statistics. Data play a pivotal role in our DATA DESCRIPTION 5 BASIC DEFINITION Statistics is concerned with scientific methods for collecting, organizing, presenting and analysing data as well as withdrawing valid conclusions and making reasonable decisions on the basis of such analysis. TYPES OF STATISTICS 1 Descriptive statistics: Deals with methods of organising, summarising and presenting data in a convenient and informative way. 2 Inferential statistics: Is a body or methods used to draw conclusions or inferences about characteristics of populations based on sample data. APPLICATION OF STATISTICS (a) Government: Statistics can be applied here e.g. number of workers: males and females, amount to be paid over the years, equipment’s required etc. others include state, local and federal information’s. (b) Biological science (c) Physical sciences etc. 6 N.B: Statistics is applicable everywhere, i.e., it is applicable to all aspect of human endeavours DATA AND VARIABLE DATA This could be defined as pieces of information that represent the qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which information and knowledge are derived for statistical analysis. TYPES OF DATA: There are mainly two types of data, which are: Primary data: Data obtained through direct observation, interviewing and questionnaire are called data from primary source (primary data). Secondary data: data obtained from published materials are called secondary data because the data are obtained from a second hand sources7 (e.g. files). SOURCES OF DATA There are two main sources of data; the primary and the secondary sources of data. The data originally collected in the process of investigation by the investigator is known as primary data. These data are more accurate and uniform, it involves the supervision of the investigator. Although, primary data collection are time and labour consuming. Secondary data are published or maintained by government or non- governmental organizations. Such as department of census/National population commission, Bureau of Statistics, Department of health, agriculture and fisheries department, official publication of United Nation, World Health Organization (WHO), UNESCO, etc. 8 VARIABLE: This is any quality that can have a number of values, which may be either discrete or continuous. A variable is a property that can take on different values. Individual in a class may differ in sex, age, intelligence, height etc. These properties are variables. TYPES OF VARIABLE Continuous variable: A variable that can theoretically assume any value between two given values is called a continuous variable e.g. height, weight etc. Discrete variable: A variable that can be counted, or for which there is a fixed set of values is called discrete variable e.g. number of students, cars etc. can take only exact value. Constant: If the variable can assume only one value, it is called constant. Quantitative Variables: This type of variables assumes values that vary in terms of magnitude. Very easy to measure and compare with others e.g. weight, height, age, distance, marks obtained in a test etc. Qualitative Variables: This type of variable differs in kind. They are only categorized, e.g. gender, nationality, social economic status, academic qualifications, marital status. 9 HOW TO COLLECT STATISTICAL DATA There are various places where data can be collected. Such places include Government, Biological sciences, Physical sciences, Business, Economics, Social, demographic sectors etc. METHODS OF DATA COLLECTION There are various methods we can use to collect data. The method depends on the problem and type of data to be collected. Some of these methods are: 1. Direct observation; 2. Experiments; 3. Interviewing; 4. Questionnaire; 5. Abstraction from published statistics. (1) Direct observation: Observational methods are used mostly in scientific enquiries where data are observed directly from controlled experiments. It is used mostly in the natural sciences through laboratory work than in the social 10 sciences. It is however useful in studying small communities and institutions. (2) Experiment: Is a means of collecting evidence to show the effect of one variable (data) upon another. E.g. experiment conducted by a psychologist who wishes to develop an equation to predict child’s span at attention on the basis of child’s age. He obtains the following pairs of observations using a sample of ten children. (3) Interviewing: In interviewing, the person collecting the data called the interviewer goes to ask the persons or people he wants to collect data (interviewee) direct questions. The interviewer has to go to the interviewees personally to collect the information required verbally or by using telephones. (4) Questionnaire: A set of questions or statement is assembled to get information on a variable (or set of variables). The entire package of questions or statements is called a questionnaire. (5) Abstraction from published statistics: These are pieces of data found in published materials such as figures related to population or accident figures. This method of collecting data could be useful as 11 preliminary to other methods. METHODS OF DATA PROCESSING Data are often recorded numerically on data sheets. Unless the numbers of observations and variables are small the data must be analysed by the computer. The data will then go through the following three steps: 1. CODING: The data are transferred, if necessary to coded sheets for simplicity. 2. TYPING: The data are typed and stored by at least two independent data entry persons. E.g. double key data entry. 3. EDITING: The data are checked by comparing the two independent typed data. The standard practice for key – entering data from paper questionnaires is to key in all the data twice. Ideally the second time should be done by a different key operator whose job specifically includes verifying mismatches between the original and second entries. It is believed that this “double key/verification” method produces a (99.8) % accuracy rate for total keystrokes. 12 ERROR IN DATA COLLECTION Two major types of error can arise when a sample of observations is taken from a population: Sampling error and non – sampling error. Sampling error: refers to differences between the sample and the population that exist only because of the observations that happen to be selected from the sample. Example: If two samples of size 10 of 1000 households were selected. If we happened to get the highest income level data points in our first sample and the lowest income levels in the second, this data is due to sampling error. Note: increasing the sample size will reduce this type of error. Non – sampling error: are more serious and are due to mistakes made in the acquisition of data or due to the sample observations being selected improperly. Three types of non – sampling errors are: 1. Error in data acquisition 2. Non – response errors and 3. Selection bias 13 Note: Increasing the sample size will not reduce this type of error. 1. Errors in data acquisition: arises from the recording of incorrect responses, due to: Incorrect measurements being taken because of faulty equipment Mistakes made during transcription from primary sources Inaccurate recording of data due to misinterpretation of terms Inaccurate responses to questions concerning sensitive issues. 2. Non – response error: refers to error (or bias) introduced when responses are not obtained from some members of the sample, i.e. the sample observations that are collected may not be representative of the target population. 3. Selection bias: occurs when the sampling plan is such that some members of the target population cannot possibly be selected for inclusion in the sample. 14 DATA PRESENTATION FREEQUENCY DISTRIBUTIONS: When statistical data contains a large number of values it is not practical to draw a bar chart and often difficult to calculate averages. Instead, it is more useful to express the data as a frequency distribution. Frequency distributions can show either the actual number of observations falling in each range or the percentage of observations. GENERAL RULES FOR FORING FREEQUENCY DISTRIBUTIONS 1. Determine the largest and smallest numbers in the raw data and thus find the range (Largest-smallest). 2. Divide the range into a convenient number of class intervals having the same size. If this is not feasible use class intervals of different sizes or open class intervals. The number of class intervals is usually taken between 5 and 20 depending on the data. 3. Determine the number of observations falling into each class intervals, that is, find the class frequencies. This is best done by using tally or score sheet. EXAMPLE 1: Consider the following number of chairs in each of 42 rooms. 8,2,4,5,2,2,3,8,5,4,6,6,3,5,7,2,7,6,7,2,7,8,2,3,1,4,6,5,4,2,7,7,3,6,1,3,5,6,5,6,7,4, (i) Construct a frequency distribution table. (ii) Find the number of rooms having one, five and eight chairs. 15 SOLUTION: (i) STEP1. Range STEP2.Since the range is 7 and is less than 10 no need of dividing 7 by any number as you will see the differences in example 2. STEP3. Since there is no class interval, we use normal numbers 1, 2,3,4,5,6,7,8 as 8 1s the highest it must be included even though the range is 7 as shown below: (ii) Number of rooms having one chair is 2 rooms, Number of rooms having five chairs is 6 rooms, Number of rooms having eight chairs is 3 rooms. 16 EXAMPLE 2: In the following table, the weights of 40 male students at university are recorded to the nearest kg. Construct frequency distribution. 138 164 150 132 144 125 149 157 146 158 140 147 136 148 152 144 168 126 138 176 163 119 154 165 146 173 142 147 135 153 140 135 161 145 135 142 150 156 145 128 SOLUTION: Range If 5 class intervals are used, the class – interval size is If 10 class intervals are used, the class – interval size is If 20 class intervals are used, the class – interval size is One convenient choice for this class interval size is 5. The required frequency distribution is shown below. 17 EXERCISE 1: (1) The scores of a group of 80 students from an examination were recorded as follows: 82 56 68 74 86 80 83 91 70 67 76 92 86 65 81 61 63 65 62 73 68 66 78 66 81 82 63 65 93 71 62 84 78 72 71 70 76 80 61 59 93 87 71 73 77 88 58 70 79 55 70 69 68 56 87 82 67 58 87 71 78 68 72 72 77 86 77 80 90 69 71 75 76 81 81 48 72 76 78 75 (i) Find the range. (ii) Decide on how many classes or groups you want and find the class interval. 18 (iii) Construct a frequency distribution. (2) A survey was taken on River Park Estate in Abuja. In each of 20 homes, people were asked how many cars were registered to their households. The results were recorded as follows: 1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0. Prepare a frequency distribution table. (3) The daily number of rock climbers in Lake Louise, Alberta was recorded over a 30-day period. The results are as follows: 31, 49, 19, 62, 24, 45, 23, 51, 55, 60, 40, 35 54, 26, 57, 37, 43, 65, 18, 41, 50, 56, 41, 54, 39, 52, 35, 51, 63, 42. Create a cumulative frequency distribution table for the data by grouping the data in class intervals of 10. 19 CUMMULATIVE FREQUENCY DISTRIBUTION A cumulative frequency distribution table is a more detailed table. It looks almost the same as a frequency distribution table but it has added columns that give the cumulative frequency and the cumulative percentage of the results, as well. Example: At a recent chess tournament, all 10 of the participants had to fill out a form that gave their names, address and age. The ages of the participants were recorded as follows: 36, 48, 54, 92, 57, 63, 66, 76, 66, 80. Prepare Cumulative Frequency Distribution Table for the score. Use the following steps to present these data in a cumulative frequency distribution table. 20 Class intervals: If a variable takes a large number of values, then it is easier to present and handle the data by grouping the values into class intervals. Continuous variables are more likely to be presented in class intervals, while discrete variables can be grouped into class intervals or not. The endpoints of a class interval are the lowest and highest values that a variable can take. Class interval width is the difference between the lower endpoint of an interval and the lower endpoint of the next interval. 21 ASSIGNMENT Represent the data given in the example above in drawing cumulative frequency curve(ogive). 22