Statistics Lecture Notes PDF
Document Details
Tags
Summary
These lecture notes provide a foundational overview of statistics, including definitions of key terms, descriptions of descriptive and inferential statistics, and different variable types. It introduces concepts like data collection, numerical processing, and presentation of results for analysis.
Full Transcript
Statistics 1 Introduction Statistics is a set of techniques for the collection and processing of data in order to extract from them useful information, from a qualitative or quantitative point of view. CONTENT Data collection. Numerical processing of information. Presentatio...
Statistics 1 Introduction Statistics is a set of techniques for the collection and processing of data in order to extract from them useful information, from a qualitative or quantitative point of view. CONTENT Data collection. Numerical processing of information. Presentation of the results. PURPOSE Get conclusions about the entire population, even when you know only the data of one or more samples. Facilitate analysis and decision-making processes. COMPONENTS DESCRIPTIVE STATISTICS: set of methods for data collection, their organization in tables and graphs and their synthesis using special indices that describe the essential characteristics of the data INFERENTIAL STATISTICS: set of methods to process the data and deduce the conclusions that go beyond the direct evidence of the data. More precisely, from the knowledge of one or more samples from the population we want go back to the charac- teristics of the same population. For example, suppose you want to know the speed of somatic growth of a particular animal or plant; it is obvious that it is not possible to take all the existing individuals of that species, but only some of them (one sample). However, the findings should not be limited to the few cases of the considered sample, but extended to the entire population, to obtain an effective general importance and contribute to the construction of scientific theories. TERMINOLOGY SET: collection of all types of objects (people, animals, plants, answers to questionnaires, symptoms, results of laboratory,... ). VARIABLE: logical entity that can have different values which constitute the output set. Generally, X, Y, Z, T,... indicate the variables; x1, x2,... indicate the values of the variable X; y1, y2,... indicate the values assumed by the variable Y and so on for the other variables. There are two types of variables: – QUALITATIVE VARIABLE: it is generally classified into categories (color of a rock, the form of a leaf, blood group). – QUANTITATIVE VARIABLE: it is expressed by a numerical value and can be of two types 1 ∗ DISCRETE: it can take only the integer values, in this case the discrete vari- able can take values from a finite or countable set (number of children per family, number of trees per km2 ); ∗ CONTINUOUS: it can assume all the values in a given interval (height, weight, pressure level). POPULATION: the set of all the possible observable objects (individuals = objects). SAMPLE: a finite subset of the population with which you work really. FREQUENCIES: let X be the variable which describes the various objects in the input set (population/sample), and assuming that it takes a given value xi (qualitative or quantitative) a number fi of times, then fi is called ABSOLUTE FREQUENCY of xi. Let x1 , x2 ,... , xr , be the distinct observations (outputs) of the variable X and let f1 , f2 ,... , fr the corresponding absolute frequencies, then: ∑ r n= fi = f1 + f2 + · · · + fr (1) i=1 is the total number of objects (individuals) of the input set (in the case that the input set is a sample then n is called the SAMPLE SIZE); moreover pi = fi /n is called the RELATIVE FREQUENCY or PROPORTION of xi. Note that: ∑ r pi = p1 + p2 + · · · + pr = 1. (2) i=1 Example 1.1 Let X be the discrete variable that consider the number of children per family in the entire population of a certain geographical area, and suppose that in a given sample it is obtained the following data: 1,0,2,2,4,1,1,3. Then we have: the sample dimension is n = 8 (the number of considered families), the distinct observations are: x1 = 0, x2 = 1, x3 = 2, x4 = 3, x5 = 4; the absolute frequencies are: f1 = 1, f2 = 3, f3 = 2, f4 = 1, f5 = 1; the relative frequencies are: p1 = 1/8, p2 = 3/8, p3 = 1/4, p4 = 1/8, p5 = 1/8. PHASES OF THE STATISTICAL SURVEY 1. Experimental design: that is the programming of a statistical research; there are two types of statistical surveys: EXPERIMENT INVESTIGATION in any case at this stage they frequently arise technical, administrative and ethical prob- lems. In particular, to use a sample as a guide to an entire population, it is important that it truly represent the overall population. Representative sampling assures that in- ferences and conclusions can safely extend from the sample to the population as a whole. Moreover, the statistical survey must be programmed according to the purpose for which it is made for example: estimate certain characteristics of the population; investigate the association between certain characteristics of the population; 2 comparison of the effectiveness of different experimental methods. 2. Sampling: collection of data relating to a population performed with precise mode. It can be done in various ways: SIMPLE RANDOM SAMPLING: each individual or object of a population has the same probability of being included in the sample; CLUSTER SAMPLING this sampling technique is used when ”natural” but rel- atively homogeneous groupings are evident in a statistical population. It is often used in marketing research. In this technique, the total population is divided into groups (or clusters) and a simple random sample of each group is selected. The pro- portion of individuals in the various clusters of the population and the corresponding proportion in the corresponding samples must be equal. MULTISTAGE SAMPLING: the population is divided into several ”groups (or clusters) of first stage” where the sample can be chosen, each selected group of first stage is divided into several ”groups (or clusters) of second stage” where the sample can be chosen and so to whenever it is considered necessary. 3. Description of the collected data: acquisition of raw information from the data collected; Sometimes this stage is an end in itself sometimes it is the preliminary stage of statistical inference; generally this stage consists in the following operations: organize data in tables and graphs; calculate synthetic indices that quantitatively describe some characteristics of the data. At this stage we can also verify the adequacy of the experimental design and of the sampling. 4. Use of tests: logical mathematical process that leads to the deduction of the character- istics of the population. 1.1 Elements of mathematics In mathematics and computer science we have that the rounding of a number x is the number y that is the nearest to x having a fixed number of significant digits (the ones on the left); the truncation of a number x is the gratest number y ≤ x that has a fixed number of significant digits (the ones on the left). For example, consider the real number x = 175.63414325436536 the rounding of x to the cents is y = 175.63; the truncation of x to the cents is z = 175.63; the rounding of x to the tenths is u = 175.6; the truncation of x to the tenths is v = 175.6; 3 the rounding of x to the units is s = 176; the truncation of x to the units is t = 175; the rounding of x to the tens is l = 180; the truncation of x to the tens is m = 170. 2 DESCRIPTIVE STATISTICS The descriptive statistics is the discipline that qualitatively or quantitatively describes the main features of a collection of information. Even when a data analysis draws its main conclusions using inferential statistics, descriptive statistics are generally also presented. For example in a paper reporting on a study involving human subjects, there typically appears a table giving the overall sample size, sample sizes in important subgroups (e.g. for each treatment or exposure group), and demographic or clinical characteristics such as the average age, the proportion of subjects of each sex, and the proportion of subjects with related comorbidities. 2.1 Measurement scales and data types Formal properties of different data (which accordingly allow different operations) are associated with different types of measurement scales. So that it is important, in statistical analysis, to know about the different scales of measurement, these are: Nominal scale – it is the lower level of information; – it is used when the data can be grouped into categories, which are identified by symbols (e.g. eye colour); individuals assigned to different classes are mutually different; those of the same class are equivalent with respect to the property that is used in the classification; – the assignment of numbers to identify the various nominal categories (e.g. the num- ber of players teams) does not allow to draw those numbers as such; – this scale is called dichotomous scale when we have only two categories (e.g. male/female). Ordinal scale or Nominal scale with order – this scale contains a quantity of information greater than the previous one – the gradation between classes is added to the property of equivalence between indi- viduals of the same class; for example: 1) a reagent colors a series of tubes according to the content of the contained essence, allowing you to sort the tubes according to the intensity of the color; 2) the answers, apparently defined at nominal level, can be expressed on an ordinal scale (e.g.: young, adult, elderly; insufficient, sufficient, good, very good, excellent); 3) any symbolic representations (e.g.:- -,-, =, +, ++); – with this scale it is impossible to judge the distance between different levels (e.g. between insufficient and sufficient is there a distance different from that between very good and excellent?) 4 Interval scale – the capability of measuring the distances between all pairs of observations is added to the two properties of the ordinal scale; – it is based on an objective and consistent measurements, even if the point of ori- gin and units are arbitrary (e.g. the temperature measured in degrees Celsius or Fahrenheit, calendars); – only the differences can be manipulated quantitatively; EXAMPLE: the temperature measurements can be easily ordered and the differences between them are directly comparable and quantifiable; but a temperature of 40 degrees is not the double of 20 degrees with respect to the absolute zero. Ratio scale – the natural origin of the measurements adds to the three properties of the above scale; – the type of measurement is the most sophisticated and complete (e.g. height, dis- tance, age, weight) – not only the differences but the same values can be multiplied or divided by a constant amount in order to get new valid information; – 0 (zero) means zero amount, in contrast to what happens for example with the temperature of 0 degrees Celsius (the freezing point of a particular substance under particular conditions). 2.2 Different types of tables in Statistics A table is a means of arranging data in rows and columns. The use of tables is pervasive throughout all communication, research and data analysis. Tables appear in print media, handwritten notes, computer software, architectural ornamentation, traffic signs and many other places. The precise conventions and terminology for describing tables varies depending on the context. However before any processing, a set of data must be organized and synthesized in frequency distribution, as an unordered set does not allow almost never highlighting the characteristics of the phenomenon under consideration. The frequency distribution is the organization of raw data in table form with classes and frequencies, and it depends on the type of data. In particular we have the following frequency distributions: QUALITATIVE DATA EXAMPLE (nominal scale): blood type (blood group) in a sample of a population: Table 1: Blood type blood type frequency A 60 B 16 AB 7 0 66 total 149 5 EXAMPLE (ordinal scale): assessments obtained by the students of two classes in the ex- ame of genetic (it is preferable to have the observations ordered in ascending or descending order) Table 2: Exam of genetic assessments class 1 class 2 insufficient 35 25 sufficient 35 30 good 20 15 excellent 11 5 total 101 75 QUANTITATIVE DATA – DISCRETE DATA In order to construct the frequency distribution table of a set of discrete data we have to a) identify the minimum value (0 in the example) and the maximum value (9 in the example) of the data set; b) choose the classes: in the organization of the discrete data it is preferable to define a class for each observed value, but if this choice lead to have a large number of classes, we have to group the values into fewer classes, see the case of continuous data given in the following pages. How many classes of frequency have to be chosen? From a minimum of 4-5 to a maximum of 15-20 (this is an usual practice) and the choice depends on the total number of observations. Indeed: · if the number of classes is too low we have a loss of information on the characteristics of the distribution that makes it non-significant; · if the number of classes is too high we have a dispersion of the values and the shape of the distribution is lost. c) get the absolute frequency fi of each class by counting how many data are in each class d) from the absolute frequency fi we can calculate the relative frequency pi obatined by the ratio of the absolute frequency fi and the total number of data n; it is especially useful when you want to compare two or more distributions of the same phenomenon with different number of observations (as in the above example of the exam of genetic). e) we can also compute the (relative or absolute) cumulative frequencies (pC i fi C respectly) ∑ ∑ pC i = pk , fiC = fk. 1≤k≤i 1≤k≤i 6 EXAMPLE: The following data have been obtained by counting the number of leaves (discrete variable) appeared on 45 branches of equal length of a plant in a given time interval: 345672323264393 203346542367342 513437021315045 class abs. freq. rel. freq. rel. cum. freq. abs. cum. freq. xi fi pi pC i fiC 0 3 0,0667 0,0667 3 1 3 0,0667 0,1333 6 2 7 0,1556 0,2889 13 3 12 0,2667 0,5556 25 4 7 0,1556 0,7111 32 5 5 0,1111 0,8222 37 6 4 0,0889 0,9111 41 7 3 0,0667 0,9778 44 8 0 0,0000 0,9778 44 9 1 0,0222 1,0000 45 45 1 – CONTINUOUS DATA Note that, in the following, the texts between breckets are relative to the example below of the heights (cm) of 40 plants, obtained by approximating the measures to units. COMPUTATION OF THE CLASSES In order to construct the frequency distribution table of a set of continuous data we must first of all determine the classes. The choice of classes is personal but must comply with the following constraints: ∗ The number of classes must be from a minimum of 4-5 to a maximum of 15-20; ∗ The classes must be disjoint; ∗ The first class must contain the minimum of the data; ∗ The last class must contain the maximim of the data; ∗ If a − b is a class and c − d is the next class, then c = b + u where ... 100 if the measures have been approximated to the hundreds 10 if the measures have been approximated to the tens u= 1 if the measures have been approximated to the units 0.1 if the measures have been approximated to the tenths 0.01 if the measures have been approximated to the cents ... In the following we will give some guidelines to follow in order to choose the casses: a) identify the minimum and maximum value (= 64cm and 198cm) of the data set; b) establish the range of variation (from 60cm to 200cm), which of course must include the entire range of variation (from 64cm to 198cm included), note that 200 − 60 = 140; 7 c) taking into account the size of the data n (= 40) we have to decide the number of classes r, we remember that 4 ≤ r ≤ 20 (if in the example we choose r = 7, with have that the amplitude of each class is equal to 20cm = 140cm / 7=range/r). WARNINGS: ∗ Accurately define the minimum and maximum of each class, to avoid uncertainty in the allocation of a single data between two contiguous classes. ∗ The determination of the extreme values (step b)), of the number of classes and of each class interval (step c)) is subjective; such a choice may result in a completely different representation of the data, generally for small samples the effects of the choices are greater than the effects for large samples. ∗ The initial and terminal classes must not be opened (for example we cannot choose