Lecture 1 On Statistics PDF

Lecture 1 Topics: Brief introduction Types of variables (Lecture notes 1; Book chapters: 1.2-1.3) Frequencies, relative frequencies and cumulative relative frequen- cies (Lecture notes 1; Book chapter: 2.3) P. Rancoita [email protected] 1 / 46 WHAT IS STATISTICS? 2 / 46 Statistics (1) Statistics is a discipline of study dealing with scientific methods for the collection, analysis, interpretation, and presentation of data, as well as with method for drawing valid conclusions on the basis of such analysis. Statistical methodologies can be classified into two groups: Descriptive statistics: it seeks only to describe and summarize the data of a sample. Inferential statistics: it consists of techniques for reaching conclusions about a population based upon information contained in a sample. Inferential statistics is based on probability theory. 3 / 46 Statistics (2) 4 / 46 Example of population and sample Example. We select 100 lung cancer patients from the national cancer registry (CR), in order to estimate the mean number of cigarettes smoked per day by a lung cancer patient, before the diagnosis of the disease. target population = all lung cancer patients parameter = mean number of cigarettes smoked per day before the diagnosis sample = 100 lung cancer patients from the CR 5 / 46 Importance of Statistics in Medicine Some examples: establishing risk factors for a disease or other health events (for prevention or for increasing the understanding of the phenomenon); establishing prognostic factor for a disease (thus, for example different treatment strategies can be adopted depended on them); assessing the benefits of new therapies; comparing the benefits of competing therapies. 6 / 46 The role of statistics in a scientific study Statistics is necessary (or must be accounted for) in every phase of a study: the design of the study; the data collection; the statistical analysis of the collected data; the interpretation of the results of the analysis. About 50% of the literature is thought to have some lack from a statistical point of view (Ercan et al, Eur. J. Gen. Med. 2007). 7 / 46 Design of a study (1) A correct design of the study allows: 1) to draw conclusions about the population upon the results obtained in the sample; 2) a correct interpretation of the results with respect to the aim of the study. 8 / 46 Design of a study (2) Examples of common errors: In a case-control study (which compares diseased and healthy subjects), the selected healthy subjects do not have characteristics (i.e. prognostic factors different from the one under study) similar to the patients. Thus, the two groups are not comparable and it is not possible to assess the effect of interest excluding confounding effects. Example. When studying a disease for which the age is a prognostic factor, the two groups are not comparable if they have not the same age distribution (e.g. one group presents a higher number of young subject than the other one). 9 / 46 Design of a study (3) In a study about a specific (target) population, the selected sample is not representative of that population. Thus, the results cannot be generalized to the target population. Example. In Emotional category data on images from the International Affective Picture System (Mikels et al, 2005), the authors wanted to identify which images were able to elicit a particular emotion more than others. They used samples of students with mean age 18-19 years, thus their findings are not generalizable to older individuals. 10 / 46 Data collection (1) A precise data collection is the base for a good research. Examples of possible errors: Variables are poorly measured. The data of a variable are recorded with different units of measurement (for the different patients), without specifying them. Data are poorly written in the medical record of the patient, thus leading to errors when recording the information in the electronic database. 11 / 46 Data collection (2) Example. The presence of only a poor measured or reported data may completely alter the result of the analysis (on the right-hand side). 12 / 46 Statistical analysis (1) Any kind of statistical analysis makes assumptions about the data. ⇒ Before performing any analysis, it is necessary to verify if the corresponding assumptions are met. Example. Several statistical methods assume that the observations are all independent. In some studies, measurements are taken before and after the treatment in order to assess its efficacy. But data referring to the same subj. are dependent, thus appropriate methods need to be applied for the analysis. The results of the analysis must be reported in a correct and precise manner in order to avoid misinterpretations. 13 / 46 Statistical analysis (2) Example. We want to represent the weights of a group of patients together with their mean. Wrong solution: A graph like the ones below may give a misleading interpretation of the data, especially if the weights show a particular trend with respect to the order of the patients. 14 / 46 Statistical analysis (3) Correct solution: A graph like the one below gives better the idea of the weights that are mostly represented in the sample. 15 / 46 Interpretation of the results For a correct interpretation of the results, it is necessary to account for: the exact meaning of the statistical analysis that was employed; the representativeness of the sample with respect to the target population. Example of misinterpretation: When a statistical analysis shows a significant association between two variables, the interpretation of this association as causality is beyond the meaning of the standard statistical analysis and can be supported only by clinical/biological knowledge of the phenomenon. 16 / 46 Interpretation of the results (2) Example (possible misinterpretation of association results). Analyzing the data about coronary heart disease (CHD), it can be usually found that there is an association between heavy coffee drinking and CHD mortality. Nevertheless, the real risk factor for CHD is heavy smoking (which is also associated with CHD mortality). Heavy coffee drinking is associated with CHD mortality (although it is not the cause), because often heavy smokers are also heavy coffee drinkers. 17 / 46 Statistics 18 / 46 Population and sample population = collection of subjects or objects of interest (target) that share common observable characteristics unit = any individual or element of the target population ⇓ we select sample = subset of the (target) population which is representative of the entire population 19 / 46 Variables and data Definition. A variable is any kind of observable characteristic that can vary among the units of a population. Example. Examples of variables are: sex and age. Definition. A parameter of the population is a numerical characteristics related to a variable of the (target) population. Example. Examples of parameters related to the previous variables are: the percentage of females and the mean age. Definition. A data is the observed value of a variable for one particular unit of the sample. Example. In a study, the reported values of sex and age of the patients are data. 20 / 46 Types of variables nominal - categorical (or qualitative) ordinal discrete - numerical (or quantitative) continuous 21 / 46 Variables: categorical vs numerical (1) Definition. A variable is called categorical (or qualitative) if its values denote the membership to a category/group, that is its values represent a particular quality of the units of the population. The possible categories of a variable must be mutual exclusive, that is a unit cannot belong to more than one category. Example. The variable sex is categorical, since its values are: male and female. 22 / 46 Variables: categorical vs numerical (2) Definition. A variable is called numerical (or quantitative) if its values represent quantities that can be measured or counted. Example. The variable age is numerical. Remark. A numerical variable can be transformed into a categorical one by dividing the interval of all its possible values in subintervals, which then define the categories of the new variable. Example. The age can be divided in three classes: < 30 , between 30 and 60 (both inclusive), > 60. The resulting variable is categorical (and, in particular, ordinal). 23 / 46 Variables: categorical vs numerical (3) Remark. In a database or in a case report form (CRF), often the categories of the categorical values are labeled with numbers. Therefore, it is necessary to understand the “real meaning” of the labels (numbers), in order to define the type of variable. Example. The values of the level of satisfaction can be denoted as: 1(=low), 2(=medium) and 3(=high). Although the values of the variable are labeled with numbers, they represent three categories (low, medium, high) and thus the variable is categorical (and not numerical). 24 / 46 Types of variables: nominal vs ordinal nominal - categorical (or qualitative) ordinal discrete - numerical (or quantitative) continuous 25 / 46 Categorical variables: nominal vs ordinal Definition. A categorical variable is called ordinal if its values (or categories) have an intrinsic (and not simply “aesthetic”) order. Otherwise, the variable is called nominal. If a nominal variable assume only two values is called dichotomic. Example (1). The level of satisfaction (which assumes the values: low, medium, high) is an ordinal variable. The categories of the variable can be ordered in the following way: low ≺ medium ≺ high. Example (2). The presence of fever (which assumes the values: no/yes) is a nominal (dichotomic) variable. In fact, it is not possible to order the values no and yes. 26 / 46 Categorical variables: remark Remark. Since often the categories of the categorical variables are labeled with numbers, it is necessary to understand the “real meaning” of labels (numbers) in order to distinguish also between nominal and ordinal variables. Example. The presence of a symptom (e.g. the fever) can be denoted as: 0(=no) and 1(=yes). The variable is categorical since the numbers represent categories. Moreover, although the values 0 and 1 can be ordered, the variable is nominal, since their represented categories (no and yes) cannot be ordered. 27 / 46 Types of variables: discrete vs continuous nominal - categorical (or qualitative) ordinal discrete - numerical (or quantitative) continuous 28 / 46 Numerical variables: discrete vs continuous Definition. A numerical variable is called discrete if it can assume a finite or countable number of numerical values. Discrete variables usually result from counting. Example. The number of children of a person is a discrete variable. Definition. A numerical variable is called continuous if it can assume any numerical value over an interval or over several intervals. A continuous variable usually results from making a measurement of some type. Example. The temperature is a continuous variable. 29 / 46 Data set 1: data of breast cancer patients 30 / 46 Types of variables in Data set 1 nominal: - categorical ordinal: tumor grade    discrete: n. positive lymph nodes continuous: age, tumor size,  - numerical   progesterone receptor,  estrogen receptor 31 / 46 Exercise Determine the type (nominal, ordinal, discrete, continuous) of the following variables: a. number of medications per day b. type of surgery c. surgery duration d. grade of postoperative complications e. weight It is possible to re-do the online in-class test without looking to the solutions in the next slide (deadline Oct. 21st) by using the Quizziz link : https://quizizz.com/join?gc=79051408 The link is available in Blackboard (https://bb.unisr.it/) in folder Lecture 1 (within Lectures-Rancoita). 32 / 46 Solution Determine the type (nominal, ordinal, discrete, continuous) of the following variables: a. number of medications per day numerical: discrete b. type of surgery categorical: nominal c. surgery duration numerical: continuous d. grade of postoperative complications categorical: ordinal e. weight numerical: continuous 33 / 46 Frequency distribution (1) The frequency distribution represents a way for summarizing the data of a variable. Definition. The (absolute) frequency distribution of a variable lists how many times each specific value (or interval of values) is observed in the sample. Example (1). tumor grade: II, II, III, II, II, II, III, II, I, I, II, II, III, III, II, II, II, III, III, I, II, II, I, II, II, III, II, III, I value (i) frequency (ni ) I 5 II 16 III 8 total 29 34 / 46 Frequency distribution (2) Example (2). age: 70, 73, 32, 65, 80, 66, 50, 54, 39, 55, 56, 57, 65, 65, 44, 43, 32, 45, 36, 55, 34, 62, 64, 53, 53, 65, 45, 58, 59. value frequency (i) (ni ) 32 ? 33 ?... ? 80 ? total 29 Does it make sense? 35 / 46 Class intervals Some variables (especially numerical ones) may assume a high number of different values. Therefore, a table with the frequencies of each possible value assumed by the variable does not represent well the data. ⇓ The data are divided in classes of disjoint intervals (such that each value belongs only to one class) Notation for the intervals: a ⊣ b all numbers greater than a and lower or equal to b, a ⊢ b all numbers greater or equal to a and lower than b, a ⊢⊣ b all numbers greater or equal to a and lower or equal to b. 36 / 46 Example of frequency table with classes age: 32, 32, 34, 36, 39, 43, 44, 45, 45, 50, 53, 53, 54, 55, 55, 56, 57, 58, 59, 62, 64, 65, 65, 65, 65, 66, 70, 73, 80 We compute the frequencies of variable age, using 5 equally spaced intervals. Maximum = 80; class frequency Minimum = 32; (i) (ni ) ⇒ length of the interval (∆) 30⊣40 5 = Maximum-Minimum n. classes 40⊣50 5 = (80 − 32)/5 50⊣60 9 = 9.6 ≈ 10 60⊣70 8 70⊣80 2 37 / 46 Relative frequencies frequency ni Relative frequency = total n. of observations = n ⇒ We are comparing each category/class to the total. variable: tumor grade variable: age relative relative value frequency frequency class frequency frequency (i) (ni ) (pi = ni /n) (i) (ni ) (pi = ni /n) I 5 5/29=0.17 30⊣40 5 5/29=0.17 II 16 16/29= 0.55 40⊣50 5 5/29=0.17 III 8 8/29=0.28 50⊣60 9 9/29=0.31 total 29 1 60⊣70 8 8/29=0.28 70⊣80 2 2/29=0.07 total 29 1 38 / 46 Graphical representations of relative or absolute fre- quencies (1) Categorical variables: bar graph and pie chart Definition. A bar graph is a graph composed of bars whose heights are the absolute or relative frequencies of the different categories. 39 / 46 Graphical representations of relative or absolute fre- quencies (2) Definition. A pie chart consists of a circle which is divided into portions that represent the absolute or relative frequencies of the different categories. 40 / 46 Graphical representations of relative or absolute fre- quencies (3) Numerical variables: histogram Definition. In case of equally spaced classes, a histogram is a graph that displays the classes on the horizontal axis and the absolute or relative frequencies of the classes on the vertical axis. The height of the bars in the graph correspond to the absolute or relative frequency of the corresponding class interval. 41 / 46 Graphical representations of relative or absolute fre- quencies (4) example of symmetric example of asymmetric distribution distribution (with a longer tail on the right side) 42 / 46 Cumulative relative frequency Cumulative (relative) frequency = for each value (or class), it is the relative frequency of the set of all values up to that value (or class) Example. Table of cumulative rel. freq. of age. class freq. rel. freq. cumulative rel. frequency (i) (ni ) (pi ) (Fi ) 30⊣40 5 0.17 0.17 40⊣50 5 0.17 0.17+0.17=0.34 50⊣60 9 0.31 0.17+0.17+0.31=0.34+0.31= 0.65 60⊣70 8 0.28 0.17+0.17+0.31+0.28=0.65+0.28=0.93 70⊣80 2 0.07 0.17+0.17+0.31+0.28+0.07=0.93+0.07=1 total 29 1 - 43 / 46 Graph of cumulative (rel.) frequencies 44 / 46 Quick exercise on frequencies Firstly, construct the frequency, relative frequency and cumulative frequency table for the variable representing the weights (in kg) of twelve children, using the following classes: 10 ⊣ 13, 13 ⊣ 16, 16 ⊣ 19. After building the table, answer to the questions of the test. Weight: 18, 17, 11, 15, 19, 16, 18, 17, 13, 14, 11, 19 Use the following link of Microsoft Forms to answer to the test : https://forms.office.com/e/guPsGEUd4M The link is available in Blackboard (https://bb.unisr.it/) in folder Lecture 1 (within Lectures-Rancoita). INSTRUCTIONS: answer to each question, eventually writing I DON’T KNOW [with upper letters] in open-ended questions. 45 / 46 Solutions class freq. rel. freq. cumulative rel. frequency (i) (ni ) (pi ) (Fi ) 10⊣13 3 3/12=0.25 0.25 13⊣16 3 3/12=0.25 0.25+0.25=0.50 16⊣19 6 6/12=0.50 0.25+0.25+0.50=1 1. What is the frequency of the class 10⊣13? 3 2. Which is the class with the highest relative frequency? 16⊣19 3. What is the cumulative relative frequency of the class 13⊣16? 0.50 46 / 46

Lecture 1 On Statistics PDF

Document Details

Tags

Related

Summary

Full Transcript