BIO203 Biostatistics Lecture 2 (Descriptive Statistics) shf 2024 PDF

Summary

This document provides lecture notes for BIO203 Biostatistics, focusing on descriptive statistics and their application to biological data. It discusses different types of measurements (nominal, measurement, ranked, ratio/interval), relevant tests, and questions concerning data analysis. The document covers sampling, populations, and samples, and how one can infer characteristics from samples about entire populations. It introduces the concepts of hypothesis testing and data analysis.

Full Transcript

BIO203 - Biostatistics: Lecture 2 - descriptive statistics BIO203.01 - Biostatistics shf - 2024 Biological data: tests that can be used different questions can be answered with different tests...

BIO203 - Biostatistics: Lecture 2 - descriptive statistics BIO203.01 - Biostatistics shf - 2024 Biological data: tests that can be used different questions can be answered with different tests nominal measurement ranked ratio / interval BIO203.01 - Biostatistics shf - 2024 Biological data: tests that can be used different questions can be answered with different tests nominal measurement ranked ratio / interval tests for parametric tests non-parametric tests nominal variables (rank tests) BIO203.01 - Biostatistics shf - 2024 Biological data: measurements of something Measurement variables (ratio / interval) length number of cells / colonies /... protein concentration … … … … … … … whatnot BIO203.01 - Biostatistics shf - 2024 Biological data: basic questions to ask the same question / data can be looked at in different ways How accurate is your class time management? option I do my observations match my expectation? 50 min BIO203.01 - Biostatistics shf - 2024 Biological data: basic questions to ask the same question / data can be looked at in different ways How accurate is your class time management? option I option II do my observations are these two measured match my expectation? values the same? 50 min 63 min 50 min BIO203.01 - Biostatistics shf - 2024 Biological data: basic questions to ask the same question / data can be looked at in different ways How accurate is your class time management? option I option II option III do my observations are these two measured does any of the measured match my expectation? values the same? values differ from the others? 72 min 50 min 63 min 50 min 50 min 50 min 50 min 43 min BIO203.01 - Biostatistics shf - 2024 Biological data: basic questions to ask the same question can be looked at in different ways How accurate is your class time management? option I option II option III do my observations are these two measured does any of the measured match my expectation? values the same? values differ from the others? time between breaks time between breaks time between breaks expectation reality shf x shf ac x y z 50 51 51 50 51 50 62 30 51 54 54 50 54 50 56 32 50 50 50 50 50 50 51 43 52 56 56 49 56 49 44 49 51 57 57 50 57 50 51 51 50 49 49 51 49 51 52 53 51 60 60 50 60 50 58 12 50 52 52 52 52 52 52 41 51 average: 53.6 53.6 50.3 53.6 50.3 50.3 38.9 50.8 Hypothesis Hypothesis Hypothesis The average class duration The average class duration The average class duration is exactly 50 minutes. is the same for both instructors. is the same for all instructors. BIO203.01 - Biostatistics shf - 2024 Sampling: population vs sample population (all possible individuals) what we want to know unfortunately, we know nothing BIO203.01 - Biostatistics shf - 2024 Sampling: population vs sample - non-representative sample population sample (all possible individuals) (a subgroup) sampling what we want to know what we can know unfortunately, we know nothing BIO203.01 - Biostatistics shf - 2024 Sampling: population vs sample - non-representative sample population laboratory interns (all possible individuals) (not a representative subgroup) sampling what we want to know what we can know unfortunately, we know nothing biased information BIO203.01 - Biostatistics shf - 2024 Sampling: population vs sample - representative sample population random sample (all possible individuals) (representative subgroup) sampling what we want to know what we can know — unfortunately, we know nothing sample mean: X sample variance: s BIO203.01 - Biostatistics shf - 2024 Sampling: population vs sample - representative sample population random sample (all possible individuals) (representative subgroup) sampling inference what we want to know what we can know — population mean: µ sample mean: X assume population variance:  sample variance: s BIO203.01 - Biostatistics shf - 2024 Biological data: sample vs population all classes for all courses ever given (and to be given) do my observations by the same instructor match my expectation? 51 53 72 53... time between breaks 54 51 60 54... expectation reality 42 52 51 50... 50 51 49...... 50 51 35 78 50...... 54 72 19 56...... 50 60 45 51...... 56 a sample of a much larger population 51 50 54...... 57 49 52 42...... 49 50 60 50...... 60 56 47 35...... 52 57 46 57...... average: 53.6 49 44 49...... 60 60 60...... 52 51 52... 50 Hypothesis The average class duration total average: 52.7 is exactly 50 minutes. BIO203.01 - Biostatistics shf - 2024 Biological data: sample vs population one sample two samples five samples do my observations are these two measured does any of the measured match my expectation? values the same? values differ from the others? time between breaks time between breaks time between breaks expectation reality shf x shf ac x y z 50 51 51 50 51 50 62 30 51 54 54 50 54 50 56 32 50 50 50 50 50 50 51 43 52 56 56 49 56 49 44 49 51 57 57 50 57 50 51 51 50 49 49 51 49 51 52 53 51 60 60 50 60 50 58 12 50 52 52 52 52 52 52 41 51 average: 53.6 53.6 50.3 53.6 50.3 50.3 38.9 50.8 Hypothesis Hypothesis Hypothesis The average class duration The average class duration The average class duration is exactly 50 minutes. is the same for both instructors. is the same for all instructors. BIO203.01 - Biostatistics shf - 2024 Biological data: basic questions to ask only limited numbers of observations were analyzed → sample do my observations are these two measured does any of the measured match my expectation? values the same? values differ from the others? time between breaks time between breaks time between breaks expectation reality shf x shf ac x y z 50 51 51 50 51 50 62 30 51 54 54 50 54 50 56 32 50 50 50 50 50 50 51 43 52 56 56 49 56 49 44 49 51 57 57 50 57 50 51 51 50 49 49 51 49 51 52 53 51 60 60 50 60 50 58 12 50 52 52 52 52 52 52 41 51 average: 53.6 53.6 50.3 53.6 50.3 50.3 38.9 50.8 and not all classes ever given (or to be given) by each instructor → population BIO203.01 - Biostatistics shf - 2024 Sampling: population vs sample - representative sample population I population II (control group) (test group) — — sample mean: X compare X sample variance: s s BIO203.01 - Biostatistics shf - 2024 Sampling: population vs sample - representative sample population I population II (control group) (test group) relationship µ /  µ /  assume assume hypothesis testing — — sample mean: X compare X sample variance: s s BIO203.01 - Biostatistics shf - 2024 Biological data: tests that can be used different questions can be answered with similar / different tests nominal measurement ranked ratio / interval if criteria are not met variable conversion tests for parametric tests non-parametric tests nominal variables (rank tests) what are the parameters ? BIO203.01 - Biostatistics shf - 2024 Parametric tests: what are the parameters ? measurement data can be (fully) described by two parameters mean dispersion on average → mean roughly / around → mean ± some measure of variation BIO203.01 - Biostatistics shf - 2024 Centrality parameters: mean the mean is the sum of all measured values divided by the number of observations average 𝑥1 + 𝑥2 + 𝑥3 + ⋯ + 𝑥𝑛 good 𝑚𝑒𝑎𝑛 = 𝑛 50 + 55 + 49 + 45 + 60 + 50 𝑚𝑒𝑎𝑛 = = 51.5 6 BIO203.01 - Biostatistics shf - 2024 Centrality parameters: sample mean vs population mean the sample mean is an estimate of the population mean average 𝑥1 + 𝑥2 + 𝑥3 + ⋯ + 𝑥𝑛 good 𝑚𝑒𝑎𝑛 = 𝑛 50 + 55 + 49 + 45 + 60 + 50 𝑚𝑒𝑎𝑛 = = 51.5 6 sample mean σ𝑛𝑖 𝑥𝑖 𝑥ҧ = 𝑛 BIO203.01 - Biostatistics shf - 2024 Centrality parameters: sample mean vs population mean the sample mean is an estimate of the population mean average 𝑥1 + 𝑥2 + 𝑥3 + ⋯ + 𝑥𝑛 good 𝑚𝑒𝑎𝑛 = 𝑛 50 + 55 + 49 + 45 + 60 + 50 𝑚𝑒𝑎𝑛 = = 51.5 6 sample mean population mean σ𝑛𝑖 𝑥𝑖 σ𝑛𝑖 𝑥𝑖 𝑥ҧ = assume 𝜇= 𝑛 𝑛 what you measure estimated (based on sample mean) the estimate gets better (more accurate) with increasing n (sample size) BIO203.01 - Biostatistics shf - 2024 Measure of central tendency: can be absurd the majority of Turkish population has more than the average number of legs BIO203.01 - Biostatistics shf - 2024 Measure of central tendency: can be absurd the majority of Turkish population has more than the average number of legs Turkish population: 86.408.416 People with one leg: 8.361 (15.10.2022) mean # of legs = ((86.408.419 – 8.361) * 2) + (8.361 * 1) / 86.408.416 = 1.99990 → 86.400.058 Turkish people have more than the average number of 1.99990 legs BIO203.01 - Biostatistics shf - 2024 Centrality parameters: different types the average can be described by different parameters of centrality center average mean arithmetic mean median parametric mean mode geometric mean harmonic mean BIO203.01 - Biostatistics shf - 2024 Centrality parameters: different types meanarithmetic = å x i arithmetic mean: sum of observations / n n geometric mean: meangeometric = n x1 ´... ´ x n nth root of the product of n values n harmonic mean: mean harmonic = åx reciprocal of the arithmetic mean of reciprocals of values 1 i median: values: 5.5 5.8 5.9 6.7 9.9 value of the (n/2)th value order: 1 2 3 4 5 mode 1 mode 2 mode local maximum in a distribution BIO203.01 - Biostatistics shf - 2024 Centrality parameters: geometric mean - example the geometric mean describes the average change fold change: 1.8 1.33 1.25 1.2 size in mm: 5 9 12 15 18 day 1 day 2 day 3 day 4 day 5 BIO203.01 - Biostatistics shf - 2024 Centrality parameters: geometric mean - example the geometric mean describes the average change fold change: 1.8 1.33 1.25 1.2 𝟒 size in mm: 5 9 12 15 18 𝒎𝒆𝒂𝒏𝒈 = 𝟏. 𝟖 × 𝟏. 𝟑𝟑 × 𝟏. 𝟐𝟓 × 𝟏. 𝟐 𝒎𝒆𝒂𝒏𝒈 = 𝟑. 𝟔 𝒎𝒎 𝒑𝒍𝒂𝒏𝒕 𝒈𝒓𝒐𝒘𝒕𝒉 = 𝒎𝒆𝒂𝒏𝒈 × # 𝒐𝒇 𝒅𝒂𝒚𝒔 𝒑𝒍𝒂𝒏𝒕 𝒈𝒓𝒐𝒘𝒕𝒉 = 𝟑. 𝟔 𝒎𝒎 × 𝟓 𝒅 = 𝟏𝟖 𝒎𝒎 day 1 day 2 day 3 day 4 day 5 BIO203.01 - Biostatistics shf - 2024 Centrality parameters: different types meanarithmetic = å x i arithmetic mean: sum of observations / n n geometric mean: meangeometric = n x1 ´... ´ x n nth root of the product of n values n harmonic mean: mean harmonic = åx reciprocal of the arithmetic mean of reciprocals of values 1 i median: values: 5.5 5.8 5.9 6.7 9.9 value of the (n/2)th value order: 1 2 3 4 5 mode 1 mode 2 mode local maximum in a distribution BIO203.01 - Biostatistics shf - 2024 Central tendency: where is the center ? early students very late students 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 arrival time (min after beginning of class) BIO203.01 - Biostatistics shf - 2024 Central tendency: where is the center ? early students very late students 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 arrival time (min after beginning of class) mean (3.96) BIO203.01 - Biostatistics shf - 2024 Central tendency: where is the center ? early students very late students outliers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 arrival time (min after beginning of class) mean mean (2.66) (3.96) BIO203.01 - Biostatistics shf - 2024 Central tendency: mean and median 12.5 12.0 11.0 10.5 10.0 3.8 3.7 3.7 3.6 3.5 3.4 3.3 3.2 smallest to largest 3.1 3.0 3.0 2.9 2.8 2.7 2.7 2.7 2.6 2.4 2.4 2.2 2.1 2.0 1.9 1.9 1.8 1.7 1.5 1.0 BIO203.01 - Biostatistics shf - 2024 Central tendency: mean and median 12.5 12.0 11.0 10.5 10.0 3.8 3.7 3.7 3.6 3.5 3.4 3.3 3.2 3.1 3.0 33 values 3.0 2.9 range: 1 - 12.5 2.8 2.7 2.7 2.7 2.6 2.4 2.4 2.2 2.1 2.0 1.9 1.9 1.8 1.7 1.5 1.0 BIO203.01 - Biostatistics shf - 2024 Central tendency: mean and median 12.5 12.0 11.0 10.5 10.0 mean (3.96) 3.8 3.7 3.7 3.6 3.5 3.4 3.3 3.2 3.1 3.0 33 values 3.0 2.9 range: 1 - 12.5 2.8 2.7 2.7 2.7 2.6 2.4 2.4 2.2 2.1 2.0 1.9 1.9 1.8 1.7 1.5 1.0 BIO203.01 - Biostatistics shf - 2024 Central tendency: mean and median 12.5 12.0 11.0 10.5 10.0 The median of a numerical data set is the point mean (3.96) 3.8 3.7 at which there are an equal number of data 3.7 16 values (50%) points whose values lie above and below the 3.6 3.5 median value. 3.4 3.3 3.2 3.1 How to: 3.0 33 values 3.0 1. Order the numbers from smallest to largest. 2.9 median (2.9) range: 1 - 12.5 2.8 2. If the data set contains an odd number of 2.7 2.7 values, the median is the data point that is 2.7 exactly in the middle. 2.6 2.4 3. If the data set contains an even number of 2.4 values, the median is the arithmetic mean 2.2 2.1 16 values (50%) between the two numbers that lie in the 2.0 middle. 1.9 1.9 1.8 1.7 1.5 1.0 BIO203.01 - Biostatistics shf - 2024 Central tendency: mean and median 12.5 12.0 11.0 10.5 10.0 8 values (25%) mean (3.96) The median of a numerical data set is the point 3.8 3.7 at which there are an equal number of data 3.65 –––– 3.7 3rd quartile points whose values lie above and below the 3.6 3.5 median value. 3.4 3.3 3.2 8 values (25%) 3.1 The 1st quartile is the midpoint of the lower half. 3.0 33 values 3.0 The 3rd quartile is the midpoint of the upper half. 2.9 median (2.9) range: 1 - 12.5 2.8 2.7 2.7 2.7 2.6 8 values (25%) 2.4 2.4 2.2 2.15 –––– 2.1 1st quartile 2.0 1.9 1.9 1.8 8 values (25%) 1.7 1.5 1.0 BIO203.01 - Biostatistics shf - 2024 Central tendency: mean and median odd number of n halves with odd number of observations 11 10 10 9 3rd Quartile 9 8 9 the number in the middle 8 3rd Quartile 3rd Quartile of the upper half = 9 8 7 upper half the number in the middle the mean between the middle 3rd Quartile 8 7 two numbers of the upper half = 7.5 of the upper half = 8 the mean between the middle 7 6 two numbers of the upper half = 6.5 7 6 median median 6 median 5 median 6 the number in the middle 5 the number in the middle the mean between the middle the mean between the middle of the entire range = 6 of the entire range = 5 5 two numbers of the entire range = 5.5 4 two numbers of the entire range = 4.5 5 4 4 3 lower half 1st Quartile 4 3 1st Quartile 1st Quartile the mean between the middle 1st Quartile the mean between the middle 3 the number in the middle 2 two numbers of the lower half = 2.5 3 the number in the middle 2 two numbers in the lower half = 2.5 of the lower half = 3 of the lower half = 3 2 1 2 1 1 1 BIO203.01 - Biostatistics shf - 2024 Central tendency: mean and median odd number of n halves with odd number halves with even number of observations of observations 11 10 10 9 3rd Quartile 9 8 9 the number in the middle 8 3rd Quartile 3rd Quartile of the upper half = 9 8 7 upper half the number in the middle the mean between the middle 3rd Quartile 8 7 two numbers of the upper half = 7.5 of the upper half = 8 the mean between the middle 7 6 two numbers of the upper half = 6.5 7 6 median median 6 median 5 median 6 the number in the middle 5 the number in the middle the mean between the middle the mean between the middle of the entire range = 6 of the entire range = 5 5 two numbers of the entire range = 5.5 4 two numbers of the entire range = 4.5 5 4 4 3 lower half 1st Quartile 4 3 1st Quartile 1st Quartile the mean between the middle 1st Quartile the mean between the middle 3 the number in the middle 2 two numbers of the lower half = 2.5 3 the number in the middle 2 two numbers in the lower half = 2.5 of the lower half = 3 of the lower half = 3 2 1 2 1 1 1 BIO203.01 - Biostatistics shf - 2024 Central tendency: mean and median odd number of n even number of n halves with odd number halves with even number halves with odd number halves with even number of observations of observations of observations of observations 11 10 10 9 3rd Quartile 9 8 9 the number in the middle 8 3rd Quartile 3rd Quartile of the upper half = 9 8 7 upper half the number in the middle the mean between the middle 3rd Quartile 8 7 two numbers of the upper half = 7.5 of the upper half = 8 the mean between the middle 7 6 two numbers of the upper half = 6.5 7 6 median median 6 median 5 median 6 the number in the middle 5 the number in the middle the mean between the middle the mean between the middle of the entire range = 6 of the entire range = 5 5 two numbers of the entire range = 5.5 4 two numbers of the entire range = 4.5 5 4 4 3 lower half 1st Quartile 4 3 1st Quartile 1st Quartile the mean between the middle 1st Quartile the mean between the middle 3 the number in the middle 2 two numbers of the lower half = 2.5 3 the number in the middle 2 two numbers in the lower half = 2.5 of the lower half = 3 of the lower half = 3 2 1 2 1 1 1 BIO203.01 - Biostatistics shf - 2024 Central tendency: mean and median 12.5 12.0 11.0 10.5 10.0 8 values (25%) mean (3.96) 3.8 3.7 3.7 3.65 –––– 3.6 3rd quartile 3.5 3.4 3.3 3.2 8 values (25%) 3.1 3.0 33 values 3.0 2.9 median (2.9) IQR (inter-quartile range) range: 1 - 12.5 2.8 2.7 2.7 2.7 2.6 8 values (25%) 2.4 2.4 2.2 2.15 –––– 2.1 1st quartile 2.0 1.9 1.9 1.8 8 values (25%) 1.7 1.5 1.0 BIO203.01 - Biostatistics shf - 2024 Central tendency: mean and median 12.5 12.0 above 3rd 11.0 outliers: > 1.5 x IQR quartile 10.5 below 1st 10.0 8 values (25%) mean (3.96) 3.8 3.7 outliers(high) ≥ 3.65 + ((3.65 - 2.15) * 1.5) ≥ 5.9 3.7 3.65 –––– 3.6 3rd quartile outliers(low) ≤ 2.15 - ((3.65 - 2.15) * 1.5) ≤ - 0.1 3.5 3.4 3.3 3.2 8 values (25%) 3.1 3.0 33 values 3.0 2.9 median (2.9) IQR (inter-quartile range) range: 1 - 12.5 2.8 2.7 2.7 2.7 2.6 8 values (25%) 2.4 2.4 2.2 2.15 –––– 2.1 1st quartile 2.0 1.9 1.9 1.8 8 values (25%) 1.7 1.5 1.0 BIO203.01 - Biostatistics shf - 2024 Central tendency: mean and median box and whisker plot 12.5 12.0 11.0 10.5 10.0 8 values (25%) mean (3.96) 3.8 3.7 3.7 outliers 3.65 –––– 3.6 3rd quartile 3.5 3.4 3.3 3.2 8 values (25%) 3.1 3.0 33 values 3.0 2.9 median (2.9) IQR range: 1 - 12.5 2.8 2.7 2.7 upper criterion for outliers 2.7 2.6 8 values (25%) 2.4 2.4 maximum 2.2 2.15 –––– 1st quartile 3rd quartile 2.1 2.0 median IQR 1.9 1st quartile 1.9 1.8 8 values (25%) 1.7 minimum 1.5 1.0 BIO203.01 - Biostatistics shf - 2024 Descriptive statistics: box and whisker plot Tukey style 25 outliers 20 15 maximum variable y 3rd quartile 10 median IQR 1st quartile 5 minimum outliers 0 variable x BIO203.01 - Biostatistics shf - 2024 Descriptive statistics: box and whisker plot Tukey style Min/Max 25 25 20 20 15 15 variable y 10 10 5 5 0 0 variable x variable x BIO203.01 - Biostatistics shf - 2024 Descriptive statistics: box and whisker plot Tukey style Min/Max w. Notches 25 25 25 20 20 20 15 15 15 variable y 10 10 10 5 5 5 0 0 0 variable x variable x variable x BIO203.01 - Biostatistics shf - 2024 Descriptive statistics: box and whisker plot Tukey style Min/Max w. Notches 25 25 25 20 20 20 15 15 15 variable y 10 10 10 notch 95% confidence interval of the median IQR = median ± 1.57× n 5 5 5 If the notches of two plots (of the same variable) do not overlap, the medians are most likely significantly differnt from each 0 0 0 other. variable x variable x variable x BIO203.01 - Biostatistics shf - 2024 Descriptive statistics: box and whisker plot Tukey style Min/Max w. Notches w. Data points 25 25 25 25 20 20 20 20 15 15 15 15 variable y 10 10 10 10 5 5 5 5 0 0 0 0 variable x variable x variable x variable x BIO203.01 - Biostatistics shf - 2024 Descriptive statistics: box and whisker plot Tukey style Min/Max w. Notches w. Data points Violin plot 25 25 25 25 25 20 20 20 20 20 15 15 15 15 15 variable y 10 10 10 10 10 5 5 5 5 5 0 0 0 0 0 variable x variable x variable x variable x data density 1 BIO203.01 - Biostatistics shf - 2024 Central tendency: percentiles BIO203.01 - Biostatistics shf - 2024 Percentiles: empirical cumulative distribution function - ECDF ecdf(data) 1.0 0.8 cumulative probablility 0.6 Fn(x) 0.4 0.2 0.0 Min 1st Q median 3rd Q Max 1 2 3 4 5 evaluation x scores BIO203.01 - Biostatistics shf - 2024 Central tendency: percentiles ecdf(data) 1.0 BIO49A 88.67% 0.8 cumulative probablility 0.6 Fn(x) 0.4 0.2 0.0 Min 1st Q median 3rd Q Max 1 2 3 4 5 1 2 3 4 5 x Fen Edebiyat Fakültesi evaluation scores BIO203.01 - Biostatistics shf - 2024 Central tendency: percentiles student grade Definition 1 (exclusive): student 1 89 The nth percentile is the lowest score that is greater than a certain percentage (“n”) of the scores. student 2 78 student 3 94 Definition 2 (inclusive): student 4 66 The nth percentile is the lowest score that is greater than or equal to a certain percentage of the scores. student 5 50 student 6 43 student 7 92 student 8 75 student 9 81 student 10 53 student 11 97 student 12 44 student 13 50 student 14 69 student 15 73 student 16 93 student 17 58 student 18 87 student 19 77 student 20 45 BIO203.01 - Biostatistics shf - 2024 Central tendency: percentiles student rank grade Definition 1 (exclusive): The nth percentile is the lowest score that is greater than a certain student 11 20 97 percentage (“n”) of the scores. student 16 19 94 student 3 18 94 Definition 2 (inclusive): The nth percentile is the lowest score that is greater than or equal to a student 7 17 92 certain percentage of the scores. student 1 16 89 student 18 15 87 student 9 14 81 student 2 13 78 student 19 12 77 student 8 11 75 student 15 10 73 student 14 9 69 student 4 8 66 student 17 7 58 student 10 6 53 student 13 5 50 student 5 4 50 student 20 3 45 student 12 2 45 student 6 1 43 BIO203.01 - Biostatistics shf - 2024 Central tendency: percentiles student rank grade Definition 1 (exclusive): The nth percentile is the lowest score that is greater than a certain student 11 20 97 percentage (“n”) of the scores. student 16 19 94 student 3 18 94 which score corresponds to the xth percentile? student 7 17 92 student 1 16 89 𝒙(𝒏 + 𝟏) x: percentile 15 87 𝑷𝒙 = n: total number of observations student 18 𝟏𝟎𝟎 student 9 14 81 student 2 13 78 𝟕𝟓(𝟐𝟏) student 19 12 77 𝑷𝟕𝟓 = = 𝟏𝟓. 𝟕𝟓 = ~𝟏𝟔 𝟏𝟎𝟎 student 8 11 75 student 15 10 73 𝑷𝟕𝟓 = 𝟖𝟗 student 14 9 69 student 4 8 66 student 17 7 58 student 10 6 53 student 13 5 50 student 5 4 50 student 20 3 45 student 12 2 45 student 6 1 43 BIO203.01 - Biostatistics shf - 2024 Central tendency: percentiles student rank grade Definition 1 (exclusive): The nth percentile is the lowest score that is greater than a certain student 11 20 97 percentage (“n”) of the scores. student 16 19 94 student 3 18 94 which score corresponds to the xth percentile? student 7 17 92 student 1 16 89 𝒙(𝒏 + 𝟏) x: percentile 15 87 𝑷𝒙 = n: total number of observations student 18 𝟏𝟎𝟎 student 9 14 81 student 2 13 78 𝟕𝟓(𝟐𝟏) student 19 12 77 𝑷𝟕𝟓 = = 𝟏𝟓. 𝟕𝟓 = ~𝟏𝟔 𝟏𝟎𝟎 student 8 11 75 student 15 10 73 𝑷𝟕𝟓 = 𝟖𝟗 student 14 9 69 student 4 8 66 which percentile corresponds to a score y ? student 17 7 58 student 10 6 53 𝒓𝒂𝒏𝒌(𝒚) − 𝟏 50 𝑷(𝒚) = × 𝟏𝟎𝟎 student 13 5 𝒏 student 5 4 50 𝟏𝟐 student 20 3 45 𝑷(𝟕𝟖) = × 𝟏𝟎𝟎 = 𝟔𝟎 𝟐𝟎 student 12 2 45 student 6 1 43 BIO203.01 - Biostatistics shf - 2024 Central tendency: percentiles student rank grade Definition 2 (inclusive): The nth percentile is the lowest score that is greater than or equal to a student 11 20 97 certain percentage of the scores. student 16 19 94 student 3 18 94 which score corresponds to the xth percentile? student 7 17 92 student 1 16 89 𝒙 × 𝒏 x: percentile 87 𝑷𝒙 = n: total number of observations student 18 15 𝟏𝟎𝟎 student 9 14 81 student 2 13 78 𝟕𝟓(𝟐𝟎) student 19 12 77 𝑷𝟕𝟓 = = 𝟏𝟓 𝟏𝟎𝟎 student 8 11 75 student 15 10 73 𝑷𝟕𝟓 = 𝟖𝟕 student 14 9 69 student 4 8 66 which percentile corresponds to a score y ? student 17 7 58 student 10 6 53 𝒓𝒂𝒏𝒌(𝒚) 50 𝑷(𝒚) = × 𝟏𝟎𝟎 student 13 5 𝒏 student 5 4 50 𝟏𝟑 student 20 3 45 𝑷(𝟕𝟖) = × 𝟏𝟎𝟎 = 𝟔𝟓 𝟐𝟎 student 12 2 45 student 6 1 43 BIO203.01 - Biostatistics shf - 2024 Central tendency: many centers – which one is a good one? if the data are normally distributed, the mean and median are close to each other normally distributed mean: 5.03 2000 1500 median: Histogram of 5 normal Frequency 1000 500 0 1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10 normal BIO203.01 - Biostatistics shf - 2024 Central tendency: many centers – which one is a good one? if the data are normally distributed, the mean and median are close to each other if the distribution is skewed, they are not skewed to the right normally distributed skewed to the left mean: 2.65 mean: 5.03 mean: 7.35 median: Histogram2 of skr median: Histogram of 5 normal median: 8skl Histogram of 2500 2000 2500 2000 2000 1500 1500 1500 Frequency Frequency Frequency 1000 1000 1000 500 500 500 0 0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 skr normal skl BIO203.01 - Biostatistics shf - 2024 Central tendency: many centers – which one is a good one? 12 12 10 10 minutes late coming to class 8 8 data 6 6 4 4 3.96 – mean 3.19 – geometric mean 2.90 – median 2.74 – harmonic mean 2 2 0.6 1.0 1.4 BIO203.01 - Biostatistics shf - 2024 Central tendency: many centers – which one is a good one? the median is the most robust measure of centrality (and not sensitive to outliers) 12 12 10 10 with outliers minutes late coming to class without outliers 8 8 data 6 6 4 4 3.96 – mean 3.19 – geometric mean 2.90 – median 2.74 – harmonic mean 2 2 0.6 1.0 1.4 BIO203.01 - Biostatistics shf - 2024 Central tendency: many centers – which one is a good one? the median is the most robust measure of centrality (and not sensitive to outliers) 12 12 non-parametric tests are more suitable for data that contain outliers 10 10 with outliers minutes late coming to class without outliers 8 8 data 6 6 4 4 3.96 – mean 3.19 – geometric mean 2.90 – median 2.74 – harmonic mean 2 2 parametric tests non-parametric tests are based on the mean are based on the median or other measures of rank 0.6 1.0 1.4 BIO203.01 - Biostatistics shf - 2024 Centrality parameters: the mean does not describe everything Three statisticians go duck hunting. The first statistician shoots and hits 1 m to the left. The second statistician hits 1 m to the right. The third statistician gets excited and yells: We got him ! BIO203.01 - Biostatistics shf - 2024 Central tendency: center and dispersion BIO203.01 - Biostatistics shf - 2024 Central tendency: dispersion BIO203.01 - Biostatistics shf - 2024 Central tendency: dispersion coffee guy accuracy # 𝒙 1 251 2 250 3 252 4 251 5 247 6 250 7 251 8 252 9 250 10 250 11 250 12 246 13 251 14 251 15 250 16 249 17 247 18 253 19 248 20 251 250 g of Turkish Coffee ഥ 𝒙 average:249.9 249.9 BIO203.01 - Biostatistics shf - 2024 Central tendency: dispersion coffee guy accuracy # 𝒙 ഥ 𝒙−𝒙 1 251 1.0 2 250 0.0 3 252 2.0 4 251 1.0 5 247 -3.0 6 250 0.0 7 251 1.0 8 252 2.0 9 250 0.0 10 250 0.0 difference from mean 11 250 0.0 12 246 -4.0 13 251 1.0 14 251 1.0 15 250 0.0 16 249 -1.0 17 247 -3.0 18 253 3.0 19 248 -2.0 20 251 1.0 250 g of Turkish Coffee ഥ 𝒙 249.9 BIO203.01 - Biostatistics shf - 2024 Central tendency: dispersion coffee guy accuracy # 𝒙 ഥ 𝒙−𝒙 1 251 1.0 2 250 0.0 3 252 2.0 4 251 1.0 5 247 -3.0 6 250 0.0 7 251 1.0 8 252 2.0 9 250 0.0 10 250 0.0 difference from mean 11 250 0.0 12 246 -4.0 13 251 1.0 14 251 1.0 15 250 0.0 16 249 -1.0 17 247 -3.0 18 253 3.0 19 248 -2.0 20 251 1.0 250 g of Turkish Coffee ഥ 𝒙 249.9 ∑: 0.0 the sum of differences from the mean is alway 0 BIO203.01 - Biostatistics shf - 2024 Central tendency: dispersion - sum of squares coffee guy accuracy # 𝒙 ഥ 𝒙−𝒙 ഥ )𝟐 (𝒙 − 𝒙 1 251 1.0 1.0 2 250 0.0 0.0 3 252 2.0 4.0 4 251 1.0 1.0 5 247 -3.0 9.0 6 250 0.0 0.0 7 251 1.0 1.0 8 252 2.0 4.0 9 250 0.0 0.0 10 250 0.0 0.0 squared difference from mean 11 250 0.0 0.0 12 246 -4.0 16.0 13 251 1.0 1.0 14 251 1.0 1.0 15 250 0.0 0.0 16 249 -1.0 1.0 17 247 -3.0 9.0 18 253 3.0 9.0 19 248 -2.0 4.0 20 251 1.0 1.0 250 g of Turkish Coffee ഥ 𝒙 249.9 ∑: 0.0 62.0  sum of squares BIO203.01 - Biostatistics shf - 2024 Central tendency: dispersion - variance coffee guy accuracy # 𝒙 ഥ 𝒙−𝒙 ഥ )𝟐 (𝒙 − 𝒙 1 251 1.0 1.0 2 250 0.0 0.0 3 252 2.0 4.0 4 251 1.0 1.0 5 247 -3.0 9.0 6 250 0.0 0.0 7 251 1.0 1.0 8 252 2.0 4.0 9 250 0.0 0.0 10 250 0.0 0.0 squared difference from mean 11 250 0.0 0.0 12 246 -4.0 16.0 13 251 1.0 1.0 14 251 1.0 1.0 15 250 0.0 0.0 16 249 -1.0 1.0 17 247 -3.0 9.0 18 253 3.0 9.0 19 248 -2.0 4.0 20 251 1.0 1.0 250 g of Turkish Coffee ഥ 𝒙 249.9 ∑: 0.0 62.0  sum of squares variance - s2: 3.3  sum of squares / n-1 BIO203.01 - Biostatistics shf - 2024 Central tendency: dispersion - variance coffee guy accuracy # 𝒙 ഥ 𝒙−𝒙 ഥ )𝟐 (𝒙 − 𝒙 1 251 1.0 1.0 2 250 0.0 0.0 3 252 2.0 4.0 4 251 1.0 1.0 5 247 -3.0 9.0 6 250 0.0 0.0 7 251 1.0 1.0 8 252 2.0 4.0 9 250 0.0 0.0 10 250 0.0 0.0 squared difference from mean 11 250 0.0 0.0 12 246 -4.0 16.0 13 251 1.0 1.0 14 251 1.0 1.0 15 250 0.0 0.0 16 249 -1.0 1.0 17 247 -3.0 9.0 18 253 3.0 9.0 19 248 -2.0 4.0 20 251 1.0 1.0 its a sample 250 g of Turkish Coffee ഥ 𝒙 249.9 degree of freedom ∑: 0.0 62.0  sum of squares variance - s2: 3.3  sum of squares / n-1 BIO203.01 - Biostatistics shf - 2024 Central tendency: dispersion - variance coffee guy accuracy # 𝒙 ഥ 𝒙−𝒙 ഥ )𝟐 (𝒙 − 𝒙 1 251 1.0 1.0 2 250 0.0 0.0 3 252 2.0 4.0 4 251 1.0 1.0 5 247 -3.0 9.0 6 250 0.0 0.0 7 251 1.0 1.0 8 252 2.0 4.0 9 250 0.0 0.0 10 250 0.0 0.0 squared difference from mean 11 250 0.0 0.0 12 246 -4.0 16.0 13 251 1.0 1.0 14 251 1.0 1.0 15 250 0.0 0.0 16 249 -1.0 1.0 17 247 -3.0 9.0 this would be 18 253 3.0 9.0 19 248 -2.0 4.0 gram2

Use Quizgecko on...
Browser
Browser