Defining Data PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document provides an overview of defining data, including classifications, derived variables, transformed variables, outcome and exposure variables, descriptive statistics, and sample statistics. It covers various aspects of data analysis and is suitable for a beginner-level understanding of statistics.
Full Transcript
Defining data Classify data variables to numerical or categorical – if numerical further classify to discrete or continuous – If categorical further classify as ordinal or nominal – Two possible values are binary or dichotomous Derived variables – variables derived from categories by using a th...
Defining data Classify data variables to numerical or categorical – if numerical further classify to discrete or continuous – If categorical further classify as ordinal or nominal – Two possible values are binary or dichotomous Derived variables – variables derived from categories by using a threshold or cutoff Transformed variables – log transformation – Standardised scores Outcome and exposure (variables) Outcome Exposure Response Explanatory Y X Dependent Independent Case control group Treatment group Predictor ______________________________________ Descriptive statistics Frequency distribution Histograms – bin - width of the bar – Frequency - Length – Range - range of data – Mode - Peak – Density - normalised so the area represented by the bars in the chart is equal to one Skewness Middle Normal To the left (rises early and tails off) Positive skew (right hand skew) To the right (rises later) Negative skew (left hand skew) Modality – unimodal (one peak) – Bimodal (two peaks) – Multimodal (multiple) – Uniform (truly random data) Sample statistics Central tendency (where do data tend to cluster - where is the middle) – Mean = sun of all observed / n of observation – Median = order observation from lowest to highest and take middle – Mode = most common Mean and median Normal Mean and median is roughly the same Skewed Mean gets pulled towards the skewed tail by the extreme observations (The mean pulled to the end where the tail is) Median is less effected by the skew When data is not normal distributed – Prefer to use the median – Better description if the central tendency of the data Geometric mean Arithmetic mean Dispersion (how spread out are the data and variation) Variance is roughly the average of the squared differences from the mean Equation is: Variance² = sum of (each observation - mean)² / number of observations - 1 Then is square rooted to give the standard deviation Variance = √ variance² Standard deviation - normal data 67% Observations lie within +/- 1SD of the mean 95% Observations lie with +/- 2SD of the mean 99.9% Observations lie within +/- 3SD of the mean the mean 95% Observations lie with +/- 2SD of the mean 99.9% Observations lie within +/- 3SD of the mean Skewed distributions – SD - captures different amounts of the distribution due to the symmetry of the distribution Measure of dispersion - non normal data Dispersion - Interquartile range (IQR) Ordering the values and extracting ones at a certain rank – IQR is between 25th and 75th percentile – Middle 50% of values – 50th percentile Box plot Box plot is good for displaying non normal distributed data Middle line Median Box 25th percentile And 75th IQR Whiskers Whiskers of a box plot is 1.5 IQR away from the quartiles Upper - Q3 + 1.5*IQR Lower - Q1 + 1.5*IQR Skew Based on the box whether one side (upper or lower) is larger Or the whiskers Outliers Are plotted so can be visualised Data presentation Normally distributed Use the mean and standard deviation Skewed Median and IQR as a measure for central tendency and dispersion Robust statistic as a measure that is not heavily affected by skewness and extreme outliers Median and IQR are robust to the influence of outliers. This is because they are based only on the rank of the observations and are not affected by outliers. – Best in terms of central tendency and spread which is used for not normal data ______________________________________ Categorical summaries & display Bar chart – categorical data Histogram – continuous data – X axis on a histogram is a number line and select the intervals to best capture the distribution Relationships and comparisons Contingency tables – shows how two variables are related Conditional distribution Relative frequency – Can present it based on either as row percentages or as column percentages Case control studies – required into the study based on whether they have the outcome – Outcome is fixed ______________________________________ Scatter plot Seeing how two variables covary Interpreting a scatter plot. Is there a relationship. Direction of relationship. Strength of relationship Scatter plot – Linear – Exponential – Quadratic Correlation coefficient - r – quantify the strength of the linear relationship shop between two variables – Takes values from -1 to +1 – r=1 – r = -1 – r=0 r has no units – r tells nothing about the steepness or how much y changes – r tells about the amount of scatter in a line r only quantified the strength of the linear relationship between two variables – r = 0 does not always mean that there isn’t a relationship, only that there isn’t a linear relationship – r = 0 could also mean a quadratic relationship – r should only be used for linear relationships A correlation coefficient only measures the strength of LINEAR association between two continuous variables ______________________________________ Z-scores Linear transformation – change the units of a measurement The centre or spread of the distribution will change but the shape of the distribution will not Z-scores are also known as SD scores Compare scores from a normal distribution with different units – Allows a calculation of the probability of a score occurring in our target population by using the normal probability distribution – Reference range A z-score measures the distance of each observation from the mean in units of standard deviation – A z-transformed variable will have a mean of 0 and SD of 1 Z-score = (observation-mean)/SD The new distribution has a mean of zero and stars deviations of one If the sample data are normally distributed then we would expect the range between the mean-1.96*SD to the mean+1.96*SD to capture the 95% of observations in the sample If the sample is representative of the population, then this reference range can be a useful guide for comparing an individual’s value with respect to other people in the population Standard normal distribution The area under the curve represents the probability of observing z scores of particular values – Total area = 1 – Represents the probability of any z-score 68, 95, 99.7 rule or three sigma rule Can make probability statements about observations 68.27% Of observations fall within one SD of the mean 95.45% Of observations fall within two SD of the mean 99.7% Of observations fall within three SD of the mean Probability = +2z - longer or shorter than ___% of unit of measure or the 97% percentile State the z-score/SD Direction Percentage in relation to the unit of measure or on the _____% percentile ______________________________________ Logged variables Logarithmic scales represent an equal amount of multiplicative change – Pull low values apart & push high values together Log transform – Reduce positive skew (make distribution symmetrical) and make analysis easier – Log transformation changes the shape of the distribution Back transformation - antilog to convert the log units back to the original – Only works for individual data Geometric mean = the antilog of the mean of a logged variable – does not get back to the arithmetic mean The geometric mean is a better measure of central tendency than the arithmetic mean when data is positively skewed – When a variable is positively skewed - it would be closer to the median than the arithmetic mean ______________________________________ Describing binary variables (prevalence & incidence) Prevalence is a type of proportion Incidence is a type of rate Prevalence of the disease = “% living with HIV/AIDS” Incidence of the disease = “Rate of the new HIV (per 100,000/y) Prevalence The prevalence is the proportion of people in a population that has the disease (or some other quantity) at a particular point in time Prevalence = Number of people with the disease / Total number at risk in the population Incidence The rate of new cases of a disease (or some other quantity in a population Incidence rate = Number of new cases of disease / Number of person-years at risk of disease Cumulative incidence (risk) The proportion of population “at risk” of developing a disease who actually get the disease during a specified time period Cumulative incidence = Number of new cases of disease in period / Number of people initially disease-free The relationship between prevalence and incidence Prevalence = incidence x average duration of disease ______________________________________ From population to sample. The first step is to define the research question (helps to understand the target population and helps design the study). The sampling process Anecdote - bad sampling process- likely biased and unrepresentative of the target population Simple sampling - Each individual has the same chance of being selected but difficult in practice so representativeness is often good enough – Hard so we seek representativeness Not allowing a random sample due to non-responders & dropouts – The analysis sample is different due to the selected sample – Dropouts often have similar characteristics This is a form of bias called selection bias Bias: a departure away from the true value we are trying to estimate ______________________________________ Observational and experimental design Types of study design Experimental [Also called an intervention study] – Manipulate ta variable to study its effect Common experiments look at: – Group A: Exposed – Group B: Not exposed All other sources of variation between groups equal (controlled for) – Randomised experiment - randomly assigned to treatment groups – Not always ethical Observational – No manipulation, just observe the natural variation in a population – Absence of random assignment of exposure that prevents attribution of causality – Cohort study – Random sample > Ascertain their consumption or exposure > Then follow them up – Select people before they had exposure – Case-control study – Random sample > Find the outcome and look at non-outcome> Then ask questions – Selected basis of whether they have the outcome – Measure if they had the outcome No random assignment of the study in observational studies Confounding variable - Just because we find an association does not mean it is causal – Correlation does not imply causation The way to remove confounders is to stratify the sample – Split the sample into similar groups _________________________________ Probability Probability is used to understand and quantify random uncertainty or stochastic variation in a study – Provides theory behind inference – A probability model describes the random process of sampling To express risk and chance – Quantify risk or chance – Eg screening/diagnostic studies; survival studies Definition of probability – The long-run proportion of times that the outcome occurs over an indefinitely long series of independent trials Independence: Knowing the outcome of one event does not affect the probability that the other occurs Contingency tables are good for helping to think clearly about probability Marginal probability - the overall probability that each variable takes a particular value – uses the margin in a contingency table Joint probability - is the probability of two outcomes taken in particular values Conditional probability - Uses the rows in the contingency table Probability trees can help understand statistics ______________________________________ Point estimates and population parameters The aim of statistical inference is to make statements about the population using a sample of observations. Estimating a population value (What is the best estimate of the relationship between smoking and fetal growth). Estimate the precision of an estimate (a confidence interval) (What range of values can I be confident contains the true underlying value for the relationship between smoking and fetal growth). Do a hypothesis test (a p-value) (could this relationship I’ve just observed be attributed to chance) Estimates and parameters Population: The universe to which we wish to generalise – In the population, we have (population) parameters: the true (fixed) unknown values we wish to estimate Sample: The (finite) study that we perform – From the sample, we obtain (point) estimates of these parameters: our best guess of the population parameter Point estimates are observed Parameters are unknown ______________________________________ Sampling variation and sampling distribution Sampling distribution - Our sample (therefore our point estimate)generally varies from one sample to another, this is captured by the sampling distribution Understanding this variation is important so we can understand how little our sample tells us about our population Random error - this difference is due to random sampling variation sometimes referred to as “random error” Standard error: Standard deviation of the sampling distribution – How far the typical estimate is away from the actual population parameters – It also describes the typical error or precision of the point estimate ______________________________________ Bias and Precision Bias and precision are two characteristics of a sample statistic (point estimate of a population parameter) – There are 2 reasons why an estimate may differ from its population values Precision – Variability of a sample statistic – Random component – eg measurement error and biological variability – Bias – Systematic component – Selection biases ______________________________________ Confidence intervals Confidence interval over repeated independent sampling shows the link to the interpretation of confidence interval as: – “providing a range of values which we are 95% confident contain the true population value” We do not know the true value: – But the true value is fixed so either the true value is inside the confidence interval or it is not Confidence intervals is a parameter catcher: – It is affected by the standard error – Smaller studies have wider confidence intervals 95% = Point estimate ± 1.96 x standard error ______________________________________ Hypothesis test P value is the probability of observing such extreme data if the null hypothesis is true Hypothesis testing. Turn a research question of interest into a statistical hypothesis (null hypothesis). Calculate a test statistic (based on the sample data). Calculate a p-value using the probability distribution of the test statistic (interpret: evaluate the hypothesis) Null hypothesis - No difference or no association Alternative hypothesis - encapsulates all alternatives to null - there is a difference/association The p-value is always conditional on the null being true so you have to understand the null hypothesis to be able to interpret the p-value The test statistic follows a known sampling distribution (this is a probability distribution) under the null hypothesis, so we use it to work out the probability of seeing a difference as extreme as we saw under the null) Test statistic = point estimate/SE(point estimate) results in the z statistic Z statistic reflects the distance of our point estimate away from the null value in units of SE from the null value – When the p-value is smaller, we reject the null hypothesis in favour of the alternative hypothesis – When the p-value is larger, we would fail to reject the null hypothesis The p-value is the probability of observing a difference as extreme as what was observed assuming the null hypothesis is true. So a p-value greater than 0.05 suggests that there's more than a 1 in 20 chance of seeing a difference between 2 groups as big as we observed if the null hypothesis is true. It is thus a measure of the strength of evidence against the null hypothesis. In this example, we would therefore suggest that there is insufficient evidence to reject the null hypothesis. This does not mean that the null is true, it just means there's an absence of evidence against it. Besides, we can never logically prove a hypothesis, only falsify it ______________________________________ Central limit theorem and the normal distribution Central limit theorem (CLT) is about the distribution of point estimates and that given certain conditions, this distribution will be nearly normal even if the sample data are non-normal The sample means and point estimates from the muiltiple random samples will be normally distributed, not the samples themselves Distribution of sample means – If the distribution of the parent population is normal, then so too will be the sampling distribution of the sample size Skewed parent population – The distribution of sample means from a skewed distribution is approximately normal but it depends on sample size and the level of skewness The distribution will be normal if: – The sample size is sufficiently large or the population is considered to have a normal distribution – The observations in the sample are independent – Not exessively skewed Binomial distribution and its normal approximation – The bionomial distribtuion is the sampling distribution for the number (or proportion) of binary events – As sample size (n) increases, the bionomial distribution becomes very close to the normal distribution Normal distribution – The only 2 parameters required to describe the normal probability distribution are its mean and standard error point estimate ± (z x SE) (z x SE) = error factor z comes from the standard normal distribution z = 1.96 SE comes from SE = s / √n One sample hypothesis test for a mean. Formulate the null hypothesis µ = null value. Calculate the point estimate (sample mean) x̄ = point estimate. Check assumptions. Independence. Sufficiently large (n>30). Not extremely skewed. Calculate a test statistic z = (x ̄ - µ0) / SE. Calculate the p-value (interpret: evaluate the hypothesis) There’s a _._% chance of observing a difference as big as ___ (unit) between the variable and second variable if the null of no difference is true Estimating a proportion Population parameter of interest: π Point estimate: p (sample proportion) SD = √π(1-π)/n ______________________________________ The link between hypothesis test & confidence intervals A confidence interval gives a range of plausible values that we are XX% confident will contain the true parameter CI = point estimate ± (z x SE) A p-value tells us the strength of evidence against the null hypothesis z = (point estimate = null value) / SE If the 95% CI does not contain the null value, then the p-value would be lower than 0.05 ______________________________________ Comparing means The t-distribution is used in a variety of statistical tests designed for situations where the population standard deviation (σ\sigma σ) is unknown. These tests are typically employed when working with small sample sizes or estimating parameters. – Uses a different multiplier of 2.4980 – The t-distribution is similar to normal distribution but has heavier tails – The degrees of freedom of a t-distribution determines how heavy the tails are – The t-distribution with high degrees of freedom (n>50) is very close to the normal distribution. Unknown Population Standard Deviation: ○ When σ\sigma σ is unknown, the sample standard deviation (s) is used as an estimate. ○ This introduces additional uncertainty into the standard error of the mean, requiring the use of the t-distribution, which accounts for this variability.. Small Sample Size: ○ For small sample sizes (n