Statistics and Data Classification Quiz

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What does the null hypothesis represent?

  • No difference or no association (correct)
  • A specific probability distribution
  • There is a difference or association
  • An alternative hypothesis

A smaller p-value indicates insufficient evidence to reject the null hypothesis.

False (B)

The p-value is the probability of observing a difference as extreme as what was observed, assuming the _______ is true.

null hypothesis

Match the following terms with their definitions:

<p>Null hypothesis = Indicates no difference or association Alternative hypothesis = Suggests there is a difference or association Z statistic = Measures distance from the null value P-value = Probability of observing a difference assuming null is true</p> Signup and view all the answers

Which of the following is NOT a type of categorical variable?

<p>Continuous (A)</p> Signup and view all the answers

A box plot is best used for displaying normally distributed data.

<p>False (B)</p> Signup and view all the answers

What is the purpose of using the interquartile range (IQR) in data analysis?

<p>To measure dispersion and identify the middle 50% of values.</p> Signup and view all the answers

The _____ is the most common value in a data set.

<p>mode</p> Signup and view all the answers

Which of the following describes a positive skew in a distribution?

<p>Rises later and tails off to the right (A)</p> Signup and view all the answers

When data is normally distributed, the mean and median will be different.

<p>False (B)</p> Signup and view all the answers

What are the two measures commonly used to describe central tendency?

<p>Mean and median.</p> Signup and view all the answers

In a graph representing normal distribution, approximately _____% of observations lie within +/- 2 standard deviations of the mean.

<p>95</p> Signup and view all the answers

Which of the following best describes a bimodal distribution?

<p>Two peaks (C)</p> Signup and view all the answers

What is a point estimate?

<p>A best guess of the population parameter derived from a sample (D)</p> Signup and view all the answers

Sampling distribution captures the fixed nature of population parameters.

<p>False (B)</p> Signup and view all the answers

What does standard error indicate?

<p>The typical deviation of a sample statistic from the actual population parameter.</p> Signup and view all the answers

A confidence interval provides a range of values which we are ____% confident contains the true population value.

<p>95</p> Signup and view all the answers

Match the terms with their definitions:

<p>Bias = Systematic difference from the true population value Random error = Variability due to random sampling Standard error = Measure of accuracy of a point estimate Confidence interval = Range of values estimating population parameter</p> Signup and view all the answers

What does a p-value represent in hypothesis testing?

<p>The probability of observing extreme data if the null hypothesis is true (A)</p> Signup and view all the answers

A wider confidence interval indicates higher precision in estimating the population parameter.

<p>False (B)</p> Signup and view all the answers

What is the primary cause of bias in estimates?

<p>Systematic components such as selection biases.</p> Signup and view all the answers

Which of the following statements regarding the Central Limit Theorem (CLT) is true?

<p>The distribution of sample means will be normal if the sample size is large enough. (B)</p> Signup and view all the answers

A sufficiently large sample size can result in a normal distribution from a skewed parent population.

<p>True (A)</p> Signup and view all the answers

What formula is used to calculate the standard error (SE)?

<p>SE = s / √n</p> Signup and view all the answers

The z-value used for a hypothesis test at a 95% confidence level is ____.

<p>1.96</p> Signup and view all the answers

What does a p-value inform us about in hypothesis testing?

<p>The strength of evidence against the null hypothesis (C)</p> Signup and view all the answers

If the 95% confidence interval does not contain the null value, the p-value is greater than 0.05.

<p>False (B)</p> Signup and view all the answers

What is the formula for the standard deviation when estimating a proportion?

<p>√(π(1-π)/n)</p> Signup and view all the answers

The t-distribution is useful when the __________ standard deviation is unknown.

<p>population</p> Signup and view all the answers

Match the following statistical terms with their descriptions:

<p>p-value = Strength of evidence against the null hypothesis Confidence Interval = Range of plausible values for a parameter Standard Error = Estimate of variability in sample means t-distribution = Distribution used when population standard deviation is unknown</p> Signup and view all the answers

What does a correlation coefficient (r) of 0 indicate?

<p>No linear relationship (A)</p> Signup and view all the answers

Z-scores measure the distance of each observation from the median in units of standard deviation.

<p>False (B)</p> Signup and view all the answers

What is the primary purpose of a scatter plot?

<p>To see how two variables covary.</p> Signup and view all the answers

The prevalence of a disease is defined as the proportion of people in a population that has the disease at a particular point in time, calculated as number of people with the disease divided by _____.

<p>total number at risk in the population</p> Signup and view all the answers

Which of the following is true about z-scores?

<p>Z-scores can compare scores from normal distributions with different units. (D)</p> Signup and view all the answers

A positive skew indicates that the mean is less than the median.

<p>False (B)</p> Signup and view all the answers

What does the area under the curve in a standard normal distribution represent?

<p>Probability of observing z-scores of particular values.</p> Signup and view all the answers

The ____ rule states that about 95% of observations fall within two standard deviations of the mean.

<p>68-95-99.7</p> Signup and view all the answers

Which statement about log transformation is correct?

<p>It reduces positive skew and simplifies analysis. (D)</p> Signup and view all the answers

Conditional distribution can be represented as either row or column percentages in a contingency table.

<p>True (A)</p> Signup and view all the answers

What is the geometric mean used for?

<p>To measure central tendency for positively skewed data.</p> Signup and view all the answers

The cumulative incidence is calculated as the number of new cases of a disease divided by the number of people initially _____.

<p>disease-free</p> Signup and view all the answers

What is indicated by an r value of -1?

<p>Perfect negative linear relationship (D)</p> Signup and view all the answers

Case control studies focus on groups based on whether they have the outcome of interest.

<p>True (A)</p> Signup and view all the answers

Flashcards

Data Variable Types

Data variables can be numerical (discrete or continuous) or categorical (ordinal or nominal). Binary/dichotomous variables have two categories.

Derived Variables

Calculated from other variables using thresholds or cutoffs.

Transformed Variables

Variables changed using methods like log transformations or standardized scores.

Frequency Distribution

Shows how often each data point occurs.

Signup and view all the flashcards

Histogram

A chart showing the distribution of data.

Signup and view all the flashcards

Skewness

Measure of asymmetry in data distribution.

Signup and view all the flashcards

Central Tendency (Mean)

Average of all data points, vulnerable to outlier effects.

Signup and view all the flashcards

Central Tendency (Median)

Middle value when data is ordered, less sensitive to extreme values.

Signup and view all the flashcards

Standard Deviation

Measures the spread/variation of data around the mean. It is the square root of the variance.

Signup and view all the flashcards

Interquartile Range (IQR)

Range of the middle 50% of the data, robust to outliers.

Signup and view all the flashcards

Population Parameter

The true, fixed, unknown value in a population we want to estimate.

Signup and view all the flashcards

Null Hypothesis

A statement of no difference or no association between variables.

Signup and view all the flashcards

Sample Estimate

Our educated guess for a population parameter, based on a sample.

Signup and view all the flashcards

P-value

The probability of observing a difference as extreme as the one seen, assuming the null hypothesis is true.

Signup and view all the flashcards

Sampling Distribution

The distribution of possible values a sample statistic can take on if we repeat many studies or samples.

Signup and view all the flashcards

Test Statistic (Z-statistic)

A measure of how far the observed data deviates from the null hypothesis's expected value in terms of standard error.

Signup and view all the flashcards

p-value < 0.05

Reject the null hypothesis; strong evidence against the null.

Signup and view all the flashcards

Standard Error

The standard deviation of the sampling distribution, showing the typical error or expected variability of an estimate.

Signup and view all the flashcards

p-value > 0.05

Fail to reject the null hypothesis; insufficient evidence against the null.

Signup and view all the flashcards

Bias

Systematic error in estimation, leading to consistently higher or lower estimates.

Signup and view all the flashcards

Precision

The variability of a sample statistic in a study. A measure of how close repeated estimates are to each other.

Signup and view all the flashcards

Confidence Interval

A range of values we are confident contains the true population parameter, along with its estimated precision.

Signup and view all the flashcards

P-value

The probability of observing a sample as extreme or more extreme than the one observed, if the null hypothesis is true.

Signup and view all the flashcards

Central Limit Theorem (CLT)

The distribution of sample means from many random samples will be approximately normal, even if the original data isn't normal, under specific conditions.

Signup and view all the flashcards

Normal Distribution of Sample Means

If the population distribution is normal, the distribution of sample means will also be normal, regardless of sample size.

Signup and view all the flashcards

Skewed Population and Sample Means

Even if the population is skewed, the distribution of sample means from many random samples will be approximately normal, especially with larger sample sizes.

Signup and view all the flashcards

Normal Approximation of Binomial

As sample size increases, a binomial distribution approaches a normal distribution.

Signup and view all the flashcards

Error Factor (z x SE)

The error factor is used to calculated the confidence interval, adding or subtracting it around a point estimate.

Signup and view all the flashcards

Confidence Interval (CI)

A range of plausible values that we are confident contains the true population parameter.

Signup and view all the flashcards

T-distribution

A probability distribution used for hypothesis testing when the population standard deviation is unknown, especially for small samples.

Signup and view all the flashcards

Degrees of Freedom (df)

The number of independent pieces of information in a sample, used in the t-distribution. Higher df means a t-distribution closer to normal.

Signup and view all the flashcards

Sample Standard Deviation (s)

An estimate of the population standard deviation, used when the true population standard deviation is unknown.

Signup and view all the flashcards

Categorical data

Data that represents categories or groups, not numerical values.

Signup and view all the flashcards

Contingency table

A table showing the relationship between two categorical variables.

Signup and view all the flashcards

Conditional distribution

The distribution of one variable based on a certain value of another variable.

Signup and view all the flashcards

Case-control study

A study where individuals are selected based on whether they have a certain outcome.

Signup and view all the flashcards

Scatter plot

A graph showing the relationship between two continuous variables.

Signup and view all the flashcards

Correlation coefficient (r)

A measure of the strength and direction of a linear relationship between two variables.

Signup and view all the flashcards

Z-score

A measure of how many standard deviations an observation is from the mean.

Signup and view all the flashcards

Standard normal distribution

A normal distribution with a mean of 0 and standard deviation of 1.

Signup and view all the flashcards

68-95-99.7 rule

States the percentage of observations within one, two, and three standard deviations of the mean in a normal distribution.

Signup and view all the flashcards

Logarithmic scale

A scale where equal distances represent equal multiplicative changes.

Signup and view all the flashcards

Log transformation

A method to reduce positive skew and make the data more symmetrical.

Signup and view all the flashcards

Prevalence

Proportion of a population with a specific characteristic at a given time.

Signup and view all the flashcards

Incidence

Rate of new cases of a disease or condition in a population over a period of time.

Signup and view all the flashcards

Cumulative incidence

Proportion of initially disease-free individuals in a population that get a disease over a period.

Signup and view all the flashcards

Study Notes

Defining Data

  • Classify data as numerical or categorical
  • Numerical data can be further classified as discrete or continuous
  • Categorical data can be further classified as ordinal or nominal
  • Binary/dichotomous data has two possible values
  • Derived variables are created from categories using a threshold or cutoff
  • Transformed variables involve transformations like log transformations or standardized scores

Outcome and Exposure

  • Outcome variables are response or dependent variables (Y)
  • Exposure variables are explanatory or independent variables (X)
  • Case control groups can be outcome or exposure dependent
  • Treatment groups are exposure dependent
  • Predictor is the exposure variable

Descriptive Statistics

  • Frequency distributions show the frequency of each data value.
  • Histograms are bar graphs showing the frequency of data within ranges.
  • Bin width is the size of each bar in a histogram
  • Frequency represents the number of data points in a bin
  • Range is the difference between highest and lowest data values
  • Mode is the most frequent data value
  • Density is the frequency normalized so that the area under the chart equals one.
  • Skewness describes the asymmetry of the distribution.
    • Positive skew (right-hand skew) - data tailing off to the right
    • Negative skew (left-hand skew) - data tailing off to the left
    • Normal distribution has no skew.
  • Modality describes how many peaks the graph has
    • Unimodal (one peak)
    • Bimodal (two peaks)
    • Multimodal (multiple peaks)
    • Uniform (truly random) data is represented by a flat line

Sample Statistics

  • Central tendency measures the center of data.
    • Mean is the average of all values (Σ X / N)
    • Median is the middle value when sorted.
    • Mode is the most frequent value.
  • When data is not normally distributed, the median is preferred.
  • Dispersion measures the spread of data.
    • Variance calculates the average of squared differences from the mean.
    • Standard deviation is the square root of variance.
  • 68.27% of observations fall within one standard deviation of mean.
  • 95.45% of observations fall within two standard deviations of mean.
  • 99.7% of observations fall within three standard deviations of mean.
  • Geometric mean is a better measure of central tendency for skewed data

Measures of Dispersion

  • Interquartile range (IQR) is the difference between the 25th and 75th percentile.
  • Box plot visually represents distribution, showing the median, quartiles and outliers.

Categorical Summaries and Displays

  • Bar charts are for categorical data
  • Histograms are for continuous data
  • Contingency tables present the relationship between two categorical variables
  • Conditional distribution
  • Relative frequency distribution can use row percentages or column percentages
  • Case control studies consider whether they have the outcome.

Scatter Plot

  • Scatter plot shows relationship between two co-varying variables
  • Evaluate relationship direction and strength

Correlation Coefficient

  • Quantifies the linear relationship strength between two variables.
  • r takes values from -1 to +1
  • r = 1 perfect positive linear relationship
  • r = -1 perfect negative linear relationship
  • r = 0 no linear relationship

Z-scores

  • Linear transformation of a measurement
  • centre and spread change, but shape doesn't change
  • Compare scores from different normal distributions
  • Reference range calculation
  • A Z-score measures the distance from the mean, in standard deviation units.

Standard Normal Distribution

  • Displays probability of observing a z-score.
  • Total area under the curve is one.

Logged Variables

  • Logarithmic scales represent equal multiplicative change
  • Pulls low values apart and high values together
  • Log transformation reduces positive skew and eases analysis
  • Back transformation converts log values back to original scale
  • Geometric mean is suitable for positively skewed data.

Prevalence and Incidence

  • Prevalence is the proportion of the population with a disease at a specific time.
  • Incidence is the rate of new cases of a disease during a specified period.
  • Cumulative incidence (also risk) is the proportion getting the disease in a specific time period.

Observational and Experimental Designs

  • Experimental studies (interventional) manipulate a variable to observe its effect.
  • Observational studies observe natural variation in a population.
  • Cohort and case-control studies are subtypes of observational studies
  • Cross-sectional studies collect data at a single point in time.

Sample Size and Power/Regression/Systematic Review

  • Sample size and power analysis inform the needed sample size to detect an effect.
  • Regression models examine relationships between variables.
  • Systematic review synthesizes findings from multiple studies.

Meta-analysis

  • Meta-analysis combines results from multiple studies.

Hypothesis Testing

  • Null hypothesis states no difference or association.
  • Alternative hypothesis asserts a difference or association.
  • P-value is the probability of observing results as extreme as, or more extreme than, those observed, if the null hypothesis is true.
  • Test statistic measures how far the data are from the null.
  • Degrees of freedom affect the shape of the distribution and are often related to sample size.

Point Estimates and Parameters

  • Statistical inference uses samples to make statements about populations.
  • Point estimates (sample values) are best guesses for population parameters (unknown values)
  • Confidence interval estimates the range likely encompassing a specific population parameter.

Confidence Intervals

  • Confidence intervals provide range of likely values for population parameter
  • Widens with decreasing sample size.
  • 95% CI means that, in repeated sampling, 95% of such estimated ranges will contain true value.

Sampling Considerations

  • Random errors reflect variability in repeated sampling
  • Standard error estimates how much point estimates deviate from population parameters during repeated sampling.

Statistical Assumptions and Methods

  • Methods like t-tests have underlying assumptions like data normality, independence, and homogeneity of variances.
  • Using correct statistical method and applying appropriate corrections is critical
  • Non-parametric methods may be needed when assumptions cannot be met.

Chi-squared Test of Independence

  • Assesses independence between two categorical variables.
  • Examines if there's an association between variables.
  • Calculations involve (observed - expected)²/ expected for all cells.
  • Assumes no more than 20% of expected values are less than 5 in each cell.
  • Degrees of freedom (df) are calculated as (rows- 1) x (columns - 1)

Clinical Trials

  • A clinical trial is a specific type of experimental study.

Risk Difference

  • The difference in probabilities between two groups.

Risk Ratio

  • The ratio of probabilities between two groups.

Odds Ratio

  • The ratio of odds between two groups.

Rank-Based Tests

  • Non-parametric tests rank observations

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Defining Data PDF

More Like This

Data Ownership and Classification Quiz
27 questions
Analytics 101: Descriptive Statistics
10 questions
Estadística Descriptiva y Variables
48 questions
Use Quizgecko on...
Browser
Browser