Biostatistics Lectures 1-4 Summary

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is the median in a data set?

  • The difference between the highest and lowest value
  • The average of all values divided by the number of values
  • The value that separates the lower half from the upper half (correct)
  • The most frequently occurring value

Which measure of spread is defined as the square root of the variance?

  • Interquartile range
  • Standard deviation (correct)
  • Mean absolute deviation
  • Range

Which of the following statements about outliers is true?

  • Outliers are calculated as any values further from the mean.
  • Outliers are values within the interquartile range.
  • Outliers lie beyond the whiskers in a box plot. (correct)
  • Outliers must always be removed from the dataset.

What does the interquartile range (IQR) represent?

<p>The range of the middle 50% of the data (D)</p> Signup and view all the answers

What is true about the mode of a dataset?

<p>It can be more than one value if there are multiple most frequent values. (D)</p> Signup and view all the answers

What does a p-value less than 0.05 in a Shapiro-Wilk test indicate?

<p>There is a significant deviation from normality. (C)</p> Signup and view all the answers

According to the central limit theorem, what happens to the means of repeated random samples?

<p>They will always be normally distributed, regardless of the population distribution. (D)</p> Signup and view all the answers

What defines the standard error (SE) in the context of the central limit theorem?

<p>The standard deviation of the distribution of sample means. (D)</p> Signup and view all the answers

In a Q-Q plot, what does it indicate if the points lie close to the diagonal line y = x?

<p>The variable is likely normally distributed. (D)</p> Signup and view all the answers

What are the components of the distribution of a variable in the sample?

<p>Mean x̄, standard deviation s. (A)</p> Signup and view all the answers

What symbols are used to represent the sample mean and sample variance?

<p>X and S (C)</p> Signup and view all the answers

In what type of data are median and quantiles considered more appropriate?

<p>Skewed or non-normal data (C)</p> Signup and view all the answers

What does a probability distribution function f(x) specify?

<p>The probabilities of different values of a random variable (C)</p> Signup and view all the answers

What is a characteristic of the Normal distribution?

<p>Mean, median, and mode are the same (D)</p> Signup and view all the answers

Which statement about z-scores is correct?

<p>They indicate the number of standard deviations a value is from the mean (C)</p> Signup and view all the answers

What percentage of probability is included between -1σ and 1σ in a Normal distribution?

<p>68% (D)</p> Signup and view all the answers

What type of variable is a random variable?

<p>Its values are determined by random phenomena (B)</p> Signup and view all the answers

What is the empirical description of probability distributions based on?

<p>Measures used for frequency distributions (B)</p> Signup and view all the answers

What does a relative frequency table provide?

<p>The percentage of participants in each category (A)</p> Signup and view all the answers

In what situation are contingency tables primarily used?

<p>To analyze the relationship between two categorical variables (A)</p> Signup and view all the answers

Which of the following is NOT appropriate for frequency tables?

<p>Continuous variables with many categories (A)</p> Signup and view all the answers

When using a bar plot to illustrate categorical variables, which is essential?

<p>Bars should be the same width with space in between (C)</p> Signup and view all the answers

What does the presence of marginal totals in a contingency table indicate?

<p>The categories are independent of each other (B)</p> Signup and view all the answers

What is the key distinction between a histogram and a bar plot?

<p>Histograms have no space between bars while bar plots do (A)</p> Signup and view all the answers

In a right skewed histogram, which of the following is true?

<p>Mean is greater than the median (B)</p> Signup and view all the answers

Cumulative frequencies can be applied to which type of variable?

<p>Discrete numeric and ordinal variables (B)</p> Signup and view all the answers

What does the Wilcoxon-Mann-Whitney test assess?

<p>Whether two samples come from the same distribution (C)</p> Signup and view all the answers

What is represented by a p-value less than 0.05 in hypothesis testing?

<p>There is evidence against the null hypothesis, suggesting significance (D)</p> Signup and view all the answers

Which statement accurately describes the null hypothesis (H0)?

<p>Sample means differ due to random error alone (D)</p> Signup and view all the answers

What is the formula used to calculate the 95% Confidence Interval for a mean?

<p>CI = X ± Z*SE (D)</p> Signup and view all the answers

What does a difference between population means of $d = 0$ indicate?

<p>Both population means are the same (D)</p> Signup and view all the answers

What is the z-value associated with a 95% confidence level?

<p>1.96 (C)</p> Signup and view all the answers

In the context of proportions, what is crucial about the ratio X/Y?

<p>The numerator should not be included in the denominator (D)</p> Signup and view all the answers

Which of the following assumptions must hold for the t-test?

<p>Both samples must be independent. (D)</p> Signup and view all the answers

When calculating the standard error for the mean, what is the formula used?

<p>SE = s/√n (D)</p> Signup and view all the answers

What should be used instead of the z-distribution for smaller samples?

<p>T-distribution (D)</p> Signup and view all the answers

Which of the following approaches can be used to test the assumption of normality?

<p>Q-Q plots and Shapiro-Wilk test (C)</p> Signup and view all the answers

What can be stated about the confidence interval in relation to the true mean µ?

<p>It gives a range in which µ will lie with a high probability. (D)</p> Signup and view all the answers

What happens when the assumptions for the t-test are not satisfied?

<p>Alternative statistical methods should be considered. (A)</p> Signup and view all the answers

Flashcards

Frequency Table

Displays the number of participants in each category of a variable.

Relative Frequency Table

Shows the percentage of participants in each category.

Contingency Table

Table showing the relationship between two categorical variables.

Categorical Variables

Variables with categories (e.g., colors, types).

Signup and view all the flashcards

Pie Chart

Visualizes relative frequencies with slices representing percentages.

Signup and view all the flashcards

Bar Plot

Displays categorical frequency using bars.

Signup and view all the flashcards

Histogram

Graphical display for numerical data with bins.

Signup and view all the flashcards

Skewed Distribution

Data's distribution is not symmetrical (e.g., right-skewed).

Signup and view all the flashcards

Mean

The average of a set of values, calculated by summing all values and dividing by the count.

Signup and view all the flashcards

Median

The middle value when a dataset is ordered. It's less affected by outliers than the mean.

Signup and view all the flashcards

Interquartile Range (IQR)

The difference between the 75th and 25th percentiles. A measure of data spread.

Signup and view all the flashcards

Outlier

A data point significantly different from other data points in a dataset.

Signup and view all the flashcards

Frequency Distribution

A visual or mathematical representation of data. Shows how often data items fall within specific ranges.

Signup and view all the flashcards

Normal Distribution

A common probability distribution where data points cluster around a central value, forming a bell shape.

Signup and view all the flashcards

Shapiro-Wilk test

A statistical test used to determine if a dataset comes from a normally distributed population.

Signup and view all the flashcards

Q-Q Plot

A graphical tool used to assess whether a dataset is normally distributed. Quantiles are plotted against each other.

Signup and view all the flashcards

Central Limit Theorem

Sampling distributions of sample means will be approximately normally distributed, even if the original population is not normally distributed.

Signup and view all the flashcards

Standard Error (SE)

The standard deviation of the sampling distribution of sample means, indicating the variability of sample means around the population mean.

Signup and view all the flashcards

Population Mean

The average value of a variable in an entire population; symbolized by µ.

Signup and view all the flashcards

Population Variance

A measure of how spread out the values in a population are from the population mean. Symbolized by σ².

Signup and view all the flashcards

Sample Mean

The average value of a variable calculated from a sample of the population; symbolized by X.

Signup and view all the flashcards

Sample Variance

A measure of how spread out the values in a sample are from the sample mean. Symbolized by S².

Signup and view all the flashcards

Normal Distribution

A symmetrical probability distribution described by its mean (µ) and standard deviation (σ),often used to model continuous variables in populations.

Signup and view all the flashcards

z-score

A value that tells you how many standard deviations a data point is from the mean in a normal distribution, calculated as (X-µ)/σ.

Signup and view all the flashcards

Probability Distribution

A function that describes the probabilities of different outcomes for a random variable.

Signup and view all the flashcards

Random Variable

A variable whose value depends on the outcome of a random experiment or phenomenon. It's often symbolized with an uppercase letter like X.

Signup and view all the flashcards

Confidence Interval (CI)

A range of values that likely contains the true value of a population parameter (like the mean), with a specified probability (e.g., 95%).

Signup and view all the flashcards

Standard Error (SE)

The standard deviation of the sampling distribution of a sample statistic (like the mean).

Signup and view all the flashcards

95% Confidence Interval

An interval where 95 out of 100 samples from a population are expected to contain the true mean value.

Signup and view all the flashcards

t-distribution

Used to estimate population parameters for smaller sample sizes, when the population standard deviation is unknown.

Signup and view all the flashcards

One-sample t-test

A statistical method to estimate the 95% confidence interval for a population mean using a t-distribution.

Signup and view all the flashcards

Sample size (n)

The number of observations in a sample, influencing the width of the confidence interval.

Signup and view all the flashcards

Assumptions of t-tests

Requirements for the validity of a t-test, typically normality, independence, and equal variances in respective populations.

Signup and view all the flashcards

Z-value

A critical value for a given confidence level from the standard normal distribution.

Signup and view all the flashcards

Wilcoxon-Mann-Whitney Test

A non-parametric test to determine if two samples come from the same distribution, unlike the parametric t-test which assumes normality.

Signup and view all the flashcards

Null Hypothesis (H0)

The assumption that there's no difference between two population means.

Signup and view all the flashcards

Alternative Hypothesis (H1)

The claim that there is a real difference between two population means.

Signup and view all the flashcards

P-value

The probability of observing results as extreme as, or more extreme than, the ones obtained, IF the null hypothesis is true.

Signup and view all the flashcards

Statistical Significance (p<0.05)

The probability value (p) is less than 0.05, which indicates the observed results are unlikely to occur if the null hypothesis were true, leading to rejection of the null.

Signup and view all the flashcards

Study Notes

Biostatistics Lectures 1-4 Summary

  • Biostatistics is the collection, classification, analysis, and interpretation of data from biomedical research. It helps create medical knowledge.
  • Science is empirical, relying on observations and experiences. Inductive reasoning draws general conclusions from specific observations.
  • Basic research and clinical research are interconnected. Clinical research uses randomized studies to avoid biased results.
  • In biostatistics, samples are studied because they are subsets of populations. However, the sample itself is not the main focus.
  • Samples are used to make inferences about larger populations. Larger samples have a higher likelihood of accurately reflecting the population. Small, biased samples are more prone to random error, which can lead to inaccurate population representations.
  • Random error (sampling error) is the difference between the sample mean and the population mean due to sampling.
  • Sample quantities are known, measured values. Population quantities are unknown and have to be estimated.
  • Clinical research involves study design, data collection, data processing, and data analysis.
  • Statistical software, like R, facilitates reproducible research, which is a crucial aspect of science. Reproducible research means the results can be verified and repeated by others.
  • Reproducible research requires using code throughout the process.
  • Rectangular data is used in most studies. This is a tabular structure where rows represent cases (observations, records) and columns represent characteristics or variables.
  • Primary key is a unique identifier for each case

Types of Variables

  • Categorical variables:
    • Nominal: No inherent order (e.g., blood type, sex).
    • Ordinal: Inherent order (e.g., educational level, satisfaction).
    • Dichotomous: Only two categories (e.g., yes/no, diseased/healthy).
  • Numerical variables:
    • Continuous: Measured on a continuum with infinite possible values (e.g. temperature).
    • Discrete: Counted, with finite possible values (e.g., number of children).

Frequency Tables

  • Frequency tables present the number or percentage of participants in each category.
  • Useful for categorical variables and grouped numeric data like "age group".
  • Can also be used for ordinal variables.
  • Cumulative frequencies can be displayed if the number of categories is limited.

Contingency Tables

  • Contingency tables (cross-tabulations) are used to explore the association between two categorical variables.
  • Examining the relationship between two categorical variables (e.g., exposure and outcome).
  • Similar marginals and category-specific proportions suggest that the variables are not associated.

Plotting Categorical Variables

  • Pie charts and bar plots illustrate relative frequencies. Pie charts show percentages, bar charts show counts.
  • Bar graphs, arranged horizontally or vertically, are useful to show categorical data.

Plotting Numeric Variables

  • Histograms and box plots are used for numeric data.
  • Histograms show the distribution of data across different ranges of values or classes, graphically depicting data frequency patterns.
  • Box plots visualize data spread and identify outliers as the values that are significantly different from most of the data.

Measures of Location

  • Mean: Average of a set of values. Sensitive to outliers.
  • Median: The middle value when data is sorted. Not sensitive to outliers.
  • Quantiles: Values that divide the data into segments based on proportions (e.g., 10th percentile, median = 50th percentile, quartiles).
  • Mode: Most frequent value.

Measures of Spread

  • Variance: The average of squared deviations from the mean.
  • Standard deviation: Square root of the variance.
  • Range: Difference between maximum and minimum values.
  • Interquartile range (IQR): Difference between 75th and 25th percentile.

Distribution as a Concept

  • Probability distributions describe the probability of different values in a variable.
  • Density plots show the total area under the curve equal to 100%.

Population versus Sample

  • Population involves the entire group or collection of data of interest. The mean and variance are unknown
  • Samples are parts of populations. Sample means and sample variances are known and are used to estimate their population counterparts.

Describing Numerical Variables

  • Median and quantiles are useful for skewed data.

Probability Distributions

  • Probability distributions are used to describe variation in numeric data, giving the probability of each possible numerical outcome
  • Probability distributions are typically empirically described by mean, standard deviation, median, and quantiles.

The Normal Distribution

  • Defined by its mean and standard deviation (i.e., μ and σ)
  • Its symmetrical which makes it useful for calculations
  • A large volume of data follows this curve, making it a useful analytical tool
  • Most probability falls within 1, 2, and 3 standard deviations from the mean
  • Used in many biostatistical calculations and test to create a standardized framework

Normal Distribution Tests

  • Contextual knowledge—knowing the variable.
  • Shapiro-Wilk test: Determines if a variable's distribution deviates significantly from normal. Lower p-value (p<0.05) suggests deviation
  • Q-Q plots (quantile-quantile plots): Compare the quantiles of the variable to the quantiles of a normal distribution. Straight line points to a normal distribution.

Importance of the Normal Distribution

  • Many variables follow a typical normal distribution.
  • Central limit theorem: The average of repeated samples approximates a normal distribution even if the underlying distribution isn't normal.
  • The standard error measures how accurate sample means are in estimating the population mean; a smaller SE shows greater accuracy

The Three Distributions

  • Population distribution: Entire group including mean and stdev, unknown
  • Sample distribution: Portion of the population, including mean and stdev, known.
  • Sampling distribution: Distribution of sample means, including mean and standard error.

Confidence Intervals

  • Confidence intervals provide a range within which the true population mean is likely to fall.
  • 95% CI means that 95% of repeated intervals would contain the true value if the experiment were repeated many times.
  • For small samples, t-distribution is used to account for the uncertainty in the sample deviation, and standard deviation, using a t-value instead of a z-value (larger t-values show greater uncertainty).

Assumptions of t-tests

  • Data comes from normally distributed populations.
  • Sample data must be independent.
  • Populations have equal standard deviations.

Comparing Two Independent Sample Means

  • Different sample means are often because of random error but also may reflect actual population differences in means.
  • Testing involves a null hypothesis that sample means are equally different (H0) or different means (H1)

P-values and Hypothesis Testing

  • P-value: The probability of observing the data or more extreme results if the null hypothesis is true.
  • A small p-value (typically less than 0.05) suggests the null hypothesis is unlikely, and we reject it in favor of the alternative hypothesis.

Proportions

  • Proportion: Part of the whole; a fraction of a total.
  • Proportion values are limited to 0-1.
  • We cannot use typical test methods for analyzing proportions.

Binomial Distribution

  • Used for discrete variables with binary outcomes (success/failure).
  • Shaped in a skewed pattern and defined by p, probability of success and n or number of trials
  • This is used to define the probability that an outcome occurs at least x times in n trials.

Confidence Intervals for Proportions

  • Confidence intervals for proportions estimate the range in which the true population proportion lies, with a given probability for repeated trials.
  • Exact methods are used when sample sizes are small, while larger sample sizes allow the use of the normal distribution methodology.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Biomedical Measurement Theories: Chapter 2 of SEBB 3043
12 questions
Biostatistics Study Design
22 questions
Biostatistics Module #9 Quiz
37 questions

Biostatistics Module #9 Quiz

BeneficiaryFantasticArt avatar
BeneficiaryFantasticArt
Use Quizgecko on...
Browser
Browser