Statistical Analysis Using R

Study Notes

R is an open-source statistical programming language used for robust data analysis, modeling, and visualization.
R has a wide variety of statistical packages (over 18,000) available on CRAN (Comprehensive R Archive Network) for specialized statistical analysis.
R offers powerful tools for creating high-quality visualizations using packages like ggplot2 and lattice.
R can handle various data formats (vectors, matrices, data frames, lists).

Data preparation is crucial before analysis, involving cleaning, transforming, and organizing data.
Common tasks include importing data from formats like CSV, Excel, databases, and web APIs.
- Example: Importing a CSV file data <- read.csv("data.csv", header = TRUE)
Data cleaning involves handling missing values, outliers, and duplicates.
- Example: Removing rows with missing values data_cleaned <- na.omit(data)
Data transformation converts data into a suitable format, e.g., factorizing categorical variables.
- Example: Converting a column to a factor data$category <- as.factor(data$category)

Probability measures the likelihood of an event occurring, ranging from 0 (impossibility) to 1 (certainty).
Key concepts include:
- Sample Space (S) - all possible outcomes of a random experiment
- Event - a subset of the sample space
- Probability of an Event - ratio of favorable outcomes to total outcomes

R supports various probability distributions, describing the likelihood of different outcomes in a random experiment.
Common distributions include:
- Uniform Distribution - all outcomes equally likely.
- Normal Distribution - symmetric bell-shaped curve, characterized by mean (µ) and standard deviation (σ).
R functions for generating random numbers from distributions:
- runif(): uniform distribution
- rnorm(): normal distribution
- rbinom(): binomial distribution
- rpois(): Poisson distribution

Hypothesis testing in R is used to validate research assumptions or hypotheses regarding data sets.
R provides functions for testing hypotheses, including onesample T-tests.

Compares the mean of a sample to a known population mean.
Requires normally distributed data.
Example syntax: t.test(x, mu) (where x is the data, mu is the hypothesized mean)

Specifies the direction of the hypothesis, e.g., one sample mean is greater/smaller than another.
Example syntax: t.test(x, mu, alternative = "greater")

A statistical method to explore the relationship between a dependent variable and one or more independent variables.
R's lm() function is used for creating linear regression models, and predict() function is used for predictions.
Various types exist, including simple linear regression and multiple linear regression.
- Example: model <- lm(y ~ x)
- To predict values res <- predict(model, newdata = data.frame(x = ...))

A statistical method for predicting a categorical response variable given one or more predictor variables.
Use R's glm() function for creating logistic regression models (use family=binomial argument)

Model fitting is crucial for assessing how well a model represents data.
Process involves data collection, model selection, parameter estimation, and evaluation.
Techniques for evaluation, like cross-validation and bootstrapping, help determine model performance.
Accurate modeling aids in prediction, pattern identification, and informed decisions based on data.