PSYC3910_Lec2_Review_More_R_fall2023_post.pptx

Full Transcript

Advanced Data Analysis: PSYC 3910 Lecture2: Review and more R Instructor: Bobby Stojanoski Course Overview • Lecture: Thursday 9:30 am – 11:00 am • Bordessa Hall Room DTB205 • Office hours: Fridays 2:30-3:30 pm (or by appointment) • TA: Guadalupe Blanco Velasco • Tutorials (Session 45684) • T...

Advanced Data Analysis: PSYC 3910 Lecture2: Review and more R Instructor: Bobby Stojanoski Course Overview • Lecture: Thursday 9:30 am – 11:00 am • Bordessa Hall Room DTB205 • Office hours: Fridays 2:30-3:30 pm (or by appointment) • TA: Guadalupe Blanco Velasco • Tutorials (Session 45684) • Tuesdsays 9:40-11:00 am: Room CHA213 • Tutorials (Session 45685) • Fridays 2:10-3:30 pm: Room CHA217 • TA: Guadalupe Blanco Velasco and TBD • Tutorials (Session 45687) • Fridays 11:10-12:30 pm: Room DTB205 • Tutorials (Session 45688) • Mondays 2:10-3:30 pm: Room CHA217 Course Overview • • • • • Lectures slides will be uploaded to Canvas before class Be prepared (read assigned chapter) Participate! Examples Assignments: work together, but submitted work should be yours Course Overview Textbook: Field, A., Miles, J., & Field, Z. (2012). Discovering Statistics Using R. Thousand Oaks, CA: Sage. You can purchase the online version of the book here: https://www.vitalsource.com/en-ca/products/discovering -statistics-using-r-andy-field-v9781446289150 There is also a Kindle edition Topic overview: Week 1 Date Sept 7, 8 Topic/Readings Introduction to R 2 Sept 14, 15 R and Review 3 Sept 21, 22 Fundamentals of GLM – Chapter 2 4 Sept 28, 29 Correlation – Chapter 6,7 5 Oct 5, 6 Bivariate Regression & Intro to Multiple Regression – Chapter 7 Oct 12, 13 Reading Week 6 Oct 19, 20 Midterm Exam 7 Oct 26, 27 Multiple Regression – Chapter 7 8 Nov 2, 3 Comparing 2 means – Chapter 9 9 Nov 9, 10 One-way ANOVA – Chapter 10 10 Nov 16, 17 ANCOVA – Chapter 11 11 Nov 23, 24 Factorial (Two-way Independent) – Chapter 12 12 Nov 30, Dec Repeated Measures ANOVA - Chapter 13 1 Grading breakdown Weekly Assignments (10 in total) %5 each = 40% (8 best) Midterm 25% Final 35% Learning outcomes • Understand the basics of programming using R • use R for data analysis • Understand the logic of statistics • conceptual – given a data set, what should I do? • practical – run univariate and multivariate analyses • learn about common statistical tests & procedures and their limitations • Understand the rationale behind statistical tests • generate hypotheses, select the appropriate analyses to address the hypotheses • Interpret findings and write-up results Why do we use statistics? Theoretical Inquiry • Systematic empirical work of science • Testing hypotheses • Description and classification • Developing models • Methods can help formulate research questions Applied Inquiry • Evaluation (e.g., program or RCT) • Assessment and diagnosis • Describing a population • Prediction Why do we use statistics? Can’t we just rely on common sense? • We don’t trust ourselves enough. Human are susceptible to biases, temptations and faulty logic. • It’s too easy for us to “believe what we want to believe” • Statistics is a safeguard. Introducing: Simpson’s Paradox Simpson’s Paradox: phenomenon where an association between two variables in a population emerges, disappears or reverses when the population is divided into subpopulations (or trends disappear when data is combined). Introducing: Simpson’s Paradox Simpson’s Paradox: phenomenon where an association between two variables in a population emerges, disappears or reverses when the population is divided into subpopulations (or trends disappear when data is combined). Introducing: Simpson’s Paradox Simpson’s Paradox: phenomenon where an association between two variables in a population emerges, disappears or reverses when the population is divided into subpopulations (or trends disappear when data is combined). Day You 7/8 = Saturday 87.5% Sunday Your friend 3/3 = 100% 5/8 = 1/2 = 50% 62.5% Introducing: Simpson’s Paradox Simpson’s Paradox: phenomenon where an association between two variables in a population emerges, disappears or reverses when the population is divided into subpopulations (or trends disappear when data is combined). Day You 7/8 = Saturday 87.5% Sunday Total Your friend 3/3 = 100% 5/8 = 1/2 = 50% 62.5% 8/10 = 7/10 = 80% 70% Introducing: Simpson’s Paradox Simpson’s Paradox: phenomenon where an association between two variables in a population emerges, disappears or reverses when the population is divided into subpopulations (or trends disappear when data is combined). Day You 7/8 = Saturday 87.5% Sunday Total Your friend 3/3 = 100% 5/8 = 1/2 = 50% 62.5% 8/10 = 7/10 = 80% 70% Why does it occur? You would expect winning all groups means winning overall? Not necessarily! This is only the case if the group sizes are equal. When group sizes differ, the totals for each are dominated by a particular “group(s)”, but these groups belong to different categories. In this example, totals are dominated by the days where each player solved 8 problems, and in this case, you won, which explains why you could win overall (when both days are combined). Why are statistics so important in psychology? Psychology is hard! Psychologists, for the most part, study humans and humans are complicated. These problem are much more difficult than in other fields. In physics: “if your experiment needs statistics, you should have done a better experiment”. Psychologists need statistics to better understand the things they study, and be sure the effects they find are real. The relationship is deeper than that: statistics is intertwined with research design. If you want to be good at designing psychological studies, you need to at least understand the basics of stats. Statistics has a PR problem • Statistics has a bad rap! • Fear that there is too much math (truth is there is nearly no math) • Courses can feel boring, stressful, confusing • often focused on rote memorization of recipes for statistical tests • Can end up with little understanding of how and why • Focus on underlying logic and rationale (no need to memorize equations, I can’t!) Today: Review of Basics Data collection: What to measure Hypothesis: • Those who play on-line video games are smarter than those who do not Independent Variable • • • • The proposed cause A predictor variable A manipulated variable (in experiments) Amount of brain training Dependent Variable • • • • The proposed effect An outcome variable Measured not manipulated (in experiments) Cognition (e.g., language, memory, executive functioning) Levels of measurement Categorical (entities are divided into distinct categories): • Binary variable: There are only two categories • e.g. dead or alive. • Nominal variable: There are more than two categories • e.g. whether someone is an omnivore, vegetarian, vegan • Ordinal variable: The same as a nominal variable but the categories have a logical order • e.g. What you find on a Likert scale: Likert Scale: strongly disagree, disagree, neutral, agree, strongly agree. Continuous (entities get a distinct score): • Interval variable: Equal intervals on the variable represent equal differences in the property being measured • e.g. the difference between 6 and 8 is equivalent to the difference between 13 and 15. • Think temperature (Celsius and Farenheit) • Ratio variable: The same as an interval variable, but the ratios of scores on the scale must also make sense by having a real 0 point • e.g. a Kelvin scale (0 means absence of thermal energy), reaction times. Data collection: How to collect data Between-group/between-subject/independent • Different entities in experimental conditions Repeated-measures (within-subject) • The same entities take part in all experimental conditions. • More power (all about variance) • Economical But • Practice effects • Fatigue Data collection: Types of Variation Systematic Variation • Differences in performance created by a specific experimental manipulation. Unsystematic Variation • Differences in performance created by unknown factors. • Age, gender, IQ, time of day, measurement error, etc. Randomization • Minimizes unsystematic variation. • Counterbalancing This is a fundamental to both statistics and experimental design Measures of Central Tendency • Once you have the data, you will want an easy way to “describe” data • Designed to minimize the distance between the central value and all data points, but “distance” is defined differently in each case. • Each measure gives you a different type of information • Each measure is more or less appropriate given different distributions Central tendency: Mode Mode • The most frequent score Bimodal • Having two modes Multimodal • Having several modes Pros: Can be applied to any type of data Not affected by outliers Cons: Can’t do much with it Central Tendency: Median Median • The middle score when scores are ordered. • Example: Number of TikTok followers. • How popular are you among your closest friends? 22, 40, 53, 57, 93, 98, 103, 108, 116, 121, 252 22, 40, 53, 57, 93, 98, 103, 108, 116, 121, 252, 515 Pros: Not affected by outliers Cons: Does not use all the data. Central Tendency: Mean • Mean • The sum of scores divided by the number of scores. • Number of TikTok followers. Dispersion: Interquartile range Quartiles • The three values that split the sorted data into four equal parts. • Second quartile = median. • Lower quartile = median of lower half of the data. • Upper quartile = median of upper half of the data. Analysing Data: Histograms Frequency Distributions (aka Histograms) • A graph plotting values of observations on the horizontal axis, with a bar showing how many times each value occurred in the data set. The ‘Normal’ Distribution • Bell-shaped • Symmetrical around the centre Analysing Data: Properties of frequency distribution Kurtosis • The ‘heaviness’ of the tails. • Leptokurtic = heavy tails (higher probability of extreme values) • Platykurtic = light tails. Skew • The symmetry of the distribution. • Positive skew (scores bunched at low values with the tail pointing to high values). • Negative skew (scores bunched at high values with the tail pointing to low values). Analysing Data: Kurtotis Leptokurtic Mesokurtic (Normal) Platykurtic Analysing Data: Skew Analysing Data: Skew and central tendency Beyond raw data: z-scores z-scores • Standardising a score with respect to the other scores in the group. Units of original data no longer important • Expresses a score in terms of how many standard deviations it is away from the mean. • The distribution of z-scores has a mean of 0 and SD = 1. X X z s Z scores and the Normal Distribution X X z s F -3SD -2SD -1SD M 1SD And later 2SD M = 32 SD = 7 68.27% 95.45% 99.73% SCORE: Zscore: Percentile: 11 -3 0 18 -2 2 25 -1 16 32 0 50 3SD 39 1 84 46 2 98 53 3 99.9 Galton Borad: https://www.youtube.com/watch?v=6YDHBFV See a normal distribution in action: IvIs Beyond raw data: z-scores • What’s the probability that someone who drinks 5 diet cokes a day is 70 or older? • Mean = 36, sd = 13 X X z s (70−36)/13 = 2.62 Properties of z-scores • With normal distribution and z-scores we can go one step beyond our specific data: • from a set of scores we can calculate the probability that a particular score will occur. • Determine if getting a particular score is likely or unlikely • Do you see where this is going? • • • • • 1.96 the top 2.5% of the distribution. −1.96 the bottom 2.5% of the distribution. 95% of z-scores lie between −1.96 and 1.96. 99% of z-scores lie between −2.58 and 2.58. 99.9% of them lie between −3.29 and 3.29. Visualizations Importance of visualizations Importance of visualizations Don’t mislead the reader! Importance of visualizations What makes a good graph? • Show the data! • Everything should be labeled, and interpretable without consulting a figure caption or having to solve a puzzle. • Graphs should facilitate relevant quantitative interpretation and comparisons. • Graphs should represent variability and uncertainty to permit inferential statistics by eye • Graphs should simplify complexity and make data coherent • Graphs should not waste ink and should otherwise look pretty. • Reveal a story Visualizations ggplot2 Plot install.packages("ggplot 2") library(ggplot2) In ggplot2 a plot is made up of layers. Visualizations Plot Plot Aesthetics or aes(). This is what the graph will look like (The finished product) Aesthetics Layers Geoms (Bars, lines, text, points) Visual elements are known as geoms ( ‘geometric objects’) What kind of graph do you want: line, bar, scatter, etc… (Colour, shape, size, Location) Visualizations Aesthetics Aesthetic Data Colour Size Shape etc. Specific Don't use aes() e.g., "Red" 2 e.g., colour = "Red" linetype = 2 Variable Use aes() e.g., gender, experimental group e.g., aes(colour =gender), aes(shape =group) Layer/Geom Plot • An aesthetic is a visual property of the objects in your plot. Aesthetics include size, shape, and/or color of your data points. Let’s use the word “level” to describe aesthetic properties. • Provide name of the aesthetic, followed by an equals sign, and something that sets the value • Eg., this can be a variable (e.g., colour = gender, which would produce different coloured aesthetics for males and females) or to specific value (e.g., colour = “Red”) • Can be applied at a specific layer or multiple layers Let’s plot some data • First: • install.packages("tidyverse") • library(tidyverse) • mpg as a tibble, but we will use data frames • mpg_df = data.frame(mpg) • variables in mpg : • displ: a car’s engine size, in litres. • hwy: a car’s fuel efficiency on the highway, in miles per gallon (mpg). Let’s plot some data • ggplot(data = mpg_df) + geom_point(mapping = aes(x = displ, y = hwy)) Let’s plot some data • ggplot(data = mpg_df) + geom_point(mapping = aes(x = displ, y = hwy)) • What is mapping? • Each geom function takes a mapping argument • Mapping defines how variables in your data are mapped to visual properties • Is always paired with aes() • Variables in aes (x and y) are linked to the x and y axis • From here can build a plotting template • ggplot(data = <your data>) + <geom_function>(mapping = <mappings>)) Let’s plot some data • How can visualizations help up make sense of the data? • Group of red data points seem to stand out (higher mileage than other cars with similar engine sizes) • What do you think might explain this? • What’s your hypothesis • To make sense of the red dots, we can add a third variable, class (which might help us make sense of the red data points), to a two-dimensional scatterplot by mapping it to an aesthetic. Let’s plot some data • ggplot(data = mpg_df) + geom_point(mapping = aes(x = displ, y = hwy)) • ggplot(data = mpg _df) + geom_point(mapping = aes(x = displ, y = hwy, colour = class)) Link the type aesthetic to the name of the variable inside aes(). ggplot2 will automatically assign a unique level of the aesthetic (unique colour) to each unique value of the variable, this is scaling. ggplot2 also adds a legend Let’s plot some data • ggplot(data = mpg_df) + geom_point(mapping = aes(x = displ, y = hwy)) • ggplot(data = mpg _df) + geom_point(mapping = aes(x = displ, y = hwy, alpha = class)) Let’s plot some data • Another way to separate data points is to split your plots • Use facet_wrap • ggplot(data = mpg _df) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_wrap(~ class, nrow = 2) Let’s plot some data • Other geoms • We’ve seen: ggplot(data = mpg _df) + geom_point(mapping = aes(x = displ, y = hwy)) • But can plot the same data this way: • ggplot(data = mpg _df) + geom_smooth(mapping = aes(x = displ, y = hwy)) Let’s plot some data • Can also combine these: • ggplot(data = mpg _df) + geom_point(mapping = aes(x = displ, y = hwy)) + geom_smooth(mapping = aes(x = displ, y = hwy)) • Notice any redundancy? Let’s plot some data • Remove redundancy my moving the mapping to a higher layer in ggplot • ggplot(data = mpg _df, mapping = aes(x = displ, y = hwy)) + geom_point() + geom_smooth() • This allows you to add other mappings at different layers • ggplot(data = mpg _df, mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(color = class)) + geom_smooth() Let’s plot some data • ggplot also allows for statistical transformations! • • • • Use the diamonds data set that you get with the tidyverse package and we will plot this data using geom_bar() Plot: total number of diamonds, grouped by cut. ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut)) ggplot(data = diamonds) + geom_boxplot(mapping = aes(x = cut, y = depth)) • How are the plots generated using geom functions just by calling the variable/dataframe? • • • • Don’t we need summary stats: min and max values for bar graphs and errorbars? No! ggplot2 has built in “stats” function that does it for us works behind the scenes Used by geom to get values to create visual elements Let’s plot some data • Can also call/select which statistical transformations you want to run • ggplot(data = diamonds, aes(x = cut, y = depth)) • To add the mean, displayed as bars, we can add this as a layer to bar using the stat_summary() function: • + stat_summary(fun = mean, geom = "bar", fill = "White", colour = "Black") • To add error bars, add these as a layer using stat_summary(): • + stat_summary(fun.data = mean_cl_normal, geom = ”errorbars”, position = position_dodge(width = 0.90), width = 0.2) • Combine all together: • ggplot(data = diamonds, aes(x = cut, y = depth)) + stat_summary(fun = mean, geom = "bar", fill = "White", colour = "Black") + stat_summary(fun.data = mean_cl_normal, geom = "errorbar",position = position_dodge(width = 0.90), width = 0.2) Visualizations Visualizations