Summary

These notes provide an overview of reliability and validity in measurement. They discuss different types of reliability, including test-retest, internal consistency, and interrater reliability. The document also covers several types of validity, including face validity, construct validity, criterion-related validity, and content validity.

Full Transcript

Test #2 study notes Reliability: the consistency, stability, or repeatability of one or more measures or observations 3 types of reliability: 1. Test-retest reliability: the extent to which a measure or observation is consistent or stable at two points in time - If the same inst...

Test #2 study notes Reliability: the consistency, stability, or repeatability of one or more measures or observations 3 types of reliability: 1. Test-retest reliability: the extent to which a measure or observation is consistent or stable at two points in time - If the same instrument returns the same result under the same conditions multiple times, it can be said to have a good test-retest reliability - If the same instrument returns different results under the same conditions multiple times, it can be said to have poor test-retest reliability Measuring test-retest reliability - If they have good test-retest reliability, the values should be strongly correlated (large effect size) - If they have poor test retest reliability, the values will be weakly correlated (smaller effect size) 2. Internal consistency: a measure of reliability used to determine the extent to which multiple times used to measure the same variable are related - The most common measurement of internal consistency is Cronbach’s Alpha: which measures how consistently a set of questions or items in a questionnaire or scale measure the same underlying concept Values range from 0-1 The larger the value of cronbach’s alpha, the higher the internal consistency is said to be Scales with poor or questionable internal consistency are rarely used 3. Interrater reliability: a measure for the extent to which two or more raters of the same behaviour or event are in agreement with what they observed - Interrater reliability is a concern wherever there is any subjective component to data scoring Subjective: refers to the potential for individual raters or evaluators to introduce their own personal biases, opinions, or interpretations when assessing or rating the same set of data, observations, or items Cohen’s kappa: a statistical measure that assesses the agreement between raters when categorizing data - Scores range from 0-1 - Indicates how strongly two raters’ scores agree with one another Validity: has to do with whether those measurement are actually measuring what they are supposed to measure 4 types of validity: 1. Face validity: the extent to which a measure for a variable or construct appears on the surface to measure what it is purported to measure - If an instrument doesn’t seem like it actually measures what it should, it could be said to have poor face validity - Highly subjective - Based on judgement or intuition - The weakest form of validity 2. Construct validity: the extent to which an operational definition for a variable is actually measuring that variable - Ex: university exams 3. Criterion-related validity: the extent to which scores obtained on some measure can be used to infer or predict a criterion or expected outcome - If scores on some variable are related to or predict a certain expected outcome, that variable can be said to have good criterion-related validity - Ex: language proficiency exams Criterion-related validity subtypes: A. Predictive: scores obtained predict scores it should predict B. Concurrent: measures distinguish between groups that they should distinguish between C. Convergent: two or more measures that should be related are related D. Discriminant: measures can be discriminated from another measure that it should not be related to 4. Content validity: how completely a measure reflects what it is trying to measure - Ex: GAD involves both anxiety/worry and 3 or more related symptoms - A measure that does not include related symptoms would not fully reflect GAD - It would therefore have poor content validity Participant reactivity: the reaction or response participants have when they know they are being observed or measured Expectancy effects: researchers expectations Range effects: a limitation in the range of data measured in which scores are clustered to one extreme Sensitivity: the extent to which a measure can change or be different in the presence of a manipulation - Ex: a bathroom scale that can only measure weight in whole kilograms - Not sensitive enough to track small changes in weight - Can lead to range effects Target population: all members of a group of interest to the researcher Accessible population: the portion of the target population that can be clearly identified and directly sampled from - Ex: survey studies often find it difficult to reach groups that make up very small portions of the population Probability sampling: when samples are selected directly from the target population - Ensures that every member of the population has a known chance of being selected in the sample Non-probability sampling: when samples are selected from the accessible population - Doesn’t guarantee equal chances for every population member 2 types of non-probability sampling 1. Convenience sampling 2. Quota sampling Convenience sampling: participants are selected based on how easy or convenient it is to reach or access them - Most common type of sampling is behavioural research Participant pools: a group of accessible and available participants for a research study - Ex: SONA studies Convenience sampling limitations - Does not ensure sample representativeness/generalizability - Can be subject to bias if the convenience of recruitment leads to very specific/non-generalizable participant groups - Ex: student only groups of participants Quota sampling: participants are selected based on criteria/characteristics in the target population - Ensures that important traits are represented in the sample - Two main types: Simple quota sampling Proportionate quota sampling Simple quota sampling: when an equal number of subject or participants are selected for a given characteristic or demographic - Ex: a study of 100 participants may choose to recruit 50 participants with characteristics A, and 50 with characteristic B Proportionate quota sampling: when participants are selected such that the known characteristic or demographic is proportionately represented in the sample - Ex: only 18.5% of Canada's population is 65+ years old. A proportionate sample would recruit the 18.5% participants that are 65+ years old, rather than equal numbers of each age group Quota sampling limitations - Only possible when the proportions of some characteristics are already known in the population - Ex: to sample based on frequency of age, you would need to know how frequent those age groups are in the population already Probability sampling: when samples are selected directly from the target population 4 types of probability sampling 1. Simple random sampling 2. Stratified random sampling 3. Systematic sampling 4. Cluster sampling Simple random sampling: sampling subjects and participants such that all individuals in a population have an equal chance of being selected Sampling with replacement: each individual selected is replaced before the next selection to ensure that the probability of selecting each individual is always the same Sampling without replacement: when each individual selected is not replaced before the next selection - Most common approach - Requires sampling from a sufficiently large population Replacement - Ex: if we sample participants from a group of 20 students, there is a 5% (1/20) chance of selecting any participant - If we do not replace them before the next sample, there is a 1/19 chance of selecting any participant next time - The effect of non-replacement is small when we are sampling from a large population - The effects of non-replacement are reduced in larger populations - Ex: if you draw one participant from a sample of 1000, there is a 1/1000 chance of selecting each participant - If you sample them again, there is a 1/999 chance of selecting each participant - The different in probability is negligible - Making sampling without replacement the most common approach Stratified random sampling: a method in which a population is divided into subgroups or strata; participants are then selected from each subgroup using simple random sampling and are combined into one overall sample - Common with characteristics like age - Reduce the risk of under- or over-representing specific groups Systematic sampling: when the first participant is selected using simple random sampling, and then every other person is systematically selected until all participants have been selected - Ex: randomly selecting one participant as a starting point for a study and then selecting including every person from that point forward Systematic sampling limitations - Can induce bias if there is a hidden pattern or periodicity to the data that is unforeseen - Sensitive to starting point - Can be challenging to replace missing data, since more participants cannot be randomly selected (as they would not follow the same sampling timeline/procedure as other participants) Cluster sampling: clusters of individuals are identified in a population, and then a portion of clusters that are representative of the population are selected - Ex: selecting several typical towns are being representative of a province Cluster sampling limitations - Works best when populations naturally cluster - Does not include the whole population, since clusters that were excluded are not sampled at all - Can pose challenges if clusters are not approximately equally sized Sampling error - There is always the possibility that a sample collected does not accurately reflect the population - The mean of the sample data is often different from the mean of the population it was drawn from - Ex: measure the average height from a sample of 20 people. Will it be the sample as the average height in Canada overall? (usually no) Sampling error: the difference between what is observed/measured and what is true in the population Standard error of the mean (SEM): the distance that sample mean values tend to deviate from the population mean Standard deviation: reflects how far individual data points can vary from the mean of a sample - Relates to variability of individual participants Standard error: reflects how far the average of the sample can vary from the mean of the entire population - Relates to variability in whole samples SEM: practical example - The population mean for female weight in canada is 155lbs - If the SEM of a sample of 100 participants was 5, what would that indicate about samples from this population ? - This indicates that if you collected a sample of 100 participants, the mean weight for a sample would tend to vary from the population average by about 5 lbs - Ie: 155 +/- 5 lbs Practical application: the more participants a study collects, the more confident you can be that it accurately reflects the population from which it was drawn Research questions 3 categories 1. Exploratory: when there is not a strong research objective or clear direction 2. Descriptive: aims to simply describe some phenomenon or sample characteristic - Relies on descriptive statistics only (mean, median, mode, standard deviation, etc) 3. Relational: describes relationships between variables - These can be correlational or cause-effect relationships - Most common Analyses 2 types 1. Measuring relationships - Statistical tests are often referred to as “correlational” or “associational” - Do not establish cause-effect relationships - P < 0.05 in this context indicates that two variables are strongly related to one another 2. Measuring group differences - Statistical tests are “comparative” - Can be used with the experimental research design to establish cause-effect relationships - P < 0.05 in this context indicates that two groups are significantly different from one another Research designs Experimental design: the use of methods and procedures in which the experimenter fully controls the conditions and experiences of participants - Allows establishing cause-effect relationships Features of the experimental design 1. Manipulation of variable: experimental studies involve actively manipulating one or more independent variables to observe their effects - Independent variable(IV): manipulated variable - Dependent variable(DV): measure the effect of that manipulation on another variable (depends on the independent variable) (sometimes referred as a pre-post design) Pre-post design: research designs that involve measuring a particular variable in a study both before and after a specific intervention, treatment, or event 2. Randomization: when each individual has an equal chance of being assigned to each group - Participants are randomly placed in treatment groups, not assigned by the experimenter - Ex: clinical drug trial, where participants could be assigned to any drug group 3. Control group: a group in an experimental study that does not receive the experimental treatment or intervention - There is always the possibility that the treatment group would have changed even if we didn't do anything - To ensure this is not the case, we include a group that receives no treatment - Used as a baseline for comparison to assess the effects of the treatment (or independent variable) versus the effect or doing nothing at all - Statistical analyses for experimental designs measure group differences T-tests, ANOVA are the most common approaches Measure differences between two groups or time-points - An independent variable is sometimes referred to as a factor - In some experiments, factors also have several levels: the specific conditions or groups created by manipulating that factor - Ex: in a drug trial, the drug is the factor, but it may be administered in several different dosage groups (levels) - When a factor has two levels, we can test the different between them using a t-test - When a factor has three or more levels, we can test the difference between using ANOVA Within-groups: measuring the same participants at different points in time Between groups: comparing groups of different participants to one another Quasi- Experimental designs - Since its difficult to control all factors and sometimes researchers are interested in variables that do not permit random assignment, and these factors have pre-existing levels, and cannot be assigned by the researcher: quasi-independent variables - Sometimes studies lack a control group, these are called one-group designs, and are a type of quasi experimental design (these are not true experimental designs since they lack a proper control Ex: researchers may study age differences - They cannot control participants ages (a feature they already have) - This is a quasi experimental design because they cannot randomly assign participants to the age groups Types of quasi experimental designs - Cross-sectional studies - Longitudinal studies - Cohort studies - One-group studies Analyzing quasi experimental designs - Measure group differences: t-test, ANOVA Limitations - Cannot be used to determine cause-effect relationships - Because the experimenters cannot manipulate quasi-independent variables - Studies with quasi-independent variables also lack randomization, and may lack a control group Non-experimental designs: the use of methods to make observations without intervention from the researcher - The researcher does not manipulate any variables directly - Qualitative Quantitative non-experimental design - Include correlation, naturalistic studies, and surveys - Also includes studies on already existing data: Meta-analysis Archival research Analyzing quantitative non-experimental design - Using measures of relationships (Correlation is a statistical measure of the strength of relationships between two variables) - Using measure of group differences - May also include content analysis - Cannot establish cause-effect relationships - Can support a relationship between phenomena - Can still hold significant value - Ex: climate change - Relies on structured or semi-structured interviews, focus groups, etc Internal validity: refers to the extent to which the study controls the conditions and experiences of participants - Studies that are well controlled (account for external causes, have a control group, etc.) have high internal validity - Mostly experimental designs - Quasi and non-experimental designs have less control, and less internal validity External validity: refers to how generalizable study is - Studies with fewer constraints are more generalizable Types of external validity - Population validity: the extent to which results from a sample generalize to the population - Ecological validity: the extent to which results generalize to different settings or environments Ex: do the results of a study depend on where it was performed? - Temporal validity: the extent to which results generalize to different points in time Ex: do findings from a study on mental health during COVID still apply after the pandemic? - Outcome validity: the extent to which results generalize to related variables Ex: does a treatment for anxiety also reduce hypervigilance? Value of non-experimental work Holistic understanding: a comprehensive and all encompassing grasp of complex phenomena; takes into account context, relationships, and the intricacies of human experience Big data: large and complex datasets that require advanced tools for management and analysis due to their size and complexity Thick data: qualitative, context rich information that provides deeper insights into human behaviour, emotions, and cultural nuances - Often obtained through methods such as ethnography, interviews, and narratives - Qualitative, harder to measure information about participants Types of non-experimental approaches Quantitative non experimental methods: - Correlational - Naturalistic - Survey - Existing data (archival data) Qualitative non experimental approaches: - Case studies - Ethnography - Phenomenology Trustworthiness: the credibility, transferability, dependability, and confirmability of a qualitative study 1. Credibility: the extent to which observed results reflect the realities of participants in such a way that the participants themselves would agree with the researcher report - Parallels internal validity 2. Transferability: the extent to which observed results are useful or applicable beyond the setting/ context of the research - Parallels external validity (generalizability) 3. Dependability: the extent to which the observed results would be similar if another study were conducted in the same/similar context - Parallels reliability (same results over time or different contexts) 4. Confirmability: the extent to which observed results reflect the actual context of participant experiences, rather than the researchers perspective - Parallels objectivity/experimenter bias Attrition: when participants fail to complete participation in a research study Qualitative non experimental approaches Phenomenology: the qualitative analysis of conscious experiences from the first-person point of view of the participant - In depth interviews - One on one - Involves having participants describe their experiences from their perspective - Researcher then constructs a narrative to summarize the experiences described in the interview Individual experiences can be highly influenced by external, contextual factors - Historical - Political - Societal - Geographic - Temporal - Gender - Familiarity Ethnography: involves analyzing the behaviour and identity of a group or culture it is described by members of the group or culture Participant observation: when the researchers participate in or join the group culture that they are observing - Ex: hells angels going to church Case study: qualitative analysis of an individual, a group, or an organization to illustrate a phenomenon or compare observations of many cases Types of case studies 1. Illustrative: pertains to unique cases where little is known (like the patient H.M) 2. Exploratory: a preliminary or pilot study conducted prior to a large scale study. Provides information pertinent to conducting the larger scale study 3. Collective: the review of several related cases together - Ex: paul broca - Discovered language center of the brain by describing case studies of patients - Described a collection of multiple patients with similar symptoms and brain damage Case study: key considerations - May not be generalizable if based on a single individual or small group of individuals - A single case that misleads researchers could lead to bad interpretations being generalized - Ex: H.M case, it later turned out that the regions of his brain researcher thought had been removed were different that what was actually removed Study designs - Content analysis - Meta analysis - Archival research Archival research concerns 1. Selective survival: the process by which historical records survive or are excluded/decay over time - Ex: what if some records are discarded while others are not? Than the historical record would be biased 2. Selective depository: the process by which records are selectively recorded at the time of creation - What if only some records were saved, while others were discarded? This could also lead to bias in the historical record Psychometrics: the branch of psychology that is concerned with the theory and technique of developing measures and assessments Survey items - Validation studies tend to report on important statistical measures (ex cronbach’s alpha) - Attempt to establish the usability of anew measurement technique - May also refine older measures to shorten/strengthen them - Provide the foundation for trusting and interpreting results Types of question that tend to be included in surveys 1. Open ended items 2. Partially open ended items 3. Restricted items Open ended item: a question that allows the respondent to given any response in their own words, without restriction - Ex: describe an experience you had, how do you feel about - Administered in surveys - Analysis method often involved content-analysis Partially open-ended items: survey questions that includes a few restricted options for answers, but with a last option that permits open-ended responses if none of the default options fit the response participants hope to give - Qualitative or quantitative - May report only the number of other responses - Or they may report the actual qualitative responses to the other option Restricted items: a question that includes a set number of options for answer options to which participants must respond - No option for other responses - May include “dont know” or “prefer not to say” - Includes likert-scales (numeric response scales used to indicate a participant’s rating or level of agreement Response set: the tendency for participants to respond the same way to all items when the direction of ratings is the same for all items in the survey Reverse coded items: an item phrased in the semantically opposite direction of most other items in a survey - Must be scored in reverse from the other items Participant fatigue: the declining engagement and willingness of participants to accurately complete surveys, questionnaires, or other research activities due to factors such as survey length, repetition, or boredom Generalizing survey findings - When response rates are low, it's still practical to publish findings - Theoretical generalizations: data is collected to support an existing theory - Empirical generalization: data may be consistent with well-established research from the past Correlational designs: the measurement of two or more factors to determine the extent to which those factors are related Correlation coefficient: a statistical measure of strength and direction of the linear relationship between two variables - Ranges from - to + (negative or positive direction) - Ranges from 0 to 1 (small to large strength of the relationship) - Most common is pearson’s r (ex: r = 0.25) - Tells us the strength and direction Direction: refers to whether the correlation between two variables is positive or negative - Positive correlation: when one variable tends to increase as the other increases - Negative correlation: when one variable tends to increase as the other decreases - Indicated by the correlation value Ex: r =-0.65 indicates a negative correlation, r =+0.65 indicates a positive correlation Strength: indicates how strong or weak the relationship between two variables is - The effect size of the correlation - A correlation coefficient closer to 1 (positive) or -1 (negative) represents a strong correlation - A correlation coefficient closer to 0 suggests a weak or no correlation between the variables - Ex: r = +0.25 indicates a weak positive relationship, while r = -0.70 indicates a strong negative relationship Regression line: a straight line that represents the relationship between two variables - Used to predict outcomes - Any value of X (the predictor variable) would be its likely corresponding value on Y(criterion variable) - We could plug in a value for X, and solve for Y - This would give us a likely (or estimated) value for Y Regression: a statistical procedure used to determine the equation of a regression line, and determine the extent to which that regression equation can be used to predict values of one factor Coefficient determination (R2): measures the proportion of variance in one variable that can be explained by the variance in another variable - Measures how much variation in the dependent variable is explained by the independent variable - A low r2 indicates that much of the variability in the related factors is accounted for by extraneous variables - A high r2 indicates that much of the variability in Y is in fact explained by its relationship with X - Ex: R2= 0.7 indicates that 70% of the variance in the dependent variable can be explained by the independent variable Pearson’s r is a type of vibrate correlation - It's a correlation between only two variables The common bi-variate regression (and pearson’s r) is for a linear regression - The regression is a straight line - Assumes the relationship between your variables is a linear relationship Non linear relationships - Common types are quadratic (u-shaped, or inverted U), cubic, exponential, and logarithmic - Pearson's r is not an appropriate measure if the relationship between the variables is not a straight line, even if they are related. R = 0 - EX: Many drug treatments produce an inverted U shaped dose response - This is a non-linear relationship: increasing drug dosages indefinitely does not continue to improve the response Testing assumption: a foundational requirement that ensures the reliability and accuracy of statistical analyses; indicates presumptions made about the data being analyzed Parametric tests: statistical methods that assume specific properties about the underlying data distribution, such as normality - Pearson’s correlation is a parametric test - It assumes the relationship is linear )or at least that we can only detect linear relationships) Pearson's correlation has 3 main testing assumptions 1. Linearity 2. Normality 3. Homoscedasticity Assumption of linearity: assumes a linear relationship between independent and dependent variables - Required for linear regression - Required for pearson correlation Assumption of normality: assumes that the distribution of data follows an approximately normal distribution Homoscedasticity: when data has consistent variability or equal variance across the range of values - Variability in the data should also be consistent - If variability was not consistent, the strength of the relationship would be stronger for some values and weaker for others - For regression, value predictions would be more meaningful at one end of the scale than the other Heteroscedastic: when data has unequal variability across the range of values Independence of errors: errors between observed and predicted values should not be related for each value of the predictor variable Using parametric tests - Can only use when the assumptions of the test have been met Non parametric tests: statistical methods that make fewer or no distributional assumptions - Spearman’s correlation - Measures monotonic relationships: as one variable increases, the other tends to increase or decrease consistently, but not necessarily at a constant rate Types of quasi experimental designs 1. One group designs 2. Nonequivalent group designs 3. Time series 4. Developmental designs One group designs - Single group post-test only design: data is measured only after the experimental intervention Lacks a control group Lacks pretreatment baseline - Single group pre-post design: data is measured both before and after the experimental intervention Lacks a control group Includes a pre-treatment baseline - Non-equivalent group designs: control group matched on certain pre-existing characteristics to be similar to the treatment group - Intervention was performed for one group, but not the other - The approach has two main designs: - Non-equivalent post-test only: - The main issue is selection differences (differences not controlled by researchers) - Non-equivalent pre-post design: - Combines non-equivalent group and pre-test baseline measurement - Allows knowing whether there is a pre-treatment difference Time series designs: designs that test the same participants over a period of time - Three main approaches - basic time series - collecting data over evenly spaced intervals - Can include many observations before and after treatment - Single-group design - Does not include a control - Accounts for time of day effects: variations in research outcomes of phenomena based on the specific time of collection - interrupted time series - The dependent variable is measured at many different points in time in one group before and after a naturally occurring treatment (naturally occurring event) - E.g., natural disasters, significant life events, medical procedures, etc - Limitations: requires ongoing data collection, lacks control group, little to no control over treatment event - control time series - Time-series design that includes a second, non-equivalent control group that is observed during the same period as the treatment group - It can be basic or interrupted - Allows comparison to a group having received no treatment - Developmental designs - Concerned with development across the lifespan - The non-random assignment variable is age - Developmental designs sometimes occur over very long periods of time - Main types - Longitudinal studies: used to study changes across the lifespan by observing the same participants at different times and measuring the same dependent variable at each time - Limitations: time-consuming, high levels of attrition(dropouts), resource intensive, requires consistency - Cross-sectional study: participants are grouped by their age and variables are measured in each age group - Creates a cross-section of aging population - Easier approach than longitudinal - Allows comparing different age groups at one - Subject to cohort effects: when differences in the characteristics of participants in different age groups or cohorts affect the observed result - Cohort-sequential designs: two or more cohorts observed together over time - Accounts for cohort effects, while still providing benefits of both approaches - Can be expensive, research-intensive, may take a long time Single case designs - Studies where a participant acts as their own control - Structured by alternating treatment and baseline phases over many trials - Three main types - Reversal design: participant goes through condition A, then B, then A (a baseline, b treatment) - Useful when effects of treatment are temporary, and participants are expected to return to baseline - - Multiple baseline design: treatment administered successively over time to different participants, rather than all at once - No need for a second baseline period - Changing-criterion design: involves changing the treatment only after some criteria have been met - Often uses studies of gradually improving behavior Analysis techniques Comparing between groups or time points - T-tests & ANOVA - Assesses if there's a significant difference between the means of two or more groups - Outputs a test statistic that reflects the significance of the test used to derive a p-value - Also has an effect size that describes the distance between those two groups - Trends & patterns/cycles - Trends: long-term, persistent movements or shifts in the overall direction of a time series - patterns/cycles: short-term or repetitive structures or behaviors within a time series, often related to seasonality or cyclicality - Autocorrelation: refers the correlation of a time series with its own past or future values - Tests for autocorrelation reflect pattern repetitions in the series - Autocorrelation is strong when repeating pattern lines up with itself - Negative when the pattern is out of phase - Low when there is no relationship - - Stationary: time series is stationary if its statistical properties, like mean and variance, remain constant over time - Non-stationary time series have trends - Curve fitting: adjusting a mathematical function to closely match the observed pattern or trend in time series data - A form of nonlinear regression - Aids in trend identification and forecasting by capturing the general pattern - Interrupted time series - Findings based on forecasting - Interruptions in the time series are often interpreted as significant deviations from projected outcomes - Segmented regression: fits different linear relationships to distinct segments of a dataset, allowing for shifts or breakpoints in the data - In this approach boundaries between time segments are called breakpoints - P value reflects whether the change in slope is statistically significant

Use Quizgecko on...
Browser
Browser