Chapter 3: Planning and Conducting Empirical Studies PDF
Document Details
Uploaded by PoliteIntellect534
Tags
Summary
This document details the planning and conduct of empirical studies, illustrating them with a real-world example concerning student use of AI tools. It's a chapter extracted from a broader academic source with a focus on the topic of Higher Education issues.
Full Transcript
Chapter 3: Planning and Conducting Empirical Studies In this chapter we will be learning more about the design and analysis of empirical studies. The results of empirical studies are often reported in the media: https://www.theguardian.com/technology/2024/feb/01/ more-t...
Chapter 3: Planning and Conducting Empirical Studies In this chapter we will be learning more about the design and analysis of empirical studies. The results of empirical studies are often reported in the media: https://www.theguardian.com/technology/2024/feb/01/ more-than-half-uk-undergraduates-ai-essays-artificial-intelligence 357 Long AI Study They are usually more formally presented in reports, or scientific journals: https://www.hepi.ac.uk/wp-content/uploads/2024/01/HEPI-Policy-Note-51.pdf 358 Long AI Study Reports of empirical studies often follow a similar format: Introduction Methods Results Conclusions 359 Section 3.1: Empirical Studies The goal of an empirical study is to collect data in order to learn about a population or process. In this chapter we will introduce a structured approach to the design and analysis of empirical studies. We will look, in turn, at the Problem, Plan, Data, Analysis, and Conclusion steps of a study, which we abbreviate to PPDAC. PPDAC emphasizes the statistical aspects of designing an empirical study. 360 PPDAC Here is a brief summary of these steps: Problem: A clear statement of the study’s objectives. Plan: The procedures used to carry out the study including how the data will be collected. Data: The physical collection of the data, as described in the Plan. Analysis: The analysis of the data collected, accounting for considerations in the Problem and the Plan. Conclusion: The conclusions that are drawn about the Problem, and any limitations of the study. 361 Example: Generative AI in Education Study We’ll use the following research study to illustrate the steps of PPDAC. “Provide or punish? Students’ views on generative AI in higher education” - conducted by the UK’s Higher Education Policy Institute (Hepi) and available at: https://www.hepi.ac.uk/wp-content/uploads/2024/01/HEPI-Policy-Note-51.pdf The report was also summarized in a Guardian news article titled “More than half of UK undergraduates say they use AI to help with essays” at https://www.theguardian.com/technology/2024/feb/01/ more-than-half-uk-undergraduates-ai-essays-artificial-intelligence PROTIP: we’ll refer to this example several times, so you may find it helpful to familiarize yourself with the details that follow! 362 Example Introduction Since ChatGPT was released on 30 November 2022, there has been an explosion of interest in artificial intelligence (AI) technology...There is now a plethora of generative AI tools. The sector must now adapt to generative AI. Many higher education institutions have already begun to do so, such as by releasing guidance on the acceptable use of AI. But students’ attitudes to the technology are not well understood. How many students use generative AI? What do they use it for, and how do they expect to use it in the future? What do they consider acceptable? How do they feel about the way their institutions have approached it? Using new polling data collected exclusively for this report, we address these questions for the first time and build a detailed picture of the growing influence of generative AI on UK higher education. 363 Example Methods We polled 1,250 undergraduate students (rounded to the nearest five) through UCAS in November 2023. Before taking the survey students were told that their responses were confidential and would not be used to identify them. The survey responses have been weighted to make the results more representative of the current student population. This accounts for differences in response rates for demographics such as age, sex, country and ethnic group. 364 Example Findings The most popular use is enhancing and editing writing, for which 37% of students use AI tools such as Grammarly. Three-in-10 students (30%) say they have used AI to generate text using tools like ChatGPT, and a quarter (25%) have used AI for translation, for example using Google Translate. Only about a third of students (34%) say they have not used AI since starting their degree. More than half of students (53%) have used generative AI. Of these students, the most popular use is as an ‘AI private tutor’, with 36% of students using AI to explain concepts to them. Other popular uses include suggesting ideas for research and summarising articles. Some students admit incorporating AI text into assessments, with 3% saying they have done so without editing and 5% editing the text first using a digital tool. 365 Example Findings Digging deeper into the demographic breaks we see that while the overall proportion of male and female students using AI is about the same, there is a gender divide in the way it is used. 366 Example Findings We also found evidence of an AI ‘digital divide’...when looking at POLAR quintiles, which categorise areas by the proportion of young people who enter higher education. The most deprived quintiles 1 and 2 have the lowest proportions of AI use, with students from quintile 5 the most likely to say they have used generative AI. 367 Example Findings PROTIP: The report includes a large number of numerical and graphical summaries. Take a look through and see if any surprise you. 368 Example Conclusions Five major conclusions emerge from the above results. 1 Generative AI has quickly become normalised among students in higher education. 2 Systems to deal with cheating seem to be having the desired effect, for now. 3 A ‘digital divide’ is beginning to emerge, with some students benefiting from AI more than others. 4 Students want more support with AI and more AI tools to be provided for them. 5 The ‘digital divide’ may apply to institutions’ use of generative AI as much as students. 369 Example To summarize: Survey of 1,250 current undergraduate students in the UK. Questions about students’ use of AI tools. Demographic data including age, gender, and racial identities also gathered. Conclusions: More than half (53%) of students have used generative AI to help them with assessments. Evidence of differences between certain groups (such as male vs. female, and POLAR quntiles). (Many other conclusions are drawn!) Important: This is a very simplified summary of the study and is just to give us an approximate idea of what it involved. We’ll be taking a more detailed look as we work through this Chapter! 370 Section 3.2: Steps of PPDAC Problem The Problem step address questions starting with ‘What’. What group of things or people do we want the conclusions to apply to? What variates can we define? What is (are) the question(s) we are trying to answer? What conclusions are we trying to draw? 371 The Problem In the Problem step the units and the target population or target process must be defined. Definition: The target population or target process is the collection of units to which the experimenters conducting the empirical study wish the conclusions to apply. What is the target population in the AI study? 372 Example From the report’s introduction: “The sector must now adapt to generative AI. Many higher education institutions have already begun to do so, such as by releasing guidance on the acceptable use of AI. But students’ attitudes to the technology are not well understood. How many students use generative AI? What do they use it for, and how do they expect to use it in the future? What do they consider acceptable? How do they feel about the way their institutions have approached it? Using new polling data collected exclusively for this report, we address these questions for the first time and build a detailed picture of the growing influence of generative AI on UK higher education.” 373 Example Possible target population/process: Current and future university students. The same group, but limited to those attending UK universities. One of the previous two groups, but also limited to undergraduate students only. Key concept: When reading reports of studies it is often unclear exactly what the target population is. When we design studies, we should have a very clear idea of precisely who (or what) we want our results to apply to, and ensure that is clearly communicated. 374 The Problem - variates Another important consideration in the Problem step is variates. As a reminder from Chapter 1: Definition: A variate is a characteristic of a unit. What are the variates of primary interest in the AI study? PROTIP: to determine the variates, look at what is measured or recorded on each unit. 375 The Problem - variates Key variates include: Demographics: Age, sex, ethnic group, POLAR quintile. Responses to survey questions. PROTIP: Don’t forget that the answer to a survey question is a variate! Exercise: What types of variates are these? 376 The Problem - variates The survey leads to variates of various types: “Did you use AI for [task]?” = Categorical (binary) Likert scale (strongly agree to strongly disagree) = Ordinal Free text fields = Complex 377 The Problem - attributes Another definition from Chapter 1 that will be very important: Definition: An attribute is a function of the variates over a population. In the Problem step the questions of interest are specified in terms of the attributes of the target population. What are the attributes of interest in the AI study? 378 The Problem - attributes From the introduction: “How many students use generative AI? What do they use it for, and how do they expect to use it in the future? What do they consider acceptable? How do they feel about the way their institutions have approached it?” 379 The Problem - attributes Possible attributes of interest: The proportion of students in the target population who have used generative AI in their university studies. The proportion of students in the target population who consider it acceptable to use generative AI in assessments. The difference between the proportion of students in the target population identifying as female and those identifying as male who have used AI to write computer code. Important: There are many other attributes that may be of interest! 380 The Problem - types The types of problems that an empirical study are designed to study can usually be categorized as one of three types: Descriptive: to determine a particular attribute of the population (e.g. the national unemployment rate). Causative: to determine the existence or nonexistence of a causal relationship between two variates (e.g. does a new hockey helmet reduce the risk of concussion). Predictive: to predict the response of a variate for a given unit (e.g. predict e-cigarette weekly sales if the sales tax on them is doubled). 381 The Problem - types In the AI study the researchers were trying to estimate various attributes of the target population, such as the proportion of students who had used generative AI in their university studies. What type of problem is this? 382 The Problem - types This is a descriptive problem. Warning: identifying problem types is not always easy, and we can often think a problem is causative when it is in fact descriptive. The report highlights some comparisons, such as those identifying as male being more likely to have used AI tools to write computer code, compared to those identifying as women. This is also a descriptive problem - they are not trying to establish if sex or gender has a causative effect on the use of AI. 383 The Problem - types Often the distinction between causative and descriptive problems can be difficult to discern. Usually we cannot answer causative problems from observational studies or sample surveys. To answer causative problems requires very careful study design, and very careful analysis. When designing a study, it’s important to know what type of problem we are trying to answer so that we can take appropriate steps in our design. More on this in Chapter 8! 384 Steps of PPDAC - the Plan The purpose of the Plan step is to decide what units are available for study, what units will be examined, and what variates will be collected and how. Definition: The study population or study process is the collection of units available to be included in the study. Warning: The study population is often misunderstood! Pay close attention to “available to be included” - this is crucial to the definition! 385 Steps of PPDAC - the Plan Key concept: ‘Available to be included’ means the set of units that could be included in the study, it does not mean the set of units that are included in the study. Example: If I asked the class to tell me their favourite animal in the Teams chat, then anyone currently attending the class could participate. The study population would be ‘everyone attending class right now’, even though we know This binturong, sadly, would not be in not everyone attending class the study population. would participate in the study. 386 The Plan Often the study population is a strict subset of the target population, but not always. In many medical applications, animals (e.g. mice) may act as study units when the target population consists of people. In manufacturing, researchers may want to draw conclusions about the future production process but they can only look at units produced in a laboratory in a pilot process. In these cases, the study units are not part of the target population. 387 The Plan: AI study From the AI study: “We polled 1,250 undergraduate students (rounded to the nearest five) through UCAS in November 2023.” Reminder: UCAS is short for the Universities and Colleges Admissions Service, which provides support services to students applying to, and attending, university in the UK. 388 The Plan: study error The report does not provide a lot of detail about who could be in the study! A reasonable definition of the study population for the AI study might be: Study population: Students enrolled in the UCAS system in November 2023, who were given the option to participate in the suvey. PROTIP: Be as precise/descriptive as you can when describing the study population! 389 The Plan: study error The AI study was limited to a study population of students enrolled in the UCAS system in November 2023. Suppose our target population was all current and future undergraduate university students in the world. Do you have any concerns about this? 390 The Plan: study error We have introduced two important definitions: Target population: who we want our results to apply to. Study population: who has the potential to be in our study. These groups are (usually) different, which can introduce errors. 391 The Plan: study error Example: in the 2020 US Presidential election, polls somewhat over-estimated Joe Biden’s winning margin. Many polling companies use online surveys to make these predictions. 392 The Plan: study error Suppose a polling company has a list of Instagram users, and messages a random group of users to ask them how they’ll vote. Target population: people who will vote in the election. Study population: people on the polling company’s list of Instagram users. How might these two populations differ? 393 The Plan: study error This graph shows the approximate age distribution of Instagram users: 394 The Plan: study error Instagram users are, on average, younger than the wider population. In this case our target population (people who’ll vote in the election) might be older on average than our study population (people with Instagram accounts). 395 The Plan: study error In the 2020 election, younger people were more likely to vote for Biden, while older people were more likely to vote for Trump. Source: Pew Research Center 396 The Plan: study error We may be concerned that the proportion of people in the study population who support Biden is higher than the proportion of people in the target population. This seems intuitive, but how can we write this in formal, mathematical terms? Hint: Think about the statistical model(s) we’d use for this example. What parameter(s) would we be trying to estimate? 397 The Plan: study error In notation, we could say: “Suppose YT ∼ Binomial(n, θT ) represents the number of people in a random sample of size n from the target population who support Biden, and YS ∼ Binomial(n, θS ) represents the number of people in a random saple of size n from the study population who support Biden.” We may be concerned that θT ̸= θS. (Specifically, that θT < θS.) 398 The Plan: study error Definition: If the attributes in the study population differ from the attributes in the target population then the difference is called study error. Very important: a difference between the study and target population is not itself an example of study error. There must be a difference in attributes. 399 The Plan: study error Example: Suppose instead of predicting who would win the election, we were instead using our online survey to estimate the most common favourite colour in our population. We would not expect younger people to be more/less likely to have a particular favourite colour than older people, and so this would not be an example of study error! 400 Study error We must carefully - and precisely - identify the target and study populations before we can discuss the existence (or otherwise) of study error. Important: We may reach different conclusions about study error for different interpretations of these populations! Let’s apply this to the AI study! 401 Study error For illustration, let us suppose the following: Target population: All current and future undergraduate university students in North America. Study population: Students enrolled in the UCAS system in November 2023, who were given the option to participate in the survey. Key Question: In what ways do these populations differ? 402 Study error The study population is limited to students attending UK universities. We might expect cultural and demographic differences between students in the UK and those in the world. Clearly the target and study populations differ, but do we think they differ in terms of any attributes of interest? PROTIP: There are many ways in which the target and study populations differ; think about other examples of these differences, and think about whether they could lead to study error. 403 Study error We must therefore also consider our attributes. Recall that we proposed the following as possible attributes of interest: The proportion of students in the target population who have used generative AI in their university studies. The proportion of students in the target population who consider it acceptable to use generative AI in assessments. The difference between the proportion of students in the target population identifying as female and those identifying as male who have used AI to write computer code. Key question: Do we think the target and study populations would differ in terms of these attributes? 404 Study error Suppose that students in the UK were more likely to use generative AI in assessments than students in the world as a whole. (For example, perhaps because of the famously terrible British weather they spend more time indoors and on their computers.) If this were the case, then this would result in study error because the attribute of interest would differ between the target and study populations. 405 Study error However, it is difficult to know if the differences between the target and study populations would lead to study error. It is important to identify and discuss potential sources of study error, even if we cannot be sure that study error is present. Sometimes we can look for other studies and research that might highlight potential sources of study error (or offer reassurance that it is not a concern). 406 Study error Since we do not know the values of the target population attributes or the study population attributes, the study error cannot be quantified. In most cases we rely on expertise from other sources to determine whether conclusions derived from the study population may apply to the target population. www.xkcd.com/2515 407 Study error Important: two very commonly misunderstood points! If study error is present it is due to differences between the target population and the study population. It does not result from differences between the target population and the sample, or the study population and the sample! Study error concerns differences in attributes, not merely differences between groups! It is not sufficient to identify how the target and study populations differ to establish study error, we must also ensure this difference relates to the attributes under study. 408 The Plan The target population is who we want our results to apply to, and the study population is who has the potential to be in our study. The next question then, is who is in our study? In other words, who is in our sample? The AI report does not give us a lot of information on this, stating: “We polled 1,250 undergraduate students (rounded to the nearest five) through UCAS in November 2023.’ Are the study population and sample the same for this study? 409 The Plan When designing a study, we must think carefully about how our units are sampled from our study population. This brings us onto another important definition: Definition: The sampling protocol is the procedure used to select a sample of units from the study population. The number of units sampled is called the sample size. 410 Sampling protocol There are many (many) different ways to select a sample of units. The simplest (and usually best) example would be a random sample. For example, in a manufacturing process the units (the manufactured products) may be selected at random from a production line. 411 Sampling protocol Random-seeming samples are often not that random, however! Example: In political polling, the study population might be a list of email addresses or telephone numbers. A random sample of these emails/numbers could be contacted, but then the sample would be those people who responded (which is a lot less random). In this example, the sampling protocol would also include some exclusion criteria, such as the respondent being of voting age. 412 Sample protocol: AI study Important: Because the sampling protocol concerns how units were selected from the study population, it must be defined within the context of the study population. As a reminder, our proposed study population for the AI study: Students enrolled in the UCAS system in November 2023, who were given the option to participate in the survey. 413 Sampling protocol: AI study A possible sampling protocol for the AI study is: Students enrolled in the UCAS system in November 2023 were invited to participate in the survey. Students entered the sample if they responded to the survey. We do not know the exact sample size, but can say it is 1,250 to the nearest five. 414 The Plan: sample error We now have three groups to consider: Target population: All current and future undergraduate university students in the world. Study population: Students enrolled in the UCAS system in November 2023, who were given the option to participate in the survey. Sample: The 1,250 students who completed a survey. Earlier, we discussed how differences between the target and study populations could lead to study error. We have a similar concern for differences between the study population and the sample. 415 The Plan: sample error Definition: If the attributes in the sample differ from the attributes in the study population then the difference is called sample error. Important: as with study error, sample error concerns a difference in attributes between the study population and sample. Sample error is not guaranteed just because these two groups are different! 416 The Plan: sample error The AI study sampled 1,250 participants who completed a survey. Key Question: The sample consists of people who (presumably) completed the survey voluntarily. Does this raise any concerns? 417 The Plan: sample error Reminder: sample error is defined in terms of attributes. Some of our possible attributes of interest, defined in terms of the study population: The proportion of students in the study population who have used generative AI in their university studies. The proportion of students in the study population who consider it acceptable to use generative AI in assessments. 418 The Plan: sample error The sample consisted of people who have volunteered to complete a survey asking them about their use of AI tools in their studies. We might be concerned that students who use AI tools are more likely to be interested in participating in such a study thatn those who have not used AI tools. In this case, our sample might over-estimate the proportion of students in the study population who have used AI tools in their studies. There are other possible sources of sample error in this study; as with study error any concerns about sample error should be discussed. 419 The Plan: sample error The sample is only a subset of the units in the study population. Study population Different sampling protocols may lead to different sample errors. Sample Some protocols tend to have smaller errors than others. The values of the study population attributes are unknown, so the sample error is unknown. Statistical models are used to quantify the size of this error. 420 The Plan: sample error Important: two very commonly misunderstood points! If sample error is present it is due to differences between the study population and the sample. It does not result from differences between the target population and the sample! Sample error concerns differences in attributes, not merely differences between groups! It is not sufficient to identify how the study population and sample differ to establish sample error, we must also ensure this difference relates to the attributes under study. 421 The Plan: what to measure We measure the variates corresponding to any attributes of interest defined in the Problem step for the units in the sample. We may also measure other variates that can aid the analysis. When taking measurements we should think carefully about their accuracy. 422 The Plan: what to measure One key measure in the AI study is whether a student had used AI tools for their university studies. They were also asked under what circumstances the use of generative AI was acceptable. Can you think of possible problems with these measurements? 423 The Plan: measurement error Definition: If the measured value and the true value of a variate are not identical the difference is called measurement error. In the AI study, participants may not tell the truth about their use of AI tools in university work, or give different opinions to what they truly believe about whether the use of such tools is acceptable. 424 The Plan: measurement error Measurement error is extremely common and can severely undermine our results. For example, the relationship between human papillomavirus (HPV) and a form of cervical cancer was underestimated due to measurement error in the identification of HPV infection. For more on measurement error, see my article in Significance Magazine: https://rss.onlinelibrary.wiley.com/doi/10.1111/j.1740-9713. 2020.01353.x 425 The Plan: measurement error The impact of measurement error can be very hard to anticipate. We should design our studies with measurement error in mind, such as by choosing measurement methods that are more accurate. Any concerns about measurement error should be reported. 426 The Plan For an empirical study the Plan should indicate the target and study populations, the sampling protocol, the variates which are to be measured, and the quality of the measurement systems that are intended for use. Attention must be paid to the various types of error that may occur and how they might impact the conclusions. You will be asked to use PPDAC to critically examine the Conclusions from a study done by someone else. You should examine each step in the Plan for strengths and weaknesses - let’s try this with a new example! 427 Steps in the plan We’ve now encountered three terms for groups of units: The target population (who we want to study). The study population (who we could study). The sample (who we actually study). And three types of error: Study error. Sample error. Measurement error. It can be helpful to lay these out diagrammatically! 428 Steps in the plan 429 Data step This concludes our discussion of the Plan step of PPDAC. A reminder of the other steps: Problem: A clear statement of the study’s objectives. Plan: The procedures used to carry out the study including how the data will be collected. Data: The physical collection of the data, as described in the Plan. Analysis: The analysis of the data collected, accounting for considerations in the Problem and the Plan. Conclusion: The conclusions that are drawn about the Problem, and any limitations of the study. 430 Data step The purpose of the Data step: collect data according to the Plan. In order to do this the variates must be clearly defined and satisfactory methods of measuring them must be used. 431 Data step Mistakes can occur in recording or entering data into a database. For complex investigations, it is useful to put checks in place to avoid these mistakes, and detect those that are made. Right: an example from a study of eye disease, where the variate ‘eye’ indicated which eye was affected by the disease. 432 Data step In many studies the units must be tracked and measured over a long period of time. For example, in the preceding eye disease study, patients were monitored for several months, and sometimes patients would change hospitals, or even move cities. When data are recorded over time or in different locations, the time and place for each measurement should be recorded. Why might this be important? 433 Data step Departures from the Plan may arise over time (e.g. persons may drop out of a long term medical study because of adverse reactions to a treatment). Such departures should be recorded. These departures will also impact the Analysis and Conclusion. 434 Analysis step The Analysis step consists of the analyses of the data collected. The Analysis step should include numerical and graphical summaries of the data. A key part of the analysis is the selection of an appropriate model that describes the data and how the data were collected. Checking the fit of the model must be included. 435 Analysis step To answer the questions posed in the Problem step we usually formulate these questions in terms of the model parameters. For example: “if Y ∼ Bin(n, θ) and θ = the probability a randomly chosen student in the study population has used AI tools in their university studies, what is θ?” In the remainder of this course we will see many examples of formal analyses (estimation, tests of hypotheses). Departures from the Plan that affect the Analysis must be noted. 436 Analysis step The AI study’s analyses were very simplistic. The report only presents numerical summaries, such as the proportion of students who gave a particular answer to a question. The report claims there is evidence of a ‘digital divide’ in terms of AI use differing between different sociodemographic groups, such as based on the POLAR quintiles, but no formal analysis of this is conducted. 437 Conclusion step In the Conclusion step, the questions posed in the Problem are answered to the extent permitted by the data: the Conclusion step is directed by the Problem. Potential study, sample or measurement errors, as described in the Plan step, should be discussed and quantified if possible. Departures from the Plan that affect the Analysis must be addressed. The limitations of the study must also be described. 438 Example The AI study does not highlight any major limitations! What limitations can you identify in the study? 439 Example Some possible limitations: Unclear definition of target and study population, unclear sampling protocol. Self-selection of participants (and possible sample error). Untruthful responses to survey questions (and thus measurement error). We should also be alert to who has conducted and/or funded a study! 440 PPDAC: Final thoughts It’s easy to think that ‘statistics’ is just the analysis step of a study. In practice, it’s common for researchers to only ask a statistician for help at this stage. What PPDAC (hopefully) demonstrates is that a statistician (or at least, statistical thinking) needs to be involved from the very beginning! There are a lot of complicated decisions to be made right from the start, that have consequences all the way through to the conclusions. 441 Chapter 3: Key Concepts Chapter 3 can feel short but it covers some extremely important principles of study design and analysis! Key concepts: PPDAC: what does each step involve, and why is it important? Terminology: target population, study population, sample, sampling protocol. Errors: study error, sample error, measurement error. PROTIP: Make sure you are very familiar with all the terminology introduced in this chapter. Practice with the end of chapter problems in the Course Notes, as well as real-life studies you see in the news/social media! 442