Chapter 1 R and the Tidyverse (Data Science) PDF
Document Details
Uploaded by Deleted User
Tags
Related
Summary
This document is an introductory chapter to data science and the R programming language. It covers data analysis, programming concepts in R, and the basics of data loading, cleaning, and visualization. The chapter also introduces different types of data analysis, and programming fundamentals in R.
Full Transcript
9/8/24, 1:31 PM Chapter 1 R and the Tidyverse | Data Science Chapter 1 R and the Tidyverse 1.1 Overview This chapter provides an introduction to data science and the R programming language. The goal here is to get your hands dirty right from the st...
9/8/24, 1:31 PM Chapter 1 R and the Tidyverse | Data Science Chapter 1 R and the Tidyverse 1.1 Overview This chapter provides an introduction to data science and the R programming language. The goal here is to get your hands dirty right from the start! We will walk through an entire data analysis, and along the way introduce different types of data analysis question, some fundamental programming concepts in R, and the basics of loading, cleaning, and visualizing data. In the following chapters, we will dig into each of these steps in much more detail; but for now, let’s jump in to see how much we can do with data science! 1.2 Chapter learning objectives By the end of the chapter, readers will be able to do the following: Identify the different types of data analysis question and categorize a question into the correct type. Load the tidyverse package into R. Read tabular data with read_csv. Create new variables and objects in R using the assignment symbol. Create and organize subsets of tabular data using filter , select , arrange , and slice. Add and modify columns in tabular data using mutate. Visualize data with a ggplot bar plot. Use ? to access help and documentation tools in R. https://datasciencebook.ca/intro.html 1/30 9/8/24, 1:31 PM Chapter 1 R and the Tidyverse | Data Science 1.3 Canadian languages data set In this chapter, we will walk through a full analysis of a data set relating to languages spoken at home by Canadian residents (Figure 1.1). Many Indigenous peoples exist in Canada with their own cultures and languages; these languages are often unique to Canada and not spoken anywhere else in the world (Statistics Canada 2018). Sadly, colonization has led to the loss of many of these languages. For instance, generations of children were not allowed to speak their mother tongue (the first language an individual learns in childhood) in Canadian residential schools. Colonizers also renamed places they had “discovered” (K. Wilson 2018). Acts such as these have significantly harmed the continuity of Indigenous languages in Canada, and some languages are considered “endangered” as few people report speaking them. To learn more, please see Canadian Geographic’s article, “Mapping Indigenous Languages in Canada” (Walker 2017), They Came for the Children: Canada, Aboriginal peoples, and Residential Schools (Truth and Reconciliation Commission of Canada 2012) and the Truth and Reconciliation Commission of Canada’s Calls to Action (Truth and Reconciliation Commission of Canada 2015). https://datasciencebook.ca/intro.html 2/30 9/8/24, 1:31 PM Chapter 1 R and the Tidyverse | Data Science Figure 1.1: Map of Canada. The data set we will study in this chapter is taken from the canlang R data package (Timbers 2020), which has population language data collected during the 2016 Canadian census (Statistics Canada 2016a). In this data, there are 214 languages recorded, each having six different properties: 1. category : Higher-level language category, describing whether the language is an Official Canadian language, an Aboriginal (i.e., Indigenous) language, or a Non-Official and Non- Aboriginal language. 2. language : The name of the language. 3. mother_tongue : Number of Canadian residents who reported the language as their mother tongue. Mother tongue is generally defined as the language someone was exposed to since birth. https://datasciencebook.ca/intro.html 3/30 9/8/24, 1:31 PM Chapter 1 R and the Tidyverse | Data Science 4. most_at_home : Number of Canadian residents who reported the language as being spoken most often at home. 5. most_at_work : Number of Canadian residents who reported the language as being used most often at work. 6. lang_known : Number of Canadian residents who reported knowledge of the language. According to the census, more than 60 Aboriginal languages were reported as being spoken in Canada. Suppose we want to know which are the most common; then we might ask the following question, which we wish to answer using our data: Which ten Aboriginal languages were most often reported in 2016 as mother tongues in Canada, and how many people speak each of them? Note: Data science cannot be done without a deep understanding of the data and problem domain. In this book, we have simplified the data sets used in our examples to concentrate on methods and fundamental concepts. But in real life, you cannot and should not do data science without a domain expert. Alternatively, it is common to practice data science in your own domain of expertise! Remember that when you work with data, it is essential to think about how the data were collected, which affects the conclusions you can draw. If your data are biased, then your results will be biased! 1.4 Asking a question Every good data analysis begins with a question—like the above—that you aim to answer using data. As it turns out, there are actually a number of different types of question regarding data: descriptive, exploratory, predictive, inferential, causal, and mechanistic, all of which are defined in Table 1.1. Carefully formulating a question as early as possible in your analysis—and correctly identifying which type of question it is—will guide your overall approach to the analysis as well as the selection of appropriate tools. https://datasciencebook.ca/intro.html 4/30 9/8/24, 1:31 PM Chapter 1 R and the Tidyverse | Data Science Table 1.1: Types of data analysis question (Leek and Peng 2015; Peng and Matsui 2015). Question type Description Example A question that asks about How many people live in each summarized characteristics of a data Descriptive province and territory in set without interpretation (i.e., report Canada? a fact). Does political party voting A question that asks if there are change with indicators of patterns, trends, or relationships Exploratory wealth in a set of data within a single data set. Often used to collected on 2,000 people propose hypotheses for future study. living in Canada? A question that asks about predicting measurements or labels for What political party will individuals (people or things). The Predictive someone vote for in the next focus is on what things predict some Canadian election? outcome, but not what causes the outcome. A question that looks for patterns, Does political party voting trends, or relationships in a single change with indicators of Inferential data set and also asks for wealth for all people living in quantification of how applicable these Canada? findings are to the wider population. A question that asks about whether Does wealth lead to voting for changing one factor will lead to a Causal a certain political party in change in another factor, on average, Canadian elections? in the wider population. A question that asks about the underlying mechanism of the How does wealth lead to Mechanistic observed patterns, trends, or voting for a certain political relationships (i.e., how does it party in Canadian elections? happen?) https://datasciencebook.ca/intro.html 5/30 9/8/24, 1:31 PM Chapter 1 R and the Tidyverse | Data Science In this book, you will learn techniques to answer the first four types of question: descriptive, exploratory, predictive, and inferential; causal and mechanistic questions are beyond the scope of this book. In particular, you will learn how to apply the following analysis tools: 1. Summarization: computing and reporting aggregated values pertaining to a data set. Summarization is most often used to answer descriptive questions, and can occasionally help with answering exploratory questions. For example, you might use summarization to answer the following question: What is the average race time for runners in this data set? Tools for summarization are covered in detail in Chapters 2 and 3, but appear regularly throughout the text. 2. Visualization: plotting data graphically. Visualization is typically used to answer descriptive and exploratory questions, but plays a critical supporting role in answering all of the types of question in Table 1.1. For example, you might use visualization to answer the following question: Is there any relationship between race time and age for runners in this data set? This is covered in detail in Chapter 4, but again appears regularly throughout the book. 3. Classification: predicting a class or category for a new observation. Classification is used to answer predictive questions. For example, you might use classification to answer the following question: Given measurements of a tumor’s average cell area and perimeter, is the tumor benign or malignant? Classification is covered in Chapters 5 and 6. 4. Regression: predicting a quantitative value for a new observation. Regression is also used to answer predictive questions. For example, you might use regression to answer the following question: What will be the race time for a 20-year-old runner who weighs 50kg? Regression is covered in Chapters 7 and 8. 5. Clustering: finding previously unknown/unlabeled subgroups in a data set. Clustering is often used to answer exploratory questions. For example, you might use clustering to answer the following question: What products are commonly bought together on Amazon? Clustering is covered in Chapter 9. 6. Estimation: taking measurements for a small number of items from a large group and making a good guess for the average or proportion for the large group. Estimation is used to answer inferential questions. For example, you might use estimation to answer the following question: Given a survey of cellphone ownership of 100 Canadians, what proportion of the entire Canadian population own Android phones? Estimation is covered in Chapter 10. Referring to Table 1.1, our question about Aboriginal languages is an example of a descriptive question: we are summarizing the characteristics of a data set without further interpretation. And referring to the list above, it looks like we should use visualization and perhaps some https://datasciencebook.ca/intro.html 6/30 9/8/24, 1:31 PM Chapter 1 R and the Tidyverse | Data Science summarization to answer the question. So in the remainder of this chapter, we will work towards making a visualization that shows us the ten most common Aboriginal languages in Canada and their associated counts, according to the 2016 census. 1.5 Loading a tabular data set A data set is, at its core essence, a structured collection of numbers and characters. Aside from that, there are really no strict rules; data sets can come in many different forms! Perhaps the most common form of data set that you will find in the wild, however, is tabular data. Think spreadsheets in Microsoft Excel: tabular data are rectangular-shaped and spreadsheet-like, as shown in Figure 1.2. In this book, we will focus primarily on tabular data. Since we are using R for data analysis in this book, the first step for us is to load the data into R. When we load tabular data into R, it is represented as a data frame object. Figure 1.2 shows that an R data frame is very similar to a spreadsheet. We refer to the rows as observations; these are the individual objects for which we collect data. In Figure 1.2, the observations are languages. We refer to the columns as variables; these are the characteristics of each observation. In Figure 1.2, the variables are the the language’s category, its name, the number of mother tongue speakers, etc. https://datasciencebook.ca/intro.html 7/30 9/8/24, 1:31 PM Chapter 1 R and the Tidyverse | Data Science Figure 1.2: A spreadsheet versus a data frame in R. The first kind of data file that we will learn how to load into R as a data frame is the comma- separated values format (.csv for short). These files have names ending in.csv , and can be opened and saved using common spreadsheet programs like Microsoft Excel and Google Sheets. For example, the.csv file named can_lang.csv is included with the code for this book. If we were to open this data in a plain text editor (a program like Notepad that just shows text with no formatting), we would see each row on its own line, and each entry in the table separated by a comma: https://datasciencebook.ca/intro.html 8/30 9/8/24, 1:31 PM Chapter 1 R and the Tidyverse | Data Science category,language,mother_tongue,most_at_home,most_at_work,lang_known Aboriginal languages,"Aboriginal languages, n.o.s.",590,235,30,665 Non-Official & Non-Aboriginal languages,Afrikaans,10260,4785,85,23415 Non-Official & Non-Aboriginal languages,"Afro-Asiatic languages, n.i.e.",1150,44 Non-Official & Non-Aboriginal languages,Akan (Twi),13460,5985,25,22150 Non-Official & Non-Aboriginal languages,Albanian,26895,13135,345,31930 Aboriginal languages,"Algonquian languages, n.i.e.",45,10,0,120 Aboriginal languages,Algonquin,1260,370,40,2480 Non-Official & Non-Aboriginal languages,American Sign Language,2685,3020,1145,21 Non-Official & Non-Aboriginal languages,Amharic,22465,12785,200,33670 To load this data into R so that we can do things with it (e.g., perform analyses or create data visualizations), we will need to use a function. A function is a special word in R that takes instructions (we call these arguments) and does something. The function we will use to load a.csv file into R is called read_csv. In its most basic use-case, read_csv expects that the data file: has column names (or headers), uses a comma ( , ) to separate the columns, and does not have row names. Below you’ll see the code used to load the data into R using the read_csv function. Note that the read_csv function is not included in the base installation of R, meaning that it is not one of the primary functions ready to use when you install R. Therefore, you need to load it from somewhere else before you can use it. The place from which we will load it is called an R package. An R package is a collection of functions that can be used in addition to the built-in R package functions once loaded. The read_csv function, in particular, can be made accessible by loading the tidyverse R package (Wickham 2021b; Wickham et al. 2019) using the library function. The tidyverse package contains many functions that we will use throughout this book to load, clean, wrangle, and visualize data. library(tidyverse) https://datasciencebook.ca/intro.html 9/30 9/8/24, 1:31 PM Chapter 1 R and the Tidyverse | Data Science ## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ## ✔ dplyr 1.1.2 ✔ readr 2.1.4 ## ✔ forcats 1.0.0 ✔ stringr 1.5.0 ## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1 ## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0 ## ✔ purrr 1.0.1 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() ## ℹ Use the conflicted package () to force all conflicts to Note: You may have noticed that we got some extra output from R regarding attached packages and conflicts below our code line. These are examples of messages in R, which give the user more information that might be handy to know. The Attaching packages message is natural when loading tidyverse , since tidyverse actually automatically causes other packages to be imported too, such as dplyr. In the future, when we load tidyverse in this book, we will silence these messages to help with the readability of the book. The Conflicts message is also totally normal in this circumstance. This message tells you if functions from different packages share the same name, which is confusing to R. For example, in this case, the dplyr package and the stats package both provide a function called filter. The message above ( dplyr::filter() masks stats::filter() ) is R telling you that it is going to default to the dplyr package version of this function. So if you use the filter function, you will be using the dplyr version. In order to use the stats version, you need to use its full name stats::filter. Messages are not errors, so generally you don’t need to take action when you see a message; but you should always read the message and critically think about what it means and whether you need to do anything about it. After loading the tidyverse package, we can call the read_csv function and pass it a single argument: the name of the file, "can_lang.csv". We have to put quotes around file names and other letters and words that we use in our code to distinguish it from the special words (like functions!) that make up the R programming language. The file’s name is the only argument we need to provide because our file satisfies everything else that the read_csv function expects in the default use-case. Figure 1.3 describes how we use the read_csv to read data into R. https://datasciencebook.ca/intro.html 10/30 9/8/24, 1:31 PM Chapter 1 R and the Tidyverse | Data Science Figure 1.3: Syntax for the read_csv function. read_csv("data/can_lang.csv") ## # A tibble: 214 × 6 ## category language mother_tongue most_at_home most_at_work lang_known ## ## 1 Aboriginal langu… Aborigi… 590 235 30 665 ## 2 Non-Official & N… Afrikaa… 10260 4785 85 23415 ## 3 Non-Official & N… Afro-As… 1150 445 10 2775 ## 4 Non-Official & N… Akan (T… 13460 5985 25 22150 ## 5 Non-Official & N… Albanian 26895 13135 345 31930 ## 6 Aboriginal langu… Algonqu… 45 10 0 120 ## 7 Aboriginal langu… Algonqu… 1260 370 40 2480 ## 8 Non-Official & N… America… 2685 3020 1145 21930 ## 9 Non-Official & N… Amharic 22465 12785 200 33670 ## 10 Non-Official & N… Arabic 419890 223535 5585 629055 ## # ℹ 204 more rows Note: There is another function that also loads csv files named read.csv. We will always use read_csv in this book, as it is designed to play nicely with all of the other tidyverse functions, which we will use extensively. Be careful not to accidentally use read.csv , as it can cause some tricky errors to occur in your code that are hard to track down! 1.6 Naming things in R When we loaded the 2016 Canadian census language data using read_csv , we did not give this data frame a name. Therefore the data was just printed on the screen, and we cannot do anything else with it. That isn’t very useful. What would be more useful would be to give a name to the data https://datasciencebook.ca/intro.html 11/30 9/8/24, 1:31 PM Chapter 1 R and the Tidyverse | Data Science frame that read_csv outputs, so that we can refer to it later for analysis and visualization. The way to assign a name to a value in R is via the assignment symbol