Chapter 2 Reading in Data Locally and From the Web (Data Science) PDF

Summary

This document covers reading tabular data using R, including various file formats like CSV and TSV and from the web and databases. The chapter also discusses concepts like absolute and relative file paths, and data importing from a database. It's suitable for an undergraduate data science course.

Full Transcript

9/8/24, 3:59 PM Chapter 2 Reading in data locally and from the web | Data Science Chapter 2 Reading in data locally and from the web 2.1 Overview In this chapter, you’ll learn to read tabular data of various formats into R from your local device (e.g....

9/8/24, 3:59 PM Chapter 2 Reading in data locally and from the web | Data Science Chapter 2 Reading in data locally and from the web 2.1 Overview In this chapter, you’ll learn to read tabular data of various formats into R from your local device (e.g., your laptop) and the web. “Reading” (or “loading”) is the process of converting data (stored as plain text, a database, HTML, etc.) into an object (e.g., a data frame) that R can easily access and manipulate. Thus reading data is the gateway to any data analysis; you won’t be able to analyze data unless you’ve loaded it first. And because there are many ways to store data, there are similarly many ways to read data into R. The more time you spend upfront matching the data reading method to the type of data you have, the less time you will have to devote to re- formatting, cleaning and wrangling your data (the second step to all data analyses). It’s like making sure your shoelaces are tied well before going for a run so that you don’t trip later on! 2.2 Chapter learning objectives By the end of the chapter, readers will be able to do the following: Define the types of path and use them to locate files: absolute file path relative file path Uniform Resource Locator (URL) Read data into R from various types of path using: read_csv read_tsv read_csv2 read_delim https://datasciencebook.ca/reading.html 1/52 9/8/24, 3:59 PM Chapter 2 Reading in data locally and from the web | Data Science read_excel Compare and contrast the read_* functions. Describe when to use the following read_* function arguments: skip delim col_names Choose the appropriate tidyverse read_* function and function arguments to load a given plain text tabular data set into R. Use the rename function to rename columns in a data frame. Use read_excel function and arguments to load a sheet from an excel file into R. Work with databases using functions from dbplyr and DBI : Connect to a database with dbConnect. List tables in the database with dbListTables. Create a reference to a database table with tbl. Bring data from a database into R using collect. Use write_csv to save a data frame to a.csv file. (Optional) Obtain data from the web using scraping and application programming interfaces (APIs): Read HTML source code from a URL using the rvest package. Read data from the NASA “Astronomy Picture of the Day” API using the httr2 package. Compare downloading tabular data from a plain text file (e.g.,.csv ), accessing data from an API, and scraping the HTML source code from a website. 2.3 Absolute and relative file paths This chapter will discuss the different functions we can use to import data into R, but before we can talk about how we read the data into R with these functions, we first need to talk about where the data lives. When you load a data set into R, you first need to tell R where those files live. The file could live on your computer (local) or somewhere on the internet (remote). The place where the file lives on your computer is referred to as its “path”. You can think of the path as directions to the file. There are two kinds of paths: relative paths and absolute paths. A relative path indicates where the file is with respect to your working directory (i.e., “where you are https://datasciencebook.ca/reading.html 2/52 9/8/24, 3:59 PM Chapter 2 Reading in data locally and from the web | Data Science currently”) on the computer. On the other hand, an absolute path indicates where the file is with respect to the computer’s filesystem base (or root) folder, regardless of where you are working. Suppose our computer’s filesystem looks like the picture in Figure 2.1. We are working in a file titled project3.ipynb , and our current working directory is project3 ; typically, as is the case here, the working directory is the directory containing the file you are currently working on. https://datasciencebook.ca/reading.html 3/52 9/8/24, 3:59 PM Chapter 2 Reading in data locally and from the web | Data Science Figure 2.1: Example file system. Let’s say we wanted to open the happiness_report.csv file. We have two options to indicate where the file is: using a relative path, or using an absolute path. The absolute path of the file always starts with a slash / —representing the root folder on the computer—and proceeds by https://datasciencebook.ca/reading.html 4/52 9/8/24, 3:59 PM Chapter 2 Reading in data locally and from the web | Data Science listing out the sequence of folders you would have to enter to reach the file, each separated by another slash /. So in this case, happiness_report.csv would be reached by starting at the root, and entering the home folder, then the dsci-100 folder, then the project3 folder, and then finally the data folder. So its absolute path would be /home/dsci- 100/project3/data/happiness_report.csv. We can load the file using its absolute path as a string passed to the read_csv function. happy_data

Use Quizgecko on...
Browser
Browser