Summary

This document is a guide to using R software for data analysis and data management, including creating/uploading R databases and using RStudio. It covers basic data structures like vectors and matrices, as well as functions for data manipulation and visualization in R.

Full Transcript

Data analysis Michele Pezzoni 1 Create/upload an R database Any R manual 2 Workspace in Rstudio 3 Workspace Includes all variables, vectors, matrices, lists, data frames. In R lan...

Data analysis Michele Pezzoni 1 Create/upload an R database Any R manual 2 Workspace in Rstudio 3 Workspace Includes all variables, vectors, matrices, lists, data frames. In R language these are called objects To show the objects in the workspace: ls() To remove all the objects rm(list=ls(all=TRUE)) To select the folder where you want to save the workspace: getwd() setwd("C:/name/name/name") To save the workspace in a “.Rdata” file save.image("name.Rsata") To load the saved workspace load("name.Rdata") 4 Vectors and Variables To assign a value to a variable x=1 The object type class(x) Creating a vector y=c(1,5,4,8) class(y) Create a matrix from two vectors z=c(3,6,99,8) m_h=rbind(y,z) m_v=cbind(y,z) class(m_h) 5 The functions in R The functions are used to: Import and export data Generate objects Perform operations on objects: average calculation, quantile, … Generate graphs, … Each function is defined by its name and parameters General form: function_name(par1=value1,par2=value2,...) Some parameters are mandatory, others optional How to know more about a specific function? ?mean 6 Let’s consider the function mean() to calculate the arithmetic mean Let’s create a vector including the number of passengers handled at the Nice airport in the last 5 days. The yesterday’s number of passengers is missing. In R, missing data are reported using the letters NA (Not available). y=c(29020,32500,40320,20328,NA) mean(y) mean(y, na.rm = FALSE) mean(y, na.rm = TRUE) 7 Dataframe (The dataset) The dataframe is an R object that stores a table with – individuals (observations) in the rows and – Individuals’ characteristics (variables) in the columns 3 ways to create a dataframe in R: – 1) Import a text/excel file – 2) Dataframe generation – 3) Use a dataframe already available in R or in a database package (Strongly discouraged for your empirical paper) 8 1) Import a dataframe from a file text/excel You need to retrieve a text file / excel sheet with individuals reported in the rows and Individuals’ characteristics reported in the columns We download the Nobel Prize data available at the following link: http://www.nber.org/nobel/ We obtain the Excel sheet: Jones_Weinberg_2011_PNAS.xlsx 9 The dataframe from a file text/excel: using RStudio Select the folder where you downloaded the excel file The file should appear in the list of files, click on “import dataset” 10 The dataframe from a file text/excel: using RStudio Three steps: 2) check in the preview if the data are correctly displayed 1) Select the appropriate setting 3) import 11 The dataframe from a file text/excel: using RStudio 12 An alternative (and more complicated) way to import a dataframe: using the command line You have to save the excel file as text (.txt) ou.csv The function read.table allows to import the text file in R nobel=read.table("Jones_Weinberg_2011_PNAS.csv", header=TRUE, sep=",") nobel=as.data.frame(nobel) class(nobel) For details on the function see: ?read.table 13 Dataframe - Basic functions Useful functions – dim() : shows the number of observations and the number of variables – rownames() : shows the name of the rows (observations) – colnames() : shows the names of the columns (variables) – head() : shows the first 6 lines – tail() : shows the last 6 lines Examples: dim(nobel) rownames(nobel) colnames(nobel) head(nobel) tail(nobel) 14 Dataframe: Advanced functions for database mamagement Select part of a dataframe subset(dataframe, subset=logical_expression,select=list_of_variables) Join 2 dataframes merge(df1,df2, by= list_of_variables, by.x=,by.y=,...) Sort the dataframe according to a variable df[order(df$nom_var),] 15 Exercise: Age and creativity of researchers Exercise 1: Create a dataframe including the Nobel prize winners’ data. Then, calculate the age in which a scientist is highly creative. Variables of interest: year_research_mid = year in which the «prize-winning work» research was done year_birth = year of birth The following commands select variables in the dataframe: nobel$year_research_mid nobel$year_birth The following command creates the variable age_discovery nobel$age_discovery=nobel$year_research_mid-nobel$year_birth mean(nobel$age_discovery) 16 Exercise: Age and creativity of researchers Exercise 2: Create two new dataframes of researchers who did their «prize- winning work» before 1905 (included) et after 1985 (included). Then, calculate the age in which a scientist is highly creative. The following command generates two new datasets, one including the researchers who did the «prize-winning work» before 1905 and the other after 1985: early_period=subset(nobel, subset= year_research_mid=1985) We calculate the average age at which researchers did the «prize-winning work» in the two periods: Before 1905 mean(early_period$age_discovery) After 1985 mean(late_period$age_discovery) 17 Example: Age and creativity of researchers What do we find? Period Age at «prize-winning work» Whole period 39.03 Before 1905 36.92 After 1985 47.78 Fact: Physics experienced a revolution in the beginning of 1900 -> from classic electromagnetism to quantum mechanics Explanations in Jones and Weinberg (2011) paper for physics: (1) “One line of age-creativity research has emphasized that abstract contributions [theory] tend to come at earlier ages than inductive contributions [empirical], which draw more heavily on accumulated knowledge”; (2) “A second line of age–creativity research has emphasized that 18 Calculate the lifespan Exercise 3: Calculate the lifespan of each researcher in the dataframe nobel$duree=nobel$year_death-nobel$year_birth cbind(nobel$year_birth,nobel$year_death,nobel$duree) Why there are many “NA”s? (Not Available) 19 Managing Missing Values Problem in calculating the average lifespan mean(nobel$duree) is.na() : indicates missing values complete.cases() : selects the lines of a dataframe (individuals) with no missing values na.omit() : removes individuals with at least one missing value 20 Managing Missing Values We can eliminate the NAs.: nobel_no_na= subset(nobel, subset=!is.na(duree)) dim(nobel) dim(nobel_no_na) The average lifespan of a Nobel Laureate is: mean(nobel_no_na$duree) Save the workspace: save.image("nobel.Rdata") 21 2) Dataframe generation We can create a dataframe that includes a sample of 5 individuals We have information about the age, gender, and weight of individuals. (3 variables) age=c(20,25,19,33,16) gender=c ("M","F", "M", "F","M") weight=c(70,60,100,75,90) Create the dataframe data=data.frame(a=age, g=gender, w=weight) class(data) data You can select each variable: data$g You can select a unique value: data[1,2] You can change a value: data[1,1]=30 data 22 3) The dataframes already available in R or in a database package For a complete list with details: library(help = "datasets") 23 Exemple 1: Motor Trend US magazine, 1974 Data for 32 cars (Models 1973–74) cars=mtcars class(cars) 24 Exemple 2: Weight and food of chickens in a chicken coop chicken=chickwts class(chicken) 25 Exemple 3: Distribution of the magnitude of earthquakes near Fiji islands from 1964 quake=quakes class(quake) 26 A database package Install the Ecdat package with the command : install.packages("Ecdat") Load a package in the workspace : library("Ecdat") library(help="Ecdat") 27 Exemple DoctorAUS: doctor=DoctorAUS class(doctor) 28 New tool toolbox.google.com/datasetsearch 29

Use Quizgecko on...
Browser
Browser