Big Data in Social Sciences - PDF
Document Details
Uploaded by LikableClavichord
Tags
Related
Summary
These are notes for a course on Big Data in Social Sciences, focusing on writing reproducible code in R. The document covers introductions to data analysis concepts, code examples, and best practices for reproducibility in research.
Full Transcript
BIG DATA IN SOCIAL SCIENCES Week1-Writing code for reproducible research ur world is surrounded by data, it used to be more about “old types of data” but lately “New data O sources” have increased significantly: - Social media data: websites, blogs - GIS data: satellite, cli...
BIG DATA IN SOCIAL SCIENCES Week1-Writing code for reproducible research ur world is surrounded by data, it used to be more about “old types of data” but lately “New data O sources” have increased significantly: - Social media data: websites, blogs - GIS data: satellite, climate - Economic data: trade, company information - Military data: casuality, insurgent attacks - Randomized experiments / surveys data this shift to new data requirednew substantive ideasand newdata analysis tools(eg massive changes like internet and computing revolution) Only statisticians used to analyze data in the past, nowadayseveryone analyzes data. quantitative reasoning is needed in this data-driven world, analyzing, interpreting, describing and evaluating it is essential to make good decisions in society/work… riting code for reproducible research W Using comments can save lots of time (no shit), explain why, not how or what, and remember to update comments if the code changes. Object names must start with a letter and can only contain letters, numbers, _, and space. There are different naming conventions in programming. - snake_case(recommended by the professor) - camelCase - Pascalcase - kebab-case ut spaces on either side of mathematical operators apart from ^ (i.e. +, -, ==, P %) allow you topass the output ofone function directly as the input to the next function, making code more readableand concise. They should have spaces before and usually are the last thing on a line. The input gets redirected toward the output. As the script gets longer, remember to section code (#load data / #plot data) eproducible researchcanbe exactly redone, giventhe material used, another must be able to R reproduce it with your code, data, and evironment, getting your results. Code, dataset and environment must be released Document the workflowto answer questions on the originaldataset, transformation on data, analysis done and how the paper has been built and how to follow that same process R projects n R project enables your work to be bundled in a portable, self-contained folder containing all A relevant data and code.setwd()is a function in Rused toset the working directory, which is the folder where R reads and saves files by default. When you use this function, you tell R where to look for files or where to save the output. Aprojectis simply aworking directorydesignatedwith a.RProj file. When you open a project (using File/Open Project in RStudio or by double–clicking on the.Rproj file outside of R), the working directory will automatically be set to the directory that the.RProj file is located in. g.setwd("/Users/yourname/Documents/RProjects") E After setting the working directory, you can use relative paths in your code instead of specifying the full path for file operations like reading or writing files. To check your current working directory, you can use thegetwd()function Quarto / R Markdown uartointegrates code and natural languagein“literateprogramming”. It is the successor of R q markdown (allows R code chunks to be included), it’s a mark up language similar to HTML or LaTeX With quarto you can get alive documentwhere codeexecutes and then forms part of the document, this can be compiled in HTML, PDFs, but it can take a while since the code needs to run Tidyverse he tydiverse is an opinionated collection of R packages designed for data science. To install the T complete tydiverse the code isinstall.packages(“tydiverse”)andlibrary(“tydiverse”). Example of the difference between R basic and tydiverse to get the same output: R basic Tydiverse ## calculate the ratio compared to 1950 > Npop %>% U > UNpop$ratio ## convert to percentage increase and round # convert to percentage increase and round > UNpop$percent 1.761456 2.106604 2.426063 2.738238 > UNpop$percent 0 20 46 76 111 143 174 Npop$ratio | U UNpop$world.pop UNpop$percent