Big Data in Social Sciences - PDF

Summary

These are notes for a course on Big Data in Social Sciences, focusing on writing reproducible code in R. The document covers introductions to data analysis concepts, code examples, and best practices for reproducibility in research.

Full Transcript

‭BIG DATA IN SOCIAL SCIENCES‬ ‭Week1-Writing code for reproducible research‬ ‭ ur world is surrounded by data, it used to be more about “old types of data” but lately “New data‬ O ‭sources” have increased significantly:‬ ‭- Social media data: websites, blogs‬ ‭- GIS data: satellite, cli...

‭BIG DATA IN SOCIAL SCIENCES‬ ‭Week1-Writing code for reproducible research‬ ‭ ur world is surrounded by data, it used to be more about “old types of data” but lately “New data‬ O ‭sources” have increased significantly:‬ ‭- Social media data: websites, blogs‬ ‭- GIS data: satellite, climate‬ ‭- Economic data: trade, company information‬ ‭- Military data: casuality, insurgent attacks‬ ‭- Randomized experiments / surveys data‬ t‭his shift to new data required‬‭new substantive ideas‬‭and new‬‭data analysis tools‬‭(eg massive‬ ‭changes like internet and computing revolution) Only statisticians used to analyze data in the past,‬ ‭nowadays‬‭everyone analyzes data.‬ ‭quantitative reasoning is needed in this data-driven world, analyzing, interpreting, describing and‬ ‭evaluating it is essential to make good decisions in society/work…‬ ‭ riting code for reproducible research‬ W ‭Using comments can save lots of time (no shit), explain why, not how or what, and remember to‬ ‭update comments if the code changes.‬ ‭Object names must start with a letter and can only contain letters, numbers, _, and space. There‬ ‭are different naming conventions in programming.‬ ‭-‬ ‭snake_case‬‭(recommended by the professor)‬ ‭-‬ ‭camelCase‬ ‭-‬ ‭Pascalcase‬ ‭-‬ ‭kebab-case‬ ‭ ut spaces on either side of mathematical operators apart from ^ (i.e. +, -, ==,‬ P ‭%‬‭) allow you to‬‭pass the output of‬‭one function directly as‬ t‭he input to the next function‬‭, making code more readable‬‭and concise. They should have spaces‬ ‭before and usually are the last thing on a line. The input gets redirected toward the output.‬ ‭As the script gets longer, remember to section code (#load data / #plot data)‬ ‭ eproducible research‬‭can‬‭be exactly redone‬‭, given‬‭the material used, another must be able to‬ R ‭reproduce it with your code, data, and evironment, getting your results.‬ ‭Code, dataset and environment must be released‬ ‭Document the workflow‬‭to answer questions on the original‬‭dataset, transformation on data,‬ ‭analysis done and how the paper has been built and how to follow that same process‬ ‭R projects‬ ‭ n R project enables your work to be bundled in a portable, self-contained folder containing all‬ A ‭relevant data and code.‬‭setwd()‬‭is a function in R‬‭used to‬‭set the working directory‬‭, which is the‬ ‭folder where R reads and saves files by default. When you use this function, you tell R where to look‬ ‭for files or where to save the output.‬ ‭A‬‭project‬‭is simply a‬‭working directory‬‭designated‬‭with a‬‭.RProj file‬‭. When you open a project‬ ‭(using File/Open Project in RStudio or by double–clicking on the.Rproj file outside of R), the‬ ‭working directory will automatically be set to the directory that the.RProj file is located in.‬ ‭ g.‬‭setwd("/Users/yourname/Documents/RProjects")‬ E ‭After setting the working directory, you can use relative paths in your code instead of specifying‬ ‭the full path for file operations like reading or writing files.‬ ‭To check your current working directory, you can use the‬‭getwd()‬‭function‬ ‭Quarto / R Markdown‬ ‭ uarto‬‭integrates code and natural language‬‭in‬‭“literate‬‭programming”‬‭. It is the successor of R‬ q ‭markdown (allows R code chunks to be included), it’s a mark up language similar to HTML or‬ ‭LaTeX‬ ‭With quarto you can get a‬‭live document‬‭where code‬‭executes and then forms part of the document,‬ ‭this can be compiled in HTML, PDFs, but it can take a while since the code needs to run‬ ‭Tidyverse‬ ‭ he tydiverse is an opinionated collection of R packages designed for data science. To install the‬ T ‭complete tydiverse the code is‬‭install.packages(“tydiverse”)‬‭and‬‭library(“tydiverse”)‬‭.‬ ‭Example of the difference between R basic and tydiverse to get the same output:‬ ‭R basic‬ ‭Tydiverse‬ ‭ ## calculate the ratio compared to 1950‬ > ‭ Npop %>%‬ U ‭> UNpop$ratio ## convert to percentage increase and round‬ ‭# convert to percentage increase and round‬ ‭> UNpop$percent ‭1.761456 2.106604 2.426063 2.738238‬ ‭> UNpop$percent 0 20 46 76 111 143 174‬ ‭ Npop$ratio |‬ U ‭UNpop$world.pop UNpop$percent

Use Quizgecko on...
Browser
Browser