L01-IntroR - DATA-100-A - Intro to Data Analytics.pdf

Full Transcript

10/28/24, 1:25 AM L01: Intro to the Course and R L01: Intro to the Course and R AUTHOR Shengda Hu, modified by Devan Becker WEBR STATUS 🟢 Ready! Welcome! Course Roadmap Get programming! P...

10/28/24, 1:25 AM L01: Intro to the Course and R L01: Intro to the Course and R AUTHOR Shengda Hu, modified by Devan Becker WEBR STATUS 🟢 Ready! Welcome! Course Roadmap Get programming! Play with data! Learn the foundations of data science! Start getting insights with real data! We’ll work with real data throughout the course! RTFS: READ THE SYLLABUS I will follow the policies outlined in the syllabus. It’s unfair to give advantages to some students and not others. This is a large class, so “one exception” means a lot of extra work. Add dates to your calendar now! Like, right now! Do it! The Textbook(s) 1. R for Data Science (1e), 2017. 2. R for Data Science (2e), 2023. We will follow this 2nd edition 3. Tidy modelling with R. Will supplement this with content from R4DS We will try to cover the majority of 2e (out of order), together with modelling basics in Tidy modelling with R. The language The language R and the packages in tidyverse evolved quite a bit between the two versions. We will mostly follow the second version for this reason. We will go through the material in slightly different order from how the textbook does it. Quarto versus Rmd Quarto (in 2nd edition) R Markdown (in 1st edition). We will get you aquainted with both, which are not that different, and each has its own benefits. The lectures will use Quarto, which facilitates generating outputs of various formats easily. The labs/assignments will use R Markdown, which facilitates submission and grading. Quarto (and R Markdown) We will not spend much time talking about the format and details of Quarto (.qmd ) or R Markdown (.Rmd ) files. Take the lecture files as templates https://mylearningspace.wlu.ca/d2l/le/content/551434/viewContent/3838361/View 1/15 10/28/24, 1:25 AM L01: Intro to the Course and R Read 2e, 29 Quarto and 1e, 27 R Markdown to understand more about these files For Quarto, start with 29.3 Visual editor in RStudio Source , Visual and Render in RStudio Visual editor is available in RStudio for.Rmd files as well. Knit instead of Render Of course, ask questions, and learn from mistakes Check out 2e, 7 Workflow: scripts and projects on the website, to see the basics of the current RStudio interface, as well as running scripts and starting projects Editor Visual editor is very convenient. Otherwise, the Source file can be edited like any other text file. Keyboard shortcuts are handy: Usual Cmd/Ctrl+C , Cmd/Ctrl+V for copy-pasting; and Cmd/Ctrl+S for saving Cmd/Ctrl+Alt+I on adds a code block Cmd/Ctrl+Shift+Enter on runs the current chunk They can also be set to different keyboard shortcuts following Tools -> Modify keyboard shortcuts... Cmd/Ctrl+Shift+P brings up the Command Palette Challenge: For the whole semester, across all courses, use your mouse as little as possible. Console The Console is in the lower half of the interface. You should learn to love (at least like) the Console. It can provide instant feedback to some code that you wish to try out before include the piece into your files Enormously useful: ? Learn to read Help generated by ? – which shows up in the lower right panel ?abs() Running Code When you click the Render (for.qmd files) or Knit (for.Rmd files) button, a document will be generated that includes both content and the output of embedded code – like the block above. The code block can also be run separately, by clicking the arrow button on its upper right corner. More choices are available in the Run button on the upper right corner of this panel. How to Learn Guitar Read the method book through once. Play the song through once, exactly as it’s written. Ask ChatGPT which strings to play. If it makes sense, move on to the next song. How to ACTUALLY Learn Guitar Read the method book, playing as you go Play the song slowly Repeat sections you struggled with Play similar songs, practice scales https://mylearningspace.wlu.ca/d2l/le/content/551434/viewContent/3838361/View 2/15 10/28/24, 1:25 AM L01: Intro to the Course and R Ask ChatGPT for other things that you can practice Never move on - always go back and re-try Record yourself so you can see the mistakes you made Check your posture and technique! Code versus Guitar Guitar is more muscle memory, code is more brain memory. Write down your mistakes! Remind yourself of the underlying logic! For both, practice, practice, practice and keep a logbook. Guitar Coding Reading tabs Reading code Listening to a song Copying and pasting code Playing a song Writing code (no copy/paste!) Learning many songs Writing different (but similar) code Practicing scales Explain code logic to yourself (comments) Posture Posture Hand position Code style Jamming Code review Soloing NA (Coding will never be this cool) Using GenAI (e.g., ChatGPT) in This Course YOUR SUBMITTED WORK MUST BE WRITTEN BY YOU AND ONLY YOU. Use GenAI to: Get explanations of code or concepts. “Can you explain what”aes()” means in ggplot?” Review your code. “Is this code well-written? …” Solve similar problems to what you’re working on. “How can I use dplyr to remove NAs from my data?” WARNING GenAI has the capacity to reproduce its training set. Copyright infringement! Submitted work must be your own! GenAI output is not your own (and might be copyrighted) My course materials are copyrighted; GenAI records prompts. It is academic misconduct to put any course material into GenAI. What is Data Analytics? https://mylearningspace.wlu.ca/d2l/le/content/551434/viewContent/3838361/View 3/15 10/28/24, 1:25 AM L01: Intro to the Course and R Four V’s of big data Volume and Velocity demand computational power and knowledge Variety creates needs for different methods, so that one-size cannot fit all Veracity is very tricky, as being played out everyday as we speak AI might make matters worse or better. Data Project Workflow From the textbook: Tidy: fix/remove any errors in the data Transform: Put the data in a useful format The “Understand” workflow is an infinite loop! Preparation and Refinement Data Analysis (for this course at least) mostly concentrates on Preparation and Refinement Trustworthy data source: “garbage in, garbage out” … Statistics can help discover inconsistencies for further investigation https://mylearningspace.wlu.ca/d2l/le/content/551434/viewContent/3838361/View 4/15 10/28/24, 1:25 AM L01: Intro to the Course and R How analysis is presented matters: communication is part of the job, and coding is simply communicating with software / computer systems … Specifying qualities of the data helps clarify suitable applications Data science A concept that is still in the making, may take many shapes and forms for different people. Key ideas: Look at data from the point of view of a scientist: do experiments, make hypothesis, test hypothesis by more experiments, rinse and repeat, ……, arrive at some conclusion Look at decision making of all kinds from the point of view of data: tell / refute a story with data, make / break a decision by data, persuade / disillusion using data, …… Data scientists Data science is a tool. The correctness of the conclusion depends on the correctness of the inputs, but also depends on interpretation. Use the tool responsibly! The science: math, statistics, computer programming, etc. The output: figures, diagrams, reports, presentation, etc. The hope: try to be useful, and hopefully helpful. To get you excited From Wikipedia: Hans Rosling (27 July 1948 – 7 February 2017) was a Swedish physician, academic, and public speaker …… held presentations around the world, including several TED Talks in which he promoted the use of data to explore development issues. From Youtube: The joy of stats, BBC-4 From more recent media: Florence Nightingale’s data revolution, R. J. Andrews, in Scientific American, August 2022, pp. 78 – 85. The Future of Recycling Is Sorty McSortface, Joe Fassler, in The Atlantic, August 2023. To get you worried Quality Free as speech (never free as beer) Cost in extraction and use – human, energy and environment Effect in application From more recent media: Google changes estimates on flight emission: BBC news article (September 2022) and the rotten link to GitHub source maintained / given up by Google. Genetics makes some people more likely to participate in genetic studies, Ars Technica, August 17, 2023. More generally Catalog of biases The elephant(s?) in the room Generative AI (way beyond the scope of this course) LLMs (Large Language Models), a growing list of which can be found on Wikipedia https://mylearningspace.wlu.ca/d2l/le/content/551434/viewContent/3838361/View 5/15 10/28/24, 1:25 AM L01: Intro to the Course and R Other types of generative AIs, again from Wikipedia All the tasks they can already perform – if we allow them. For instance, writing reasonable jokes and short stories as well as potentially unsettling poetry; proposing convincing but wrong answers to Stack Overflow questions and down-right creepy list of books to be banned. What is for mere individual humans to do? And tons of other questions. Philosophical questions of LLMs For now, AIs do not seem to understand themselves … or the relationship of themselves with reality There may still be breakthroughs that could allow the kind of future in which they do … in the mean time, understanding as much as possible is the least we humans can do Data need statistics (more below) Data sets Many freely available datasets in the wild Kaggle Open Government CIA World Bank COVID-19 Weather / climate related … Many more datasets in private companies Alphabet (parent of Google) Meta (previously Facebook) Musk’s X (previously Twitter) Amazon (?) ISPs Cellular providers Banks … and data brokers Miranda warning for data collection? Data sets and You Very little stay with individuals Create, but do not cumulate Generate, but do not analyze Structures in data do not become apparent until organized and cleaned Power imbalance created / maintained using statistics and algorithms “Lies, damned lies, and statistics” Algorithm bias and Why we need to audit algorithms Regain some balance? Understand fundamentals in statistics and design decisions in algorithms (“data literacy”) … Full courses needed, but they’ll be worth it Will discuss some descriptive statisticsand basic model building Data need programming https://mylearningspace.wlu.ca/d2l/le/content/551434/viewContent/3838361/View 6/15 10/28/24, 1:25 AM L01: Intro to the Course and R R the language From Wikipedia: R is a programming language for statistical computing and graphics supported by the R Core Team and the R Foundation for Statistical Computing.... R is used among data miners, bioinformaticians and statisticians for data analysis and developing statistical software. The core R language is augmented by a large number of extension packages containing reusable code and documentation. tidyverse and tidymodels are suites of extension packages that this course will be mostly work with together with a few other packages that will be included as we go In the following, we’ll start with the most elementary part of the R language Microcosm of this Course The Dept of Stats Sciences at UoT has a nice series of tutorials: https://dosstoolkit.com. The following code will get it set up and run the intro tutorial. This MUST be done in R on your computer (or on a WLU Virtual Machine). install.packages('tidyverse') install.packages('remotes') install.packages('opendatatoronto') remotes::install_github("rstudio-education/gradethis") devtools::install_github("kbodwin/flair") remotes::install_github("RohanAlexander/DoSStoolkit") learnr::run_tutorial("hello_world", package = "DoSStoolkit") Arithmetic The arithmetic in R is straightforward  Run Code   1 5 + 3 / 4 # mixed arithmetic operation -- recognizes order of operations 2 4^3 # cube of 4 3 2^(1/2) # square root of 2 4 log(3) # NATURAL log (base e) of 3 5 log10(3) # log (base 10) of 3 6 pi # in language constant, 3.14159.... 7 cos(pi) # trig function, angles in radian -- pi = 180 degrees 8 sin(pi) # issue with limited precision 5.75 64 1.414214 1.098612 0.4771213 3.141593 -1 1.224647e-16 Notice: log , log10 , cos and sin has () following them containing another value. They are examples of a function , which will be briefly described below. Object Types There are different type s of objects in R. For instance https://mylearningspace.wlu.ca/d2l/le/content/551434/viewContent/3838361/View 7/15 10/28/24, 1:25 AM L01: Intro to the Course and R 2L is an integer (the L specifies that 2 is an integer) 3.4 is a double precision number (real number, writing 2 without L following it makes it a double as well) 'a' is a character ( 'abd' is a string ) TRUE is a boolean value (as is FALSE ) Booleans The following expressions involve == , a boolean operator produces TRUE if both sides are the same and FALSE if they are not the same … there are always more details, but we can wait till they actually show up != is the opposite of ==.  Run Code   1 2 + 2 == 5 2 2 + 2 != 5 FALSE TRUE Variables Variables are useful in expressing relation s and describing procedure s. In the simplest situation, they help with repetitive work.  Run Code   1 # assign values to variables, or `names` 2 x = 3 # "=" sign can be used 3 y

Use Quizgecko on...
Browser
Browser