CSE5DEV Week 1-4 Syllabus PDF
Document Details
Uploaded by GenerousChrysoprase
La Trobe University
2023
Dr Kiki Adhinugraha
Tags
Summary
This document provides an outline, lecturer information, subject materials, timetable, prerequisites, assessment details, and sources of help for a CSE5DEV course at La Trobe University. It also includes a brief overview of an R programming course. The document appears to be lecture notes.
Full Transcript
October 8, 2023 Outline Section 1: General Information Section 2: Introduction Section 3: Basics of R Programming Lecturer info ▶ Name: Dr Kiki Adhinugraha ▶ Email: [email protected] ▶ Website: https://scholars.latrobe.edu.au/kadhinugraha ▶ Consultation Time: Wed 01:00 PM-02:00 PM,...
October 8, 2023 Outline Section 1: General Information Section 2: Introduction Section 3: Basics of R Programming Lecturer info ▶ Name: Dr Kiki Adhinugraha ▶ Email: [email protected] ▶ Website: https://scholars.latrobe.edu.au/kadhinugraha ▶ Consultation Time: Wed 01:00 PM-02:00 PM, PS1-215A, by appointment only ▶ Research interests: Spatial Data Science Subject materials ▶ Subject Homepage: Go to LMS: https://lms. latrobe.edu.au/course/view.php?id=135911 ▶ Lecture slides: All slides will be available in Subject Homepage: LMS. ▶ Recommended textbooks ▶ Hastie et al. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2009 Springer. ▶ Downey Think Stats: Exploratory Data Analysis. 2011, Amazon. ▶ Fischetti Data Analysis with R. 2015, Packt Publishing. ▶ Other materials: I may point you to some other supporting materials through LMS - News & Announcements Forum. Everything you are required to know is covered by the subject slides Timetable ▶ Lecture ▶ Thursday - 9:00 AM to 11:00 AM - DMC-01-C121. ▶ Labs ▶ Lab-1: Thursday - 1:00 PM to 3:00 PM - BG-106. ▶ Lab-2: Thursday - 3:00 PM to 5:00 PM - BG-106. ▶ Lab-3: Friday - 9:00 AM to 11:00 AM - BG-104. Prerequisites: ▶ CSE4DBF: DATABASE FUNDAMENTALS. ▶ MAT4NLA: NUMBER SYSTEMS AND LINEAR ALGEBRA. ▶ Or any programming subjects Lab coding: This subject uses R Programming language. ▶ This subject is not about programming, but we do need a language for the practical parts. ▶ R: At a minimum, you should be comfortable with basic data types and structures, reading data, writing output, loading packages, visualisation, implementing algorithms. ▶ Note: I will cover the basics of R in Lecture 1, Lecture 2 and Lecture 3. ▶ Warning: The assignments and labs are all in R, so you will struggle if you are not comfortable with R. Maths :) This subject has some maths in it. We need to understand certain algorithms and data analysis tools. P ▶ Equations (summations, recursions). ni=0 2. ▶ Arithmetric and partial sum simplifications. Pn i=0 2 = 2n + 2. If you are uncomfortable with above, it is advisable that you review your maths subjects. Attending lectures: ▶ Everything you are required to know is covered by the subject slides. ▶ BUT the subject slides are designed assuming that you are attending the weekly lectures. ▶ During the lectures, you can ask questions at any time. ▶ If you can not attend, ▶ All lectures will be recorded and are available to watch in the Subject Homepage: LMS. ▶ You can post your questions in the LMS subject Q & A Forum. Attending labs ▶ The focus in all labs is on application rather than on the theory. ▶ Lab consolidate your understanding, and help you translate your knowledge into a running code. ▶ Practising and coding are crucial: Implementing all lab tasks. ▶ Please finish the lab tasks for better learning. Remember that computer skills need a systematic approach. Therefore, every week’s learning is built on the previous week’s learning. Assessments: Assignments This subject require you to complete and submit 2 Assignments: ▶ Assignment 1 - worth 15% of your final mark. It covers Week 1 to Week 6. ▶ Assignment 2 - worth 25% of your final mark. It covers Week 4 to Week 10. Both assignments will be a written report in R programming. Assessments: Exam The final exam will be online. ▶ One 2-hour examination- worth 60% - Semester 2 exam period. ▶ Examinable materials ▶ Everything we cover in the lectures and labs is examinable unless explicitly stated otherwise. ▶ Requirements for passing: To obtain a pass, you must: ▶ Accumulate at least 50% over all forms of assessments. Sources of help (We are here to help!) ▶ See me if you are having difficulties which may prevent you from completing the subject as soon as possible. ▶ Ask lab demonstrators! ▶ Weekly consultations times. ▶ Discussion forum and each other! ▶ Extensions: ▶ Please do not wait until the day before it is due. ▶ Extensions are only granted in exceptional circumstances and require the official process of “Special Consideration” to be followed. Academic Misconduct ▶ Please read the University Plagiarism Statement in the subject guide very carefully. ▶ In short, cheating, whether by fabrication, falsification of data, or representing the work of someone else as your own is an offence subject to University disciplinary procedures. ▶ Plagiarism may result in charges of academic misconduct which carry a range of penalties including cancellation of results and exclusion from the subject. ▶ Exact penalties are decided in formal plagiarism hearings. ▶ All assignment, weekly quizzes and exam must be done individually. Student feedback on subject survey The Student Feedback on Subjects (SFS) Survey is part of the quality assurance process that occurs across the university. In this survey you are invited to tell us about your learning experiences in this subject. We want you to tell us of your experience in this subject. Your views will be taken seriously and will assist us to enhance this subject for the next group of students. Your feedback will also contribute to the text for ”Summary of Previous Student Feedback” below so please take the time to tell us your views. The surveys are anonymous and will be distributed prior to the end of the teaching period. CSE5DEV LMS: https: //lms.latrobe.edu.au/course/view.php?id=106176 CSE5DEV LMS: https: //lms.latrobe.edu.au/course/view.php?id=106176 CSE5DEV content overlap: There are two or more subjects that might have a very minor overlap with CSE5DEV. These are: CSE5DMI and CSE5ML. CSE5DEV CSE5DM CSE5ML ▶ To find out whether your subject has any overlap with another subject(s), please check Subject Learning Guide (SGL) and Subject Website. ▶ If you find that CSE5DEV overlap with any elective subjects, you should NOT enrol in elective subject. General Subject Goals Key Goal The goal of this subject is to equip graduate students with indepth practical knowledge, and solid understanding of the latest data exploration techniques and tools in order to find practical solutions to real-world problems The goals of CSE5DEV are: ▶ Theory: ▶ Understand the basics of data types and notations. ▶ Understand data exploration and analysis steps. ▶ Practice: ▶ Learn to implement visualisation techniques in the context of data exploration and analysis. ▶ Learn to implement various tools and techniques to solve a variety of problems. Data exploration and analysis process steps Process steps In data exploration and analysis, we often need to execute different steps to achieve our gaol. We need to know the right data to draw accurate conclusions and inform decision maker. The process steps are also known as Problem Solving. Problem solving Problem solving is the process of identifying a problem, developing possible solution, and perform the appropriate action(s). Data exploration and analysis process steps Problem solving can be summarised as follows: ▶ ▶ ▶ ▶ ▶ ▶ What is the question(s)? Design a solution method. Implement the solution method. Testing. User evaluation. Refinement. What Is the Question? Question We often start with high-level questions. For example, ▶ How to track houses prices across different areas? ▶ How to track customers behaviour in different groups? ▶ What is going to be the fuel price in the next month? Understanding the objectives and requirements are very crucial to a successful data exploration project. In order to answer the above question(s), we need to: ▶ ▶ ▶ ▶ understand data format. understand the structure and size of the data. know which variables suggest interesting relationships. know which observations are usual and unusual. Design, implement and communicate In data exploration and analysis subject you will learn: ▶ How to format and organise data. ▶ How to clean and normalise data. ▶ How to use statistical techniques for the exploratory analysis of data. ▶ How to use visualisation tools to begin uncovering the structure of your data. ▶ How to implement various tools and techniques in R programming language. ▶ How to communicate your results using R programming language. Subject Syllabus Lecture 1 Introduction Lecture 2 Data Collection & R Programming Lecture 3 Data Wrangling & R Programming Lecture 4 Data Cleaning & Normalisation Lecture 5 Data Visualisation Lecture 6 Data Exploration 1 Lecture 7 Data Exploration 2 Lecture 8 Data Exploration 3 Lecture 9 Correlation & Pattern Discovery Analysis Analysis Analysis Analysis Lecture 10 Case Study 1 Lecture 11 Case Study 2 Lecture 12 Revision Data Science Project Almost all data science and analysis projects require the same set of stages to be performed. These are: Stage -1 Identify the problem (question) Stage - 2 Collect & Prepare the data Stage - 3 Explore the data Stage - 4 Communicate the results What is the goal? What do you want to estimate? How to track houses prices across different areas? Data resources Descriptive statistics What are the findings? Data representation Visualisation What we learn? Report the findings Does the result make sense? Clean and normalise the data Data can be explored using either Manual tools, Automation tools or both. ▶ Manual tools: Excel, Notepad, MS-Word. ▶ Automation tools: Programming Language, e.g, R. ▶ Hybrid: manual tools and automation tools. Data Explore Manual Automation Excel Programming Users vs. Programmers ▶ Users see computers as a set of tools - word processor, email, excel, note, etc. ▶ Programmers learn computer languages to write Program. ▶ Programmers use some tools that allow them to build new tools. ▶ Programmers often build tools for lots of users and/or for themselves. What is a program? Program Program is a set of actions (or rules) to accomplish a specific task. Program Development Cycle The process of creating a program that works correctly typically involves Five Phases known as the program development cycle. Design the program steps Write R code Correct syntax errors Run the program Correct logic/output errors What is a programming language? Programming language A programming language comprises a set of instructions to produce various kinds of output. Programming languages are used in computer programming to implement algorithms. Examples of computer programming languages are: ▶ R ▶ C, C++ ▶ JAVA ▶ Python What is R? R R is a high-level programming language that uses a set of instructions or rules for instructing a computer to perform specific tasks. R Features ▶ R is a free software environment for statistical computing and graphics. ▶ Can be easily extended with 15,000+ packages available on CRAN2 (as of Jun 2019). ▶ Many other packages provided in Bioconductor, R-Forge, GitHub, etc. ▶ Many R manuals and books are available in CRAN ▶ An Introduction to R ▶ The R Language Definition ▶ ... Why we use R? ▶ R is easy to understand and implement . ▶ R is widely used in both academia and industry. ▶ R was ranked #1 in the KDnuggets 2016 poll on Top Analytics and Data Science software (actually R has been #1 in a row from 2011 to 2019!). ▶ The CRAN Task Views provide collections of packages for different tasks ▶ ▶ ▶ ▶ ▶ Machine learning & Deep learning Statistical learning Visualisation Optimisation ... R for CSE5DEV To use R in CSE5DEV Labs, we need to do the following steps: ▶ Step-1: We need to install R Programming language. ▶ Step-2: We need to install RStudio - an integrated development environments (IDE) for writing R codes. R for CSE5DEV Before you can try to write any R programs, you need to ▶ Make sure that R is installed on your computer and properly configured. ▶ If you are working in a Uni computer lab, this has been done already. ▶ If you are using your own computer, you can follow the instructions in next slides to install R from Internet. Step 1- downnlaod and install R https://cran.r-project.org/ Basics of R Programming Step 2- downnlaod and install RStudio https://rstudio.com/products/rstudio/download/ How RStudio and R work? CSE5DEV Student Write RStudio Interface R code Run R in the background R programming software PC Monitor Output Computer - Note: you ONLY need to run and write your code in RStudio Interface. RStudio Interface RStudio Interface RStudio Interface RStudio Interface RStudio Interface RStudio Interface RStudio Interface RStudio Interface RStudio Interface RStudio Interface RStudio Interface RStudio Interface RStudio Interface R - Elementary arithmetic operators R - Numeric functions R- Special values R- Data Types Design the program steps Write R code Correct syntax errors Run the program Correct logic/output errors Program Design Steps ▶ The process of designing a program is known as the most important part of the Program Development Cycle. ▶ Define the task(s) that the program is to perform. ▶ Determine the steps that need to be implemented to perform the task(s). ▶ There are several ways to design a program such as Pseudocode and Flowcharts Flowchart What is a Flowchart? Flowchart is a diagram that graphically describes the steps that take place in a program. ▶ It shows steps in sequential order. ▶ It shows the steps as boxes of various kinds, and their order by connecting them using arrows. Flowchart- example Q: Calculate the average of two numbers: num1 and num2. Flowchart- example Q: Calculate the sum of two numbers: A and B. Flowchart- example Q: A school timetable. Input, Processing, and Output Computer programs A computer program usually perform the following three steps: ▶ Take Input(s) ▶ Some Process is performed on the Input(s). ▶ Produce Output(s). Input, Processing, and Output- example Example: Calculate the average of two numbers: num1 and num2. Input Process Output Num1 Num2 Average= (num1+num2)/2 Average Input, Processing, and Output- example R: Calculate the average of two numbers: num1 and num2. Input, Processing, and Output- example R: Calculate the Sum of two numbers: A and B. Do’s and Don’ts ▶ Please attend lectures and labs regularly. ▶ Please do ask questions whenever you have any doubt. Again, participation is very important to understand this subject well. ▶ Practice makes perfect. So, whenever you are given an exercise, please try and practice it. ▶ Come to lectures and labs on time. ▶ Avoid using mobile phones during lecture and labs. ▶ If you wish to communicate with me, please use your La Trobe Email (this is due to privacy act). End of Week 1 See you Next Lecture (Week 2) Data Collection & R Programming Table: CSE5DEV Timetable Check LMS Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Week 2 CSE5DEV DATA EXPLORATION AND ANALYSIS Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Overview 1 Section 1: CSE5DEV Syllabus 2 Section 2: Data Collection 3 Section 3: Basics of R Programming Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Subject Syllabus Lecture 1 Introduction Lecture 2 Data Collection & R Programming Lecture 3 Data Wrangling & R Programming Lecture 4 Data Cleaning & Normalisation Lecture 5 Data Visualisation Lecture 6 Lecture 7 Lecture 8 Data Exploration 1 Data Exploration 2 Data Exploration 3 Analysis Analysis Analysis Lecture 10 Case Study 1 Lecture 11 Case Study 2 Lecture 12 Revision Lecture 9 Correlation & Pattern Discovery Analysis Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming What we have learned so far? Lecture 1 — Introduction — What we have learned so far? 1 Install R and Rstudio 2 Create Rmarkdown file. 3 Add chunk of code. 4 Write and run basic codes. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Science Project Almost all data science and analysis projects require the same set of stages to be performed. These are: Stage -1 Identify the problem (question) Stage - 2 Collect & Prepare the data Stage - 3 Explore the data Stage - 4 Communicate the results What is the goal? What do you want to estimate? How to track houses prices across different areas? Data resources Descriptive statistics What are the findings? Data representation Visualisation What we learn? Report the findings Does the result make sense? Clean and normalise the data Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Week 2 Overview Data Collection & R programming This week will be covering the basics of Data Collection & R programming. Learning outcomes: • Learn about the source of data. • Learn about data type. • Learn about how to import data into Rmarkdown. • Learn about R programming. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Overview 1 Section 1: CSE5DEV Syllabus 2 Section 2: Data Collection 3 Section 3: Basics of R Programming Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection Data Collection Data collection is the process of gathering information from a specific source, which can be used to answer relevant questions and evaluate outcomes. Data can help us in: • learning more about customers, items, products, ..etc. • discovering trends in the current system, organisation, ..etc. • segmenting elements into different groups based on their individual needs. • decision making process to improve the quality of the system. • improving the quality of the product or service based on the feedback obtained. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection Data Collection Data collection is the process of gathering information from a specific source, which can be used to answer relevant questions and evaluate outcomes. R Code Data Exploration & Analysis Techniques R Code R Code Knowledge, Conclusions, Actions,…,etc R Code Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection Data sources: Data can be obtained from various sources such as: PC Data Internet External PC PC Data Data Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection Data format: Data can be stored in a different format such as : Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection What is data? Data Data is a set of facts such as numbers, words, measurements, observations or descriptions of things. A set of values of qualitative or quantitative variables collected by a various range of organisations and institutions, such as businesses and non-governmental organisations. ▶ Qualitative data: descriptive information (describes something). ▶ Quantitative data: numerical information (numbers). Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection What is data?: Qualitative vs Quantitative Data Qualitative The trip was great Quantitative Discrete 10 Continuous 3.3 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection Data values can be: ▶ Numeric: • Discrete - integer values. Example: number of car in the park. • Continuous - any value in a pre-defined range (float, double). Example: average mark (e.g., 63.4) ▶ Categorical: values are selected from a predefined number of categories. • Ordinal - categories could be meaningfully ordered. Example: grades (A, B, C, D, E, F). • Nominal - don’t have any order. Example: eye colours (blue, black, honey, etc.) • Binary - the special case of nominal, with only 2 possible categories. Example: binary value (1, 0) ▶ Date: datetime, timestamp. Example: 11.10.2018. ▶ Text: Multidimensional data ▶ Time series: Data points indexed in the time order Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection Data category: data can be one of two main categories: experimental or observational Experimental data Data collected from strictly controlled/designed experiments with efforts made to ensure statistical validity. Examples • Medical clinical trials • Election polls Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection Data category: data can be one of two main categories: experimental or observational. Observational data Data collected from ’real-world’ settings without control over the captured underlying phenomena. It is easier to collect and obtain, but results and conclusions from such data may be biased or inconclusive. Example Almost all data used in data mining, bushiness analytic and data science are observational data. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection Data Type: data can be • • • • • • Numbers String Relational data Factors or categorical variables Dates and times Description We can read data from the various sources or files. Files can be in any format such as: • • • • • • name.CSV name.DAT name.TXT name.XLS name.HTML name.json Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection When we get a new data, we often ask: • What is in it? • What is wrong with it? • What should I do with it? Answer: • Step 1. Import the data into your code. • Step 2. Organise the data in a readable format. • Step 3. ... • .... • Step n. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection Data importing Data importing can be defined as the process of writing R code to get the data from disk (PC) into R environment. This lecture will cover Step 1. • Step 1. Import the data into R environment. 1 Reading Data: write R codes to import data into Rstudio environment. 2 View the data: explore, access and print. col 1 col 2 col 3 Value 1 Value 2 Value 3 Value 4 Value 5 Value 6 Value 7 Value 8 Value 9 Value 10 Value 11 Value 12 Write R Code Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Overview 1 Section 1: CSE5DEV Syllabus 2 Section 2: Data Collection 3 Section 3: Basics of R Programming Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Basics of R Programming How RStudio and R work? CSE5DEV Student Write RStudio Interface R code Run R in the background R programming software Output PC Monitor Computer - Note: you ONLY need to run and write your code in RStudio Interface. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Basics of R Programming RStudio Interface Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Basics of R Programming RStudio Interface Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Basics of R Programming In this lecture, we will learn how to write R code for the following tasks: • Import data: reading data from file. • View data. • Access data. • Check data types. • Export data Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Importing data R uses various functions to import data from the Working Directory into R environment. We can import data from different formats such as: • Text files: txt files. • Comma Separated Values: CSV files. • Excel Files: xls or xlsx files. • Web-site: URL files. • SPSS File • ... etc Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Importing data R reading function syntax: R Code: Read data function format Object_name <- R_read_function("file_name.ext", Arguments) • Object name: variable that can hold different values. • R read function: used read data from file based on file extension. • file name.ext: the name of the file to read, file extension and location. • Arguments: control statements Examples of R reading functions: • read.table for TEXT files • read.csv for CSV files Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Importing data — Read data from text files — Example: read data from text file called Mytext.txt and assign the data to dat Object (or variable). R Code: — Read data from text files — dat <- read.table("Mytext.txt", header=TRUE, sep =" ", dec=".") • The read.table function read the file and save it in object. • header=TRUE: By default the header argument is set as TRUE. This indicates that the first row in the file is set as header information (column names). If your file does not have a header, set the header argument to FALSE: header=FALSE. • sep =” ”. Indicate the columns are separated by white space(s). We can use tabs, newlines or comma. • dec=”.”. The character used in the file for decimal points is a dot. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Importing data — Read data from CSV files — Example: read data from csv file called data.csv and assign the data to dat object (or variable). R Code: — Read data from text csv — dat <- read.csv("data.csv", header=TRUE, sep =",") • read.csv: read the data from ”data.csv”, which includes a header row and separated by comma (,). • By default dat will be data frame. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming View data We can use the following functions to view/check the data in dat: • names() - shows the names attribute for a data frame, which gives the column names. • head() - shows first 6 rows. • tail() - shows last 6 rows. • dim() - returns the dimensions of data frame (number of rows and number of columns). • nrow() - number of rows. • ncol() - number of columns. • str() - structure of data frame - name, type and preview of data in each column. • sapply(dataframe, class) - shows the class of each column in the data frame. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming View data Example of functions for viewing/checking data. dat <- read.csv("data.csv", header=TRUE, sep =",") names(dat) "Model" "mpg" "am" "gear" "cyl" "carb" "disp" "hp" "drat" "wt" "qsec" "vs" head(dat) ## ## ## ## ## ## ## Model 1 Mazda RX4 2 Mazda RX4 Wag 3 Datsun 710 4 Hornet 4 Drive 5 Hornet Sportabout 6 Valiant dim(dat) ## [1] 32 12 nrow(dat) ## [1] 32 ncol(dat) ## [1] 12 mpg cyl disp hp drat wt qsec vs am gear carb 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming View data We can use print () function to display dat data at the screen. print(dat) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Model mpg cyl disp hp drat wt qsec vs am gear carb 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 11 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 12 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 13 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 14 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 15 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 16 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 17 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 18 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 19 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 20 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 21 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 22 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 23 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 24 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 25 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 26 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 27 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 28 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 29 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 30 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 31 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 32 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming View data str(dat) - displays the structure of data, type and the data in each column. str(dat) ## 'data.frame': 32 obs. of 12 variables: ## $ Model: chr "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... ## $ cyl : int 6 6 4 6 8 6 8 4 4 6 ... ## $ disp : num 160 160 108 258 360 ... ## $ hp : int 110 110 93 110 175 105 245 62 95 123 ... ## $ drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... ## $ wt : num 2.62 2.88 2.32 3.21 3.44 ... ## $ qsec : num 16.5 17 18.6 19.4 17 ... ## $ vs : int 0 0 1 1 0 1 0 1 1 1 ... ## $ am : int 1 1 1 0 0 0 0 0 0 0 ... ## $ gear : int 4 4 4 3 3 3 3 4 4 4 ... ## $ carb : int 4 4 1 1 2 1 4 2 2 4 ... Based on the above, we can see that • dat is categorised as an object and data.frame type. • Columns data are either character, number or integer. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure ▶ Objects or variables are used to save data values that R programs can manipulate. A valid object name consists of letters, numbers and the dot or underline characters. It should starts with a letter, or the dot not followed by a number. ▶ Examples of Valid and Invalid object names are: 1 2 3 4 5 6 ▶ object name2. valid - contains letters, numbers, dot and underscore. object name% Invalid - contains the character ’%’. Only dot(.) and underscore allowed. 2object name invalid - starts with a number. .object name, object.name valid - can start with a dot(.) but the dot(.) should not be followed by a number. .2object name invalid - dot is followed by a number. object name invalid - starts with which is not valid. Objects assignment: the objects can be assigned values using <- symbol. For example, x<-5, y<-5.2, z<-”CSE5DEV”. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure ▶ Objects are reserved memory locations to store values. They store data of different types, and different types can do different things. The stored values are known as R data types. ▶ In R the data types can be one of following: Logical: TRUE, FALSE. 2 Integer: 21L, 3L, 3L, ...etc. The letter ”L” declares this as an integer. 1 ▶ 3 Numeric: real or decimal (2.1, 2.0, pi). 4 Character: ”a” or ”swc”. 5 Complex: 1 + 0i or 1 + 4i. 6 Date Values: ”2021-07-26”. We can use class() or typeof() function to check the data type of objects. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — Examples of R objects assignment and data types — # numeric x <- 5.5 class(x) ## [1] "numeric" # integer x <- 200L class(x) ## [1] "integer" # complex x <- 6i + 2 class(x) ## [1] "complex" # character/string x <- "R CSE5DEV" class(x) ## [1] "character" # logical/boolean x <- TRUE class(x) ## [1] "logical" Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Type Conversion — ▶ In R , we can convert a value from one type to another using the following functions: • • • • ▶ as.numeric() as.integer() as.complex() as.Date () Examples of data type conversion are: x <- 2L # integer y <- 4 # numeric # convert from integer to numeric: a <- as.numeric(x) # convert from numeric to integer: b <- as.integer(y) # print values of x and y print (x) ## [1] 2 print (y) ## [1] 4 # print the class name of a and b class(a) ## [1] "numeric" class(b) ## [1] "integer" Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures— ▶ Data structures are used to store data, keep it organised, and enable easy modification and access. ▶ Data structures store a SET of data values that relate to each other, and allows us to perform operations or functions on these values. ▶ Examples of R data structures are: 1 Vectors. Matrices. 3 Data Frames. 2 4 Factors. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Vectors — ▶ Vectors store a list of items (or values) of the same type. ▶ We use the c() function declare a vector consists of set of values separated by a comma. ▶ We can create a vector that combines a set of values as follows: # Vector of numerical values numbers <- c(1, 2, 3, 4) print (numbers) ## [1] 1 2 3 4 # Vector of strings fruits <- c("apple", "orange", "banana") print (fruits) ## [1] "apple" "orange" "banana" # We can create a vector using the Colon : operator numbers <- 1:10 print (numbers) ## [1] 1 2 3 4 5 6 7 8 9 10 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Vectors — Some of useful functions for vectors: ▶ Vector Length: length() returns the number of values. ▶ Sort a Vector: sort() sorts values alphabetically or numerically. ▶ Access Vectors: use [] brackets to access the vector items by index number. ▶ Change an Item Value: use [index number] to change the value of a specific item. ▶ Repeat Vectors: use rep() to repeat vectors items. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Vectors — Examples of vectors functions. # Vector Length fruits <- c("banana", "apple", "orange") length(fruits) ## [1] 3 fruits <- c("banana", "apple", "orange", "mango", "lemon") numbers <- c(13, 3, 5, 7, 20, 2) # Sort vector sort(fruits) # Sort a string ## [1] "apple" "banana" "lemon" "mango" "orange" sort(numbers) # Sort numbers ## [1] 2 3 5 7 13 20 #Access Vectors fruits <- c("banana", "apple", "orange") # Access the first item (banana) fruits[1] ## [1] "banana" fruits[3] ## [1] "orange" #Change an Item fruits <- c("banana", "apple", "orange", "mango", "lemon") # Change "apple" to "pear" fruits[2] <- "pear" print (fruits) ## [1] "banana" "pear" "orange" "mango" # Repeat Vector repeat_vec <- rep(c(1,2,3), each = 3) print (repeat_vec) ## [1] 1 1 1 2 2 2 3 3 3 "lemon" Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Vectors — Examples of vector operations. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Vectors — Functions for vectors. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Matrices — ▶ A matrix stores data in two-dimensional rectangular layout with columns and rows. ▶ A column is a vertical representation of data, while a row is a horizontal representation of data. ▶ We use matrix() function to create a matrix. We also need to specify the nrow and ncol parameters to get the number of rows and columns. # Create a matrix matr <- matrix(c(1,2,3,4,5,6), nrow = 3, ncol = 2) print (matr) ## [,1] [,2] ## [1,] 1 4 ## [2,] 2 5 ## [3,] 3 6 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Matrices — Some of useful functions for Matrices: ▶ Access matrix items: use [] brackets to access items using two index numbers: first one for row while the second one for column. ▶ Access more than one row or column: use [] and c() to access more than one row or column: [c(1,2), ] or [, c(1,2)]. ▶ Add cows and columns: use cbind() to add columns and rbind() to add rows. ▶ Remove rows and columns: use c() to remove rows and columns: [-c(1), -c(1)]. ▶ Check if an item exists: use %in% operator to check if an item is exist: item %in% matrix. ▶ Matrix size: dim() returns the number of rows and columns. ▶ Matrix length: length() returns the dimension of a Matrix Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Matrices — #Access Matrix Items mart <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2) print (mart[1, 2]) ## [1] "cherry" print (mart[2,]) ## [1] "banana" "orange" # Access More Than One Row mart <- matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "melon", "fig"), nrow = 3, ncol = 3) print (mart[c(1,2),]) ## [,1] [,2] [,3] ## [1,] "apple" "orange" "pear" ## [2,] "banana" "grape" "melon" Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Matrices — # Access More Than One Column print (mart[, c(1,2)]) ## [,1] [,2] ## [1,] "apple" "orange" ## [2,] "banana" "grape" ## [3,] "cherry" "pineapple" # Add Rows and Columns mart <- matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "melon", "fig"), nrow = 3, ncol = 3) print (mart) ## [,1] [,2] ## [1,] "apple" "orange" ## [2,] "banana" "grape" ## [3,] "cherry" "pineapple" [,3] "pear" "melon" "fig" Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Matrices — newmatrix <- cbind(mart, c("strawberry", "blueberry", "raspberry")) print (newmatrix) ## [,1] [,2] ## [1,] "apple" "orange" ## [2,] "banana" "grape" ## [3,] "cherry" "pineapple" [,3] "pear" "melon" "fig" [,4] "strawberry" "blueberry" "raspberry" newmatrix <- rbind(mart, c("strawberry", "blueberry", "raspberry")) print (newmatrix) ## ## ## ## ## [1,] [2,] [3,] [4,] [,1] "apple" "banana" "cherry" "strawberry" [,2] "orange" "grape" "pineapple" "blueberry" [,3] "pear" "melon" "fig" "raspberry" Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Matrices — mart <- matrix(c("apple", "banana", "cherry", "orange", "mango", "pineapple"), nrow = 3, ncol =2) #Remove the first row and the first column mart <- mart[-c(1), -c(1)] print (mart) ## [1] "mango" "pineapple" # Check if an Item Exists mart <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2) "apple" %in% mart ## [1] TRUE # check no of rows and columns mart <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2) dim(mart) ## [1] 2 2 # Matrix Length mart <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2) length(mart) ## [1] 4 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Matrices — Examples of matrix operations. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Matrices — Functions for matrix. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Matrices — The following functions can be used to check data type in each column: 1 is.numeric(): Check if the data is Numeric - True or False. 2 is.integer(): check if the data is Integer - True or False. 3 is.factor(): check if the data is Factor - True or False. 4 is.character(): check if the data is Character - True or False. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Factors — ▶ ▶ ▶ ▶ Factors can be used to categorise data and store it as levels. Factors store both strings and integers. They are very useful in the columns which have a limited number of unique values: Demography {Male, Female}, Music {Rock, Classic, Jazz}, Training {Strength, Stamina}, Logical {True, False}. We use factor() function to create a factor and add a vector c() as an argument. # Create a factor music <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz")) print (music) ## [1] Jazz Rock Classic Classic Pop ## Levels: Classic Jazz Pop Rock Jazz Rock Jazz Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Factors — Some of useful functions for Factors: ▶ Levels: we can use levels() function to print factor levels or set the levels. ▶ Factor length: length() function returns the number of items. ▶ Access factors: use [] brackets to access factor items. ▶ Change item value: use [] and item index number to change its value. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Factors — Examples of factors functions: # print levels music <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz")) print (levels(music)) ## [1] "Classic" "Jazz" "Pop" "Rock" # set the levels music <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz"), levels = c("Classic", "Jazz", "Pop", "Rock", "Other")) print (levels(music)) ## [1] "Classic" "Jazz" "Pop" "Rock" # Factor length length(music) ## [1] 8 # Access factors print (music[3]) ## [1] Classic ## Levels: Classic Jazz Pop Rock Other # Change item value music[4] <- "Pop" print (music[4]) ## [1] Pop ## Levels: Classic Jazz Pop Rock Other "Other" Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Data Frames — Data Frame is the most common and practical way of storing data in R, especially in data analyses. ▶ ▶ ▶ ▶ ▶ data.frame shows data in a table format. data.frame stores different types of data inside it. Different columns can have different data types. For example, the first column can be numeric, the second can be character and the third logical, ..etc. However, each column must have the same data type. We use data.frame() function to create a data frame. Note: read.csv() All files that we import using the read.csv() function are stored as data.frame() data structures. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Data Frames — Example: Create a data frame consists of 3 columns and 3 rows. # Create a data frame data_frame <- data.frame ( Training = c("Strength", "Stamina", "Other"), ID = c(10, 11, 13), Time = c(6.6, 3.2, 4.0) ) print (data_frame) ## Training ID Time ## 1 Strength 10 6.6 ## 2 Stamina 11 3.2 ## 3 Other 13 4.0 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Data Frames — Some of useful functions for Data Frames: ▶ Summarise the data: use summary() function to summarise the data. ▶ Access items: use single [], double brackets [ [] ] and $ to access columns. ▶ Add rows and columns: use rbind() to add rows and cbind() to add columns. ▶ Remove rows and columns: use c() to remove rows and columns. ▶ Number of rows and columns: use dim() or ncol() & nrow() to find the number of rows and columns. ▶ Data frame length: length() returns the number of columns. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Data Frames — # Summarise the Data # Create a data frame data_frame <- data.frame ( Training = c("Strength", "Stamina", "Other"), ID = c(10, 11, 13), Time = c(6.6, 3.2, 4.0) ) print (data_frame) ## Training ID Time ## 1 Strength 10 6.6 ## 2 Stamina 11 3.2 ## 3 Other 13 4.0 summary(data_frame) ## ## ## ## ## ## ## Training Length:3 Class :character Mode :character ID Min. :10.00 1st Qu.:10.50 Median :11.00 Mean :11.33 3rd Qu.:12.00 Max. :13.00 Time Min. :3.2 1st Qu.:3.6 Median :4.0 Mean :4.6 3rd Qu.:5.3 Max. :6.6 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Data Frames — # Access items data_frame[1] ## Training ## 1 Strength ## 2 Stamina ## 3 Other data_frame[["Training"]] ## [1] "Strength" "Stamina" "Other" data_frame$Training ## [1] "Strength" "Stamina" "Other" # Add a new row New_row_DF <- rbind(data_frame, c("Strength", 110, 11.0)) print (New_row_DF) ## ## ## ## ## Training ID Time 1 Strength 10 6.6 2 Stamina 11 3.2 3 Other 13 4.0 4 Strength 110 11.0 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Data Frames — # Add a new column New_col_DF <- cbind(data_frame, Steps = c(1000, 6000, 2000)) print(New_col_DF) ## Training ID Time Steps ## 1 Strength 10 6.6 1000 ## 2 Stamina 11 3.2 6000 ## 3 Other 13 4.0 2000 # Remove the first row and column Data_Frame_New <- data_frame[-c(1), -c(1)] print (Data_Frame_New) ## ID Time ## 2 11 3.2 ## 3 13 4.0 # find the number of rows and columns print (dim(data_frame)) ## [1] 3 3 # Data Frame Length print (length(data_frame)) ## [1] 3 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Data Frames — Note: data frame rules In R, all data frames should respect the following rules. ▶ All column names should be non-empty. ▶ All row names should be unique. ▶ The data stored in data frame columns can be of numeric, factor or character. ▶ Each column should contains the same number of items and data type. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Data Frames — Example: create data frame for five employees consists of employee ID, name, salary and starting date. # Create employee data frame. employee <- data.frame( employee_id = c (1:5), employee_name = c("A","B","C","D","E"), employee_salary = c(611.3,512.2,621.0,722.0,343.21), start_date = as.Date(c("2014-01-010", "2015-08-23", "2016-10-15", "2016-04-11", "2016-04-26")), stringsAsFactors = FALSE) print(employee) ## ## ## ## ## ## 1 2 3 4 5 employee_id employee_name employee_salary start_date 1 A 611.30 2014-01-01 2 B 512.20 2015-08-23 3 C 621.00 2016-10-15 4 D 722.00 2016-04-11 5 E 343.21 2016-04-26 Please note: 1 Creating data frames using data.frame() function will converted (character) strings to factors (distinct groups). 2 Use stringsAsFactors = FALSE if you are going to change it or making it as plain strings. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Example: import and view data dat <- read.csv("data.csv", header=TRUE, sep =",") str(dat) ## 'data.frame': 32 obs. of 12 variables: ## $ Model: chr "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... ## $ cyl : int 6 6 4 6 8 6 8 4 4 6 ... ## $ disp : num 160 160 108 258 360 ... ## $ hp : int 110 110 93 110 175 105 245 62 95 123 ... ## $ drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... ## $ wt : num 2.62 2.88 2.32 3.21 3.44 ... ## $ qsec : num 16.5 17 18.6 19.4 17 ... ## $ vs : int 0 0 1 1 0 1 0 1 1 1 ... ## $ am : int 1 1 1 0 0 0 0 0 0 0 ... ## $ gear : int 4 4 4 3 3 3 3 4 4 4 ... ## $ carb : int 4 4 1 1 2 1 4 2 2 4 ... dim(dat) ## [1] 32 12 class(dat) ## [1] "data.frame" class(dat$Model) ## [1] "character" class(dat[[2]]) ## [1] "numeric" Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Example: import and view data # summary dat summary(dat) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Model Length:32 Class :character Mode :character hp Min. : 52.0 1st Qu.: 96.5 Median :123.0 Mean :146.7 3rd Qu.:180.0 Max. :335.0 vs Min. :0.0000 1st Qu.:0.0000 Median :0.0000 Mean :0.4375 3rd Qu.:1.0000 Max. :1.0000 mpg cyl disp Min. :10.40 Min. :4.000 Min. : 71.1 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 Median :19.20 Median :6.000 Median :196.3 Mean :20.09 Mean :6.188 Mean :230.7 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 Max. :33.90 Max. :8.000 Max. :472.0 drat wt qsec Min. :2.760 Min. :1.513 Min. :14.50 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 Median :3.695 Median :3.325 Median :17.71 Mean :3.597 Mean :3.217 Mean :17.85 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 Max. :4.930 Max. :5.424 Max. :22.90 am gear carb Min. :0.0000 Min. :3.000 Min. :1.000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000 Median :0.0000 Median :4.000 Median :2.000 Mean :0.4062 Mean :3.688 Mean :2.812 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000 Max. :1.0000 Max. :5.000 Max. :8.000 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Exporting data The data stored in objects can be exported and saved as text or csv files using the following functions: ▶ write.table: export text file: write.table(data to export, file = ”file name.txt”, sep = ” ”). ▶ write.csv: export csv file: write.csv(data to export, file = ”file name.csv”, sep = ”,”) Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Basics of R Programming In this lecture, we have learned 1 how to import data into R environment (RStudio->RMarkdown). 2 how to view data in R. 3 objects and how to manipulate them. 4 R data types. 5 R data structures. 6 how to export data. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming End of Week 2 See you Next Lecture (Week 3) Data Wrangling & R Programming Table: CSE5DEV Timetable Check LMS CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming CSE5DEV DATA EXPLORATION AND ANALYSIS Week 3 Data Wrangling & R programming CSE5DEV Syllabus Week-Overview Data Wrangling Overview 1 CSE5DEV Syllabus 2 Week-Overview 3 Data Wrangling 4 Basics of R Programming Basics of R Programming CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Subject Syllabus — Lecture 1 — Introduction — Lecture 2 — Data Collection & R Programming — Lecture 3 — Data Wrangling & R Programming Lecture 4 Data Cleaning & Normalisation Lecture 5 Data Visualisation Lecture 6 Lecture 7 Lecture 8 Data Exploration 1 Data Exploration 2 Data Exploration 3 Analysis Analysis Analysis Lecture 10 Case Study 1 Lecture 11 Case Study 2 Lecture 12 Revision Lecture 9 Correlation & Pattern Discovery Analysis CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Science Project Almost all data science and analysis projects require the same set of stages to be performed. These are: Stage -1 Identify the problem (question) Stage - 2 Collect & Prepare the data Stage - 3 Explore the data Stage - 4 Communicate the results What is the goal? What do you want to estimate? How to track houses prices across different areas? Data resources Descriptive statistics What are the findings? Data representation Visualisation What we learn? Report the findings Does the result make sense? Clean and normalise the data CSE5DEV Syllabus Week-Overview Data Wrangling Overview 1 CSE5DEV Syllabus 2 Week-Overview 3 Data Wrangling 4 Basics of R Programming Basics of R Programming CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Week 3 Overview Data Wrangling & R programming This week will be covering the basics of Data Wrangling & R programming. Learning outcomes: • Learn about data representation. • Learn how to convert data from one format to another . • Learn R programming conditional statement. • Learn how to use R programming packages. CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming What we have learned so far? Data can be in different formats, but computer program expects your data to be organised in a well-defined structure. What we have learned so far? —— Theory —— • Data Collection: working with data 1 Data sources; PC, internet, external. 2 Data formats: text, CSV, URL, ..., etc. 3 Data values: qualitative or quantitative. 4 Data categories: experimental or observational. CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming What we have learned so far? What we have learned so far? —— R Programming —— 1 Install R and Rstudio, create Rmarkdown file, write and run basic codes, ..etc 2 Data Type and data structure (vector, factor, matrix and data frame) 3 View, Access, Change.... etc. 4 Import data into R Environment (text file and csv files) Note The above steps (Reading, Viewing, Accessing, Changing, ..., etc) are very crucial for Lecture 3 to lecture 11. If you DON’T know how to perform them in R, please let us know as soon as possible. CSE5DEV Syllabus Week-Overview Data Wrangling Overview 1 CSE5DEV Syllabus 2 Week-Overview 3 Data Wrangling 4 Basics of R Programming Basics of R Programming CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Data Wrangling Data wrangling can be defined as the process of organising data in consistent representation or format that can be easily used and presented. CSV file R Code: Import CSV file View Data Data Type Data Structure Access Data Rstudio Environment Transform data into a readable format CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Example: Consider the country population dataset (data1.csv). The same data can be organised in different representations, as shown in next slides. CSE5DEV Syllabus Week-Overview Data Wrangling Example: format-1. Data Wrangling Basics of R Programming CSE5DEV Syllabus Week-Overview Data Wrangling Example: format-2. Data Wrangling Basics of R Programming CSE5DEV Syllabus Week-Overview Data Wrangling Example: format-3. Data Wrangling Basics of R Programming CSE5DEV Syllabus Week-Overview Data Wrangling Example: format-4. Data Wrangling Basics of R Programming CSE5DEV Syllabus Week-Overview Data Wrangling Example: format-5. Data Wrangling Basics of R Programming CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling From the previous examples, we have see that • The same data can be organised in different representations or formats. • Each format shows the same values of four variables: country, year, population and cases. • Different format show the values in a different representation. CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Q: What type of representation will be used in CSE5DEV labs? A: Tabular representation (Observations-by-features). Figure: Image from R for Data Science CSE5DEV Syllabus Week-Overview Data Wrangling Data Wrangling Tabular representation In CSE5DEV, we use data frame data structure Figure: Image from R for Data Science Basics of R Programming CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Tabular representation Organising data in observations-by-features is considered the most convenient and standard representation for data analysis. CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Tabular data Types of features/attributes: It is important to recognise the types of values each feature/attribute takes in order to understand which operations make sense for it. Example • Can we compute an average eye colour? • How do we compute the difference between phone numbers? • Can we say today is ’twice as hot/cold’ as yesterday? This is similar to problems like 6 apples / 4 people = 1.5 apples per person, but 10 people / 4 car seats = 3 cars. CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Tabular data Qualitative vs. Quantitative attributes: Attribute values can be split into two types: Qualitative attributes Attributes that take values from a (finite) set of categories are called categorical or qualitative attributes. In some sense, they describe an object/observation, rather than measure its properties. Quantitative attributes Attributes that represent quantities are called numerical or quantitative attributes. They provide concrete quantifiable measurements of an object/observation. CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Tabular data Qualitative: Nominal vs. Ordinal: Qualitative attributes can be split further into two types: Nominal attributes Examples: zip codes, eye colour, operating system, gender. Values of such attributes just specify names without any particular order or relation between them (except for = and ̸=). Binary attributes are nominal attributes with only two values (Yes/No or 0/1). They can be symmetric or asymmetric based in whether or not their values are equally informative. Ordinal attributes Examples: ratings, grades, street/avenue numbers. Values of such attributes have some order, even though they don’t specify an exact quantity. CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Tabular data Quantitative: Interval vs. Ratio: Quantitative attributes can also be split into two types: Interval attributes Examples: calendar dates, azimuth direction, Fahrenheit temperatures. Such attributes represent quantities with meaningful difference (or fixed intervals) between their values (but no multiplicative relations). Ratio attributes Examples: mass, length, distance, currency, age, electrical current. Such attributes represent quantities that have meaningful ratios between their values. Unlike interval a