COMP5310 Data Science W2 Review PDF
Document Details
Uploaded by Deleted User
The University of Sydney
Maryam Khanian
Tags
Summary
This document is an academic lecture note from the University of Sydney on data science. It covers data cleaning and exploration, using spreadsheets. The lecture covers topics like data types, cleaning, preprocessing and descriptive statistics and visualization in spreadsheets.
Full Transcript
COMP5310: Principles of Data Science W2: Data Cleaning and Exploration (via Spreadsheet) Presented by Maryam Khanian Based on slides by previous lecturers of this unit of study The University of Sydney Page 1 Last week: Introductions and housekeeping Objective...
COMP5310: Principles of Data Science W2: Data Cleaning and Exploration (via Spreadsheet) Presented by Maryam Khanian Based on slides by previous lecturers of this unit of study The University of Sydney Page 1 Last week: Introductions and housekeeping Objective Readings – Housekeeping; Learn about – Data Science from Scratch: Ch 1. backgrounds and goals; Define data – Install Anaconda and PostgreSQL. science. TO-DO in W1 Lecture – Ed Lessons Python modules 1-3. – Welcome, introductions. – Organise into project groups. – Unit overview, assessment, resources. – Choose project dataset. – Discuss definitions/scope of data science. The University of Sydney Page 2 Today: Data cleaning and exploration (via spreadsheet) Objective Readings – Use interactive tools to explore a – Introduction to Data Mining: Ch 2.1.1 new data set quickly. – Data Science from Scratch: Ch 2-3. Lecture Exercises – Data types, cleaning, preprocessing. – Spreadsheets: Visualisation. – Descriptive statistics, e.g., mean, stdev, – Spreadsheets: Descriptive statistics. median. TO-DO in W2 – Descriptive visualisation, e.g., – Ed Lessons Python modules 4-6. scatterplots, histograms. – Ed Lessons SQL modules 16-17. – Explore project data. The University of Sydney Page 3 Exploratory Analysis Workflow The University of Sydney Page 4 Example dataset 2020 Remote Working Survey Responses: https://data.nsw.gov.au/data/dataset/nsw-remote-working-survey The University of Sydney Page 5 PRELIMINARIES: TYPES OF DATA The University of Sydney Page 6 Attributes (features) Class Data is collection of examples (also called Tid Refund Marital Taxable instances, records, observations, objects) Status Income Cheat 1 Yes Single 125K No 2 No Married 100K No Examples or Objects are described with 3 No Single 70K No attributes (features, variables) Examples 4 Yes Married 120K No /Objects 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 The University of Sydney Page 7 Types of Data There are different types of data – Nominal Examples: ID numbers, eye color, zip codes – Ordinal Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height {tall, medium, short} – Interval Examples: calendar dates, temperatures in Celsius or Fahrenheit. – Ratio Examples: temperature in Kelvin, length, counts, elapsed time (e.g., time to run a race) The University of Sydney Page 8 Properties of Data Types – The type of data depends on which of the following properties/operations it possesses: – Distinctness: = ¹ – Order: < > – Differences are + - meaningful : – Ratios are * / meaningful – Nominal data: distinctness – Ordinal data: distinctness & order – Interval data: distinctness, order & meaningful differences – Ratio data: all 4 properties/operations The University of Sydney Page 9 Interval Data – Interval scales provide information about order, and also possess equal intervals. What year were you born? – Values encode differences. 1972 – Equal intervals between values. 1972 – Addition is defined. 1982 1987 – e.g., degrees in Celcius. 1991 – No true Zero and can represent values below zero. (No multiplication or division) The University of Sydney Page 10 Interval Data (cont’) – Central tendency can be measured What year were you born? by mode, median, or mean. 1972 – Dispersion can be estimated by the Inter-Quartile Range (IQR), stdev, 1972 variance 1982 1987 1991 The University of Sydney Page 11 Calculating descriptive statistics – First sort values, then: – Median is the middle value (or average of two middle values). – Minimum is the first value. – Maximum is the last value. – 10th percentile is item at index 0.1*N. – 90th percentile is item at index 0.9*N. – Range is Maximum minus Minimum. – IQR is the difference between the first and third quartile. The University of Sydney Page 12 Interval Data – How to calculate the median for the given output data: [1,1,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5] the ‘cut-off’ points – How to calculate the IQR: are called quartiles [1,1,2,2,2,2,2,2,2,2,2,3,3,3, 3] [3,3,3,3,3,3,3,3,3,3,3,3,3,3, 3] [3,3,3,3,3,4,4,4,4,4,4,4,4,4, 4] [4,4,4,4,4,4,4,4,5,5,5,5,5,5, 5] – i.e.: Q3 – Q1 = 4 – 3 = 1. The University of Sydney Page 13 Calculating descriptive statistics – Mean is the sum of values divided by the number of values: – Variance: – Standard deviation: The University of Sydney Page 14 Ratio Data How long have you been in your current – All operations supported by job? interval data are good here (Reponses edited for example: scale in years) – Zero defined. 2 years – Multiplication defined. 10 years – Ratio is meaningful. 8 years – e.g., length, weight, income, 4 years degree in Kelvin 45 years – Degree in Celcius is not ratio data The University of Sydney Page 15 Ordinal Data – Values are ordered. My organisation encouraged people to – No distance is implied. work remotely NA – e.g., rank, agreement. Strongly disagree – Central tendency can be Disagree measured by mode or median. Somewhat disagree Neither agree nor disagree – The mean and stdev cannot be Somewhat agree defined from an ordinal set. Agree Strongly agree The University of Sydney Page 16 Ordinal Data – Countable: can assign a positive integer one-to-one to each response. – Order defined: 1. Strongly Disagree 2. Disagree 3. Somewhat Disagree 4. Neither Agree nor Disagree 5. Somewhat Agree 6. Agree 7. Strongly Agree The University of Sydney Page 17 Nominal Data Which of the following best describes your industry? – Values are names. Manufacturing Wholesale Trade – No ordering is implied. Electricity, Gas, Water and Waste Services – e.g., football jersey numbers. Professional, Scientific and Technical Services Transport, Postal and Warehousing The University of Sydney Page 18 What about text data? What do you like about remote work? (Manufactured example) – Not defined as traditional data Avoiding my commute type in statistics. Going to the gym at lunch time – Requires interpretation, coding Staying home with my dog or conversion. Spending lunch with my family – More in future lectures… Peace and quiet while working The University of Sydney Page 19 Self Assessment (1) – Which one(s) of these data types can be ordered? – Nominal – Ordinal – Interval – Ratio – None of them – All of them The University of Sydney Page 20 Self Assessment (2) – Which one(s) of these data types can use multiplication and division operations? – Nominal – Ordinal – Interval – Ratio – None of them – All of them The University of Sydney Page 21 Levels of Measurement Nominal Ordinal Interval Ratio Countable ! ! ! ! Order defined ! ! ! Difference defined (addition, subtraction) ! ! Zero defined (multiplication, division) ! The University of Sydney Page 22 Measures of Central Tendency Nominal Ordinal Interval Ratio Mode ! ! ! ! Median ! ! ! Mean ! ! The University of Sydney Page 23 Measures of Dispersion Nominal Ordinal Interval Ratio Counts / Distribution ! ! ! ! Minimum, Maximum ! ! ! Range ! ! ! Percentiles ! ! ! Standard deviation, Variance ! ! The University of Sydney Page 24 DATA ACQUISITION AND CLEANING The University of Sydney Page 25 Exploratory Analysis Workflow The University of Sydney Page 26 Data Acquisition – Where does data come from? – File Access – You or your organisation might already have a data set, or a colleague provides you access to data. – Web download from an online data server. – Typical exchange formats: CSV, Excel, sometimes also XML. – Programmatically (Cf. Data Science from Scratch, Ch 9) – Scraping the web (HTML). – Using APIs of Web Services (XML/JSON). – Database Access (Week 4 onwards). – Collect data yourself, e.g., via a survey. – This week: Using data from the WFH survey. The University of Sydney Page 27 Acquire data – Create new Excel spreadsheet – Go to your university email. – Click the Spreadsheet button. – File > Open > navigate to WFH survey data. The University of Sydney Page 28 Cleaning and Transforming Data – Real data is often 'dirty’. – Important to do some data cleaning and transforming first. – Typical steps involved: – Type and name conversion. – Filtering of missing or inconsistent data. – Unifying semantic data representations. – Matching entries from different sources. – Later also: – Rescaling and optional dimensionality reduction. The University of Sydney Page 29 Exercise: Reformat and clean data Review and discuss – Any problems with columns in spreadsheet? – How should we fix those problems? Clean – Change any text to numeric values in “Number of years…” columns. – Check format of "Thinking about your current job, how much of your time did you spend remote working last year?" Note that rounding applied on top of underlying data. Is this intentional? The University of Sydney Page 30 a P g e 0 3 WHAT QUESTIONS CAN WE ANSWER? The University of Sydney Page 31 Exploratory Analysis Workflow The University of Sydney Page 32 Exercise: What questions can you ask? – Review WFH Survey data. – List 3 questions you can ask. – Discuss how you would answer each question with this data. The University of Sydney Page 33 Some descriptive questions – What industries do people spend more time WFH? – Do more people who manage than not manage WFH? – In what industries do people with dependents (e.g., children) WFH the most? – Do large organizations encourage more people to WFH? WFH = working from home The University of Sydney Page 34 PIVOT TABLES The University of Sydney Page 35 Table and bar chart of industry Which of the Count of Which of Total following best the following best Wholesale Trade describes your describes your Transport, Postal and Warehousing Retail Trade industry? industry? Rental, Hiring and Real Estate Services Public Administration and Safety Accommodation and Professional, Scientific and Technical… Other Services Food Services 32 Mining Manufacturing Administrative and Information Media and Telecommunications Health Care and Social Assistance Support Services 76 Financial and Insurance Services Electricity, Gas, Water and Waste Services Agriculture, Forestry Education and Training Construction and Fishing 9 Arts and Recreation Services Agriculture, Forestry and Fishing Administrative and Support Services Arts and Recreation Accommodation and Food Services Services 38 0 50 100 150 200 250 300 Construction 56 Total The University of Sydney Page 36 Creating a pivot table – Summarize data by calculating statistics over sub-populations. – e.g., count of industry by name. – In Excel: – Select data range (e.g., C1:En) – Go to Insert > Pivot Table (should insert a new sheet). – Select industry under row. – Select industry under value. – Summarize by count. The University of Sydney Page 37 Exercise: Using a pivot table to summarise data – Pivot table – Create a table of average age by industry. – Discuss/explore – What other statistics can we calculate? – What other variable combinations could we explore? The University of Sydney Page 38 SUMMARISING NOMINAL DATA The University of Sydney Page 39 Summarise nominal data with bar charts Total Wholesale Trade Transport, Postal and Warehousing Retail Trade – Measures of central tendency: Rental, Hiring and Real Estate Services Public Administration and Safety Professional, Scientific and Technical Services Other Services Mining – Mode. Manufacturing Information Media and Telecommunications Health Care and Social Assistance – Measures of dispersion: Financial and Insurance Services Electricity, Gas, Water and Waste Services Education and Training – Counts/distribution Construction Arts and Recreation Services Agriculture, Forestry and Fishing Administrative and Support Services Accommodation and Food Services 0 50 100 150 200 250 300 Total The University of Sydney Page 40 Calculating the Mode – The most frequent value. – Defined for nominal data, but spreadsheets might not compute. – Can be read from a bar chart. The University of Sydney Page 41 Create Bar Charts – Count frequency of each category. – Display on bar chart. – In Excel – Needs a column of responses and a column of counts (can be aggregated in a pivot table). – Select data range (e.g., A2:B20). – Insert > Bar Chart. The University of Sydney Page 42 Exercise: Exploring nominal data – Visualise – Create histograms of current and desired industries (synthetic data). – Discuss – What do we need to do to make these comparable? – What is the mode? The University of Sydney Page 43 Bar charts comparing known and future industries Wholesale Trade Wholesale Trade Transport, Postal and Warehousing Transport, Postal and Warehousing Retail Trade Retail Trade Rental, Hiring and Real Estate Services Rental, Hiring and Real Estate Services Public Administration and Safety Public Administration and Safety Professional, Scientific and Technical… Professional, Scientific and Technical… Other Services Other Services Mining Mining Manufacturing Manufacturing Information Media and… Information Media and… Health Care and Social Assistance Health Care and Social Assistance Financial and Insurance Services Financial and Insurance Services Electricity, Gas, Water and Waste… Electricity, Gas, Water and Waste… Education and Training Education and Training Construction Construction Arts and Recreation Services Arts and Recreation Services Agriculture, Forestry and Fishing Agriculture, Forestry and Fishing Administrative and Support Services Administrative and Support Services Accommodation and Food Services Accommodation and Food Services 0 50 100 150 200 250 300 0 50 100 150 200 250 300 350 Count of Current Industry Count of Desired Industry Discuss: Do modes differ? Ranges? Number of responses? The University of Sydney Page 44 SUMMARISING ORDINAL DATA The University of Sydney Page 45 Summarise ordinal data: histograms, median, percentiles – Measures of central tendency: – Median, mode. – Measures of dispersion: – Counts/distribution. – Min/max/range. – Percentiles. The University of Sydney Page 46 Creating a Histogram chart – Count frequency, e.g., of ordinal values within each category – Display on histogram chart with one variable grouped inside – In Excel: – Needs a column of responses and a column of counts (can be aggregated in a pivot table). – Insert > Pivot Table > Select full range of data in the spreadsheet > Drag and drop column name that holds response data into the Rows and Values field (check Value is set to Count). – Select data range from Pivot Table (e.g., A2:B20). – Insert > Column Chart. The University of Sydney Page 47 Exercise: Exploring ordinal data – Visualise – Create a histogram diagram of the question “What year were you born? " – Discuss – What do the responses "1900" mean? – Does this reflect underlying working population distribution or are some age groups more well-represented in the survey data? The University of Sydney Page 48 SUMMARISING RATIO DATA: How do professional/programming experience compare? The University of Sydney Page 49 Ratio (and interval) data – Measures of central tendency: – Mean, median, mode – Measures of dispersion: – Counts/distribution – Min/max/range – Percentiles – Stdev/variance The University of Sydney Page 50 Creating a Scatterplot – Plots relationship between two different variables. – Display, e.g., professional experience on x-axis vs. programming experience on y-axis for each respondent. – In Excel: – Select data range (e.g., D1:En). – Insert > Scatter. The University of Sydney Page 51 REVIEW The University of Sydney Page 52 W2 Review: Data cleaning and exploration (via spreadsheet) Objective Readings – Use interactive tools to explore a – Introduction to Data Mining: Ch 2.1.1 new data set quickly. – Data Science from Scratch: Ch 2-3. Lecture Exercises – Data types, cleaning, preprocessing. – Spreadsheets: Visualisation. – Descriptive statistics, e.g., mean, stdev, – Spreadsheets: Descriptive stats. median. TO-DO in W2 – Descriptive visualisation, e.g., – Ed Lessons Python modules 4-6. scatterplots, histograms. – Ed Lessons SQL modules 16-17. – Explore project data. The University of Sydney Page 53