w1B-lifecycle.pdf
Document Details
Uploaded by DeadCheapOcarina
University of Sydney
Tags
Full Transcript
DATA1002_1902 DATA1002/1902 Informatics: Data and Computation 1B: Data Science Activities...
DATA1002_1902 DATA1002/1902 Informatics: Data and Computation 1B: Data Science Activities Dr. Josiah Poon School of Computer Science Image source: 1 https://www.pngegg.com/en/png-pylpy/download DATA1002_1902 COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of the University of Sydney pursuant to Part VB of the Copyright Act 1968 (the Act). The material in this communication may be subject to copyright under the Act. Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. Do not remove this notice. 2 DATA1002_1902 Acknowledgement of Country We acknowledge the traditional custodianship and law of the Country on which the University of Sydney campuses stand. We pay respects to those who have cared and continue to care for the Country. Do not remove this notice. 3 DATA1002_1902 Lecture in this Session Data science work cycle Data and metadata Ethics and data science 4 DATA1002_1902 The main Data Science (DS) activities1 1) Problem formulation 2) Data acquisition 3) Data exploration − Data quality checks and cleaning − Summaries 4) Analysis − Summaries − Predictive model 5) Communication of results 6) Take action 7) Go to (1) if necessary 1There are several different ways people describe the work done in data science projects; this is just one view. 5 DATA1002_1902 Problem formulation Decide what the goal is, for the work Decide what success looks like Often, this involves negotiation and conversation with whoever is paying for the work – they are often not quite sure at first what they want, nor what is feasible 6 DATA1002_1902 Problem formulation – Some example goals Sociologist wants to know which demographic and economic factors influence choice of school for children Farmer wants advice on what fertilizer to use, to maximize crop yield Bank wants to automatically flag some credit card purchases as potentially fraudulent, to delay payment till checks have been made Biologist wants to be able to find out which species of micro- organism are present in a location, given a list of protein fragments found in an environmental sample Doctor wants to determine whether a patient is likely to have a particular disease, given results of tests (none of which is perfect) Designer wants a car that brakes automatically when a pedestrian steps in front 7 DATA1002_1902 Problem formulation – Types of goal Understand/Explain − for scholarship or interest − as a guide to action Predict then Respond Automatically Predict then Automatically Respond 8 DATA1002_1902 Problem formulation – Some vital features to consider Does a human need to make sense of the outcome? What is the damage if the data scientist gets things wrong? − quite different if there is a subsequent check, than if action happens directly; but also depends on domain Are we summarizing a (more-or-less) unchanging situation , or are we expecting to handle new cases as they arise? 9 DATA1002_1902 Problem formulation – Class activity: match up Example goal Types of goal 1. know which demographic and economic factors Understand/Explain influence choice of school for children − for scholarship or interest − as a guide to action 2. advice on what fertilizer to use, to maximize crop yield Predict then Respond 3. automatically flag some credit card purchases as potentially fraudulent, to delay payment till checks have Automatically Predict then Automatically been made Respond 4. find out which species of micro-organism are present in a location, given a list of protein fragments found in an environmental sample 5. determine whether a patient is likely to have a particular disease, given results of tests (none of which is perfect) 6. a car that brakes automatically when a pedestrian steps in front 10 DATA1002_1902 Data acquisition Decide what data you would like to have, to address the goal Find what data you can get − Not always the same as what you want − Often, need to use “proxy data” − Eg you want to know how many students are working more than 15 hours per week, but all you have is their earnings from employment Get the data Put it somewhere in file(s), so you can work with it with computational tools − Then get it into a chosen tool (we say: “ingest” the data) − Much more about this, in upcoming lectures on data management 11 DATA1002_1902 Data acquisition – Data sources Original sources (these all will contain errors!): − sensors (measure the world) − surveys (ask people) − digital logs (track IT activities) Secondary sources − other scholars, organizations, etc − data may already be summarized, transformed, cleaned, etc Combining sources (we say: “data integration”) 12 DATA1002_1902 Data acquisition – Examples of datasets Census − raw data has individual level demographics etc − available summaries combine these into counts in a suburb etc Crop observations − many plantings, with many features (seed type, date, weather, soil, fertilizer etc), and resulting crop yields Credit card histories − lots of transactions of many users, with many features, some transactions were reported as fraudulent Medical records − lots of patients, their test results, diagnoses 13 DATA1002_1902 Data exploration Learn about the contents of the dataset How is it structured? What is the meaning of the different features? eg is temperature the daily maximum, monthly average, at some specific time? is income measured in actual dollars or inflation-adjusted ones? Are there data quality concerns? see later data management lectures when there are, do data cleaning! Can you find patterns connecting different features? 14 DATA1002_1902 Data exploration – Quality checks and cleaning Are there missing data? Are the values within the data field range? If data come from different sources, are the using the same format and consistent? If data have to be joined from different sources, are they ways to recreate unique entries? 15 DATA1002_1902 Data exploration - Tools Ease-of-use is crucial here − precision is less important Perhaps work with a subset of the data, to keep the effort manageable Usually, produce summaries and plots, get an idea, refine the processing (narrow focus to specific subsets, add more features, etc) 16 DATA1002_1902 Data exploration - Summaries Many different ways to summarize some aspect of a dataset Eg: distribution of values in an attribute Eg: table of count/average/etc for cases in different categories Wind direction Number of days Average speed SSE 4 13.4 E 3 16.7 WNW 7 30.2 etc This is hypothetical data, not real Eg: plot showing how two attributes occur for different items 17 DATA1002_1902 Analysis Decide what calculations to do that will provide an answer (or assistance, at least) towards the goal Do the calculations − precision is crucial here Validate what is produced Interpret the results in terms of the goal Source image: book cover images © O’Reilly, MIT Press 18 DATA1002_1902 Analysis - Output Tables or charts of summaries − Similar to what is done in exploration, but now final, precise, and done to make a carefully designed case Or a predictive model 19 DATA1002_1902 Analysis - Models Most often, the output of analysis is a model of the domain − a mathematical or computational formula, that describes what (you think) is happening “All models are wrong, but some models are useful.” – George Box (1919-2013) Use the model to predict in a previously unseen case Use aspects of the model for explanation, or to suggest interventions − Eg (hypothetical) “every year of extra preschool for a low SES child, increases their likely future earnings by $X/yr” − Eg (hypothetical) “exposure to second-hand cigarette smoke increase risk of lung cancer by Y%” − Warning: these explanations are often very misleading − model is not reality − average is not applicable in any particular case 20 DATA1002_1902 Analysis – Model construction Typically, the data scientists first choose a category of models perhaps based on prior knowledge of the domain, but perhaps based on the techniques the scientists know how to use! eg predicted feature is a linear combination of the available known features eg decision tree (computer program that splits cases based on value of one feature at a time) Then, use dataset to determine the “best” model in this category eg least-squares regression to pick the coefficients in the linear combination Lots more in later data-science lectures 21 DATA1002_1902 Communication Let others know what the data scientist found out Many different stakeholders client (who paid for the work), peers, subjects (their data was used), regulators, general public Usually done through a variable mix of written reports, spoken presentations, websites almost always, illustrated by charts (maybe interactive) 22 DATA1002_1902 Take action (response) Based on analysis, build a computer-hosted system that examines future cases, and takes appropriate action deploy this system in use Typically, build a predictor, that looks at features of the new case, and calculates what this is likely to be, and then flows into other systems eg when credit card transaction shows as likely to be fraudulent, then halt payment and send alerts eg when car’s cameras show image that might be a pedestrian, then start braking 23 DATA1002_1902 Again and again (Do you remember Step 7?) DS work is iterative Within an activity type, lots of small revisions based on what happened the previous time eg produce a chart; show it to someone and see how they (mis)understand the point or get confused or ask for clarification; adjust the chart. “rinse and repeat” Move back to revisit early activity type, after you see what happens downstream Eg after exploring data and seeing some common errors, seek to acquire a different/better data set 25 DATA1002_1902 Lecture in this Session Data science work cycle Data and metadata Ethics and data science 26 DATA1002_1902 Example (partial screenshot from bom.gov.au) From http://www.bom.gov.au/climate/dwo/201907/html/IDCJDW2124.201907.shtml 27 DATA1002_1902 Metadata Metadata is facts about the data Examples include explanation of data meaning, description of data format, information about data source, applicable policy, statistical models/summaries of data Metadata is itself valuable and so worthy of being managed Without knowing metadata, users can’t make any sensible use of the data itself 28 DATA1002_1902 Data meaning Each field in a data item needs context to interpret it properly Eg in meteorology dataset, is this number a temperature, or a rainfall or a windspeed if temperature, is it daily maximum, temperature at a particular instant, average calculated somehow? is it in Fahrenheit or Celsius, or some other representation? Are there special values used for missing data, out of range data, etc? eg -1 or 99 often used this way eg “FNU” in US passport of someone without a first name eg geolocation data that uses midpoint of state if nothing more specific is known 29 DATA1002_1902 Example (partial screenshot from bom.gov.au) From http://www.bom.gov.au/climate/dwo/201907/html/IDCJDW2124.201907.shtml 30 DATA1002_1902 More metadata from bom.gov.au This is often called a “data dictionary” 31 From http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml DATA1002_1902 Data format How is the data arranged? Eg rows, with all data from one observation on each row In a file, this might have fields separated by “,” or, multiple rows per observation Eg “repeating groups” first row in a group gives time and place of the observation then multiple rows, each with name of attribute “:”value of measurement or tree structure with explicit description eg JSON [see later lecture] or, … 32 DATA1002_1902 Schema In many datasets, the format is a collection of observations which all have have the same structure eg “Date-time, location, temperature, rainfall” This format can be documented in a header row common in many settings This description of the format, along with information on the type of each attribute, is called a schema for the data 33 DATA1002_1902 Schema choices In some settings the schema is known in advance eg decided in a design process, or determined by the data source Sometimes, data comes with a schema attached In other cases, the schema is discovered by looking at the dataset and recognizing a pattern this should always be checked with domain experts Sometimes, data mostly fits a schema, but there are exceptional data items 34 DATA1002_1902 Data origin “Provenance” is a term indicating metadata about origin of the data It is important to know where data has come from if collected: when, where, by whom if from a secondary source: when, from whom if computed: when, by what process, using what inputs Based on origin, there may be restrictions on use license rules human ethics rules 35 DATA1002_1902 Example of provenance metadata from bom.gov.au page 36 DATA1002_1902 Lecture in this Session Data science work cycle Data and metadata Ethics and data science 37 DATA1002_1902 Ethical concerns with data science Is the process done appropriately? Is the outcome used appropriately? In DATA1002, we don’t dictate your moral values But, we want you to recognize that there are issues that need to be decided, and that have implications 38 DATA1002_1902 Ethics in DS process Rights to use datasets obtaining permission, licensing rules, etc Care of the datasets risk of privacy breach if you hold personal data anonymisation is not always perfect Competence with the tools too often, analysis is done without adequate awareness of the limitations of the approach chosen Built-in bias eg voice recognition that doesn’t handle many accents eg medical datasets that have many more men than women (so predictions may be inaccurate for women) 39 DATA1002_1902 Ethics in DS outputs Conclusions and automated systems can be used for good, or for ill any professional has some responsibility for uses made of their work, if they can reasonable expect those uses to happen Eg Volkswagen detected the pattern of use in air- quality tests, and switched to lower-polluting, lower-performance mechanisms just while the test was on what would DS team think, when asked to build a system to detect occurrence of these tests? 40 DATA1002_1902 Summary Activities done in data science work, and how they typically occur 1. Problem formulation 2. Data acquisition 3. Data exploration 4. Analysis 5. Communication of results 6. Take action automatically Data and metadata Data meaning, Data format (schema), Provenance Ethics in data science work 41 DATA1002_1902 THANKS FOR READING Good luck in studying 42