Module 4.1 - Data Science PDF

Summary

This document provides an overview of Data Science, covering topics such as large-scale data, expectations, and opportunities. It also examines the various skill sets and challenges associated with Data science, including developing wisdom, curiosity, and the ability to ask good questions.

Full Transcript

Data Science Large-scale Data is Everywhere!  There has been enormous data growth in both commercial and scientific databases due to advances in data generation and collection technologies E-Commerce Cyb...

Data Science Large-scale Data is Everywhere!  There has been enormous data growth in both commercial and scientific databases due to advances in data generation and collection technologies E-Commerce Cyber Security  New mantra  Gather whatever data you can whenever and wherever possible.  Expectations  Gathered data will have value Social Networking: Twitter Traffic Patterns either for the purpose collected or for a purpose not envisioned. Sensor Networks Computational Simulations 2 Why Data Science? Commercial Viewpoint  Lots of data is being collected and warehoused – Web data Googlehas Peta Bytes of web data Facebook has billions of active users – purchases at department/ grocery stores, e-commerce  Amazon handles millions of visits/day – Bank/Credit Card transactions  Computers have become cheaper and more powerful  Competitive Pressure is Strong – Provide better, customized services for an edge (e.g. in Customer Relationship Management) 3 Why Data Science? Scientific Viewpoint  Data collected and stored at enormous speeds – remote sensors on a satellite  NASA EOSDIS archives over petabytes of earth science data / year fMRI Data from Brain Sky Survey Data – telescopes scanning the skies  Sky survey data – High-throughput biological data – scientific simulations  terabytes of data generated in a few hours Gene Expression Data  Data science helps scientists – in automated analysis of massive datasets – In hypothesis formation Surface Temperature of Earth 4 Great Opportunities to Solve Society’s Major Problems Improving health care and reducing costs Predicting the impact of climate change Finding alternative/ green energy sources Reducing hunger and poverty by increasing agriculture production 5 What is Data Science? Like any emerging field, it isn’t yet well defined, but incorporates elements of: Exploratory Data Analysis and Visualization Machine Learning and Statistics High-Performance Computing technologies for dealing with scale. Skill Sets for Data Science Appreciating Data Computer Scientists do not naturally appreciate data: it’s just stuff to run through a program. The usual way to test algorithm performance is to run the implementation on “random data”. But interesting data sets are a scarce resource, which requires hard work and imagination to obtain. Computer vs. Real Scientists (1) Scientists strive to understand the complicated and messy natural world, while computer scientists build their own clean and organized virtual worlds. Thus: Nothing is ever completely true or false in science, while everything is either true or false in Computer Science / Mathematics. Computer vs. Real Scientists (2) Scientists are data-driven, while computer scientists are algorithm-driven. Scientists obsess about discovering things, which computer scientists invent rather than discover. Scientists are comfortable with the idea that data has errors; computer scientists are not. Genius vs. Wisdom Software developers are hired to produce code. Data Scientists are hired to produce insights. Genius shows in finding the right answer!!! Wisdom shows in avoiding the wrong answers. Data science (like most things) benefits more from wisdom than from genius. Developing Wisdom Wisdom comes from experience. Wisdom comes from general knowledge. Wisdom comes from listening to others. Wisdom comes from humility, observing how often you have been wrong and why/how. I seek pass on wisdom, through my experience on the difficulty of making good predictions. Developing Curiosity The good data scientist develops a curiosity about the domain/application they are working in. They talk shop with the people whose data they are working on. They read the newspaper every day, to get a broader perspective on the world. Asking Good Questions Software developers are not encouraged to ask questions, but data scientists are: What exciting things might you be able to learn from a given data set? What things do you/your people really want to know? What data sets might get you there? Let’s Practice Asking Questions! Who, What, Where, When, and Why on the following datasets: Baseball-reference.com Google ngrams NYC taxi cab records Baseball-Reference.com: biosketch Statistical Record of Play Summary statistics of each years batting, pitching, and fielding record, with teams and awards. Baseball Questions How to best measure individual player’s skill, value or performance? How fair do trades between teams work out? What is the trajectory of player’s performances as they mature and age? To what extent does batting performance correlate with the position played? Demographic Questions Do left-handed people have shorter lifespans than right-handers? How often do people return to where they were born? Do player salaries reflect past, present, or future performance? Are heights and weights increasing in the population? Google Ngrams Presents an annual time series of the frequency of every “popular” word/phrase with 1 to 5 words occurs in scanned books. `Popular’ means appears >40 times in total. Google has scanned about 15% of all books ever published, making this resource quite comprehensive. Google Ngram Viewer Ngram Questions How has the amount of cursing changed over time? What is the lifespan of fame and technologies? Is it increasing/decreasing? How often do new words emerge? Do they stay in common usage? What words are associated with other words, i.e. can you build a language model? NYC Taxi Cab Data Gives driver/owner, pickup/dropoff location, and fare data for every taxi trip taken. Data obtained from NYC via Freedom of Information Act Request (FOA) Taxicab Questions How much do drivers make each night? How far do they travel? How much slower is traffic during rush hour? Where are people traveling to/from at different times of the day? Do faster drivers get tipped better? Where should drivers go to pick up their next fare? Machine Learning Tasks … Data Milk 25 Predictive Modeling: Classification  Find a model for class attribute as a function of the values of other attributes Model for predicting credit worthiness Class 26 Classification Example Test Set Training Learn Model Set Classifier Introduction to Data Mining, 2nd Edition Tan, 27 Steinbach, Karpatne, Kumar Examples of Classification Task  Classifying credit card transactions as legitimate or fraudulent  Classifying land covers (water bodies, urban areas, forests, etc.) using satellite data  Categorizing news stories as finance, weather, entertainment, sports, etc  Identifying intruders in the cyberspace  Predicting tumor cells as benign or malignant  Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil 09/09/2020 28 Classification: Application 1  Fraud Detection – Goal: Predict fraudulent cases in credit card transactions. – Approach:  Use credit card transactions and the information on its account-holder as attributes. – When does a customer buy, what does he buy, how often he pays on time, etc  Label past transactions as fraud or fair transactions. This forms the class attribute.  Learn a model for the class of the transactions.  Use this model to detect fraud by observing credit card transactions on an account. 29 Classification: Application 2  Churn prediction for telephone customers – Goal: To predict whether a customer is likely to be lost to a competitor. – Approach:  Use detailed record of transactions with each of the past and present customers, to find attributes. – How often the customer calls, where he calls, what time-of-the day he calls most, his financial status, marital status, etc.  Label the customers as loyal or disloyal.  Find a model for loyalty. 30 Classification: Application 3  Sky Survey Cataloging – Goal: To predict class (star or galaxy) of sky objects, especially visually faint ones, based on the telescopic survey images (from Palomar Observatory). – 3000 images with 23,040 x 23,040 pixels per image. – Approach:  Segment the image.  Measure image attributes (features) - 40 of them per object.  Model the class based on these features.  Success Story: Could find 16 new high red-shift quasars, some of the farthest objects that are difficult to find! 31 Classifying Galaxies Courtesy: http://aps.umn.edu Early Class: Attributes: Stages of Formation Image features, Characteristics of light waves received, etc. Intermediate Late Data Size: 72 million stars, 20 million galaxies Object Catalog: 9 GB Image Database: 150 GB 32 Regression  Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency.  Extensively studied in statistics, neural network fields.  Examples: – Predicting sales amounts of new product based on advertising expenditure. – Predicting wind velocities as a function of temperature, humidity, air pressure, etc. – Time series prediction of stock market indices. 33 Clustering  Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster Intra-cluster distances are distances are maximized minimized 34 Applications of Cluster Analysis  Understanding – Custom profiling for targeted marketing – Group related documents for browsing – Group genes and proteins that have similar functionality – Group stocks with similar price fluctuations  Summarization – Reduce the size of large data sets Courtesy: Michael Eisen Clusters for Raw SST and Raw NPP 90 Use of K-means to 60 Land Cluster 2 partition Sea Surface 30 Temperature (SST) Land Cluster 1 and Net Primary latitude 0 Ice or No NPP Production (NPP) into -30 clusters that reflect Sea Cluster 2 the Northern and -60 Southern Sea Cluster 1 -90 -180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180 Hemispheres. 35 Cluster longitude Clustering: Application 1  Market Segmentation: – Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. – Approach:  Collect different attributes of customers based on their geographical and lifestyle related information.  Find clusters of similar customers.  Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters. 36 Clustering: Application 2  Document Clustering: – Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. – Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. Enron email dataset 37 Deviation/Anomaly/Change Detection  Detect significant deviations from normal behavior  Applications: – Credit Card Fraud Detection – Network Intrusion Detection – Identify anomalous behavior from sensor networks for monitoring and surveillance. – Detecting changes in the global forest cover. 38 Motivating Challenges  Scalability  High Dimensionality  Heterogeneous and Complex Data  Data Ownership and Distribution  Non-traditional Analysis 39 DS Career path Introduction to Data Mining, 2nd Edition Tan, Steinbach, 09/09/2020 40 Karpatne, Kumar Introduction Graduates of data science program will mostly, and preferably, work as Data Scientists Data Scientists can work in any type of organization: – Private – Governmental – Non-for-Profit 9/3/20XX Presentation Title 41 Industries Any organization can benefit from the data they have, so data scientists can work in any industry: – Financial Institutions (E.g., Banks) – Government agencies (E.g., Civil Status and Passports Department and Police Department) – Healthcare (E.g., Hospitals) – Online platforms (E.g., Uber) – Large Retailers (E.g., Carrefour and Amazon) – Agricultural Companies – And much more … 9/3/20XX Presentation Title 42 Data Scientist Responsibilities Data scientists usually need to build models of verified and validated data sets These models will be used by the employer to predict, recommend, or evaluate any future business decision 9/3/20XX Presentation Title 43 Data Scientist Responsibilities For example, a data scientist, working for a hospital, can build a data model that predicts the best treatment for a specific patient The data scientist will use the data that was collected by the hospital about the patients and the treatments that worked and did not work for them in the past. 9/3/20XX Presentation Title 44 Data Scientist Responsibilities Another example could be a data scientist, working for the police department, can build a data model that predicts the location and time of the next crime before it happens The data scientist will use the data that was collected by the police department about the previous crimes to build the proposed model 9/3/20XX Presentation Title 45 Data Scientist Responsibilities Another example could be a data scientist, working for a large retailer, can build a data model that predicts the demand for certain products and services The data scientist will use the data that was collected by the retailer about the previous purchasing transactions The data scientist may use data that is provided by external entities 9/3/20XX Presentation Title 46 Data Scientist Responsibilities Before building the model, data scientist usually need to clean and normalize the data Data could be collected from internal sources or/and external sources Data scientists need to communicate with data management guys to make sure that necessary data is being collected – Data compliance department should be involved to make sure that data collection is properly handled from a legal perspective 9/3/20XX Presentation Title 47 More Opportunities In addition to working as data scientists, graduates of data science program can work as software development engineers In this field, they will mostly specialize in developing platforms that help data scientists in their jobs They also can develop dashboards that present business intelligence charts and reports to users 9/3/20XX Presentation Title 48 CIS Career Path CIS Career path Introduction Graduates of Computer Information Systems (CIS) program can pursue a job in of the following fields: – Business Analysis – Software Development – System Implementation 9/3/20XX Presentation Title 50 Introduction CIS is an interdisciplinary program that encompasses technology and business courses This makes the graduates of this program knowledgeable about how business works and how technology can make businesses more efficient and more effective 9/3/20XX Presentation Title 51 Introduction People who have knowledge about the technology only will have the following issues while working in the software development field: – Difficulty in developing a software that satisfies the business requirements – Difficulty in architecting the software systems according to the international standards – Difficulty in maintaining existing systems due to lack of knowledge about the business behind them 9/3/20XX Presentation Title 52 Example CIS program exposes students to healthcare information systems When a CIS graduate joins a software development team that is responsible for developing an electronic health record (EHR), he/she will be already aware of the features and functionality of the proposed system 9/3/20XX Presentation Title 53 You as a Business Analyst You will help customers define their requirements of any proposed software system Because you are already aware of how existing systems work, you can make notes and suggestions on how the proposed software system should look like Also, It is less likely you will misinterpret the requirements provided by customers 9/3/20XX Presentation Title 54 You as a Software Developer You will write code to make a software system Because you are already aware of how business works, you will be able to choose the right architecture for the system The right architecture is one that supports any future improvements without making radical changes to the existing architecture 9/3/20XX Presentation Title 55 You as a System Implementer You will help users use the software system the right way Because you are already aware of how business works, you will be able to provide a very helpful advice on how the software should be used and utilized 9/3/20XX Presentation Title 56

Use Quizgecko on...
Browser
Browser