Introduction To Data Science PDF

Summary

This document is a presentation on introduction to Data Science. It covers the data science process, roles, tools, and a summary for a successful data science project.

Full Transcript

Introduction to Data Science Pedro G. Ferreira [email protected] Pedro G. Ferreira :: Introduction to Data Science - FCUP | 1 Agenda Large scale data Commercial and Scientific interest and vast amount of opportunities Applications of Data Analysi...

Introduction to Data Science Pedro G. Ferreira [email protected] Pedro G. Ferreira :: Introduction to Data Science - FCUP | 1 Agenda Large scale data Commercial and Scientific interest and vast amount of opportunities Applications of Data Analysis Data Science – Definitions – Skill Set – Roles Lifecycle of a DS Project Roles in DS Summary Pedro G. Ferreira :: Introduction to Data Science - FCUP | 2 Data, Data, Data! 1 Human genome ~100Gb, sequenced in 4/5 days. ~1 Million expected to be Credit cards have completed. billions of transactions/year Hosts > 40 billion photos There is an avalanche of 300 hours of video uploaded every minute data being produced. Smartphone apps In the last 2 years more Public web Machine Log data data was produced than in Data storage the entire human history. Sensor data Docs Scientific literature The amount of digital archives Many other sources... 40 Tb/ sec each information increases experiment tenfold every five years. The value of data 1M transactions/hour feeding a 2.5PetaB DB Pedro G. Ferreira :: Introduction to Data Science - FCUP | 3 The value of data Data as the raw material of science and business Data Evidence Understanding Progress Pedro G. Ferreira :: Introduction to Data Science - FCUP | 4 Applications Autonomous vehicle Recommendation control / Robotics Systems Personalized Personal Assistants / Medicine / Genomics Voice Recognition Pedro G. Ferreira :: Introduction to Data Science - FCUP | 5 Data Science DS is the discipline that deals with collection, processing/preparation, management, analysis, interpretation and visualization of large, heterogeneous and complex datasets. The goals of DS is the extraction of non-obvious and useful information and knowledge from large volumes of data, in order to improve scientific, social and business decision making. The ultimate mission for a data scientist is to solve a scientific or business problem and not just analyze the data or build a predictive model. Pedro G. Ferreira :: Introduction to Data Science - FCUP | 6 Data Science for Data Analysis Pedro G. Ferreira :: Introduction to Data Science - FCUP | 7 Data Science Skill Set Data Science = Data Analytical Thinking + Automation Pedro G. Ferreira :: Introduction to Data Science - FCUP | 8 Data Scientist The Data Scientist is responsible for guiding the DS project from start to end. Success comes from: – Having quantifiable goals; – Good methodology; – Cross-discipline interaction; – Repeatable workflow Pedro G. Ferreira :: Introduction to Data Science - FCUP | 9 Project Roles Adapted from Practical Data Science with R, 2nd Edition Manning Pedro G. Ferreira :: Introduction to Data Science - FCUP | 10 Lifecycle of DS Project Adapted from Practical Data Science with R, 2nd Edition Manning Pedro G. Ferreira :: Introduction to Data Science - FCUP | 11 Define the Goal Define measurable and quantifiable goals. Learn all about the context of the project. – Why is the project needed? – What is the current approach to the problem? – Why the current approach is not enough? – What resources will be needed? – How do you plan to deploy the project? Adapted from Practical Data Science with R, 2nd Edition Manning Pedro G. Ferreira :: Introduction to Data Science - FCUP | 12 Data collection and Management Identify the data you need, explore it and prepare it for analysis. – What data is available for the DS? – Is the data useful? – Does it have quality good enough? – Explore and visualize the data. – Clean the data: repair errors and transform variables. Adapted from Practical Data Science with R, 2nd Edition Manning Pedro G. Ferreira :: Introduction to Data Science - FCUP | 13 Modeling Extract useful insights with statistics and machine learning. – Classifying: deciding if something belongs to one category or the other. – Scoring: estimate a numerical value, such as a price or probability. – Ranking: Learn to order the items by preferences. – Clustering: Grouping into most similar groups. – Finding relations: find correlations or potential causes of effects seen in data. – Characterization: plotting and report generation Adapted from Practical Data Science with R, 2nd Edition Manning Pedro G. Ferreira :: Introduction to Data Science - FCUP | 14 Model evaluation and Critique Is the model accurate enough for the needs? Does it generalize well? Does it perform better than the obvious? Do the results make sense in the problem domain? Adapted from Practical Data Science with R, 2nd Edition Manning Pedro G. Ferreira :: Introduction to Data Science - FCUP | 15 Presentation and Documentation Present results to project sponsor and stakeholders. Document the model for those in the organization who will use, run and maintain it. Define the impact of the findings in terms of domain metrics. Report most interesting findings and recommendations. Adapted from Practical Data Science with R, 2nd Edition Manning Pedro G. Ferreira :: Introduction to Data Science - FCUP | 16 Data Science Roles and tools Data Engineer – Information Architects – Build Data Pipelines and storage solutions – Maintain data access SQL – To store and organize data Java, Scala or Python – Programming to process data Shell – To automate and run tasks Cloud Computing Adapted from DataCamp – AWS, Azure, Google Cloud Platform Pedro G. Ferreira :: Introduction to Data Science - FCUP | 17 Data Science Roles and tools Data Analysts – Perform simpler analysis that describe data – Create reports and dashboards to summarize data – Clean data for analysis SQL – Retrieve and aggregate data Spreadshets – Simple analysis BI Tools (Tableau, PowerBI, Looker) – Dashboards and visualization Python or R Adapted from DataCamp – Clean and analyze data Pedro G. Ferreira :: Introduction to Data Science - FCUP | 18 Data Science Roles and tools Data Scientist – Strong background on Statistics – Run experiments and analyses for insights – Traditional machine learning SQL – Retrieve and aggregate data Python or R (advanced level) – DS libraries (e.g. Scikit-learn, pandas, tidyverse). Adapted from DataCamp Pedro G. Ferreira :: Introduction to Data Science - FCUP | 19 Data Science Roles and tools Machine Learning Scientist – Predictions and extrapolations – Classification and regression – Deep Learning Image Processing Natural Language Processing Python or R (advanced level) – ML libraries (e.g. tensorflow or Spark) Adapted from DataCamp Pedro G. Ferreira :: Introduction to Data Science - FCUP | 20 Data Science Roles and tools Adapted from DataCamp Pedro G. Ferreira :: Introduction to Data Science - FCUP | 21 Summary The DS process involves a lot of back-and-forth between all the intervenients. The data scientist plays a pivotal role. For a successful DS project you should have clear, verifiable and quantifiable goals, in more than one perspective – go beyond accuracy benchmarking. Adjust the expectations for all stakeholders. Determine the lower bounds on model performance by comparing with a baseline model – obvious guess. Your model should do better than that. Make an effort to communicate well your results and findings. Pedro G. Ferreira :: Introduction to Data Science - FCUP | 22

Use Quizgecko on...
Browser
Browser