IntroductionDataScience_DS1.pdf
Document Details
Uploaded by Deleted User
Full Transcript
Introduction to Data Science Pedro G. Ferreira [email protected] Pedro G. Ferreira :: Introduction to Data Science - FCUP | 1 Agenda Large scale data Commercial and Scientific interest and vast amount of opportunities Applications of Data Analysi...
Introduction to Data Science Pedro G. Ferreira [email protected] Pedro G. Ferreira :: Introduction to Data Science - FCUP | 1 Agenda Large scale data Commercial and Scientific interest and vast amount of opportunities Applications of Data Analysis Data Science – Definitions – Skill Set – Roles Lifecycle of a DS Project Roles in DS Summary Pedro G. Ferreira :: Introduction to Data Science - FCUP | 2 Data, Data, Data! 1 Human genome ~100Gb, sequenced in 4/5 days. ~1 Million expected to be Credit cards have completed. billions of transactions/year Hosts > 40 billion photos There is an avalanche of 300 hours of video uploaded every minute data being produced. Smartphone apps In the last 2 years more Public web Machine Log data data was produced than in Data storage the entire human history. Sensor data Docs Scientific literature The amount of digital archives Many other sources... 40 Tb/ sec each information increases experiment tenfold every five years. The value of data 1M transactions/hour feeding a 2.5PetaB DB Pedro G. Ferreira :: Introduction to Data Science - FCUP | 3 The value of data Data as the raw material of science and business Data Evidence Understanding Progress Pedro G. Ferreira :: Introduction to Data Science - FCUP | 4 Applications Autonomous vehicle Recommendation control / Robotics Systems Personalized Personal Assistants / Medicine / Genomics Voice Recognition Pedro G. Ferreira :: Introduction to Data Science - FCUP | 5 Data Science DS is the discipline that deals with collection, processing/preparation, management, analysis, interpretation and visualization of large, heterogeneous and complex datasets. The goals of DS is the extraction of non-obvious and useful information and knowledge from large volumes of data, in order to improve scientific, social and business decision making. The ultimate mission for a data scientist is to solve a scientific or business problem and not just analyze the data or build a predictive model. Pedro G. Ferreira :: Introduction to Data Science - FCUP | 6 Data Science for Data Analysis Pedro G. Ferreira :: Introduction to Data Science - FCUP | 7 Data Science Skill Set Data Science = Data Analytical Thinking + Automation Pedro G. Ferreira :: Introduction to Data Science - FCUP | 8 Data Scientist The Data Scientist is responsible for guiding the DS project from start to end. Success comes from: – Having quantifiable goals; – Good methodology; – Cross-discipline interaction; (interagir com os outros) – Repeatable workflow (criar coisas possiveis de serem reproduziveis por outros) Pedro G. Ferreira :: Introduction to Data Science - FCUP | 9 Project Roles (é quem gere mais a data) Adapted from Practical Data Science with R, 2nd Edition Manning Pedro G. Ferreira :: Introduction to Data Science - FCUP | 10 Lifecycle of DS Project Adapted from Practical Data Science with R, 2nd Edition Manning Pedro G. Ferreira :: Introduction to Data Science - FCUP | 11 Define the Goal Define measurable and quantifiable goals. (ter objetivos) Learn all about the context of the project. – Why is the project needed? – What is the current approach to the problem? qual é o método que usamos? porque não está a servir? há outro projeto que possa render – Why the current approach is not enough? mais dinheiro ou algo assim? – What resources will be needed? – How do you plan to deploy the project? Adapted from Practical Data Science with R, 2nd Edition Manning Pedro G. Ferreira :: Introduction to Data Science - FCUP | 12 Data collection and Management Identify the data you need, explore it and prepare it for analysis. – What data is available for the DS? – Is the data useful? – Does it have quality good enough? – Explore and visualize the data. – Clean the data: repair errors and transform variables. Adapted from Practical Data Science with R, 2nd Edition Manning Pedro G. Ferreira :: Introduction to Data Science - FCUP | 13 Modeling Extract useful insights with statistics and machine learning. – Classifying: deciding if something belongs to one category or the other. – Scoring: estimate a numerical value, such as a price or probability. – Ranking: Learn to order the items by preferences. – Clustering: Grouping into most similar groups. – Finding relations: find correlations or potential causes of effects seen in data. – Characterization: plotting and report generation coisas para fazer com a data Adapted from Practical Data Science with R, 2nd Edition Manning Pedro G. Ferreira :: Introduction to Data Science - FCUP | 14 Model evaluation and Critique Is the model accurate enough for the needs? Does it generalize well? Does it perform better than the obvious? Do the results make sense in the problem domain? Adapted from Practical Data Science with R, 2nd Edition Manning Pedro G. Ferreira :: Introduction to Data Science - FCUP | 15 Presentation and Documentation Present results to project sponsor and stakeholders. Document the model for those in the organization who will use, run and maintain it. Define the impact of the findings in terms of domain metrics. Report most interesting findings and recommendations. Adapted from Practical Data Science with R, 2nd Edition Manning Pedro G. Ferreira :: Introduction to Data Science - FCUP | 16 Data Science Roles and tools Data Engineer – Information Architects – Build Data Pipelines and storage solutions – Maintain data access dar acesso a certas partes da data, por vezes pode não dar a tudo; funções de um data engineer Funções dos data engineers SQL (Structure Query Language) – To store and organize data Java, Scala or Python – Programming to process data Shell – To automate and run tasks Cloud Computing (também sabem fazer isto) Adapted from DataCamp – AWS, Azure, Google Cloud Platform Pedro G. Ferreira :: Introduction to Data Science - FCUP | 17 Data Science Roles and tools Data Analysts – Perform simpler analysis that describe data – Create reports and dashboards to summarize data "analisar" a data , limpar a data – Clean data for analysis usam sql, python, R, spreadsheets para analisar a data etc SQL – Retrieve and aggregate data Spreadshets – Simple analysis BI Tools (Tableau, PowerBI, Looker) – Dashboards and visualization Python or R Adapted from DataCamp – Clean and analyze data Pedro G. Ferreira :: Introduction to Data Science - FCUP | 18 Data Science Roles and tools Data Scientist – Strong background on Statistics – Run experiments and analyses for insights – Traditional machine learning SQL – Retrieve and aggregate data Python or R (advanced level) – DS libraries (e.g. Scikit-learn, pandas, tidyverse). tidyverse - usado em R é um exemplo de biblioteca de data science Adapted from DataCamp Pedro G. Ferreira :: Introduction to Data Science - FCUP | 19 Data Science Roles and tools Machine Learning Scientist – Predictions and extrapolations – Classification and regression – Deep Learning tem que saber mais sobre machine learning etc Image Processing Natural Language Processing Python or R (advanced level) – ML libraries (e.g. tensorflow or Spark) Adapted from DataCamp Pedro G. Ferreira :: Introduction to Data Science - FCUP | 20 Data Science Roles and tools resumo Adapted from DataCamp Pedro G. Ferreira :: Introduction to Data Science - FCUP | 21 Summary resumo 2 The DS process involves a lot of back-and-forth between all the intervenients. The data scientist plays a pivotal role. For a successful DS project you should have clear, verifiable and quantifiable goals, in more than one perspective – go beyond accuracy benchmarking. Adjust the expectations for all stakeholders. (stakeholders - persons envolved in the project) Determine the lower bounds on model performance by comparing with a baseline model – obvious guess. Your model should do better than that. Make an effort to communicate well your results and findings. bad report + bad presention can ruin a good model. Communication is import when we produce a model. we need to create a strutucted presentation(teacher values this a lot). o slide do lifecycle poderá ajudar na organização da comunicação Pedro G. Ferreira :: Introduction to Data Science - FCUP | 22