Data_Collection_and_Storage_DS2.pdf
Document Details
Uploaded by NoiselessBlueTourmaline1546
Full Transcript
Data Collection and Storage Pedro G. Ferreira [email protected] Pedro G. Ferreira :: Introduction to Data Science - FCUP | 1 Agenda Sources of Data Data Storage Data Pipelines Data Preparation Pedro G. Ferreira ::...
Data Collection and Storage Pedro G. Ferreira [email protected] Pedro G. Ferreira :: Introduction to Data Science - FCUP | 1 Agenda Sources of Data Data Storage Data Pipelines Data Preparation Pedro G. Ferreira :: Introduction to Data Science - FCUP | 2 Sources of Data Open Data Company Data (private data) Provided in different repositories – UCI Machine Learning Dataset Collected by companies Repository (há muitos datasets que até dá Helps make them business pra fazer download) – Kaggle (plataforma para praticar coisas de decisions datascience) – Google Datasets Types of data: Data published from scientific – Web events sites papers or competitions – Financial Transactions – Survey data Can be downloaded in different – Customer Data formats or accessed via APIs. API is a way of downloading directly the data. – Logistics Data Public Records empresas: META, Google, OpenAI – UN, WHO, Pordata sources of open data no kaggle há test data(hidden) e train data(available) m -> predictions Pedro G. Ferreira :: Introduction to Data Science - FCUP | 3 Data Storage To organize and store your data consider Location – Server or cluster to run locally é necessário um server se for data mais complexa – Cloud computing (AWS, Azure, GCP) só algumas empresas conseguem isso estes computadores mais caros Data types – Unstructured Email, text, video and audio files, web pages, social media Stored in Document Database – Tabular and Structured Data organized as rows and columns Relational Database Retrieval and Data Querying Data Type Data baseType Query Language Unstructured Document Database NoSQL Tabular Relational Database SQL Pedro G. Ferreira :: Introduction to Data Science - FCUP | 4 Data Pipelines How do we scale the analysis? – Multiple data sources – Different data types Unstructured data Tabular data Real-time streaming data Data pipeline – Moves data into defined stages – Automated collected and stored Scheduled by frequency or triggered by an event – Monitored with generated alerts – Extract Transformation and Load (ETL) Pedro G. Ferreira :: Introduction to Data Science - FCUP | 5 Data Pipelines - Copied from Angermueller C. et al. Deep learning for computational biology. Mol Syst Biol. 2016 Jul 29;12(7):878 Pedro G. Ferreira :: Introduction to Data Science - FCUP | 6 Data Pipelines com machine learning é diferente o processo - Adapted from Aurélien Géron, Hands-On Machine Learning with Scikit-Learn & TensorFlow Pedro G. Ferreira :: Introduction to Data Science - FCUP | 7 Data Preparation Why prepare data? – Real-life data is messy – Processing is done to prevent: Errors Adapted from DataCamp Biasing algorithms Incorrect results Tidy data – Organize cases as rows – Features as columns features/carateristicas são nas colunas(columns) casos são os exemplos que ficam nas rows(linha) there is a duplicated, there is missing data, analise do que tá different format(USA,FR but then we have mal aqui: belgium), age is a string instead of a float. colocar IDs para que todos os casos sejam únicos. Pedro G. Ferreira :: Introduction to Data Science - FCUP | 8 Data Preparation Remove duplicates Unique Identifiers colocar id para serem todos os casos serem únicos, podem haver 2 saras. Homogeneity Pedro G. Ferreira :: Introduction to Data Science - FCUP | 9 Data Preparation Data Types Missing values – Data entry – Error – Valid missing value Handling Missing values – Impute (mean, max, median,...) inserir um valor que está missing – Drop – Keep it if the algorithm handles it Pedro G. Ferreira :: Introduction to Data Science - FCUP | 10 Summary Data rarely comes in ready for analysis. Real-life data is messy and dirty. Preparing the data conveniently will save you time in later stages of analysis. Keep in mind the multiple steps of the analysis pipeline from retrieving the data to presentation of results. Most algorithms require data in tabular format without missing data or duplicates. Check that your data is in the right tabular format with cases as rows and features as columns; that you have no missing values; that data types are conveniently represented. (é preciso ver se está tudo bem organizado) Pedro G. Ferreira :: Introduction to Data Science - FCUP | 11