Learning from Data Lecture 1 PDF
Document Details
Uploaded by SportyDeciduousForest4462
University of Exeter
Dr Marcos Oliveira
Tags
Summary
This document is a lecture on learning from data, covering module overview, data characteristics, and the various types of data; structured, semi-structured and unstructured; and the different tasks that a data scientist might do, whether building training sets or cleaning and organising data.
Full Transcript
Learning from Data Lecture 1 Dr Marcos Oliveira 1. Module Overview 2. Data Characteristics Module Overview Learning from Data Tell me why Learning from Data Data to learn about Exeter Learning from Data Learning from Data Learning from Data: new questions! Nu...
Learning from Data Lecture 1 Dr Marcos Oliveira 1. Module Overview 2. Data Characteristics Module Overview Learning from Data Tell me why Learning from Data Data to learn about Exeter Learning from Data Learning from Data Learning from Data: new questions! Number of timeline posts Days before/after event Data all the way down supply-demand challenge mand challenge Capacity, however, is only one of the makers analysts essential enablers of digitization. Finance and Capability is essential, too, and that Insurance resides in the skills of people. Data all the way down The market for DSA skills is hot, but it’s full of mismatches. The time-honored 26% 51% 13% 5% 2% practice of treating degrees as proxies Healthcare Figure 2: The demand is for business Digital infrastructure people with is being analytics built out and Social Figure 2: The demand is for business people with analytics for skill sets doesn’t work with DSA. Assistance Industry: skills, not just data notscientists in only dataThere public and scientists, private is a growing spaces butnumberpeople across of DSA the with analytics skills!just data scientists skills, not US. With it, our capacity to generate 32% 44% Of 2.35 million job postings in the degrees US.and credentials—since 2010, Of 2.35 million job postings in the US.15% 5% 2% and transmit data from 303 new accredited DSA programs and across devices, wereData entities,insystems, launched the US, and a 52%sensor Analytics-enabled jobs science jobs Analytics-enabled jobs Data science jobs networks (embedded increase overall. 5 Butinmostthe of Internet them of Information Data-driven FunctionalThings) notisbeen have Data growingaround exponentially. Data longData enough for scientists Data-driven Functional Data Data Data scientists decision analysts analysts and advanced decision analysts analysts and advanced employers to get a is clear-eyed view 43% 23% 26% 4% analysts2% makers Capacity, however, only one of theof analysts makers the viability of the job essential enablers of digitization. candidates they Finance and produce. is essential, too, and that Capability Finance and Insurance Manufacturing Insurance resides There is inathe farskills deeper of pool people.of STEM 26% 51% The talent, but 13%here, market for DSA too,skills it’s often 5% unclear is hot,2%but it’s 26% 51% 13% 5% 2% 45% 25% 23% 4% 2% how well prepared these full of mismatches. The time-honored job candidates Healthcare are to use practice of data science treating degreesand as analytics proxiesin Professional, Healthcare and Social business pursuits. Meanwhile, business Scientific, and Social for skill sets doesn’t work with DSA. and Technical Assistance Assistance schools have very few There is a growing number of DSA programs that Services 32% include15% 44% degrees DSAcredentials—since and coursework. 5% 2% 2010, 30% 32% 29% 44% 31% 15% 5%6% 2% 2% 303 It isnew accredited left to hiring managersDSA programsand were launched in recruiters to determine the US,howa 52% Information Retail Trade Information increase candidates overall. 5 But most meet skill of them requirements have notchanging in this been around long enough environment. To for 43% 23% employers 26% 4% do that they need two things: 1) a of to get a clear-eyed 2% view 43% 46% 23% 35% 26% 13% 4% 4% 2% 2% the viability common of the job candidates nomenclature to trade in they Notes: Job category of analytics managers not shown. Totals may not equal 100%. produce. DSA competencies and skills; and 2) a Source: PwC analysis based on Burning Glass Technologies data, January 2017. https://www.bhef.com/publications/investing-americas-data-science-and-analytics-talent Manufacturing Manufacturing closer, more collaborative relationship Number of postings: Finance and Insurance (535,683); Healthcare and Social Learning from Data: Module Overview Learning from Data: Module Overview Supervised learning: Linear regression, polynomial regression, logistic regression. Measures of error, model complexity and model selection. Multilayer Perceptron, Convolution Neural Networks. K-Nearest Neighbors, Support Vector Machines, Linear Discriminant Analysis. Decision trees. Unsupervised learning: Centroid-based clustering, hierarchical clustering, density-based clustering. Gaussian Mixture Model, Clustering validation. Dimensionality reduction: PCA, t-SNE, UMAP. Introduction to Computer Vision. Natural language processing: TF-IDF, topic modeling, embeddings. Learning from Data: Teaching Team Module leader ECM3420 Chico Camargo Chico Camargo [email protected] ECMM445 Diogo Pacheco Diogo Pacheco [email protected] Lecturers (ECM3420 + ECMM445) Chico Camargo Diogo Pacheco Marcos Oliveira Marcos Oliveira [email protected] Learning from Data: Lecture Plan Module overview and data K-Nearest Neighbours Marcos Diogo W1 W7 Oliveira Pacheco Linear regression Support Vector Machines Measures of error Decision Trees Marcos Diogo W2 W8 Oliveira Pacheco Other regressions and data splits Networks Model complexity and selection Unsupervised learning, introduction to clustering Marcos Chico W3 W9 Oliveira Dimensionality reduction, feature selection, Camargo Preprocessing PCA, t-SNE, UMAP Classifiers, Logistic regression, Multilayer Perceptron Hierarchical Clustering, DBSCAN Chico Chico W4 W10 Camargo Camargo Gradient descent, Evaluating classifier performance KMeans, GMM, Clustering validation Natural language processing: TF-IDF Intro to computer vision Diogo Chico W5 W11 Natural language processing: Topic modeling, Pacheco Camargo Convolution neural networks embeddings Applications of machine learning Diogo W6 No lectures or workshops W12 Pacheco Revision. No class. Learning from Data: Workshops What? Python libraries such as matplotlib, pandas, scikit-learn, and keras. Jupyter notebook. When? ECM3420 ECMM445 Thursdays 11:35-12:25 - (the new) Lovelace Thursdays 09:35-10:25 - Babbage Thursdays 12:35-13:25 - Harrison 208 Thursdays 10:35-11:25 - Babbage Thursdays 14:35-15:25 - Babbage Who? Songyuan Li ([email protected]) Owen Saunders ([email protected]) Arjun Biswas ([email protected]) Joshua Dare-Cullen ([email protected]) my work provides computat different actions or interventions on various systems, from social gatherings to big cities. different actions or interventio Urban Science Urban Science In the field of Urban Science, my overarching goal is to disentangle the processes governing the Learning from Data: Assessment organization of cities to advance our understanding of urbanization at different time scales 1–7. This endeavor requires an interdisciplinary effort to devise computational and mathematical models supported In the field of Urban Science organization of cities to adv endeavor requires an interdis by social theories. I develop mechanistic models instead of black-box models, which enables a broad by social theories. I develop audience to understand the emergence of complex urban phenomena (e.g., crime, mobility inequality). audience to understand the e More importantly, it allows us to ask new questions about urgent issues, propose potential interventions, Coursework (40%) - Data analysis report and evaluate potential trade-offs, supporting policy making. My plans aims at tackling the following research More importantly, it allows us and evaluate potential trade-of What: challenges: challenges: DATA INSIGHTS How crime emerges with urbanization? At long time scales (e.g., decennial), my previous works How crime emerges with urb suggests that crime growth might be related to local-level crime characteristics—systematically, suggests that crime grow different local-level crime dynamics lead to distinct city-level crime growth 1;2 (see Fig. 1). I will Release: soon. disentangle this relationship with theory-informed, data-driven mechanistic models that help examine different local-level crime disentangle this relationsh Deadline: 03/12/2024. example crime empirically, numerically, and analytically. Why is this important? If we understand how crime empirically, numeri local-level crime dynamics relate to global-level crime growth, we can propose local-level interventions local-level crime dynamics that might yield far-reaching, long-term outcomes for cities and entire nations. that might yield far-reachin By framing crime analytically, we can begin to formulate critical questions. For example, how does By framing crime analytica urban crime constrain urban development? If constraints exist, what are the limits of urban crime constrain urbanization? Should we prioritize having large cities or aim for medium-sized ones? When do urbanization? Should we 4 cities fail in terms of crime? Why do they fail? I aim to introduce the necessary rigor to articulate cities fail in terms of crim such questions, creating unprecedented avenues A for researchers and policy-makers B 1.0 and providing the such questions, creating u means for designing efficient approaches to inform the general public. Exam (60%) – Multiple choice exam Chicago, IL Insights Theft means for designing efficie occurrences 0.8 Robbery Cumulative share Burglary A 10 5 B 1.0 C 0.6 A 5 May exam period. Chicago 10 City β = 1.00 Theft β = 1.00 of criminal occurrences Cumulative share criminal 0.4 4 0.8 Robbery 4 β = 1.18 Closed book in-person10online exam 10 β = 1.18 40 of 220 0.2 Burglary 0.6 3 3 Region 10 0.0 Y 10 Y 0.0 0.2 0.4 0.6 0.8 1.0 0.4 Cumulative share of regions C 247 2 100 2 10 Theft 0.2 10 Theft 10-1 39 1 Burglary 1 Burglary 10 4 5 6 0.0 10-2 2009 10 cn 0.2 0.4 0.6 0.8 1.0 2011 2013 2015 4 5 6 10 10 10 10 10 10 N Cumulative share of regions Time N P C 10-3 α = 2.44 Background reading An Introduction to Statistical Learning: With Applications in R by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, Jerome Friedman Data Mining: Practical Machine Learning Tools and Techniques by Ian H. Witten; Frank Eibe, Mark A. Hall, Christopher J. Pal Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow by Aurelien Geron Library link: https://www.exeter.ac.uk/departments/library/ 1. Module Overview 2. Data Characteristics Data Characteristics Data characteristics What are the features of data? Data variety What is structured data, semi-structured data, and unstructured data? Structured data Structured data is data that adheres to a data model. It conforms to a tabular format with relationship between the different rows and columns. Examples of structured data include tables in SQL databases. Data elements are addressable for effective analysis. Structured data makes it easier to contextualize and understand the data. Search engines such as Google use structured data to match website content to relevant search queries. LecturerID FirstName LastName EMail L0001 Chico Camargo [email protected] L0002 Diogo Pacheco [email protected] L0003 Marcos Oliveira [email protected] LecturerID Module L0001 ECM3420 L0002 L0003 ECMM445 ECM1407 SELECT * FROM lecturers; Unstructured data Unstructured data is data which is not organized according to a preset data model or schema. Therefore, it cannot be stored in a traditional relational database. Unstructured data Unstructured data is data which is not organized according to a preset data model or schema. Therefore, it cannot be stored in a traditional relational database. From 80% to 90% of data generated and collected by organizations is unstructured. This data might be rich in content but not immediately usable without first being sorted. Unstructured data can be produced by people or machines. SELECT ???? FROM ????? 🤯 Unstructured Structured Source: https://medium.com/@shay.strong/geospatial-machine-learning-structuring-unstructured-structured-data-58dbca8a48b0 Unstructured Structured The questions we ask are often unstructured! Source: https://medium.com/@shay.strong/geospatial-machine-learning-structuring-unstructured-structured-data-58dbca8a48b0 Semi-structured data Semi-structured data does not adheres to a data model, but it has some some level of of structure. The data contains tags, hierarchies, and other types of markers that give data structure. Semi-structured data Semi-structured data does not adheres to a data model, but it has some some level of of structure. The data contains tags, hierarchies, and other types of markers that give data structure. From: [email protected] Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable X-Universally-Unique-Identifier: B173983D-92C5-48D1-8C89-0D5E98FA0A12 Date: Sun, 24 Sep 2023 10:15:23 +0100 To: Marcos Oliveira MIME-Version: 1.0 Subject: If you're looking for someone who can make you happy, let's talk and see if we can create a fun connection. Hello! My name is Agnes. I’m great to hang out with and I am searching for a man that’ll be happy to enjoy my companionship. If you’re in my region then you’ll be glad to connect with me. Types of data Properties Structured Data Semi-Structured Data Unstructured Data Based on XML/RDF (Resource Description Technology Based on relational database table. Based on character and binary data. Framework) Transaction Matured transaction and concurrency Transaction is adapted from DBMS. Not No transaction management and no management techniques. matured. concurrency. Version Versioning over tuples, rows, tables. Versioning over tuples or graph is possible. Versioned as a whole. management More flexible than structured data but less More flexible and there is an absence of a Flexibility Schema dependent, less flexible. flexible than unstructured data. schema. Analysis Query languages (e.g., Cassandra, Natural Language Processing, audio SQL queries. methods MongoDB) analysis, video analysis, text analysis. https://www.mongodb.com/unstructured-data/structured-vs-unstructured What do Data Scientists do? Here’s where the popular view of data scientists diverges pretty significantly from reality. Generally, we think of data scientists building algorithms, exploring data, and doing predictive analysis. That’s actually not what they spend most of their time doing, however. What do Data Scientists do? 3% 5% What data scientists spend the most time doing 4% 9% Building training sets: 3% Cleaning and organizing data: 60% Collecting data sets; 19% 19% Mining data for patterns: 9% 60% Refining algorithms: 4% Other: 5% They had this to say: What do Data Scientists do? 3% 4% What’s the least enjoyable part of data science? 5% Building training sets: 10% 10% Cleaning and organizing data: 57% Collecting data sets: 21% 57% Mining data for patterns: 3% 21% Refining algorithms: 4% Other: 5% https://www.ipa.go.jp/digital/chousa/trend/datademocra/ug65p90000001hna-att/CrowdFlower_DataScienceReport_2016.pdf Learning from Data Lecture 1 Dr Marcos Oliveira