Introduction to Data Science Lecture PDF

Summary

This lecture provides an introduction to data science. It covers the definition of big data, including the three Vs (volume, velocity, and variety), different data structures, and data science applications. It also discusses data science vs. business intelligence and the key skills required for data scientists.

Full Transcript

Introduction to Data Science Outline Introduction to Big Data Data Science and Business Intelligence The Skillset of Data Scientists Summary Is this really What is „Big Data“?!? about size? Naive Definition Naive definition: Big data only depends on t...

Introduction to Data Science Outline Introduction to Big Data Data Science and Business Intelligence The Skillset of Data Scientists Summary Is this really What is „Big Data“?!? about size? Naive Definition Naive definition: Big data only depends on the data size 1 Gigabyte? 1 Terabyte? 1 Petabyte? Naive interpretation misses important aspects Time: Analyzing 1 Gigabyte of data per day is different from analyzing 1 Gigabyte of data per second Diversity: Analyzing spread sheets with numeric data is different from analyzing Web pages that contain a mixture of text and images Distribution: Analyzing data from a single source is different from analyzing data from multiple sources Definition of Big Data Following Gartner‘s IT Glossary: Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. The three Vs Volume Some people actually use 10 Vs to define Velocity big data! Variability Variety Veracity Validity Vulnerability Volatility Visualization Value The 3 Vs: Volume Scale of the data must be „big“ No clear definition „that demand […] innovative forms of information processing“ (Gartner) Data center storage worldwide © Statista 2018 The 3 Vs: Velocity Speed at which new data is created Speed at which data must be processed and analyzed Often close to real-time The 3 Vs: Variety Diversity in data types and data sources Data with defined types and structure Example: comma separated values Structured Textual data with parseable pattern Semi- Example: XML files with schema Structured Textual data with erratic formats that can be formated with effort Quasi-Structured Example: Clickstream data Data that has no inherent structure, often Unstructured with multiple formats Example: Web site, videos Examples for data types Structured Quasi-Structured Semi-Structured Unstructured Defining Data Science Unfortunately, there is no clear definition (yet?) Goal is the extraction of knowledge from data Combination of techniques from different disciplines Scientific principles guide the data analysis Tools? Big Data? What is „Data Science“?!? Machine Learning? Mathematical Aspects Computational Optimization Stochastics Geometry Scientific Machine Computing Learning Computer Science Aspects Data Structures and Databases Distributed Computing Algorithms Software Engineering Artificial Intelligence Machine Learning Statistical Aspects Linear Models Statistical Tests Inference Time Series Analysis Machine Learning Applications Intelligent Systems Robotics Marketing Medicine Autonomous Driving Social Networks Data Science vs. Business Intelligence Business Intelligence (Gartner IT Glossary) […] best practices that enable access to and analysis of information to improve and optimize decisions and performance. Business Data Science High Intelligence Techniques Dashboards, Optimization, alerts, queries predictive modelling, forecasting Data Types Structured, data Any kind, often Depth of Data Science warehouses unstructured Insights Common What What if…? questions happened…? What will…? How much did…? How can we…? When did…? Business Intelligence Low Past Present Future Time More Data  More Opportunities LARGE TERABYTES PETABYTES EXABYTES VOLUME OF INFORMATION SMALL 1990’s 2000’s 2010’s Relational Content Management Key-Value Storages Databases & & Unstructured Data Data Warehouses What are Data Scientists? Not computer scientists But should know about databases, data structures, algorithms, etc. Not mathematicians But should know about optimization, stochastics, etc. Not statisticians But should know about regression, statistical tests, etc. Not domain experts But must work together with them Skills of Data Scientists Quantitative Maths Algorithms Statistics A bit of everything Collaborative Technical Data Teamwork Communication Scientists Programming Infrastructures … but actually as much as skills possible of everything Skeptical Create hypotheses, but be skeptical about them Different types of Data Scientists According to Microsoft Research: Polymath Data Analyzer „Do it all“ Analyzing data Data Evangelist Platform Builder Data analysis, disseminating and acting Collect data and create on insights infrastructures Data Preparer Moonlighters (50%/20%) Querying existing data, preparing data „Spare time“ data scientists for analysis Data Shapers Insight Actors Analyzing and preparing data Use the outcome and act on insights. Miyung Kim, Thomas Zimmermann, Robert DeLine, Andrew Begel: Data Scientists in Software Teams: State of the Art and Challenges, IEEE Transactions on Software Engineering (Online First) Summary Big data has a high volume, velocity, and variety Different data structures Structured, semi-structured, quasi-structured, unstructured Data science is a very diverse discipline Maths, computer science, statistics, applications  Data scientists require a diverse skillset

Use Quizgecko on...
Browser
Browser