Introduction to Data Science PDF
Document Details
Uploaded by AdmiringColumbus
Alexandria University
Tags
Summary
This document provides an introduction to data science, explaining the concept of big data and its characteristics such as volume, velocity, and variety. It also covers the challenges and opportunities associated with big data, from data collection to analysis.
Full Transcript
Lec.2. Introduction to What is What makes data, “Big” Data? Big data from its name is very big Starting size of it at least 1 TB Volume Velocity Variety Data Data Data quantity Speed Types 1st Character of Big Data...
Lec.2. Introduction to What is What makes data, “Big” Data? Big data from its name is very big Starting size of it at least 1 TB Volume Velocity Variety Data Data Data quantity Speed Types 1st Character of Big Data Scale (Volume) A typical PC might have had 10 gigabytes of storage in 2000. Today, Facebook ingests 500 terabytes of new data every day. Boeing 737 will generate 240 terabytes of flight data during a single flight across the US. The smart phones, the data they create and consume; sensors embedded into everyday objects will soon result in billions of new, constantly-updated data feeds containing environmental, location, and other information, including video. Data volume is increasing exponentially Exponential increase in collected/generated data 6 Clickstreams and ad impressions capture user behavior at millions of events per second high-frequency stock trading algorithms reflect market changes within microseconds machine to machine processes exchange data between billions of devices infrastructure and sensors generate massive log data in real-time on-line gaming systems support millions of concurrent users, each producing multiple inputs per second. Data is begin generated fast and need to be processed fast Online Data Analytics Late decisions ➔ missing opportunities Examples ◦ E-Promotions: Based on your current location, your purchase history, what you like ➔ send promotions right now for store next to you ◦ Healthcare monitoring: sensors monitoring your activities and body ➔ any abnormal measurements require immediate reaction 8 Big Data isn't just numbers, dates, and strings. Big Data is also geospatial data, 3D data, audio and video, and unstructured text, including log files and social media. Traditional database systems were designed to address smaller volumes of structured data, fewer updates or a predictable, consistent data structure. Big Data analysis includes different types of data Various formats, types, and structures Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc… Static data vs. streaming data A single application can be generating/collecting many types of data To extract knowledge➔ all these types of data need to linked together 10 Data quantity Data Speed Data Types 11 "Big Data are high-volume, high-velocity, and/or high- variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization”. Complicated (intelligent) analysis of data may make a small data “appear” to be “big” Bottom line: Any data that exceeds our current capability of processing can be regarded as “big” Big Data is any data that is expensive to manage and hard to extract value from ◦ Volume The size of the data ◦ Velocity The latency of data processing relative to the growing demand for interactivity ◦ Variety and Complexity the diversity of sources, formats, quality, structures. Sources of data Data from internet Data from military corporations Hospitals data NASA corporation data And so on… Where is all this data coming from ? Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) Graph Data Social Network, Semantic Web (RDF), … Streaming Data Data Science What is data science? An area that manages, manipulates, extracts, and interprets knowledge from tremendous amount of data Data science (DS) is a multidisciplinary field of study with goal to address the challenges in big data Data science principles apply to all data – big and small - Data Science is the science which uses computer science, statistics and machine learning, visualization and human-computer interactions to collect, clean, integrate, analyze, visualize, interact with data to create data products. - - Simply, data science is an umbrella of several techniques that are used for extracting the information and the insights of data. 21 - Companies learn your secrets, shopping patterns, and preferences For example, can we know if a person is diabetic, even if he/she doesn’t want us to know? - Data Science and election (2008, 2012) 1 million people installed the Obama Facebook app that gave access to info on “friends” Data Scientist ◦ The most attractive Job of the 21st Century They find stories, extract knowledge. They are not reporters A data scientist is the key person in acquiring, clearing, representing and analyzing data for business and research purposes Data scientists are the key to realizing the opportunities presented by big data. They bring structure to it, find compelling patterns in it, and advise executives on the implications for products, processes, and decisions The problem is that with this un- sorted very large data size , we cant analysis it, more over we cant classify it , it become un-useful if we stored data without any usage. How to solve this ?! The Data Analytics Lifecycle is designed specifically for Big Data problems and data science projects. The lifecycle has six phases, and project work can occur in several phases at once.