Programming for Big Data - Introduction (1).pptx
Document Details
Uploaded by Deleted User
Full Transcript
CP 422 - Programming for Big Data -Introduction Dr. Emad Mohammed Objectives = Take-home Messages 1. What are the characteristics of Big Data, and what are the main considerations in processing Big Data? 2. What is an analytic sandbox, and why is it important? 3. Explain the differences betwee...
CP 422 - Programming for Big Data -Introduction Dr. Emad Mohammed Objectives = Take-home Messages 1. What are the characteristics of Big Data, and what are the main considerations in processing Big Data? 2. What is an analytic sandbox, and why is it important? 3. Explain the differences between (Business Intelligence) Bl and Data Science. 4. Describe the challenges of the current analytical architecture for data scientists. 5. What are the key skill sets and behavioral characteristics of a data scientist? 2 Definitions What are data? Data are facts and statistics (measurements) that collected for further analysis What is information? Information is understanding of a particular subject, object, event and/or process. Information can be obtained from various data sources. What is knowledge? Knowledge is information understanding that is gained from experience or education What is intelligence A mental capability involved in the ability to acquire and apply knowledge and skills Example “Learning how to drive a car” 3 The Knowledge Cycle Data Information Knowledge Intelligence Cloud computing Agents Internet of Things Social Nets Mobile computing Web 3.0 Knowledge Google Maps management Ontology Etc. Etc. Etc. Etc. 4 What Are Data Processing Capabilities? Skill – Rule – Knowledge Triangle Knowledge Level Learning Rule Level Skill Level data 5 What is Big Data? Volume: High volume Variety: From many diverse sources Velocity: Changing Rapidly Value: Has some positive value (utility) 6 What is Big Data? “Big Data” is data whose scale, distribution, diversity, and/or timeliness require the use of new technical architectures and analytics to enable insights that unlock new sources of business value. Requires new data architectures New analytical methodologies and tools Integrating multiple skills (not only programming) Source: McKinsey May 2011 article Big Data: The next frontier for innovation, competition, and productivity 7 Trends and Technologies: Data Explosion Facebook: has 40 PB of data and captures 100 TB/day; Twitter: captures 8 TB/day 3+ billion devices are “connected” (2013) Everything will be connected to everything: Internet of Things Web 3.0 Every device will have processing power and storage (Embedded intelligence) Carry on servers Many sensor devices (environment, body, dynamic sensors) Location/contents sensitive networking 8 Trends and Technologies: Data Explosion Smart phones/mobile devices generate half of the data traffic Possible issue: Communication bottleneck Being able to track everything, everywhere, anytime Possible issue: Processing bottleneck Quality of life vs. Quality of service 9 Trends and Technologies: Data Explosion Several new technologies have emerged to fill the needs of the Sensing, Networking, Analyzing, Application Ma n Con et t Sensing ext-awar e ro Analyzing Int utin g ell i- d Networking as re middlewa Filt erin hb g oard Lo c Priv atio acy n/c ; se o nt cur e nt it y ; bas fau lt to ed Application lera nce 10 Big Data Structures Data Growth is Increasingly Unstructured Data containing a defined data type, format, structure Structured Example: Transaction data and OLAP More Structured Textual data files with a discernable Semi- pattern, enabling parsing Structured Example: XML data files that are self describing and defined by an xml schema Textual data with erratic data formats, can be formatted with “Quasi” effort, tools, and time Structured Example: Web clickstream data that may contain some inconsistencies Data in data values and that has no inherent formats structure and is usually stored as different types of Unstructured files. Example: Text documents, PDFs, images and video 11 Structured Quasi-Structured Data Data Semi-Structured Data View Source http://www.google.com/ #hl=en&sugexp=kjrmc&cp=8&gs_id=2m&xhr=t&q=data+scientist&pq=big+ data&pf=p&sclient=psyb&source=hp&pbx=1&oq=data+sci&aq=0&aqi=g4& aql=f&gs_sm=&gs_upl=&bav=on.2,or.r_gc.r_pw.,cf.osb&fp=d566e0fbd09c86 04&biw=1382&bih=651 Unstructured Data The Red Wheelbarrow, by William Carlos Williams Big Data Usage: Typical Tasks Problem-solving: The process of working through the details of a problem to reach a solution. Problem-solving may include mathematical or systematic operations Learning: The process that builds on prior knowledge Decision making: The process of selecting a logical choice from the available options. Planning: The process that (1) identifies the objectives to be achieved, (2) formulates strategies to achieve them, (3) arranges or creates the means required, and (4) implements, directs, and monitors all steps in their proper sequence. 13 Typical Scenario Case Study: Vehicle’s Commute Time Prediction 1. Where is data? 2. How to collect data? 3. How to extract meaningful info from data? 4. How to use meaningful info in …? 14 Where is Data? Data typically collected from sensors and cameras Needs dedicated infrastructure Specialized sensors and analytics platforms May be difficult to customize E.g., analyzing new routes Alternatives? 15 Web Mining Factors which influence commute time Time of the day Google Maps Accidents Weather Twitter, etc. Etc. Environment Canada 16 How to Collect Data? Data is gathered from the following sources: Google Maps API and Web interface Twitter and QR 660 accident tweets Environment Canada annual reports 17 Large-Scale Data Analytics Engineering: is a set of tools that can be used to transform facts and science into a set of useful deliverables (systems) Large-Scale: Data of huge size and different types that can not be fitted into a single computer memory (Size in TB and PB) Analytics: algorithms (e.g. data mining) used to answer questions like: What happened, what is the problem, what actions are needed? why is this happening , what will happen next? what if we try this, what is the best scenario? statistics are not enough What are the differences between data processing 18 Would this results is an Arrhythmia episode in the next 3 Minutes? Is this Fraud? Fraud is the existence of unrealistic patterns in the data 19 Business Intelligence Business Goals Examples Optimize business operations Sales, pricing, profitability, efficiency Identify business risk Customer churn, fraud, default Predict new business Upsell, cross-sell, best new customer opportunities prospects Regulatory requirements Anti-Money Laundering, Fair Lending 20 21 Large Scale Data Analytics, Why and how? What happened, what is Why is this happening What if we try this, the problem, what actions , what will happen what is the best are needed? next? scenario? Descriptive Analytics Predictive Analytics Prescriptive Analytics Predictive Optimization Alerts Modeling Statistical Randomized Modeling Testing Value Query Standard Reports Business Intelligence We are Here ! Big Data Cycle n? tio es Qu Data Integration Multi-Modal Experience and Bi s ht Optimization gD ig ns at eI a or M Assisted Experience and Management Concluding Performance Big Data Monitoring Actionable Insights Analytics Decision making and Learning Ap re pl com ial yM m nt re od end Knowledge nfe ifi at Transformation into Big Data ca io lI Actions tio ns ma ns r Fo an d 22 Expected Background Mathematical and quantitative capabilities Experience with statistical methods and basic statistical software, such as R or Python, Java, Scala Basic programming skills. 23 Big Data Tools (Sandbox) 1. Data Analytics: Mahout, R, Python 2. High level Programming: Hive, Pig 3. Batch Parallel Programming model: Hadoop 4. Streaming Programming model: Storm, Kafka 5. In-memory: Spark, Giraph 6. Data Management: Hbase, MongoDB, MySQL 7. Distributed Coordination: Zookeeper 8. Cluster Management: Yarn 9. File Systems: HDFS, GPFS 10. IaaS: Amazon, Azure, OpenStack, Docker 11. Monitoring: Ganglia, Nagios 24 Infrastructure: Hadoop computing cluster 25 26 Data Mining Professions 1. Manager (business / domain expert). 2. Data Science Solution Architect. 3. Data Mining Application programmer (Data Scientist). this course is the first step 4. Data Analyst. 5. Data Infrastructure (facilities administrator/operator: storage, cloud, computation). How to gain experiences in data mining? To gain experiences in data mining you MUST have sufficient knowledge of the overlapping regimes of expertise in business needs, domain knowledge, analytical skills and programming expertise to manage the end-to-end scientific method process through each stage in the data mining lifecycle. http://bigdatawg.nist.gov/V1_output_docs.php 27 Data Data Mining Mining Skills for data Analytics Skillsscientist Skill 1: Competences (Not only Learning but looking for Impact) 1. Data Machine Learning 2. Data Management (Query, format, quality, cleansing, preprocessing) 3. Scientific/Research Methods 4. Application/subject domain related business 5. Mathematics and Statistics Skill 2: Data Mining tools and platforms Data Analytics platforms Math & Stats apps & tools Databases (SQL and NoSQL) Data Management and Curation platform Data and applications visualization Cloud based platforms and tools Skill 3: Programming and programming languages and IDE General and specialized development platforms for data analysis and statistics Skill 4: Soft skills or Social Intelligence Personal, inter-personal communication, team work (also called social intelligence or soft skills) 28 Stay Alert! The Ford Challenge Driving is a major part of our current lives and so are, unfortunately, car accidents. In this project from Kaggle.com, the objective is to use any combination of vehicular, environmental and driver physiological data to design a classifier that will detect whether the driver is alert or not. 29 Diabetic Patients Data Analysis 30 Demands for Big Data Jobs McKinsey Global Institute on Big Data Jobs (2011) http://www.mckinsey.com/mgi/publications/big_data/index.asp Estimated gap of 140,000 - 190,000 data analytics skills by 2018 31