Summary

This lecture covers the fundamentals of Big Data, including its characteristics, technologies, and applications. The different types of data are also explained, along with the various methods of analyzing and processing large volumes of data.

Full Transcript

1 References. 2 Big-Data Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value and knowledge from them 3 4 5 Revision 6 7 En...

1 References. 2 Big-Data Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value and knowledge from them 3 4 5 Revision 6 7 Enterprise data is data that is shared by the users of an organization, generally across departments and/or geographic regions. Because enterprise data loss can result in significant financial losses for all 8 parties involved, enterprises spend time and resources on careful and effective data modeling, solutions, security and storage. 9 Data mining is the process of sorting through large data sets to identify patterns and relationships that can help solve business problems through data analysis. Data mining techniques and tools help enterprises to predict future trends and make more informed business decisions. 10 What is Big Data? What makes data, “Big” Data? 11 Big Data Definition  No single standard definition… “Big Data” is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it… 12 Big Data: 3V’s 13 Characteristics of Big Data: 1-Scale (Volume)  Data Volume  44x increase from 2009 2020  From 0.8 zettabytes to 35zb  Data volume is increasing exponentially Exponential increase in collected/generated data 14 Characteristics of Big Data: 2-Complexity (Varity)  Various formats, types, and structures  Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc…  Static data vs. streaming data  A single application can be generating/collecting many types of data To extract knowledge➔ all these types of data need to linked together 15 Characteristics of Big Data: 3-Speed (Velocity)  Data is begin generated fast and need to be processed fast  Online Data Analytics  Late decisions ➔ missing opportunities  Examples  E-Promotions: Based on your current location, your purchase history, what you like ➔ send promotions right now for store next to you  Healthcare monitoring: sensors monitoring your activities and body ➔ any abnormal measurements require immediate reaction 16 Some Make it 4V’s 17 Harnessing Big Data  OLTP: Online Transaction Processing (DBMSs)  OLAP: Online Analytical Processing (Data Warehousing)  RTAP: Real-Time Analytics Processing (Big Data Architecture & technology) 18 Who’s Generating Big Data Mobile devices (tracking all objects all the time) Scientific instruments (collecting all sorts of data) Social media and networks (all of us are generating data) Sensor technology and networks (measuring all kinds of data)  The progress and innovation is no longer hindered by the ability to collect data  But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion 19 The Model Has Changed…  The Model of Generating/Consuming Data has Changed Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data 20 What’s driving Big Data - Optimizations and predictive analytics - Complex statistical analysis - All types of data, and many sources - Very large datasets - More of a real-time - Ad-hoc querying and reporting - Data mining techniques - Structured data, typical sources - Small to mid-size datasets 21 Value of Big Data Analytics  Big data is more real-time in nature than traditional DW applications  Traditional DW architectures (e.g. Exadata, Teradata) are not well-suited for big data apps  Shared nothing, massively parallel processing, scale out architectures are well-suited for big data apps 22 Challenges in Handling Big Data  The Bottleneck is in technology  New architecture, algorithms, techniques are needed  Also, in technical skills  Experts in using the new technology and dealing with big data 23 What Technology Do We Have For Big Data ?? 24 25 Big Data Technology 26 What is Data Mining?  Discovery of useful, possibly unexpected, patterns in data  Non-trivial extraction of implicit, previously unknown and potentially useful information from data  Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns Data Mining Tasks  Classification [Predictive]  Clustering [Descriptive]  Association Rule Discovery [Descriptive]  Sequential Pattern Discovery [Descriptive]  Regression [Predictive]  Deviation Detection [Predictive]  Collaborative Filter [Predictive] Clustering income education age 29 Association Rule Mining tran1 cust33 p2, p5, p8 sales tran2 cust45 p5, p8, p11 market-basket records: tran3 cust12 p1, p9 data tran4 cust40 p5, p8, p11 tran5 cust12 p2, p9 tran6 cust12 p9 Trend: Products p5, p8 often bough together Trend: Customer 12 likes product p9 30 Collaborative Filtering  Goal: predict what movies/books/… a person may be interested in, on the basis of  Past preferences of the person  Other people with similar past preferences  The preferences of such people for a new movie/book/…  One approach based on repeated clustering  Cluster people on the basis of preferences for movies  Then cluster movies on the basis of being liked by the same clusters of people  Again cluster people based on their preferences for (the newly created clusters of) movies  Repeat above till equilibrium  Above problem is an instance of collaborative filtering, where users collaborate in the task of filtering information to find information of interest 31 Other Types of Mining  Text mining: application of data mining to textual documents  cluster Web pages to find related pages  cluster pages a user has visited to organize their visit history  classify Web pages automatically into a Web directory  Graph Mining:  Deal with graph data 32 Data Streams  What are Data Streams?  Continuous streams  Huge, Fast, and Changing  Why Data Streams?  The arriving speed of streams and the huge amount of data are beyond our capability to store them.  “Real-time” processing  Window Models  Landscape window (Entire Data Stream)  Sliding Window  Damped Window  Mining Data Stream 33 34

Use Quizgecko on...
Browser
Browser