Big Data - W14S1 - PDF
Document Details
Uploaded by FlawlessHeliotrope6818
Institut Teknologi Bandung
Humasak T.A. Simanjuntak
Tags
Summary
This document is a presentation on Big Data, its characteristics, importance, and usage examples. Big Data is a presentation on the increasing volume of data, how quickly it's generated, and the various types of data.
Full Transcript
Big Data Humasak T.A. Simanjuntak Overview Introduction: Explosion in Quantity of Data What’s Big Data? Big Data Characteristics Importance of Big Data Usage Example of Big Data 2023/11/27 HTS_BigData_NoSQL 2 Introduction: Explosion in...
Big Data Humasak T.A. Simanjuntak Overview Introduction: Explosion in Quantity of Data What’s Big Data? Big Data Characteristics Importance of Big Data Usage Example of Big Data 2023/11/27 HTS_BigData_NoSQL 2 Introduction: Explosion in Quantity of Data 1946 2012 Eniac LHC X 6000000 = 1 (40 TB/S) Air Bus A380 - 1 billion line of code 640TB per Flight - each engine generate 10 TB every 30 min Twitter (X) Generate approximately 12 TB of data per day New York Stock Exchange 1TB of data everyday storage capacity has doubled roughly every three 2023/11/27 years since the 1980s HTS_BigData_NoSQL 3 www.internetlivestats.com Capture on 22 November 2021 2023/11/27 HTS_BigData_NoSQL 4 Introduction:Explosion in Quantity of Data Our Data-driven World Science – Data bases from astronomy, genomics, environmental data, transportation data, … Humanities and Social Sciences – Scanned books, historical documents, social interactions data, new technology like GPS … Business & Commerce – Corporate sales, stock market transactions, census, airline traffic, … Entertainment – Internet images, Hollywood movies, MP3 files, … Medicine – MRI & CT scans, patient records, … 2023/11/27 HTS_BigData_NoSQL 5 What’s Big Data? No single definition; here is from Wikipedia: Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions.” 2023/11/27 HTS_BigData_NoSQL 6 Big Data Characteristics How big is the Big Data? – What is big today maybe not big tomorrow Any data that can challenge our current technology in some manner can consider as Big Data - Volume - Communication - Speed of Generating - Meaningful Analysis 2023/11/27 HTS_BigData_NoSQL 7 2023/11/27 HTS_BigData_NoSQL 8 Big Data: 3V’s 2023/11/27 HTS_BigData_NoSQL 9 Big Data Characteristics Big Data Vectors (3Vs) – "Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization” (Gartner 2012) 2023/11/27 HTS_BigData_NoSQL 10 Big Data Characteristics (3Vs) high-volume - amount of data high-velocity - Speed rate in collecting or acquiring or generating or processing of data high-variety - different data type such as audio, video, image data (mostly unstructured data) 2023/11/27 HTS_BigData_NoSQL 11 Volume (Scale) Data Volume – 44x increase from 2009 2020 – From 0.8 zettabytes to 35zb Data volume is increasing exponentially Exponential increase in collected/generated data 2023/11/27 HTS_BigData_NoSQL 12 4.6 30 billion RFID billion tags today 12+ TBs (1.3B in 2005) camera of tweet data phones every day world wide 100s of millions data every day of GPS ? TBs of enabled devices sold annually 25+ TBs of log data 2+ every day billion people on the Web 76 million smart by end meters in 2009… 2011 200M by 2014 2023/11/27 HTS_BigData_NoSQL 13 2023/11/27 HTS_BigData_NoSQL 14 Maximilien Brice, © CERN CERN’s Large Hydron Collider (LHC) generates 15 PB a year The Earthscope The Earthscope is the world's largest science project. Designed to track North America's geological evolution, this observatory records data over 3.8 million square miles, amassing 67 terabytes of data. It analyzes seismic slips in the San Andreas fault, sure, but also the plume of magma underneath Yellowstone and much, much more. (http://www.msnbc.msn.com/id/4436 3598/ns/technology_and_science- future_of_technology/#.TmetOdQ--uI) 2023/11/27 HTS_BigData_NoSQL 15 Variety (Complexity) Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) Graph Data – Social Network, Semantic Web (RDF), … Streaming Data – You can only scan the data once A single application can be generating/collecting many types of data Big Public Data (online, weather, finance, etc) To extract knowledgeè all these types of data need to linked together 16 A Single View to the Customer Social Banking Media Finance Our Gaming Customer Known History Purchas Entertain e Velocity (Speed) Data is begin generated fast and need to be processed fast Online Data Analytics Late decisions è missing opportunities Examples – E-Promotions: Based on your current location, your purchase history, what you like è send promotions right now for store next to you – Healthcare monitoring: sensors monitoring your activities and body è any abnormal measurements require immediate reaction 18 Real-time/Fast Data Mobile devices (tracking all objects all the time) Social media and networks Scientific instruments (all of us are generating data) (collecting all sorts of data) Sensor technology and networks (measuring all kinds of data) The progress and innovation is no longer hindered by the ability to collect data But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion 19 Real-Time Analytics/Decision Requirement Product Recommendations Learning why Customers Influence that are Relevant Behavior Switch to competitors & Compelling and their offers; in time to Counter Friend Invitations Improving the Customer to join a Marketing Game or Activity Effectiveness of a that expands Promotion while it business is still in Play Preventing Fraud as it is Occurring & preventing more proactively Some Make it 4V’s 21 Harnessing Big Data OLTP: Online Transaction Processing (DBMSs) OLAP: Online Analytical Processing (Data Warehousing) RTAP: Real-Time Analytics Processing (Big Data Architecture & technology) 22 The Model Has Changed… The Model of Generating/Consuming Data has Changed Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data 23 What’s driving Big Data - Optimizations and predictive analytics - Complex statistical analysis - All types of data, and many sources - Very large datasets - More of a real-time - Ad-hoc querying and reporting - Data mining techniques - Structured data, typical sources - Small to mid-size datasets 24 THE EVOLUTION OF BUSINESS INTELLIGENCE Interactive Business Intelligence & Big Data: Speed In-memory RDBMS Scale Real Time & Single View BI Reporting QliqView, Tableau, HANA OLAP & Graph Databases Dataware house Business Objects, SAS, Big Data: Scale Speed Informatica, Cognos other SQL Batch Processing & Reporting Tools Distributed Data Store Hadoop/Spark; HBase/Cassandra 1990’s 2000’s 2010’s 2023/11/27 HTS_BigData_NoSQL 26 Big Data Analytics Big data is more real-time in nature than traditional DW applications Traditional DW architectures (e.g. Exadata, Teradata) are not well- suited for big data apps Shared nothing, massively parallel processing, scale out architectures are well-suited for big data apps 27 Big Data Technology 29 Importance of Big Data Government - In 2012, the Obama administration announced the Big Data Research and Development Initiative - 84 different big data programs spread across six departments Private Sector - Wal-Mart stores, Inc handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes of data - Facebook handles 40 billion photos from its user base. - Falcon Credit Card Fraud Detection System protects 2.1 billion active accounts world-wide Science - Large Synoptic Survey Telescope will generate 140 Terabyte of data every 5 days. - Large Hardon Colider 13 Petabyte data produced in 2010 - Medical computation like decoding human Genome - Social science revolution - New way of science (Microscope example) 2023/11/27 HTS_BigData_NoSQL 30 Importance of Big Data Technology Player in this field Oracle : Exadata Microsoft : HDInsight Server IBM : Netezza 2023/11/27 HTS_BigData_NoSQL 31 Usage Example of Big Data - Moneyball : The Art of Winning an Unfair Game Oakland Athletics baseball team and its general manager Billy Beane - Oakland A's' front office took advantage of more analytical gauges of player performance to field a team that could compete successfully against richer competitors in MLB - Oakland approximately $41 million in salary, New York Yankees, $125 million in payroll that same season. Oakland is forced to find players undervalued by the market, Moneyball had a huge impact in other teams in MLB And there is a moneyball movie!!!!! 2023/11/27 HTS_BigData_NoSQL 32 Usage Example of Big Data US 2012 Election predictive modeling data mining for individualized ad mybarackobama.com targeting drive traffic to other campaign sites Orca big-data app – Facebook page (33 million "likes") YouTube channel( 23,700 – YouTube channel (240,000 subscribers subscribers and 246 million page views). and 26 million page views) a contest to dine with Sarah Jessica Parker Ace of Spades HQ Every single night, the team ran 66,000 computer simulations, Reddit!!! Amazon web services 2023/11/27 HTS_BigData_NoSQL 33 Usage Example of Big Data Data Analysis prediction for US 2012 Election Drew Linzer, June 2012 media continue reporting the race as very 332 for Obama, tight 206 for Romney Nate Silver’s, Five thirty Eight blog Predict Obama had a 86% chance of winning Predicted all 50 state correctly Sam Wang, the Princeton Election Consortium The probability of Obama's re-election at more than 98% 2023/11/27 HTS_BigData_NoSQL 34 Modern Databases NoSQL and NewSQL Relational DBs Cannot Handle Web-Scale or can they? – To be honest the jury is out on this one NoSQL – An attempt at using non-relational solutions NewSQL – Scaling relational DBs 2023/11/27 HTS_BigData_NoSQL 36 The NoSQL Movement Not Only SQL – It is not No SQL – Not only relational would have been better Use the right tools (DBs) for the job It is more like a feature set, or even the not of a feature set 2023/11/27 HTS_BigData_NoSQL 37 Definition from nosql-databases.org Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open- source and horizontal scalable. The original intention has been modern web-scale databases. The movement began early 2009 and is growing rapidly. Often more characteristics apply as: schema-free, easy replication support, simple API, eventually consistent /BASE (not ACID), a huge data amount, and more. So the misleading term "nosql" (the community now translates it mostly with "not only sql") should be seen as an alias to something like the definition above. 2023/11/27 HTS_BigData_NoSQL 38 NoSQL http://nosql-database.org/ Non relational Scalability – Vertically Add more data – Horizontally Add more storage Collection of structures – Hashtables, maps, dictionaries No pre-defined schema No join operations CAP not ACID – Consistency, Availability and Partitioning (but not all three at once!) – Atomicity, Consistency, Isolation and Durability 2023/11/27 HTS_BigData_NoSQL 39 2023/11/27 HTS_BigData_NoSQL 40 Advantages of NoSQL Cheap, easy to implement Data are replicated and can be partitioned Easy to distribute Don't require a schema Can scale up and down Quickly process large amounts of data Relax the data consistency requirement (CAP) Can handle web-scale data, whereas Relational DBs cannot 2023/11/27 HTS_BigData_NoSQL 41 Disadvantages of NoSQL New and sometimes buggy Data is generally duplicated, potential for inconsistency No standardized schema No standard format for queries No standard language Difficult to impose complicated structures Depend on the application layer to enforce data integrity No guarantee of support Too many options, which one, or ones to pick 2023/11/27 HTS_BigData_NoSQL 42 NoSQL Presentation Introduction to NoSQL by John Nunemaker – http://glennas.wordpress.com/2011/03/11/introd uction-to-nosql-john-nunemaker-presentation- from-june-2010/ – Added it to our pages at Movie http://www.cs.sun.ac.za/rw334/nosql.mp4 Slides: http://www.cs.sun.ac.za/rw334/whynosql.pdf 2023/11/27 HTS_BigData_NoSQL 43 NoSQL Options Key-Value Stores Column Stores Document Stores Graph Stores 2023/11/27 HTS_BigData_NoSQL 44 Key-Value Stores This technology you know and love and use all the time – Hashmap for example Put(key,value) value = Get(key) Examples – Redis – in memory store – Memcached – and 100s more 2023/11/27 HTS_BigData_NoSQL 45 Column Stores Not to be confused with the relational-db version of this – Sybase-IQ etc. Multi-dimensional map Not all entries are relevant each time – Column families Examples – Cassandra – Hbase – Amazon SimpleDB 2023/11/27 HTS_BigData_NoSQL 46 Document Stores Key-document stores – However the document can be seen as a value so you can consider this is a super-set of key-value Big difference is that in document stores one can query also on the document, i.e. the document portion is structured (not just a blob of data) Examples – MongoDB – CouchDB 2023/11/27 HTS_BigData_NoSQL 47 Graph Stores Use a graph structure – Labeled, directed, attributed multi-graph Label for each edge Directed edges Multiple attributes per node Multiple edges between nodes – Relational DBs can model graphs, but an edge requires a join which is expensive Example Neo4j – http://www.infoq.com/articles/graph-nosql-neo4j 2023/11/27 HTS_BigData_NoSQL 48 Example: MongoDb MongoDB stores data in the form of documents, which are JSON-like field and value pairs Documents are analogous to structures in programming languages that associate keys with values, where keys may hold other pairs of keys and values (e.g. dictionaries, hashes, maps, and associative arrays) MongoDB documents are BSON documents, which is a binary representation of JSON with additional type information MongoDB stores all documents in collections. A collection is a group of related documents that have a set of shared common indexes. Collections are analogous to a table in relational databases. 2023/11/27 HTS_BigData_NoSQL 49 Survey https://db-engines.com/en/ranking 2023/11/27 HTS_BigData_NoSQL 50 JSON vs XML JSON (Javascript Object Notation) – Data format for data transfer between different device/OS/language programming – Data storage MongoDB. – supports the embedding of documents and arrays within other documents and arrays. Advantages of JSON – Smaller than XML: data transfer faster and save resource (bandwith) – JSON is Javascript: not needed for addition library for processing – More simple than XML – Easy to use by programmer 2023/11/27 HTS_BigData_NoSQL 51 JSON vs XML 2023/11/27 HTS_BigData_NoSQL 52 Features MongoDB: Ad hoc queries MongoDB supports search by field, range queries, regular expression searches. Queries can return specific fields of documents and also include user-defined JavaScript functions. Indexing Any field in a MongoDB document can be indexed (indices in MongoDB are conceptually similar to those in RDBMSes). Secondary indices are also available. 2023/11/27 HTS_BigData_NoSQL 53 Features MongoDB: 2023/11/27 HTS_BigData_NoSQL 54 Features MongoDB: Replication MongoDB provides high availability and increased throughput with replica sets. A replica set consists of two or more copies of the data. Each replica may act in the role of primary or secondary replica at any time. The primary replica performs all writes and reads by default. Secondary replicas maintain a copy of the data on the primary using built-in replication. When a primary replica fails, the replica set automatically conducts an election process to determine which secondary should become the primary. Secondaries can also perform read operations, but the data is eventually consistent by default. 2023/11/27 HTS_BigData_NoSQL 55 Features MongoDB: File storage MongoDB can be used as a file system, taking advantage of load balancing and data replication features over multiple machines for storing files. This function, called GridFS included with MongoDB drivers and available with no difficulty for development languages (see “Language Support" for a list of supported languages). MongoDB exposes functions for file manipulation and content to developers. GridFS is a specification for storing and retrieving files that exceed the BSON-document size limit of 16MB 2023/11/27 HTS_BigData_NoSQL 56 Features MongoDB: Load balancing MongoDB scales horizontally using sharding. Sharding is a method for storing data across multiple machines. The user chooses a shard key, which determines how the data in a collection will be distributed. The data is split into ranges (based on the shard key) and distributed across multiple shards. (A shard is a master with one or more slaves.) 2023/11/27 HTS_BigData_NoSQL 57 Features MongoDB: Aggregation Aggregations are operations that process data records and return computed results. MongoDB provides a rich set of aggregation operations that examine and perform calculations on the data sets. MapReduce can be used for batch processing of data and aggregation operations. The aggregation framework enables users to obtain the kind of results for which the SQL GROUP BY clause is used. – A map stage : processes each document and emits one or more objects for each input document – reduce phase : that combines the output of the map operation 2023/11/27 HTS_BigData_NoSQL 58 MongoDB: CRUD Introduction Document: 2023/11/27 HTS_BigData_NoSQL 59 MongoDB: CRUD Introduction Collection of MongoDB document 2023/11/27 HTS_BigData_NoSQL 60 MongoDB: insert operation 2023/11/27 HTS_BigData_NoSQL 61 MongoDB: select operation db.collection.find() method 2023/11/27 HTS_BigData_NoSQL 62 MongoDB: select operation Stages: 2023/11/27 HTS_BigData_NoSQL 63 MongoDB: update operation Update: 2023/11/27 HTS_BigData_NoSQL 64 MongoDB: delete operation delete 2023/11/27 HTS_BigData_NoSQL 65 MongoDB: Data model Data in MongoDB has a flexible schema. Unlike SQL databases, where you must determine and declare a table’s schema before inserting data, MongoDB’s collections do not enforce document structure. This flexibility facilitates the mapping of documents to an entity or an object. Each document can match the data fields of the represented entity, even if the data has substantial variation. Remember: – key challenge in DM is balancing the needs of the application, the performance characteristics of database engine, and the data retrieval patterns. – Always consider the application usage of the data (i.e. queries, updates, and 2023/11/27processing of the data) as well as the inherent structure of the data itself. 66 HTS_BigData_NoSQL MongoDB: Data model (Example) 2023/11/27 HTS_BigData_NoSQL 67 MongoDB: Data model Key decision in designing data models for MongoDB: – structure of documents – how the application represents relationships between data. There are two tools that allow applications to represent these relationships: references and embedded documents. 2023/11/27 HTS_BigData_NoSQL 68 Embedded Data (Denormalized Data Model) Embedded documents capture relationships between data by storing related data in a single document structure. MongoDB documents make it possible to embed document structures as sub-documents in a field or array within a document. Allow applications to retrieve and manipulate related data in a single database operation. 2023/11/27 HTS_BigData_NoSQL 69 Embedded Data (Denormalized Data Model) Allow applications to store related pieces of information in the same database record Result: – Applications may need to issue fewer queries and updates to complete common operations Use Embedded when: – “contains” relationships between entities ( 1 to 1) – one-to-many relationships between entities (“many” or child documents always appear with or are viewed in the context of the “one” or parent documents) 2023/11/27 HTS_BigData_NoSQL 70 Embedded Data (Denormalized Data Model) Better performance for read operations (request and retrieve related data in a single database operation) Possible to update related data in a single atomic write operation. 2023/11/27 HTS_BigData_NoSQL 71 References (Normalized Data Model) References store capture the relationships between data by including links or references from one document to another 2023/11/27 HTS_BigData_NoSQL 72 References (Normalized Data Model) Use normalized data models when: – when embedding would result in duplication of data but would not provide sufficient read performance advantages to outweigh the implications of the duplication. – to represent more complex many-to-many relationships. – to model large hierarchical data sets. References provides more flexibility than embedding. However, client-side applications must issue follow- up queries to resolve the references. In other words, normalized data models can require more roundtrips to the server. 2023/11/27 HTS_BigData_NoSQL 73 1 to 1 Relationships with Embedded Documents example that maps patron and address relationships. Advantage embedding over referencing if you need to view one data entity in context of the other. In this one-to-one relationship between patron and address data, the address belongs to the patron If the address data is frequently retrieved with the name information, then with referencing, your application needs to issue multiple queries to resolve the reference. With the embedded data model, your application can retrieve the complete patron information with one query. 2023/11/27 HTS_BigData_NoSQL 74 1 to M Relationships with Embedded Documents example that maps patron and address relationships. advantage of embedding over referencing if you need to view many data entities in context of another In this one-to many relationship between patron and address data, the patron has multiple address entities. If your application frequently retrieves the address data with the name information, then your application needs to issue multiple queries to resolve the references. 2023/11/27 HTS_BigData_NoSQL 75 1 to M Relationships with Embedded Documents More optimal schema With the embedded data model, your application can retrieve the complete patron information with one query. 2023/11/27 HTS_BigData_NoSQL 76 M to M Relationships with Document References Example publisher and book relationships. ilustrates the advantage of referencing over embedding to avoid repetition of the publisher information. Embedding the publisher document inside the book document would lead to repetition of the publisher data. 2023/11/27 HTS_BigData_NoSQL 77 M to M Relationships with Document References To avoid repetition of the publisher data, use references and keep the publisher information in a separate collection from the book collection. When using references, the growth of the relationships determine where to store the reference. – If the number of books per publisher is small with limited growth, storing the book reference inside the publisher document may sometimes be useful. – Otherwise, if the number of books per publisher is unbounded, this data model would lead to mutable, growing arrays. – So, store the publisher reference inside the book document 2023/11/27 HTS_BigData_NoSQL 78 M to M Relationships with Document References 2023/11/27 HTS_BigData_NoSQL 79 NewSQL Just like NoSQL it is more of a movement than specific product or even product family The “New” refers to the Vendors and not the SQL Goal(s): – Bring the benefits of relational model to distributed architectures, or, VoltDB, ScaleDB, etc. – Improve Relational DB performance to no longer require horizontal scaling Tokutek, ScaleBase, etc. “SQL-as-a-service”: Amazon RDS, Microsoft SQL Azure, Google Cloud SQL 2023/11/27 HTS_BigData_NoSQL 80 Thanks 2023/11/27 HTS_BigData_NoSQL 81