NDBI040 Lecture 1 Introduction PDF
Document Details
Uploaded by HeartfeltAtlanta
Charles University
2019
Mar n Svoboda
Tags
Summary
These are lecture notes for a course on modern database concepts. It covers introductory topics, including Big Data, NoSQL databases, and relational databases, with details on various types and features.
Full Transcript
NDBI040: Modern Database Concepts h p://www.ksi.mff.cuni.cz/~svoboda/courses/191-NDBI040/ Lecture 1 Introduc on Mar n Svoboda [email protected]ff.cuni.cz 1. 10. 2019 Charles University, Faculty of Mathema cs and Physics Lecture Outline Big Data Characteris cs Current trends...
NDBI040: Modern Database Concepts h p://www.ksi.mff.cuni.cz/~svoboda/courses/191-NDBI040/ Lecture 1 Introduc on Mar n Svoboda [email protected]ff.cuni.cz 1. 10. 2019 Charles University, Faculty of Mathema cs and Physics Lecture Outline Big Data Characteris cs Current trends NoSQL databases Mo va on Features Overview of NoSQL database types Key-value, wide column, document, graph, … NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 2 What is Big Data? Buzzword? Bubble? Gold rush? Revolu on? Dan Ariely: Big Data: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it. NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 3 What is Big Data? No standard defini on Gartner (research and advisory company): High Performance Compu ng Big Data is high volume, high velocity, and/or high variety informa on assets that require new forms of processing to enable enhanced decision making, insight discovery and pro- cess op miza on. NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 4 Where is Big Data? Sources of Big Data Social media and networks …all of us are genera ng data Scien fic instruments …collec ng all sorts of data Mobile devices …tracking all objects all the me Sensor technology and networks …measuring all kinds of data NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 5 Big Data Characteris cs Volume (Scale) Source: h p://www.ibmbigdatahub.com/ NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 6 Big Data Characteris cs Variety (Complexity) Source: h p://www.ibmbigdatahub.com/ NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 7 Big Data Characteris cs Velocity (Speed) Source: h p://www.ibmbigdatahub.com/ NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 8 Big Data Characteris cs Veracity (Uncertainty) Source: h p://www.ibmbigdatahub.com/ NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 9 Big Data Characteris cs Basic 4V Volume (Scale) Data volume is increasing exponen ally, not linearly Even large amounts of small data can result into Big Data Variety (Complexity) Various formats, types, and structures (from semi-structured XML to unstructured mul media) Velocity (Speed) Data is being generated fast and needs to be processed fast Veracity (Uncertainty) Uncertainty due to inconsistency, incompleteness, latency, ambigui es, or approxima ons NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 10 Big Data Characteris cs Addi onal V and C Value Business value of the data (needs to be revealed) Validity Data correctness and accuracy with respect to the intended use Vola lity Period of me the data is valid and should be maintained Cardinality Con nuity Complexity NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 11 Big Data Characteris cs Addi onal V Source: h ps://www.xenonstack.com/blog/big-data-engineering/inges on-processing-big-data-iot-stream/ NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 12 Rela onal Databases Data model Instance → database → table → row Query languages Real-world: SQL (Structured Query Language) Formal: Rela onal algebra, rela onal calculi (domain, tuple) Query pa erns Selec on based on complex condi ons, projec on, joins, aggrega on, deriva on of new values, recursive queries, … Representa ves Oracle Database, Microso SQL Server, IBM DB2 MySQL, PostgreSQL NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 13 Rela onal Databases Representa ves NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 14 Rela onal Databases Features: Normal Forms Model Func onal dependencies 1NF, 2NF, 3NF, BCNF (Boyce-Codd normal form) Objec ve Normaliza on of database schema to BCNF or 3NF Algorithms: decomposi on or synthesis Mo va on Diminish data redundancy, prevent update anomalies However: Data is scattered into small pieces (high granularity), and so these pieces have to be joined back together when querying! NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 15 Rela onal Databases Features: Transac ons Model Transac on = flat sequence of database opera ons (READ, WRITE, COMMIT, ABORT) Objec ves Enforcement of ACID proper es Efficient parallel / concurrent execu on (slow hard drives, …) ACID proper es Atomicity – par al execu on is not allowed (all or nothing) Consistency – transac ons turn one valid database state into another Isola on – uncommi ed effects are concealed among transac ons Durability – effects of commi ed transac ons are permanent NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 16 Current Trends Big Data Volume: terabytes → ze abytes Variety: structured → structured and unstructured data Velocity: batch processing → streaming data … Big users Popula on online, hours spent online, devices online, … Rapidly growing companies / web applica ons Even millions of users within a few months NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 17 Current Trends Everything is in cloud SaaS: So ware as a Service PaaS: Pla orm as a Service IaaS: Infrastructure as a Service Processing paradigms OLTP: Online Transac on Processing OLAP: Online Analy cal Processing …but also… RTAP: Real-Time Analy cal Processing NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 18 Current Trends Data assump ons Data format is becoming unknown or inconsistent Linear growth → unpredictable exponen al growth Read requests o en prevail write requests Data updates are no longer frequent Data is expected to be replaced Strong consistency is no longer mission-cri cal NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 19 Current Trends ⇒ New approach is required Rela onal databases simply do not follow the current trends Key technologies Distributed file systems MapReduce and other programming models Grid compu ng, cloud compu ng NoSQL databases Data warehouses Large scale machine learning NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 20 NoSQL Databases What does NoSQL actually mean? A bit of history … 1998 First used for a rela onal database that omi ed usage of SQL 2009 First used during a conference to advocate non-rela onal databases So? Not: no to SQL Not: not only SQL NoSQL is an accidental term with no precise defini on NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 21 NoSQL Databases What does NoSQL actually mean? NoSQL movement = The whole point of seeking alterna ves is that you need to solve a problem that rela onal databases are a bad fit for NoSQL databases = Next genera on databases mostly ad- dressing some of the points: being non-rela onal, dis- tributed, open-source and horizontally scalable. The original inten on has been modern web-scale databases. O en more characteris cs apply as: schema-free, easy replica on sup- port, simple API, eventually consistent, a huge data amount, and more. Source: h p://nosql-database.org/ NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 22 Types of NoSQL Databases Core types Key-value stores Wide column (column family, column oriented, …) stores Document stores Graph databases Non-core types Object databases Na ve XML databases RDF stores … NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 23 Key-Value Stores Data model The most simple NoSQL database type Works as a simple hash table (mapping) Key-value pairs Key (id, iden fier, primary key) Value: binary object, black box for the database system Query pa erns Create, update or remove value for a given key Get value for a given key Characteris cs Simple model ⇒ great performance, easily scaled, … Simple model ⇒ not for complex queries nor complex data NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 24 Key-Value Stores Suitable use cases Session data, user profiles, user preferences, shopping carts, … I.e. when values are only accessed via keys When not to use Rela onships among en es Queries requiring access to the content of the value part Set opera ons involving mul ple key-value pairs Representa ves Redis, MemcachedDB, Riak KV, Hazelcast, Ehcache, Amazon SimpleDB, Berkeley DB, Oracle NoSQL, Infinispan, LevelDB, Ignite, Project Voldemort Mul -model: OrientDB, ArangoDB NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 25 Key-Value Stores Representa ves NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 26 Document Stores Data model Documents Self-describing Hierarchical tree structures (JSON, XML, …) – Scalar values, maps, lists, sets, nested documents, … Iden fied by a unique iden fier (key, …) Documents are organized into collec ons Query pa erns Create, update or remove a document Retrieve documents according to complex query condi ons Observa on Extended key-value stores where the value part is examinable! NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 27 Document Stores Suitable use cases Event logging, content management systems, blogs, web analy cs, e-commerce applica ons, … I.e. for structured documents with similar schema When not to use Set opera ons involving mul ple documents Design of document structure is constantly changing I.e. when the required level of granularity would outbalance the advantages of aggregates NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 28 Document Stores Representa ves MongoDB, Couchbase, Amazon DynamoDB, CouchDB, RethinkDB, RavenDB, Terrastore Mul -model: MarkLogic, OrientDB, OpenLink Virtuoso, ArangoDB NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 29 Document Stores Representa ves NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 30 Wide Column Stores Data model Column family (table) Table is a collec on of similar rows (not necessarily iden cal) Row Row is a collec on of columns – Should encompass a group of data that is accessed together Associated with a unique row key Column Column consists of a column name and column value (and possibly other metadata records) Scalar values, but also flat sets, lists or maps may be allowed NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 31 Wide Column Stores Query pa erns Create, update or remove a row within a given column family Select rows according to a row key or simple condi ons Warning Wide column stores are not just a special kind of RDBMSs with a variable set of columns! NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 32 Wide Column Stores Suitable use cases Event logging, content management systems, blogs, … I.e. for structured flat data with similar schema When not to use ACID transac ons are required Complex queries: aggrega on (SUM, AVG, …), joining, … Early prototypes: i.e. when database design may change Representa ves Apache Cassandra, Apache HBase, Apache Accumulo, Hypertable, Google Bigtable NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 33 Wide Column Stores Representa ves NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 34 Graph Databases Data model Property graphs Directed / undirected graphs, i.e. collec ons of … – nodes (ver ces) for real-world en es, and – rela onships (edges) between these nodes Both the nodes and rela onships can be associated with addi onal proper es Types of databases Non-transac onal = small number of very large graphs Transac onal = large number of small graphs NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 35 Graph Databases Query pa erns Create, update or remove a node / rela onship in a graph Graph algorithms (shortest paths, spanning trees, …) General graph traversals Sub-graph queries or super-graph queries Similarity based queries (approximate matching) Representa ves Neo4j, Titan, Apache Giraph, InfiniteGraph, FlockDB Mul -model: OrientDB, OpenLink Virtuoso, ArangoDB NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 36 Graph Databases Suitable use cases Social networks, rou ng, dispatch, and loca on-based services, recommenda on engines, chemical compounds, biological pathways, linguis c trees, … I.e. simply for graph structures When not to use Extensive batch opera ons are required Mul ple nodes / rela onships are to be affected Only too large graphs to be stored Graph distribu on is difficult or impossible at all NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 37 Graph Databases Representa ves NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 38 Na ve XML Databases Data model XML documents Tree structure with nested elements, a ributes, and text values (beside other less important constructs) Documents are organized into collec ons Query languages XPath: XML Path Language (naviga on) XQuery: XML Query Language (querying) XSLT: XSL Transforma ons (transforma on) Representa ves Sedna, Tamino, BaseX, eXist-db Mul -model: MarkLogic, OpenLink Virtuoso NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 39 Na ve XML Databases Representa ves NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 40 RDF Stores Data model RDF triples Components: subject, predicate, and object Each triple represents a statement about a real-world en ty Triples can be viewed as graphs Ver ces for subjects and objects Edges directly correspond to individual statements Query language SPARQL: SPARQL Protocol and RDF Query Language Representa ves Apache Jena, rdf4j (Sesame), Algebraix Mul -model: MarkLogic, OpenLink Virtuoso NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 41 RDF Stores Representa ves NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 42 Features of NoSQL Databases Data model Tradi onal approach: rela onal model (New) possibili es: Key-value, document, wide column, graph Object, XML, RDF, … Goal Respect the real-world nature of data (i.e. data structure and mutual rela onships) NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 43 Features of NoSQL Databases Aggregate structure Aggregate defini on Data unit with a complex structure Collec on of related data pieces we wish to treat as a unit (with respect to data manipula on and data consistency) Examples Value part of key-value pairs in key-value stores Document in document stores Row of a column family in wide column stores NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 44 Features of NoSQL Databases Aggregate structure Types of systems Aggregate-ignorant: rela onal, graph – It is not a bad thing, it is a feature Aggregate-oriented: key-value, document, wide column Design notes No universal strategy how to draw aggregate boundaries Atomicity of database opera ons: just a single aggregate at a me NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 45 Features of NoSQL Databases Elas c scaling Tradi onal approach: scaling-up Buying bigger servers as database load increases New approach: scaling-out Distribu ng database data across mul ple hosts – Graph databases (unfortunately): difficult or impossible at all Data distribu on Sharding Par cular ways how database data is split into separate groups Replica on Maintaining several data copies (performance, recovery) NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 46 Features of NoSQL Databases Automated processes Tradi onal approach Expensive and highly trained database administrators New approach: automa c recovery, distribu on, tuning, … Relaxed consistency Tradi onal approach Strong consistency (ACID proper es and transac ons) New approach Eventual consistency only (BASE proper es) I.e. we have to make trade-offs because of the data distribu on NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 47 Features of NoSQL Databases Schemalessness Rela onal databases Database schema present and strictly enforced NoSQL databases Relaxed schema or completely missing Consequences: higher flexibility – Dealing with non-uniform data – Structural changes cause no overhead However: there is (usually) an implicit schema – We must know the data structure at the applica on level anyway NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 48 Features of NoSQL Databases Open source O en community and enterprise versions (with extended features or extent of support) Simple APIs O en state-less applica on interfaces (HTTP) NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 49 Features of NoSQL Databases Current State: Five advantages Scaling Horizontal distribu on of data among hosts Volume High volumes of data that cannot be handled by RDBMS Administrators No longer needed because of the automated maintenance Economics Usage of cheap commodity servers, lower overall costs Flexibility Relaxed or missing data schema, easier design changes NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 50 Features of NoSQL Databases Current State: Five challenges Maturity O en s ll in pre-produc on phase with key features missing Support Mostly open source, limited sources of credibility Administra on Some mes rela vely difficult to install and maintain Analy cs Missing support for business intelligence and ad-hoc querying Exper se S ll low number of NoSQL experts available in the market NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 51 Conclusion The end of rela onal databases? Certainly no They are s ll suitable for most projects Familiarity, stability, feature set, available support, … However, we should also consider different database models and systems Polyglot persistence = usage of different data stores in different circumstances NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 52 Lecture Conclusion Big Data 4V characteris cs: volume, variety, velocity, veracity NoSQL databases (New) logical models Core: key-value, wide column, document, graph Non-core: XML, RDF, … (New) principles and features Horizontal scaling, data sharding and replica on, eventual consistency, … NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 53 Course Overview Outline and Objec ves Principles Scaling, distribu on, consistency Transac ons, visualiza on, … Technologies MapReduce programming model Apache Hadoop Data formats XML, JSON, RDF, … NoSQL databases Core: RiakKV, Redis, MongoDB, Cassandra, Neo4j Non-core: XML, RDF Data models, query languages, … NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 54