0-Introduction_merged (1).pdf

Databases and Information Systems (DIS) Dr. Annett Ungethüm Universität Hamburg Foto: UHH/Esfandiari Course Outline Architectures of Database Systems Transaction Management Modern Database Technology Data Warehouses and OLAP Data Mining Big Data Analytics Exercises ▪ Will start on 09.04.2024, first deadline 23.04.2024 ▪ Working period: usually 1 or 2 weeks ▪ Handling ▪ Teams of students (~2 students) ▪ Mostly practical exercises ▪ Exercise hours in Stellingen/Informatikum, different buildings → Look at STiNE for the building and room of your exercise ▪ You can also work on exercises at home or at IRZ ▪ Demonstration of results/solutions during exercise hours, see advisors! 3 Requirements to obtain certificate ▪ Successful and timely submission: All but one exercise sheets must be submitted and approved on time. What exactly is required for an approval, depends on the specification of the task. For theoretical tasks, 50% of the maximum points are required in any case. ▪ Teamwork: All participants work in groups of two, submitting one common solution. Attempted fraud may lead to the reduction of points or, in severe cases, even to the denial or withdrawal of the certificate. ▪ Active participation: An active participation in the exercises through presentation of solutions and through discussions is also required. Refusal to participate actively may lead to the reduction of points. 4 Lecture Exercise ▪ 2x per week ▪ Monday: 12:15 ESA J ▪ Wednesday: 1015 Phil G ▪ In-person (unless stated otherwise) ▪ Exam: ▪ 120 min ▪ 31.07.2024, 930 (Audi 1) Lecture ▪ 25.09.2024, 930 (ESA A) 5 Moodle ▪ lernen.min.uni-hamburg.de → Exercise sheets and lecture slides Slideset with handout (where applicable) 6 Course Outline Architecture of Database Systems Towards a relational DBS architecture Evolution of the layer model Transaction Management The 5 Layer Model Properties of transactions Modern Database Technology Data Warehouses and OLAP Data Mining Big Data Analytics 7 Before The Relational Model, There Were Other Abstraction Models for Database Systems ▪ Hierarchical Model ▪ Network Model Data base task group report to the CODASYL programming language committee, April 1971 8 Towards a relational DBS architecture Hierarchical Model: Each record has only one parent record A record is a collection of fields where each field contains one value The fields of a record are defined by its type Used only in few database systems today, e.g. in IBM Information management System Network Model: Proposed by the Data Base Task Group of the CODASYL programming language committee as the most general form of a data structure → No limitation to the links between records Enables complex data structures but only simple operations Widely replaced by the relational model 8 The Relational Model ▪ Developed by Codd in 1970 ▪ Created to be simple ▪ Became more popular than the network model with increased computing power ▪ Postulates independence of the language and implementation How can this idea be transformed into a Database System? Codd, E.F. (1970). "A Relational Model of Data for Large Shared Data Banks". Communications of the ACM 13 (6): 377–387. 9 Towards a relational DBS architecture In the 70s, implementations of the relational model were too slow for productive use Computing power grew rapidly Now the relational model is the standard abstraction model for database systems Declarative queries, the use of values, and set orientation make it easy to use compared to the network model which uses pointers and is record oriented 9 How are Relational Database Systems Built? SELECT s.firstname, s.lastname, COUNT(l.name) FROM Student s INNER JOIN Program p ON s.programId = p.id INNER JOIN Attendance a ON a.studentId = s.studentId INNER JOIN Lecture l ON a.lectureId = l.id GROUP BY s.firstname, s.lastname WHERE p.name=‘DSE’ ? byte[] b = read(File f, int pos, int length) 10 Evolution of the layer model 10 The Monolithic Approach Query Evolution of Databases introduces more challenges Database Server ▪ New storage structures and access methods ▪ Changes of storage media ▪ Additional object types DB ▪ Integrity constraints ▪ Changing interfaces and application programs Memory Interface Encapsulation via a hierarchical structure 11 Evolution of the layer model 11 The 5 Layer Model SQL, JDBC, ODBC, … SQL Query Application Query processing - Parsing 1. READ Tree A - Plan generation 2. HASH JOIN B Data System - Plan optimization 3. FETCH C Run - Plan execution Database System Database System Data model semantics Table Person: id INT, name VARCHAR, birthday DATE Index P_id_IX - System catalog - Record format on Person.id Access System 1 ‘Smith’ 15.06.1982 - Logical access paths Storage Structures TID : TID : - Record management - Free space management TID : TID : Storage System TID : TID : - Physical access paths Buffered Pages - Page replacement strategy - Materialization strategy Buffer - Logging, Backup, Recovery Paged files File System Disks, Flash, RAID, SAN, … Hardware The 5 Layer Model Additional literature: Härder, Theo. "DBMS Architecture–the Layer Model and its Evolution." Datenbank-Spektrum 13 (2005): 45-57. Received requests from upper layer (examples): Data System: Select * from mytable; Access System: FIND NEXT/STORE record Storage System: insert into B-Tree, store internal record Buffer: fetch page File System: Read/write block Managed objects: Data System: Relations, Views Access System: External (logical) records, Index Structures Storage System: Internal (physical) records, Hash tables, Trees Buffer: Segments, Pages File System: Files, Blocks Layers are ideally independent of each other → For performance reasons they are sometimes merged in a DBS The following slides show examples of the work of each layer. They are not a complete representation of the layer’s tasks. 12 System Buffer P1 Calls of the System Buffer ▪ Provide page (logical reference), might require to replace another page ▪ FIX – Fix page in the buffer, such that the content can be accessed, usually done when a page is provided Main Memory ▪ UNFIX – Release a FIX, such that page can Secondary Memory be replaced fetch(P1) flush(P2) ▪ Revision mark – A note that the content has been changed and needs to be physically written to disc when it is replaced !!!The note must be written before applying the change!!! ▪ Write – Write the buffer to main memory/disc The 5 Layer Model Read and write operations use the buffer of a DBS Buffer can only hold a fraction of the whole dataset → Page replacement strategies are used to manage buffer (FIFI, LIFO, LRU, LFU) → Ring buffers exist but are usually applied for stream processing Buffer consists of page structured segments and buffer control block (holds information necessary for managing buffer) Special feature of DB buffer in contrast to general buffer management (e.g. by the operating system): Application knowledge can be used for buffer management 13 The 5 Layer Model SQL, JDBC, ODBC, … SQL Query Application Query processing - Parsing 1. READ Tree A - Plan generation 2. HASH JOIN B Data System - Plan optimization 3. FETCH C Run - Plan execution Database System Database System Data model semantics Table Person: id INT, name VARCHAR, birthday DATE Index P_id_IX - System catalog - Record format on Person.id Access System 1 ‘Smith’ 15.06.1982 - Logical access paths Storage Structures TID : TID : - Record management - Free space management TID : TID : Storage System TID : TID : - Physical access paths Buffered Pages - Page replacement strategy - Materialization strategy Buffer - Logging, Backup, Recovery Paged files File System Disks, Flash, RAID, SAN, … Hardware The 5 Layer Model 14 Record Management Costumer(ID, Name, Age, Street, City, PostalCode) Product(ID, ProductName, Price, Picture) place records on pages pages The 5 Layer Model Representation of complete records on pages Pages have a fixed length, records can have a variable length Addressing of Records Allocation Table TID-concept (Tuple Identifier) Heap management 15 Organization of Pages Page 115 Page 136 81 136 Header information 115 175 Header information Record 267 Record 123 Record 108 The 5 Layer Model Addresses of records are created when they are inserted. They provide a way to access the records later. Concatenation Pages are linked together by double linked lists Recording of free pages: heap management Page Header Information about previous and and following page Optionally also number of the page itself Information about the type of record (Table Directory) Information about free space Issues and challenges Distinct addressing of records for the entire lifetime of the record Support of data migration (Runtime) stability against relocation within a page Fast access, access as direct as possible No necessity for frequent reorganizations 16 TIDs ▪ Address is the tuple (Page ID, index in this page) ▪ Array in page holds the position of the records ▪ Migration possible without change of TID → Required when a record in a page becomes too big TID Page 115 Page 256 115 42 Record 108 TID(256,50) Record 124 TID 115 98 The 5 Layer Model Deleting a record: The entry in the page array is marked as invalid All other records on the same page can be moved to maximize the free continuous space → only the positions are changed, not their index in the page array This way, record addresses are not changed, the same TIDs can still be used Changing a record: Case 1: Record becomes smaller → all records are moved wihting the same page, positions in the page array are changed (same as with a deleted record) Case 2: Record becomes larger and space on the page is sufficient to store it → see case 1 Case 3: Record becomes larger and space on the page is not sufficient to store it → Record is moved to another page and TID is stored on the original page (see Figure), If record is moved again later, TID in original page is changed again → only one additional reference necessary even if record is changed multiple times 17 The 5 Layer Model SQL, JDBC, ODBC, … SQL Query Application Query processing - Parsing 1. READ Tree A - Plan generation 2. HASH JOIN B Data System - Plan optimization 3. FETCH C Run - Plan execution Database System Database System Data model semantics Table Person: id INT, name VARCHAR, birthday DATE Index P_id_IX - System catalog - Record format on Person.id Access System 1 ‘Smith’ 15.06.1982 - Logical access paths Storage Structures TID : TID : - Record management - Free space management TID : TID : Storage System TID : TID : - Physical access paths Buffered Pages - Page replacement strategy - Materialization strategy Buffer - Logging, Backup, Recovery Paged files File System Disks, Flash, RAID, SAN, … Hardware The 5 Layer Model 18 Use of Indexes ▪ To avoid full scans, indexes are used → Extra step to avoid unnecessary memory access and comparisons How to get DB-Index this mapping: DB-Scan My key value Query Attr1 Query TID (123,3) Query Attr2 The 5 Layer Model Types of access: Sequential access (scan) → not sufficiently fast for large record types or small result sets Sequenial access on a sorted attribute Direct access via primary key, foreign key(s), composite keys, or search terms Navigation from a record to a set of associated records Requirements: Efficient way of finding records, i.e. as direct as possible Avoiding sequential search of a dataset Facilitate access control by predefined access paths (constraints) Keep topological relations 19 B-Tree revisited https://www.cs.usfca.edu/~galles/visualization/BTree.html 20 B-Trees were covered in DB Grundlagen (not only at UHH, but also at other universities) 20 Sequential List Classification of Indexes blau braun ▪ Primary or Secondary Index (additional access path) grün TID(123, 3) TID(125, 4) lila ▪ Simple index or multilevel index rot ▪ Storage structure: schwarz ▪ Tree-structured, e.g. B-Tree ▪ Sequential, e.g. Sequential lists Binary search tree lila ▪ Scattered, e.g. Hash regions braun schwarz ▪ Access method: ▪ Key comparison blau grün rot ▪ Key transformation, e.g. hashing 123 TID(123, 3) 125 TID(125, 4) 21 The 5 Layer Model Keys do not always fit into main memory, i. e. it can be useful to chose an index that minimalizes disc accesses (e.g. B-Trees and its variants) Extension of binary trees: trees with multiple children per node → B-Trees were designed for use in DB systems Features: Dynamic reorganization by splitting and merging of pages, direct key access Extensions, e.g. B+ tree, B*-tree add more features, e.g. sorted sequential access 21 Creating an Index in SQL ▪ CREATE INDEX my_index ON myTable(myAttr); ▪ CREATE INDEX my_index ON myTable(myAttr,myAttr2); ▪ CREATE UNIQUE INDEX my_index ON myTable (myAttr) INCLUDE (myAttr2) ▪ CREATE UNIQUE INDEX index ON myTable (myAttr) [DIS]ALLOW REVERSE SCANS ▪ Some Systems allow computed indexes: ▪ CREATE INDEX my_index ON myTable (a + b * (c - 1), a, b); 22 The 5 Layer Model The order of the attributes in a multilevel index is not commutative → An existing index on (myAttr1,myAttr2) can be used if both attributes are queried or only myAttr1. It cannot be used (efficiently) if only my Attr2 is queried. UNIQUE does not allow duplicate values INCLUDE includes the attribute in the key, but does not use it for the creation of the index, i.e. in a search tree it is only part of the leaf nodes but not necessary for the other nodes → Useful if an attribute is part of the column list in a query but not of the WHERE clause [DIS]ALLOW REVERSE SCANS → Indexes are usually scanned in the order they were created in. This can be used to allow the opposite behavior. The default depends on the DB system. Not all of these are available on every system. Always check the docs! 22 Databases and Information Systems (DIS) Dr. Annett Ungethüm Universität Hamburg Foto: UHH/Esfandiari 1 The 5 Layer Model SQL, JDBC, ODBC, … SQL Query Application Query processing - Parsing 1. READ Tree A - Plan generation 2. HASH JOIN B Data System - Plan optimization 3. FETCH C Run - Plan execution Database System Database System Data model semantics Table Person: id INT, name VARCHAR, birthday DATE Index P_id_IX - System catalog - Record format on Person.id Access System 1 ‘Smith’ 15.06.1982 - Logical access paths Storage Structures TID : TID : - Record management - Free space management TID : TID : Storage System TID : TID : - Physical access paths Buffered Pages - Page replacement strategy - Materialization strategy Buffer - Logging, Backup, Recovery Paged files File System Disks, Flash, RAID, SAN, … Hardware The 5 Layer Model 2 Logical Query Execution Plan MensaMeals SELECT Mensa FROM MensaMeals, DailyOffers Meal Price WHERE MensaMeals.Meal = DailyOffers.Meal Pizza 6,50 AND MensaMeals.Price < 5; Pasta 4,90 Plan A: ∏Mensa (σPricePut( WriteOptions(), AgeAlice, “26”); 26 [Press Enter] Retrieve a KV-pair Retrieve a KV-pair std::string content; get AgeAlice database->Get( ReadOptions(), AgeAlice, &content); Delete a KV-pair Delete a KV-pair delete AgeAlice database->Delete(WriteOptions(), AgeAlice) 13 NoSQL Syntax of Set in Memcached via telnet set KEY META_DATA EXPIRATION_TIME_IN_S LENGTH_IN_BYTES [Press Enter] VALUE [Press Enter] 13 Log-Structured Merge Trees (LSM Trees) write- optimized in-memory writes buffer (C0) reads C0 max capacity T read- optimized compaction C1t on-disk storage (C1) C1t+1 14 NoSQL LSM Overview Many KV-stores rely on LSM-trees as their storage engine (e.g., BigTable, DynamoDB, LevelDB, Riak, RocksDB, Cassandra, HBase) Approach: Buffers writes in memory, flushes data as sorted runs to storage, merges runs into larger runs of next level (compaction) System Architecture Writes in C0 Reads against C0 and C1 (w/ buffer for C1) Compaction (rolling merge): sort, merge, includingdeduplication Further Reading Patrick E. O'Neil, Edward Cheng, Dieter Gawlick, Elizabeth J. O'Neil: The Log-Structured Merge-Tree (LSM- Tree). Acta Inf. 1996 14 LSM Trees: Merging LSM Tiering LSM Leveling ▪ Keep up to T-1 (sorted) runs per level L ▪ Keep 1 (sorted) run per level L ▪ Merge all runs of Li into 1 run of Li+1 ▪ Sort-Merge run of Li with Li+1 ▪ L1 ▪ L1 ▪ L2 ▪ L2 ▪ L3 ▪ L3 15 NoSQL 15 Storing key-value pairs: Sorted String Tables (sst) All keys are sorted within an sst file → New sst file if new key doesn‘t keep order *image: https://www.igvita.com/2012/02/06/sstable-and-log-structured-storage-leveldb/ sst files level 0 compaction sst files level 1 → Up to 7 levels (originally developed for LevelDB) 16 NoSQL Sorted String Tables Popular format for key-value stores Stores data in a compact format Basically implements an lsm tree For large SSTables, index can be kept separately in-memory 16 SSTables example K,V-Pairs are inserted in the Index 1 SSTable 1 following order: (2,’an’) 2 0 2 an 3 an example. (3, ‘example.’) 3 3 (0,’This’) (1,’is’) Index 2 SSTable 2 Assume each key is stored as a 0 0 0 This 1 is character and each character 1 5 is stored using 1 byte and memory is byte-addressable 17 NoSQL In practice, the length of a key is no necessarily constant → Delimiters or a fixed maximum key size are used to separate the key from the offset 17 Exercise: SSTables ▪ Insert the following (key,value)-pairs into an empty sst: (5,’!’), (0,’H’), (1,’ey, h’),(3,’l’),(4,’o’), (2,’el’) ▪ Assume each character takes up 1 Byte, each key is a character, and memory is byte-addressable ▪ How many SSTables are created? ▪ What do they look like? ▪ What does the index look like? 18 NoSQL 18 NoSQL Databases: Document Stores Document Stores Key-Value Stores Value is a (semi) key value structured document key value Value is a row Wide Column Stores key value or a table Get … a value by a given key Put … a new key-value pair into the database Delete … a key and the associated value 20 NoSQL Basic Data Model The general notion of a document – words, phrases, sentences, paragraphs, sections, subsections, footnotes, etc. Flexible schema: subcomponent structure may be nested, and vary from document-to-document Essentially, they support the embedding of documents and arrays within other documents and arrays Document structures do not have to be predefined, so are schema-free (XML, mostly JSON) 20 Document Stores ▪ Application-oriented management of structured, semi-structured, and unstructured information ▪ Collections of (key, document)-pairs key document 1234 {customer:”Jane Smith”, items:[{name:”P1”,price:49}, {name:”P2”,price:19}]} 1756 {customer:”John Smith”,...} 989 {customer:”Jane Smith”,...} 21 NoSQL Motivation Application-oriented management of structured, semi-structured, and unstructured information Scalability via parallelization on commodity HW (cloud computing) System Architecture Collections of (key, document) Scalability via sharding (horizontal partitioning) Custom SQL-like or functional query languages Example Systems MongoDB (C++, 2007, CP) → RethinkDB, Espresso, Amazon DocumentDB (Jan 2019) CouchDB (Erlang, 2005, AP) → CouchBase 21 Recap: JSON (JavaScript Object Notation) JSON Data Model Query Languages Data exchange format for semi-structured data Most common: libraries for tree traversal Not as verbose as XML (especially for arrays) and data extraction Popular format JSONig: XQuery-like query language JSONPath: XPath-like query language {“students:”[ JSONiq Example: {“id”: 1, “courses”:[ declare option jsoniq-version “…”; {“id“:“INF.01017UF”, “name“:“DM”}, for $x in collection(“students”) {“id“:“706.550”, “name“:“AMLS”}]}, {“id”: 5, “courses”:[ where $x.id lt 10 {“id“:“706.520”, “name“:“DIA”}]}, let $c := count($x.courses) ]} return {“sid”:$x.id, “count”:$c} [http://www.jsoniq.org/docs/JSONiq/html-single/index.html] 22 NoSQL 22 [Credit: https://api.mongodb.com/ Example MongoDB python/current] import pymongo as m Creating a Collection conn = m.MongoClient(“mongodb://localhost:123/”) db = conn[“dbs19”] # database dbs19 cust = db[“customers”] # collection customers mdict = { “name“: “Jane Smith”, Inserting into a Collection “address”: “Inffeldgasse 13, Graz” } id = cust.insert_one(mdict).inserted_id # ids = cust.insert_many(mlist).inserted_ids Querying a Collection print(cust.find_one({"_id": id})) ret = cust.find({"name": "Jane Smith"}) for x in ret: print(x) NoSQL 23 Challenges for Graph Engines Summary Key-Value Stores Memcached via telnet Add a new KV-pair set AgeAlice 0 120 1 [Press Enter] 26 [Press Enter] Retrieve a KV-pair get AgeAlice Delete a KV-pair delete AgeAlice Image: https://www.igvita.com/2012/02/06/sstable-and-log-structured- storage-leveldb/ Robert Ryan McCune, et al.: Thinking Like a Vertex: A Survey of Vertex-Centric Frameworks for Large-Scale Distributed Graph Processing. ACM Comput. Surv. 48(2): 25:1-25:39 (2015), 2 NoSQL Databases: Document Stores Document Stores Key-Value Stores Value is a (semi) key value structured document key value Value is a row Wide Column Stores key value or a table Get … a value by a given key Put … a new key-value pair into the database Delete … a key and the associated value 3 NoSQL Basic Data Model The general notion of a document – words, phrases, sentences, paragraphs, sections, subsections, footnotes, etc. Flexible schema: subcomponent structure may be nested, and vary from document-to-document Essentially, they support the embedding of documents and arrays within other documents and arrays Document structures do not have to be predefined, so are schema-free (XML, mostly JSON) 3 Document Stores ▪ Application-oriented management of structured, semi-structured, and unstructured information ▪ Collections of (key, document)-pairs key document 1234 {customer:”Jane Smith”, items:[{name:”P1”,price:49}, {name:”P2”,price:19}]} 1756 {customer:”John Smith”,...} 989 {customer:”Jane Smith”,...} 4 NoSQL Motivation Application-oriented management of structured, semi-structured, and unstructured information Scalability via parallelization on commodity HW (cloud computing) System Architecture Collections of (key, document) Scalability via sharding (horizontal partitioning) Custom SQL-like or functional query languages Example Systems MongoDB (C++, 2007, CP) → RethinkDB, Espresso, Amazon DocumentDB (Jan 2019) CouchDB (Erlang, 2005, AP) → CouchBase 4 Recap: JSON (JavaScript Object Notation) JSON Data Model Query Languages Data exchange format for semi-structured data Most common: libraries for tree traversal Not as verbose as XML (especially for arrays) and data extraction Popular format JSONig: XQuery-like query language JSONPath: XPath-like query language {“students:”[ JSONiq Example: {“id”: 1, “courses”:[ declare option jsoniq-version “…”; {“id“:“INF.01017UF”, “name“:“DM”}, for $x in collection(“students”) {“id“:“706.550”, “name“:“AMLS”}]}, {“id”: 5, “courses”:[ where $x.id lt 10 {“id“:“706.520”, “name“:“DIA”}]}, let $c := count($x.courses) ]} return {“sid”:$x.id, “count”:$c} [http://www.jsoniq.org/docs/JSONiq/html-single/index.html] 5 NoSQL 5 [Credit: https://api.mongodb.com/ Example MongoDB python/current] import pymongo as m Creating a Collection conn = m.MongoClient(“mongodb://localhost:123/”) db = conn[“dbs19”] # database dbs19 cust = db[“customers”] # collection customers mdict = { “name“: “Jane Smith”, Inserting into a Collection “address”: “Inffeldgasse 13, Graz” } id = cust.insert_one(mdict).inserted_id # ids = cust.insert_many(mlist).inserted_ids Querying a Collection print(cust.find_one({"_id": id})) ret = cust.find({"name": "Jane Smith"}) for x in ret: print(x) NoSQL 6 NoSQL Databases: Wide Column Stores Document Stores Key-Value Stores Value is a (semi) key value structured document key value Value is a row Wide Column Stores key value or a table Get … a value by a given key Put … a new key-value pair into the database Delete … a key and the associated value 7 NoSQL Wide Column Stores are sometimes called “Extensible Record Stores” Column Store != Wide Column Store Basic Data Model Database is a collection of key/value pairs Key consists of 3 parts: a row key, a column key, and a time-stamp (i.e., the version) Flexible schema: the set of columns is not fixed, and may differ from row-to- row Example Systems Google Bigtable HBase Further Reading Chang, Fay, et al. "Bigtable: A distributed storage system for structured data." ACM Transactions on Computer Systems (TOCS) 26.2 (2008): 1-26. 7 Example: Wide Column Data Model Column families Row key Personal data Professional data Column qualifiers 8 8 Example: Wide Column Data Model Personal data Professional data First Last Date of Job Date of ID Salary Employer Name Name Birth Category Hire First Middle Last Job Hourly ID Employer Name Name Name Category Rate First Last Job ID Salary Employer Group Seniority Bldg # Office # Name Name Category Job Date of Insurance Emergency ID Last Name Salary Employer Category Hire ID Contact Medical data single “table” 9 9 Example: Wide Column Data Model t1 Row key t0 First Last Date of Job Date of ID Salary Employer Name Name Birth Category Hire Personal data Professional data One “row” One “row” in a wide-column NoSQL database table = Many rows in several relations/tables in a relational database 10 10 Overview NoSQL Systems Key-Value Wide Table Document Time Series, Spatial, Arrays… Graphs 11 Further Reading Ronald Barber et al: Evolving Databases for New-Gen Big Data Applications. CIDR 2017 11 Course Outline Architecture of Database Systems Transaction Management Modern Database Technology Data Warehouses and OLAP Data Mining Big Data Analytics 12 Motivation for a Data Warehouse Preparation Multitude of different data sources Data cleansing / cleaning → historic reason for DWH Adaptation/Standardization/Interpretability Central defined key performance indicators (KPI Management) and dimensions One single truth, no different interpretations Allows new business processes Statistical methods (e.g. correlation analysis) Examples: customer segmentation, -evaluation More Organization: data from multiple business units Technical: different query models, usage patterns, data volumes Consolidation: heterogeneity (schema, data quality, “garbage in, garbage out!”) 12 Data-Warehouse concept External systems, operational Metadata Reporting, OLAP, Data Mining systems, flat files Load phase Analysis Data Warehouse 13 Data Warehouses and OLAP Goal: (analytic) access to consistent data Integration of external/operational data sources in a centralized database Extraction and multi-stage loading process Integrated (and historicized) data form a base for analytic tools 13 Descriptive Statistics Processing step Actions Data Creation of raw data Raw data Data recording calibration / checking „Microcosmos“ optional: anonymization Structural adaptation Validation (filtering, quality checking) Data preparation Statistical corrections processed raw data (imputation, estimation, „missing values“) Optional: anonymization, pre-aggregation Application-specific aggregation Enriched Data analysis data/aggregates, Representation / interpretation „Macrocosmos“ 14 Data Warehouses and OLAP Explorative statistics Find “undiscovered” structures and connections to generate new hypotheses Based on samples Mathematical statistics Also called “statistical inference” or “inductive statistics”, in German also “schließende Statistik” Based on samples Uses stochastic models → provides probability of error (in contrast to descriptive statistics) 14 DWH Definition A data warehouse is a central database which is optimized for analysis purposes and contains data of several, usually heterogeneous data sources. Erhard Rahm Physical database as an integrated view on (arbitrary) data with a focus on the evaluation aspect, where data is often but not necessarily historized.” GI 15 Data Warehouses and OLAP Some general approaches “Data warehouse is a subject-oriented, integrated, time-varying, non-volatile collection of data in support of the management's decision-making process.” (Bill Inmon) “Data Warehouse is an environment, not a product.” (Berson/ Smith) 15 Characteristics of DWHs Analysis-oriented organization of data Integration of data from different source systems Domain-oriented, models a specific application Integrated database, integration on structural goal and data level of multiple databases No user updates Historic data (almost) no updates and deletes (technical Data is kept over a long period of time updates / deletes only, quality assurance) 16 Data Warehouses and OLAP Operations in operational environment: Insert, Delete, Update, Select Operations in a data warehouse: Insert the initial and additional loading of data by (batch) processes, Select the access of data Non-volatile/stable database, loaded data is not deleted or modified, read- only access Non-volatile Data What happens in the OLTP system if the customer cancels his booking? Delete operation in OLTP Seat gets available again and can be sold to another passenger What happens in the DWH? Insert operation in DWH with, e.g., a flag indicating that the customer cancelled/deleted his booking Business can make analysis about cancelled booking: why might the customer have cancelled? How to prevent the customer or other customers to cancel next time? 16 Tales from the reality of data science “Compulsive data hoarders” exist in many (but not all) projects, so data is collected before it is defined what this data will be used for or if it will be used at all → Exponential data growth Especially in natural science, it’s not always clear which parts of the data might be useful at a later point in time → Everything is collected → Data growth is only restricted by the ratio of failed experiments/failed hardware 16 Reference architecture for DWH Data-Warehouse-System Data marts Analysis Metadata repository Internal update Detail & Sum data Provisioning Internal update Consolidated database Consolidation Internal update Operational External update data store Base data Acquisition & Transformation External update External update Source systems Data Warehouses and OLAP Further Reading Wolfgang Lehner: „Datenbanktechnologie für Data-Warehouse-Systeme: Konzepte und Methoden“, dpunkt-Verlag, 2003 Metadata examples Description of the Data Warehouse System Names, definitions, structure, and content of a DWH Identification of data sources Integration and transformation rules for filling the DWH Integration and transformation rules for end-user rules Operational information like updates, versions Usage and performance of the Data Warehouse (Monitoring) Security, access rights 17 Example DWH architecture ZID MID Salary ZID MID Salary Data analytics 1 1 3.300€ 1 2 6.176€ 2 1 3.500€ 2 2 5.176€ ZID Time ZID MID Salary MID Name 1 T1 1 1 3.300€ 1 Max Mueller Data provisioning 2 T2 1 2 6.176€ 2 Tim Mueller 2 1 3.500€ 2 2 5.176€ Zeit Name Salary T1 Max Mueller 3.300€ Data consolidation T1 Tim Mueller 6.176€ T2 Max Mueller 3.500€ T2 Tim Mueller 5.176€ Name Salary Data transformation Sources Name Gehalt Max Mueller Name3.500€ Salary Max Müller 3.500€ Tim Mueller 5.176€ 7,400$ Tim Mueller Data Warehouses and OLAP 18 Data acquisition and transformation Source ETL Tool Target systems systems RDBMS Extraction Selection RDBMS … … XML-DB Join Load XML-DB … … csv Extraction Selection csv 19 Data Warehouses and OLAP Goal Provisioning of data for the consolidated database Efficiency vs. Actuality What are data sources? Heterogeneous systems or files Local schemata and semantics Describing attributes of data sources Utilization of data for the Data Warehouse Origin (internal or external data) Cooperation (active sources, snapshot-sources, …..) Availability of source data (legal, social, organizational, technical) Cost of acquisition Quality (consistency, correctness, completeness, accuracy,…) Staging Area “Landing Zone” for data coming into a DWH Temporary memory for integrating extracted data after an external update 19 Schema integration Overcoming semantic/structural heterogeneity Integration of different local schemata /data models into one global schema Decoupling of transformation from the source systems and the consolidated database Data integration Adaptation of data formats etc. Correction of different spellings (abbreviations, etc.) ETL-components Extraction-, Transformation- and Load component 19 Example ETL Tools https://www.hitachivantara.com/en-us/products/data-management-analytics/pentaho/download-pentaho.html Data Warehouses and OLAP Commercial Systems Informatica PowerCenter Cognos Decision-Stream Oracle DW Builder IBM InfoSphere DataStage IBM DB2 Warehouse Enterprise Edition AB Initio Open Source Pentaho Data Integration (a.k.a Kettle) Talend ETL Integration Suite Clover ETL Data Integration JasperETL Open Source 20 Data Consolidation ▪ Integrated database from cleaned data ▪ Not specifically modeled, no specific optimizations Update Alternatives …for schema (classical logical and Real time Periodically Depending on physical database design) update quantity …for physical database design (index structures, etc.) t 21 Data Warehouses and OLAP Function Organization-spanning and application-independent data storage Collection, integration and distribution Analytic functionality for operative use 21 Data Provisioning and Analysis ▪ Dispositive database derived from consolidated data ▪ Mostly historic, maybe detailed data, mostly complete Schemas optimized for analysis Dimension tables may contain Snowflake Schema Star Schema redundant data/dependencies between non-key attributes dimension table 1 … dimension table 1.1 dimension table 5 dimension table2 … dimension table 1 Dimension tables fact table Fact tables can become very normalized to fact table reduce redundancy large and may be archived dimension table 4 … dimension table 2 dimension table 3 … … … 22 Data Warehouses and OLAP More optimization for analysis Logical access paths Partitioning Pre-calculation of summed data (e.g. materialized views) Physical access paths: specific index structures (e.g. bitmap index) Data analysis Data-Mart databases derived from dispositive database Specific extracts for a specific class of applications Mostly proprietary formats on the physical level (e.g. MOLAP-systems) 22 Survey https://evasys-online.uni-hamburg.de/evasys/online.php?pswd=J3Q6H 23 23 Derivatives of Key-Value Stores Summary Data Warehouses Example Reference Architecture 2 Data-Mart Concept Structural/content related subset of the complete DWH data, e.g. Geographic/Organization/Function/Competition oriented Data-Marts Dependent (redundant) Data-Mart Independent Data-Mart In reality, the distinction between data warehouse and data-mart can be difficult, because data warehouses do always provide a company-wide integration. 3 Data Warehouses and OLAP Why? Read-optimized layer: Data is stored in a denormalized data model for better read performance and better end user usability/understanding The Data Mart Layer is providing typically aggregated data or data with less history (e.g. latest years only) in a denormalized data model Dividing independent topics, ideally one subject per data mart Privacy aspects Reduction of data volume → Queries on the whole data warehouse can become a bottleneck Performance/Load distribution 3 DWH 2.0 ▪ Structured and “unstructured” data ▪ Life cycle of data with different storage areas Hot data: High speed, Cold data: Low speed, inexpensive, expensive storage for most large storage for old data; archival recent data data model with compression ▪ Metadata s an integral part of the DWH, not just an afterthought W.H. H. Inmon, Derek Strauss, Genia Neushloss: DW 2.0: The Architecture for the Next Generation of Data Warehousing 4 Data Warehouses and OLAP 4 Lambda Architecture (Big Data) 5 Data Warehouses and OLAP Batch Layer DWH alike, correct & complete Output in Read-only DB, complete replacement Hadoop/Spark de facto standard Speed Layer Stream processing Real-time, latency is king, „batch gap“ Correct-/completeness minor concern Apache Storm, Spark etc. Output into fast NoSQL DBs Indexes most recently added data Serving Layer Query processing Stores outputs, builds views Druid, Cassandra, HBase Example: Streaming service analytics → Speed layer: we need information about service issues → aggregate only needed to check if any threshold is hit and the system must react → Batch layer: analytics about user groups 5 Logical Data Warehouse ▪ Focus on Jobs to be done ▪ DWH ▪ Query processing on structured data ▪ Guaranteed quality ▪ Data Lake ▪ Very large volumes of unstructured data ▪ Logical Data Warehouse ▪ Provide access to both underlying systems ▪ Requires ETL functionality ▪ Virtual integration [https://blogs.gartner.com/henry-cook/2018/01/28/the-logical-data-warehouse-and-its-jobs-to-be-done/] 6 Data Warehouses and OLAP Data Lakes With cheap storage costs, people promote the concept of the data lake Combines data from many sources and of any type Allows for conducting future analysis and not miss any opportunity Collect everything All data, both raw sources over extended periods of time as well as any processed data Decide during analysis which data is important, e.g., no “schema“ until read Dive in anywhere Enable users across multiple business units to refine, explore and enrich data on their terms Flexible access Enable multiple data access patterns across a shared infrastructure: batch, interactive, online, search, and others 6 Multidimensional Modelling Sales Sony JVC Grundig Superstore Northern Germany Specialty retailer Independent shops Superstore Superstore Specialty retailer Southern Germany Specialty retailer Independent Independent shops shops JVC Sony Grundig Sales S X C C C Northern Germany Southern Germany Sony Grundig JVC Superstore Specialty Retailer Independent shops 7 Data Warehouses and OLAP Static table also called “summary table” Motivation: Reporting and interactive Analysis OLAP (Online Analytical Processing) on multidimensional data model Data Mining: Search for unknown patterns or relations in data Visualization 7 Basic Concept Descriptive information (cube edges) → Dimensions (partially ordered set D of categorical attributes) Quantified information (cube cells) → Facts (Measures / key performance indicators for analysis / aggregation) Multidimensional schema Ω Set of dimension hierarchies (D1,…, Dm) Set of measures (M1,...,Mn) 8 Data Warehouses and OLAP Central data structure: multidimensional cube Descriptive data (categorical attributes) Quantified data (sum attributes) Multidimensional modeling “Predict” analytic patterns of users Drill-paths for navigations operators Limit to meaningful aggregation options Data structures Descriptive information (cube edges) → dimensions Hierarchies, dimensional attributes Structural basis for selection and aggregation Quantified information (cube cells) → facts Measures / key performance indicators for analysis / aggregation Goal Orthogonal dimensional descriptions Clear separation of measures Dimensions / Dimension hierarchy Partially ordered set D of categorical attributes 8 ({𝐷_1, …, 𝐷_𝑛,〖𝑇𝑜𝑝〗_𝐷 };→) TopD is a generic maximum element w.r.t. “→”, i.e. ∀𝑖 (1≤𝑖≤𝑛):𝐷_𝑖→〖𝑇𝑜𝑝〗_𝐷 There is a Di with finest granularity, i.e. Di→Dj for all Dj “→” denotes the functional dependency Partial ordering allows arbitrary parallel hierarchies !!!Multidimensional Model is conceptual not physical!! 8 Roles of categorical attributes Schema of a Dimension Primary attribute: finest granularity (e.g. order) Classification attribute: implies multi-stage categorization on value level (e.g. customer, Top nation) Describing attribute: additional description (e.g. Example: Order dimension status, phone number) region nation payment adress phone discount method customer order shipping order status priority priority price Primary attribute Describing attribute order Classification attribute Orthogonality ▪ No functional dependencies between attributes from different dimensions 9 Data Warehouses and OLAP There is a functional dependency A→B if for all a∈A there is exactly one b∈B, e.g. Germany → Europe, but not Europe → Germany 9 Schema of a Dimension Example: Multiple hierarchies region Not modelled here: products sold only in some nations nation supplier customer product Primary attribute Describing attribute order Classification attribute Orthogonality ▪ No functional dependencies between attributes from different dimensions 10 Data Warehouses and OLAP 10 Dimensional Values Top Top region Africa Europe Asia America Middle East nation France Germany Russia Romania customer User_X User_Y order #5674 #2354 → Top can be reached via multiple paths 11 Data Warehouses and OLAP Functional dependencies define tree structure on instances Functional dependency corresponds to 1:N-relation! Every path from a classification attribute to Top defines a classification hierarchy 11 Facts and Measures Quantifying part of a multidimensional data schema Fact Measure Defined as F = (G, SumType) M = (G, f(F1,...,Fk), SumType) Granularity: G = {G1,...,Gn} Calculation formula f() Subset of all categorical attributes of all existing Scalar function +,-, ,/,mod etc., e.g. sales tax= dimensional hierarchies D1,..., Dn quantity price tax rate Summation type: SumTyp {FLOW, STOCK, VPU} Aggregation functions, e.g. SUM(), MAX(), COUNT(), e.g. SUM(quantity price) Example: Delivered quantity with granularity Order-based functions, e.g. cumulation, top-k G={orderNo, partNo, supplierNo} calculations Non-empty subset of all existing Facts F1,...,Fk Example Fact: Number and price per product Scalar measure: Sum of price and sales tax for one product Aggregated measure: Aggregated price for a completed order Order-based measure: The k most expensive products of the order 12 Data Warehouses and OLAP Summability Problem Not all functions can be summed up (median, quantiles, standard deviation) Even simple aggregation functions (SUM, AVG, MIN, MAX, COUNT,...) cannot be aggregated further at will Change of granularity G = (G1,...,Gn) is finer (same) G’ = (G1’,...,Gk’), i.e. G  G’ iff, for each Gj’ G’ exists a Gi G, s.t. Gi → Gj’ Coarsening / refinement of granularity adding / removing categorical attributes Example (orderNo, partNo) ≤ (customerNo, brand) ≤ (market segment) Necessary characteristics for summability Disjointness, completeness, type compatibility 12 Characteristics of Summability Disjointness Completeness →An individual value is considered exactly once during →Measures on higher aggregation-level are completely calculation derived from measures of lower aggregation-levels →General: Summability is given if functional dependencies exist between the categories that should be summed up nation Germany federal state ?? Berlin city Berlin Number of students 2020 2021 2022 Total Computer science 15 17 13 28 Number of Economics 10 15 11 21 2010 2011 2012 2013 restaurants Total 25 32 24 49 Dresden 34 38 31 29 Leipzig 123 121 128 131 Chemnitz 21 18 24 27 Case 1: sum by year gives correct results OTHER 11 13 13 12 Case 2: sum up number of computer science students Total 189 190 196 199 naive: 15+17+13 = 45 gives wrong result (2 year overlap) Number of restaurants in the triangle Dresden- Assume faculty founding in 2020: 15 + (17 - 15) + (13 - 2) = 28 Leipzig-Chemnitz OTHER for inns, not located in one of the cities 13 Data Warehouses and OLAP Possibly, the whole space cannot be captured Example: assume Berlin is not modeled as a federal state (city → federal state) Weak functional dependency (A⇒B): for each a∈A exists at most one b∈B example: city⇒ federal state Remove weak functional dependencies NULL, OTHER, dummy values 13 Summation types FLOW Summability Can be aggregated freely, usually measured in per time STOCK: aggregation over FLOW temporal dimension? VPU STOCK States which can be aggregated freely except no yes for a temporal dimension MIN/MAX SUM ✘ ✘ Value-per-Unit AVG States which cannot be summed (except for min, max, and avg), e.g. exchange rate COUNT → Values of different points in time might not be aggregated into one value 14 Data Warehouses and OLAP Summation Types Flow (event at time T) Can be aggregated freely Examples: order quantity of a certain item per day, number of traffic deaths per month, sales, earnings per year,... Stock (status at time T) Can be aggregated freely, without a temporal dimension Items in stock over time → disjointness voided Example: number of school kids per month, Value-Per-Unit (VPU) Current states, that cannot be summed Use of COUNT()-, MIN()-, MAX()- und AVG() is allowed Example: exchange rate, unit costs,... 14 Exercise Summation Types FLOW STOCK VPU Inventory Value added tax rate Ordered items per day #inhabitants per city Stock price Exams per semester SWS (Semesterwochenstunden/ semester hours) Exam grade 15 Data Warehouses and OLAP 15 The Data Cube Data cube: C = (DS,M) Set of dimensional hierarchies: DS = {D1,...,Dn} Set of fact-based measures: M = {M1,...,Mm} Domain of a data cube color Cartesian product of all value ranges of all attributes of the cube schema brand Data cube instance All cube cells from the cube domain 17 Data Warehouses and OLAP Cube is just a metaphor! Almost never, all cells are present (sparsity) Non-existent values on the implementation level become NULL or 0 on the model level! There are usually more than 3 dimensions! 17 Operations on Data Cubes I Slicing → Select “slices of a cube” Dicing → Select a sub-cube → Selection via specification of classification → Selection by specification of attributes of one dimension classification attributes of multiple dimensions color color color blue VW brand brand brand city=“Dresden”, AND city=“Leipzig” brand=“VW”, AND color=“blue” 18 Data Warehouses and OLAP 18 Operations on Data Cubes II Drill-Down and Roll-Up Drill-Across → Moving along the dimension hierarchy → Change from one sub-cube to another Drill-down: Disaggregation of measures into sub-measures Roll-up: Aggregation of measures into super-measures Hierarchy: Region  Nation  Federal state  City color color Refine federal state → city Drill-down city brand brand color color Roll-up Origin Target Aggregate city → federal state city=“Dresden”, AND city=“Dresden”, AND brand=“VW”, AND brand=“Ford”, AND color=“blue” color=“blue” 19 Data Warehouses and OLAP Drill-Down and Roll-Up Changes granularity, not Categorization Drill-Across Dimension stays on the same hierarchy-level, but selection value changes Also change of data cube (Join of multiple data cubes) 19 Exercise Data Cube ▪ Which operations are described? (slicing, dicing, roll-up, drill-down, drill-across) ▪ How could you visualize the results? a) name = “Jane Doh” b) refine name → first c) student=“Jane Doh” AND time = “2 pm” AND course = “AD” d) time = “4 pm” AND course = “Math I” course e) aggregate first → name f) student=“Jane Doh” AND time = “2 pm” AND course = “AD” → student=“Jane Doh” AND time = “10 am” AND course = “AD” student 20 Data Warehouses and OLAP 20 Data Cube in PostgreSQL ▪ Install extension: CREATE EXTENSION cube; ▪ Create a cube: SELECT cube(array[X,Y,Z]); //creates cube with dimensions X, Y, and Z ▪ Selected functions: ▪ Cube_union (cube,cube) ▪ Cube_inter(cube,cube) → intersection ▪ cube_subset ( cube, integer[] ) → new cube from existing group using list of dimension indexes, e.g. for dicing ▪ Rollup is a separate function and a subclause of GROUP BY (x, y, z) (x, y) (y, z)

0-Introduction_merged (1).pdf

Document Details

Tags

Related

Full Transcript