ilovepdf_merged (1).pdf
Document Details
Uploaded by EthicalGermanium
2021
Tags
Full Transcript
Database Management System - Part 1 DWBI\DQE Lab Ukraine Autumn 2021 Iryna Shkinder DWBI Software Engineer Alona Shkinder DWBI Software Engineer CONFIDENTIAL | © 2021 EPAM Systems, Inc. RECOMMENDATIONS Discuss and interact Raise hand, ask questions Duration: 3 hours Coffee break: by r...
Database Management System - Part 1 DWBI\DQE Lab Ukraine Autumn 2021 Iryna Shkinder DWBI Software Engineer Alona Shkinder DWBI Software Engineer CONFIDENTIAL | © 2021 EPAM Systems, Inc. RECOMMENDATIONS Discuss and interact Raise hand, ask questions Duration: 3 hours Coffee break: by request 5-10 minutes Feedback is appreciated Q&A after each module, Teams Questions Channel CONFIDENTIAL | © 2021 EPAM Systems, Inc. 2 I RY N A S HKI NDER DWBI Software Engineer BI Lab Teacher, RM, Mentor EPAM BI Lab 2016 Participant EPAM Career Path: Multiple projects/domains, Student -> DWBI Software Engineer ALO N A SHKI N DER DWBI Software Engineer BI Lab Teacher EPAM BI Lab 2018 Participant EPAM Career Path: Multiple projects, Student -> DWBI Software Engineer CONFIDENTIAL | © 2021 EPAM Systems, Inc. 3 Database Management System Part 1 CONFIDENTIAL | © 2021 EPAM Systems, Inc. 4 AGENDA 1. What is DBMS? 2. What is RDBMS? 3. What is SQL? 4. What is NoSQL? 5. What is IMDBMS? 6. What is CDBMS? 7. What is C(B)DBMS? 8. SMP vs MPP 9. What is RAID? CONFIDENTIAL | © 2021 EPAM Systems, Inc. 5 WHAT IS DBMS? A Database Management System (DBMS) is a software product/package/system created to handle/change/retrieve/etc data in a database. A DBMS manipulates with: data data formats metadata structures DBMS defines rules how to … with data. SQL -> used in DBMS CONFIDENTIAL | © 2021 EPAM Systems, Inc. 6 WHAT IS DBMS WIKI? A Database is an organized collection of data. A Database Management System (DBMS) is a computer-software application that interacts with end-users, other applications, and the database itself to capture and analyze data. A general-purpose DBMS allows the definition, creation, querying, update, and administration of databases. CONFIDENTIAL | © 2021 EPAM Systems, Inc. 7 COMPONENTS OF DBMS Usually DBMS consists of: Data Software Hardware Procedures (rules and presets) Database Access Language Database Engine Data Dictionary Query Processor Managers (Data, Databased, others) CONFIDENTIAL | © 2021 EPAM Systems, Inc. 8 COMPONENTS OF DBMS - Data Data is the most important component of the DBMS. This is the raw material from which information is generated. The main task of a DBMS is data processing. The database contains both the metadata and the actual (or operational) data. Metadata is a description of data or data about data. This is information stored by the DBMS to better understand the data stored in it. CONFIDENTIAL | © 2021 EPAM Systems, Inc. 9 COMPONENTS OF DBMS - Software Software is the set of programs used to control and manage the overall database. The DBMS software provides an easy-to-use interface to store, retrieve, and update data in the database. This component is used for processing of Database Access Language and converts it into actual database commands to execute or run them on the DB. CONFIDENTIAL | © 2021 EPAM Systems, Inc. 10 COMPONENTS OF DBMS - Hardware This component of DBMS consists of a set of physical electronic devices such as computers (PCs, workstations, servers, and supercomputers), I/O channels for data, storage devices, network devices (hubs, switches, routers, fiber optics) and any other physical components involved before any data is successfully stored into the memory. These are all those physical components that create the interface between computers and users. This DBMS component is used for keeping and storing the data in the database. CONFIDENTIAL | © 2021 EPAM Systems, Inc. 11 COMPONENTS OF DBMS - Procedures Procedures refer to general rules and instructions that help to design the database and to use a database management system. Used to setup and install a new DBMS, to login and logout of DBMS software, to manage DBMS or application programs, to take backup of the database, and to change the structure of the database, etc. Procedures are also used to ensure that there is an organized way to monitor and audit both the data that enter the database and the information that is generated through use of those data. CONFIDENTIAL | © 2021 EPAM Systems, Inc. 12 COMPONENTS OF DBMS – Database Access Language Database Access Language is a simple language designed to write commands to access, insert, update and delete data stored in any database. Every user writes a set of appropriate commands in the database access language, submit these to the DBMS, which then processes the data and generates it, displays a set of results into a user-readable form. User can create new databases, tables, insert data, fetch stored data, update data and delete the data using the access language. CONFIDENTIAL | © 2021 EPAM Systems, Inc. 13 COMPONENTS OF DBMS - Database Engine A Database Engine (or Storage Engine) is the underlying software component that a DBMS uses to create, read, update and delete (CRUD) data from a database. Most database management systems include their own application programming interface (API) that allows the user to interact with their underlying engine without going through the user interface of the DBMS. Database Engines are quite sophisticated and utilize special indexing structures to speed up data retrieval. CONFIDENTIAL | © 2021 EPAM Systems, Inc. 14 COMPONENTS OF DBMS - Data Dictionary The Data Dictionary holds a description of each database. The Data Dictionary (or Metadata Repository) is a "centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format". This term can also be used with the following meanings: o A document describing a database or collection of databases o An integral component of a DBMS that is required to determine its structure o A piece of middleware that extends or supplants the native data dictionary of a DBMS CONFIDENTIAL | © 2021 EPAM Systems, Inc. 15 COMPONENTS OF DBMS - Query Processor Query Processor transforms the user queries into a series of instructions which machine can understand and perform the requested action for user. Query Processor typically contains the following components: o DML Compiler – It processes the DML statements into low level instruction (machine language) so that they can be executed. o DDL Interpreter – It processes the DDL statements into a set of tables containing metadata. o Embedded DML Pre-compiler – It processes DML statements embedded in an application program into procedural calls. o Query Optimizer – It executes the instruction generated by DML Compiler. CONFIDENTIAL | © 2021 EPAM Systems, Inc. 16 COMPONENTS OF DBMS - Managers Database Management tools o UI (Web/Desktop) o CLI Database Installation and Configuration tools Data loading and Data Migration tools Backup and Recovery tools Resource Management and Task Scheduling tools Performance and Tuning tools Others CONFIDENTIAL | © 2021 EPAM Systems, Inc. 17 D B M S F UNC TIONS What-Why-How? Handle DATA Store DATA Transform DATA Access to DATA Secure DATA Backup DATA Recover DATA Integrate DATA CONFIDENTIAL | © 2021 EPAM Systems, Inc. 18 WHAT IS RDBMS? A Relational Database Management System (RDBMS) is a database engine/system/products based on the Relational Model Who: Edgar F. Codd When: 1970. The Relational Model (RM) for Database Management System is an approach to work with data based on the relation algebra (tuples, predicate logic, structures, relations). CONFIDENTIAL | © 2021 EPAM Systems, Inc. 19 WHAT IS RDBMS WIKI? A Relational Database Management System (RDBMS) is a Database Management System (DBMS) based on the Relational Model invented by Edgar F. Codd at IBM's San Jose Research Laboratory. Rules: Present the data to the user as relations (a presentation in tabular form, i.e. as a collection of tables with each table consisting of a set of rows and columns); Provide relational operators to manipulate the data in tabular form. CONFIDENTIAL | © 2021 EPAM Systems, Inc. 20 WHAT IS SQL? Structured Query Language (SQL) is a standard computer language for Relational Database Management System plus data manipulation. ISO (International Organization for Standardization): https://www.iso.org/standard/63555.html The next generation “under development”: https://www.iso.org/standard/76583.html ISO/IEC 9075-1:2016 describes the conceptual framework used in other parts of ISO/IEC 9075 to specify the grammar of SQL and the result of processing statements in that language by an SQL- implementation. CONFIDENTIAL | © 2021 EPAM Systems, Inc. 21 WHAT IS SQL WIKI? SQL is a domain-specific language used in programming and designed for managing data held in a Relational Database Management System (RDBMS), or for stream processing in a Relational Data Stream Management System (RDSMS). Consists of: DQL DDL DCL DML CONFIDENTIAL | © 2021 EPAM Systems, Inc. 22 COMPONENTS OF SQL – DQL, DDL Data Query Language (DQL) - SQL commands used to perform queries on the data within schema objects. The purpose of DQL commands is to get the schema relation based on the query passed to it. Example: SELECT Data Definition Language (DDL) – SQL commands used to create and modify database objects such as tables, indexes, and users. Example: CREATE, ALTER, DROP, TRUNCATE CONFIDENTIAL | © 2021 EPAM Systems, Inc. 23 COMPONENTS OF SQL – DCL, DML Data Control Language (DCL) – SQL commands used to control access to data stored in a database (rights, permissions and other controls). Example: GRANT, REVOKE Data Manipulation Language (DML) - SQL commands used to manipulate/modify data presented in the database. Example: INSERT, UPDATE, DELETE CONFIDENTIAL | © 2021 EPAM Systems, Inc. 24 WHAT IS NOSQL? NoSQL is a class of Database Management Systems (DBMS) that does not use SQL and does not follow of Relation Model. NoSQL – no strict following of ACID. No SQL -> Not only SQL (OR) Non-relational It is an alternative solution. Was created to SUPPORT SQL. CONFIDENTIAL | © 2021 EPAM Systems, Inc. 25 WHAT IS NOSQL WIKI? A NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. NoSQL databases are increasingly used in big data and real-time web applications. The data structures used by NoSQL databases (e.g. key-value, wide column, graph, or document) are different from those used by default in Relational Databases, making some operations faster in NoSQL. CONFIDENTIAL | © 2021 EPAM Systems, Inc. 26 WHAT IS IMDBMS? An In-Memory Database Management System (IMDBMS) is a Database Management System that stores data on main memory. Why: to optimize work with disks (I/O). How: direct data manipulation and specialized architecture (CPU + memory). Real-Time data movement. CONFIDENTIAL | © 2021 EPAM Systems, Inc. 27 WHAT IS IMDBMS WIKI? An In-Memory Database Management System is a Database Management System that primarily relies on main memory for computer data storage. It is contrasted with Database Management Systems that employ a disk storage mechanism. In-Memory Databases are faster than disk-optimized databases because disk access is slower than memory access, the internal optimization algorithms are simpler and execute fewer CPU instructions. Accessing data in memory eliminates seek time when querying the data, which provides faster and more predictable performance than disk. CONFIDENTIAL | © 2021 EPAM Systems, Inc. 28 WHAT IS CDBMS? A Columnar Database Management System (CDBMS) is a Database Management System where data is stored in columns instead of rows. Why: to optimize performance, different storage options and schemas. How: data is ordered and sorted. A record can be retrieved as a group of columns. CONFIDENTIAL | © 2021 EPAM Systems, Inc. 29 WHAT IS CDBMS WIKI? A Columnar Database Management System (or Column-Oriented DBMS) is a Database Management System that stores data tables by column rather than by row. Storing data in columns rather than rows, the database can more precisely access the data it needs to answer a query rather than scanning and discarding unwanted data in rows. Query performance is often increased as a result, particularly in very large data sets. Loading performance is often decreased as a result, particularly in very large data sets (especially increment loading). CONFIDENTIAL | © 2021 EPAM Systems, Inc. 30 WHAT IS C(B)DBMS? A Cloud (Based) Database Management System is a is a Database Management System that is created/built/deployed/supported as a Cloud Service on a Cloud Platform. Reference: PaaS (Platform as a Service) or DBaaS (Database as a Service). Access: Via Web or API or CLI or SDK. Why: Scalability, Self-Supporting, No Hardware. CONFIDENTIAL | © 2021 EPAM Systems, Inc. 31 WHAT IS C(B)DBMS WIKI? A Cloud (Based) Database Management System is a is a Database Management System that that typically runs on a cloud computing platform, access to it is provided as a service. Database services take care of scalability and high availability of the database. Database services make the underlying software- stack transparent to the user. Scaling: Vertical – add resources to an existing instance. Horizontal – add an additional instance CONFIDENTIAL | © 2021 EPAM Systems, Inc. 32 SMP Symmetric Multiprocessing (SMP) is a Computing Architecture that is have two+ processors which are attached to a single instance. Consists of: Software and Hardware multiprocessing. One server – one love. Multiple servers – a cluster. Core: CPU and Memory CONFIDENTIAL | © 2021 EPAM Systems, Inc. 33 MPP Massively Parallel Processing (MPP) is a Computing Architecture that is have one program which is paralleled to multiple instances. Each instance has own resources. Program splitted to threads. Core: Interconnection and collaboration CONFIDENTIAL | © 2021 EPAM Systems, Inc. 34 WHAT IS RAID? Redundant Array of Independent Disks (RAID) is a Data Storage Virtualization Technology that combines multiple physical disk drive components into one or more logical units for the purposes of data redundancy or/and performance improvement. RAID 0 -> RAID 1 -> RAID 2 -> RAID 3 -> RAID 4 -> RAID 5 -> RAID 6 RAID types: RAID 0 (Striping) RAID 1 (Mirroring) RAID 5 (Striping with parity) RAID 6 (Striping with double parity) RAID 10 (1+0 -> Striping + Mirroring) CONFIDENTIAL | © 2021 EPAM Systems, Inc. 35 RAID 0 RAID 0 (also known as a stripe set or striped volume) splits ("stripes") data evenly across two or more disks, without parity information, redundancy, or fault tolerance. Since RAID 0 provides no fault tolerance or redundancy, the failure of one drive will cause the entire array to fail; as a result of having data striped across all disks, the failure will result in total data loss. Stripe Stripe Stripe A B C D E F CONFIDENTIAL | © 2021 EPAM Systems, Inc. 36 RAID 1 RAID 1 consists of an exact copy (or mirror) of a set of data on two or more disks. A classic RAID 1 mirrored pair contains two disks. The array will continue to operate so long as at least one drive is operational. Stripe Stripe Stripe Stipe A A B B C C D D CONFIDENTIAL | © 2021 EPAM Systems, Inc. 37 RAID 2, 3, 4 – Rare usage RAID 2 stripes data at the bit level and uses a Hamming code for error correction. Multiply correction bits to other disks. Stripe Stripe Stripe ECC ECC A0 A1 A2 ECC Ax ECC Ay B0 B1 B2 ECC Bx ECC By RAID 3 consists of byte-level striping with a dedicated parity disk. Uses “Stripes” – divided data blocks. Parity bytes to other disk. Stripe Stripe Stripe Parity A0 A1 A2 A Parity B0 B1 B2 B Parity RAID 4 consists of block-level striping with a dedicated parity disk. Uses “Blocks” of data. Parity bytes to other disk. Block Block Block Parity A0 A1 A2 A Parity B0 B1 B2 B Parity CONFIDENTIAL | © 2021 EPAM Systems, Inc. 38 RAID 5 RAID 5 consists of block-level striping with distributed parity. In comparison to RAID 4, RAID 5's distributed parity evens out the stress of a dedicated parity disk among all RAID members. RAID Level 5 requires a minimum of 3 drives to implement. Block Block Block Block A0 B0 C0 0 Parity A1 B1 1 Parity D0 A2 2 Parity C2 D2 3 Parity B3 C3 D3 CONFIDENTIAL | © 2021 EPAM Systems, Inc. 39 RAID 6 RAID 6 extends RAID 5 by adding another parity block. It uses block-level striping with two parity blocks distributed across all member disks. RAID Level 6 requires a minimum of 4 drives to implement. Block Block Block Block A0 B0 Q0 Parity P0 Parity A1 Q1 Parity P1 Parity D0 Q2 Parity P2 Parity C2 D2 P3 Parity B3 C3 Q3 Parity CONFIDENTIAL | © 2021 EPAM Systems, Inc. 40 Q&A CONFIDENTIAL | © 2021 EPAM Systems, Inc. T H A N K YO U F O R YO U R AT T E N T I O N CONFIDENTIAL | © 2021 EPAM Systems, Inc. Database Management System - Part 2 DWBI\DQE Lab Ukraine Autumn 2021 Iryna Shkinder DWBI Software Engineer Alona Shkinder DWBI Software Engineer CONFIDENTIAL | © 2021 EPAM Systems, Inc. RECOMMENDATIONS Discuss and interact Raise hand, ask questions Duration: 3 hours Coffee break: by request 5-10 minutes Feedback is appreciated Q&A after each module, Teams Questions Channel CONFIDENTIAL | © 2021 EPAM Systems, Inc. 2 Database Management System Part 2 CONFIDENTIAL | © 2021 EPAM Systems, Inc. 3 AGENDA 1. How does a RDBMS work? 2. Big O notation 3. Sort 4. Array vs Tree vs Hash 5. RDBMS components 6. Join(s) 7. Query + Data + Tools CONFIDENTIAL | © 2021 EPAM Systems, Inc. 4 HOW DOES A RDBMS WORK? A Relational Database Management System (RDBMS) is a database engine/system/products based on the Relational Model by Edgar F. Codd in 1970. To understand something you should know a basic principles and ask questions (why/what/when/how and etc). Check this out: http://coding-geek.com/how-databases- work/ Author: Christophe in his blog Coding Geek A translation (Habr): https://habr.com/company/mailru/blog/2668 11/ CONFIDENTIAL | © 2021 EPAM Systems, Inc. 5 BIG O NOTATION Big O notation is a mathematical notation that describes the limiting behavior of a function when the argument tends towards a particular value or infinity. It is a member of a family of notations invented by Paul Bachmann, Edmund Landau, and others, collectively called Bachmann–Landau notation or asymptotic notation. Big O notation is used to classify algorithms according to how their running time or space requirements grow as the input size grows. Used for analysis of algorithms. CONFIDENTIAL | © 2021 EPAM Systems, Inc. 6 BIG O -> COST BASED OPTIMIZATION Input: time + data Algorithms: O(1) – constant O(n) – linear O(log(n)) – low with big data O(n^2) – quadratic grow O(2^n) – exponential grow O(n!) – exploding grow Each algorithm “costs” N operations – we can measure and compare them. CONFIDENTIAL | © 2021 EPAM Systems, Inc. 7 BIG O COMPARISON O(n) O(log(n)) O(n^2) O(n!) 1 0 1 1 2 0.301029996 4 2 3 0.477121255 9 6 4 0.602059991 16 24 5 0.698970004 25 120 6 0.77815125 36 720 7 0.84509804 49 5040 8 0.903089987 64 40320 9 0.954242509 81 362880 10 1 100 3628800 CONFIDENTIAL | © 2021 EPAM Systems, Inc. 8 HOW TO SORT? Sorting is arranging in an ordered sequence. Sorting algorithms: Bubble sort Insertion sort Shell sort Selection sort Quick sort Merge sort Search -> Sort + Scan CONFIDENTIAL | © 2021 EPAM Systems, Inc. 9 SORTING ALGORITHMS CONFIDENTIAL | © 2021 EPAM Systems, Inc. 10 BUBBLE SORT Bubble sort (sometimes called sinking sort) is a simple sorting algorithm that repeatedly steps through the list, compares adjacent elements and swaps them if they are in the wrong order. The pass through the list is repeated until the list is sorted. The algorithm, which is a comparison sort, is named for the way smaller or larger elements "bubble" to the top of the list. Worst-case performance: O(n2) Best-case performance: O(n) CONFIDENTIAL | © 2021 EPAM Systems, Inc. 11 INSERTION SORT Insertion sort is a simple sorting algorithm that builds the final sorted array one item at a time. The array is virtually split into a sorted and an unsorted part. Values from the unsorted part are picked and placed at the correct position in the sorted part. Worst-case performance: O(n2) Best-case performance: O(n) CONFIDENTIAL | © 2021 EPAM Systems, Inc. 12 SHELL SORT Shell sort improves on the insertion sort by breaking the original list into several smaller sublists, each of which is sorted using an insertion sort. The method starts by sorting pairs of elements far apart from each other, then progressively reducing the gap between elements to be compared. Worst-case performance: O(n2) Best-case performance: O(n log2n) CONFIDENTIAL | © 2021 EPAM Systems, Inc. 13 SELECTION SORT Selection sort is an in-place comparison sorting algorithm. This algorithm divides the array into two parts: sorted (left) and unsorted (right) subarray. It selects the smallest/largest element from unsorted subarray and places in the first position of that subarray (ascending/descending order). It repeatedly selects the next smallest/largest element. Worst-case performance: O(n2) Best-case performance: O(n2) CONFIDENTIAL | © 2021 EPAM Systems, Inc. 14 QUICK SORT Quick sort is an efficient comparison sorting algorithm. It works by selecting a 'pivot' element from the array and partitioning the other elements into two sub-arrays, according to whether they are less than or greater than the pivot. The sub-arrays are then sorted recursively. Worst-case performance: O(n2) Best-case performance: O(n log n) CONFIDENTIAL | © 2021 EPAM Systems, Inc. 15 MERGE SORT Merge: combine or cause to combine to form a single entity. Merge sort: 2+ pre-sorted arrays are combined to 1 sorted array. Steps to merge: 1. 1 array is split to N smaller arrays 2. Sort N arrays 3. Compare the optimal values (max or min) 4. Choose one and put into final array 5. Move 6. Go to 3 CONFIDENTIAL | © 2021 EPAM Systems, Inc. 16 MERGE SORT Merge sort phases: Division Sorting Cost: N*log(N) Division: Divide array to atomic level Sorting: 1. Compare the pairs 2. Merge them 3. Compare the small arrays 4. Merge them 5. Go to 3 CONFIDENTIAL | © 2021 EPAM Systems, Inc. 17 ARRAY AND TABLE An Array is a systematic arrangement of similar objects, usually in rows and columns. 2 dimensional Array is a Table or a Matrix. A Table is an arrangement of data in rows and columns, or possibly in a more complex structure. Row – record, n-tuple, vector. Column – field, parameter, attribute. Lets find something scanning whole table starting from the beginning to the end – full scan. CONFIDENTIAL | © 2021 EPAM Systems, Inc. 18 BINARY TREE A Binary Tree is a tree data structure with nodes which has children – left and right child. 4 2 6 1 3 5 7 CONFIDENTIAL | © 2021 EPAM Systems, Inc. 19 BINARY PLUS TREE A Binary Plus Tree is a modification of Binary Tree where: The lowest nodes contain/store information The other nodes – routes to the lowest nodes The lowest nodes are linked to the successors/neighbors Purpose: a range search CONFIDENTIAL | © 2021 EPAM Systems, Inc. 20 BINARY PLUS TREE < 5? < < 3? 7? < < < < 2? 4? 6? 8? 1 2 3 4 5 6 7 8 CONFIDENTIAL | © 2021 EPAM Systems, Inc. 21 HASH A Hash function is any function that can be used to map data of arbitrary size to data of fixed size. How to hash for hash table: Need a key Need a function Need a bucket Need an additional function Steps: 1. A hash function compute the hash code 2. The hash code points to a bucket 3. An additional function check a bucket in case of multiple values CONFIDENTIAL | © 2021 EPAM Systems, Inc. 22 RELATIONAL DATABASE MANAGEMENT SYSTEM Executor Rewriter Cache Access Backup Admin Parser Optimizer Transaction Recovery Monitor QUERY MANAGER DATA MANAGER OTHER MANAGERS File Memory Network Client Security Process System ENGINE COMPONENTS CONFIDENTIAL | © 2021 EPAM Systems, Inc. 23 CLIENT MANAGER RDBMS Client Query Data Client Manager Manager Manager Steps: Authorization and Authentication Check other managers (availability, resources, network and etc) Query – to Query Manager Results – from Query Manager to Client Manager CONFIDENTIAL | © 2021 EPAM Systems, Inc. 24 QUERY MANAGER RDBMS Client Query Data Client Manager Manager Manager Steps: Parse (+ integrity checks) Rewrite (human != machine) Optimize (stats, indexes, access path, joins, dynamic programming, heuristics, genetic and greedy algorithms, caches) Compile Execute CONFIDENTIAL | © 2021 EPAM Systems, Inc. 25 SET OPERATIONS UNION - A ∪ B Union of the sets A and B is the set of distinct A B elements that belong to set A or set B, or both. INTERSECT - A ∩ B The intersection of the sets A and B is the set of A B elements that belong to both A and B i.e. set of the common elements in A and B. EXCEPT (MINUS) - A \ B (A - B) The difference between sets A and B is the set containing elements that are in A but not in B. i.e., A B all elements of A except the element of B. CONFIDENTIAL | © 2021 EPAM Systems, Inc. 26 JOIN(S) A Join (relational algebra) - a binary operation on tuples corresponding to the relation join of SQL. An SQL Join clause combines columns from one or more tables in a relational database. Relations: Inner Outer Joins: Nested loop Merge join Hash join CONFIDENTIAL | © 2021 EPAM Systems, Inc. 27 NESTED LOOP A Nested Loop join is a naive algorithm that joins two sets by using two nested loops.. Steps to nested loop: 1. Choose inner and outer relations 2. Inner – the smallest one 3. Iterate an outer relation 4. Check an inner relation for matching values CONFIDENTIAL | © 2021 EPAM Systems, Inc. 28 MERGE JOIN A Merge join is based on a Merge sort. Produces a sorted output. Steps to merge join: 1. Sort 2. Compare the optimal values (max or min) 3. If equal – put them to result 4. Move to the next elements 5. Go to 2 CONFIDENTIAL | © 2021 EPAM Systems, Inc. 29 HASH JOIN The task of a Hash join algorithm is to find, for each distinct value of the join attribute, the set of tuples in each relation which have that value. Steps to hash join: 1. Choose inner and outer relations 2. Inner – the smallest one 3. Build an in-memory hash table based on an inner relation 4. Iterate an outer relation 5. A hash functions to each element 6. Compare hashes CONFIDENTIAL | © 2021 EPAM Systems, Inc. 30 DATA MANAGER RDBMS Client Query Data Client Manager Manager Manager A Data manager gets data from tables. What it should check: Transactions Caches and buffers I/O bottleneck CONFIDENTIAL | © 2021 EPAM Systems, Inc. 31 TRANSACTIONS ACID: Atomicity Consistency Isolation Durability Levels of isolation: Serializable Repeatable read -> a phantom read (new data) Read committed -> a non-repeatable read (modification of data) Read uncommitted -> a dirty read (no commits) CONFIDENTIAL | © 2021 EPAM Systems, Inc. 32 TRANSACTIONS AND ISOLATION LEVELS Transaction is a single unit of logic or work, sometimes made up of multiple operations. Any logical calculation done in a consistent mode in a database is known as a transaction. Atomicity - transaction must either complete in its entirety or have no effect whatsoever Consistency - transaction must conform to existing constraints in the database Isolation - transaction must not affect other transactions Durability - transaction must get written to persistent storage Isolation levels define the degree to which a transaction must be isolated from the data modifications made by any other transaction in the database system. CONFIDENTIAL | © 2021 EPAM Systems, Inc. 33 READ UNCOMMITTED AND DIRTY READ Read Uncommitted – is the lowest isolation level. In this level, one transaction may read not yet committed changes made by other transaction, thereby allowing dirty reads. In this level, transactions are not isolated from each other. Dirty Read – occurs when a transaction reads a data that has not yet been committed. Example: Transaction T1 updates a row and leaves it uncommitted, meanwhile, transaction T2 reads the updated row. If T1 rolls back the change, T2 will have read data that is considered never to have existed. CONFIDENTIAL | © 2021 EPAM Systems, Inc. 34 READ COMMITTED AND NON-REPEATABLE READ Read Committed – this isolation level guarantees that any data read is committed at the moment it is read. Thus it does not allows dirty read. The transaction holds a read or write lock on the current row, and thus prevents other transactions from reading, updating, or deleting it. Non-Repeatable Read – occurs when a transaction reads same row twice and get a different value each time. Example: Transaction T1 reads data. Due to concurrency, another Transaction T2 updates the same data and commit, Now if T1 rereads the same data, it will retrieve a different value. CONFIDENTIAL | © 2021 EPAM Systems, Inc. 35 REPEATABLE READ AND PHANTOM READ Repeatable Read – this is the most restrictive isolation level. The transaction holds read locks on all rows it references and writes locks on all rows it inserts, updates, or deletes. Since other transactions cannot read, update or delete these rows, consequently it avoids non-repeatable read. Phantom Read – occurs when two same queries are executed, but the rows retrieved by the two, are different. Example: Transaction T1 retrieves a set of rows that satisfy some search criteria. Now, Transaction T2 generates some new rows that match the search criteria for T1. If T1 re-executes the statement that reads the rows, it gets a different set of rows this time. CONFIDENTIAL | © 2021 EPAM Systems, Inc. 36 SERIALIZABLE Serializable – this is the Highest isolation level. A serializable execution is guaranteed to be serializable. Serializable execution is defined to be an execution of operations in which concurrently executing transactions appears to be serially executing. A serial execution is one in which each SQL-transaction executes to completion before the next SQL-transaction begins. CONFIDENTIAL | © 2021 EPAM Systems, Inc. 37 Isolation Levels, Read Phenomena and Locks Relationship Relationship between Isolation Levels, Read Phenomena and Locks: Isolation Level Dirty Reads Non-Repeatable Reads Phantom Reads Read Uncommitted + + + Read Committed - + + Repeatable Read - - + Serializable - - - + may occur - do not occur CONFIDENTIAL | © 2021 EPAM Systems, Inc. 38 Q&A CONFIDENTIAL | © 2021 EPAM Systems, Inc. T H A N K YO U F O R YO U R AT T E N T I O N CONFIDENTIAL | © 2021 EPAM Systems, Inc.