DBMS Unit 2 PDF
Document Details
Uploaded by Deleted User
Tags
Summary
These are lecture notes from a database management system course, specifically focusing on unit 2 which covers SQL, indexing and hashing. The notes include definitions, rules, and basic structure of SQL queries. A query on a single relation about instructors and their department is also provided.
Full Transcript
DATABASE MANAGEMENT SYSTEM (EBCS22003) UNIT – II UNIT II SQL, INDEXING & HASHING 9Hrs SQL -– normalization – normalization using functional – Multivalued join dependence – File trans...
DATABASE MANAGEMENT SYSTEM (EBCS22003) UNIT – II UNIT II SQL, INDEXING & HASHING 9Hrs SQL -– normalization – normalization using functional – Multivalued join dependence – File transaction – data dictionary – indexing and hashing basic concepts and B+ tree indices – static and Dynamic hash functions. SQL ( Structured Query Language) SQL stands for Structured Query Language. It is used for storing and managing data in relational database management system (RDMS). It is a standard language for Relational Database System. It enables a user to create, read, update and delete relational databases and tables. All the RDBMS like MySQL, Informix, Oracle, MS Access and SQL Server use SQL as their standard database language. SQL allows users to query the database in a number of ways, using English-like statements. Rules: SQL follows the following rules: Structure query language is not case sensitive. Generally, keywords of SQL are written in uppercase. Statements of SQL are dependent on text lines. We can use a single SQL statement on one or multiple text line. Using the SQL statements, you can perform most of the actions in a database. SQL depends on tuple relational calculus and relational algebra SQL process: When an SQL command is executing for any RDBMS, then the system figure out the best way to carry out the request and the SQL engine determines that how to interpret the task. In the process, various components are included. These components can be optimization Engine, Query engine, Query dispatcher, classic, etc. All the non-SQL queries are handled by the classic query engine, but SQL query engine won't handle logical files Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI SQL Query Execution Order In the above diagrammatic representation following steps are performed: Parsing: In this process, Query statement is tokenized. Optimizing: In this process, SQL statement optimizes the best algorithm for byte code. From: In SQL statement, from keyword is used to specify the tables from which data fetched. Where: Where keyword works like conditional statement in SQL. Join: A Join statement is used to combine data from more than one tables based on a common field among them. Group by: It is used to group the fields by different records from table(s). Having: Having clause is also works like conditional statement in SQL. It is mostly used with group by clause to filter the records. Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI Order by: This clause is used to sort the data in particular order by using "ASC" for ascending and "DESC" for descending order. Select: This "Data Manipulation Language" statement is used to get the data from the database. Limit: It is used to specify the how many rows returned by the SQL select statement. Basic Structure: What is the Basic Structure of SQL Queries? The fundamental structure of SQL queries includes three clauses that are select, from, and where clause. What we want in the final result relation is specified in the select clause. Which relations we need to access to get the result is specified in from clause. How the relation must be operated to get the result is specified in the where clause. select A1, A2,... , An from r1, r2,... , rm where P; ▪ In the select clause, you have to specify the attributes that you want to see in the result relation ▪ In the from clause, you have to specify the list of relations that has to be accessed for evaluating the query. ▪ In the where clause involves a predicate that includes attributes of the relations that we have listed in the from clause. Though the SQL query has a sequence select, from, and where. To understand how the query will operate? You must consider the query in the order, from, where and then focus on select. So with the help of these three clauses, we can retrieve the information we want out of the huge set of data. Let us begin with the structure of queries imposed on the single relation. Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI Queries on Single Relation Consider that we have a relation ‘instructor’ with the attributes instr_id, name, dept_name, and salary. Now we want the names of all the instructors along with their corresponding department names. The SQL query we would structure to get a result relation with instructor’s names along with their department name. select name, from instructor; Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI Observe that here we have not included where clause in the query above as we want the name of all the instructors with their department name. So there is no need of imposing any condition. Now, if we have asked only for the dept_name in the above query by default it would have listed names of all the departments retaining the duplicates. To eliminate the duplicates you can make use of the distinct keyword. select distinct dept_name from instructor; Further, you can even include the arithmetic expression in the select clause using operators such as +, -, *, and /. In case, you want the result relation to display instructor name along with their salary which reduced by 10%. Then the SQL query you will impose on the data set is: select instr_name, salary*0.9 from instructor; To get specific tuples in the result relation we can use the where clause to specify the particular condition. Like if we want the result relation to display names of instructors that have salaries greater than 50000. In where clause we can make use of logical connectives and, or & not. Along with these, we can include expressions where we can use comparison operators to compare operands in the Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI expression. This was a query imposed on a single relation now let’s move ahead and discuss queries requested on multiple relations. Queries on Multiple Relation The SQL queries often need to access multiple relations from the data set in order to get the required result. Let us take an example we have two relations instructor and department. Now, if you want to retrieve the names of all the instructors along with their department names and the corresponding department building. We will get the instructor’s name and department name in the instructor relation but building name in the department relation. So the query would be: select name, instructor.dept_name, building from instructor, department where instructor.dept_name= department.dept_name; Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI Here department name of each tuple of instructor relation will be matched with the department name of each tuple of department relation. Observe that we have used relation name as a prefix to the attribute name in where clause, as both the attributes to be compared in where clause has the same name. The result of the query above is: The from the clause in the queries with multiple relations act as a Cartesian product of the relations present the from clause. Consider the relation instructor and teaches and the Cartesian product between the two can be expressed as: (instructor.ID, instructor.name, instructor.dept_name, instructor.salary, teaches.ID, teaches.course id, teaches.sec id, teaches.semester, teaches.year) The prefix should not be added to the attributes that are only in one of the relations. So this would be like: (instructor.ID, name, dept_name, salary teaches.ID, course id, sec id, semester, year) The Cartesian product combines each tuple of instructor relation to every tuple of teaches relation. This Cartesian product results in extremely large relations which are hardly of any use. So, it is the where clause that restricts the unnecessary combinations, and let’s retain the meaningful combination that will help in retrieving the desired result. Natural Join Like, Cartesian product the natural join operation also combines the tuples from two relations to provide the result. The natural join operation combines only those tuples that have the same value for the attribute that appears in both the relation on which natural join is applied. Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI The natural join of instructor and teaches relation is: Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI The natural join query can be written as: select name, course id from instructor natural join teaches; This is how you can structure a SQL query to retrieve your desired result. There is a lot to explore further in SQL queries. To get the correct result you must be able to structure the query correctly. SET Operators in SQL SET operators are special type of operators which are used to combine the result of two queries. Operators covered under SET operators are: 1. UNION 2. UNION ALL 3. INTERSECT 4. MINUS There are certain rules which must be followed to perform operations using SET operators in SQL. Rules are as follows: 1. The number and order of columns must be the same. 2. Data types must be compatible. Table 1: t_employees ID Name Department Salary Year_of_Experience 1 Aakash Singh Development 72000 2 Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI 2 Abhishek Pawar Production 45000 1 3 Pranav Deshmukh HR 59900 3 4 Shubham Mahale Accounts 57000 2 5 Sunil Kulkarni Development 87000 3 6 Bhushan Wagh R&D 75000 2 7 Paras Jaiswal Marketing 32000 1 Table 2: t2_employees ID Name Department Salary Year_of_Experience 1 Prashant Wagh R&D 49000 1 2 Abhishek Pawar Production 45000 1 3 Gautam Jain Development 56000 4 4 Shubham Mahale Accounts 57000 2 5 Rahul Thakur Production 76000 4 6 Bhushan Wagh R&D 75000 2 7 Anand Singh Marketing 28000 1 Table 3: t_students ID Name Hometown Percentage Favourite_Subject 1 Soniya Jain Udaipur 89 Physics Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI 2 Harshada Sharma Kanpur 92 Chemistry 3 Anuja Rajput Jaipur 78 History 4 Pranali Singh Nashik 88 Geography 5 Renuka Deshmukh Panipat 90 Biology 6 Swati Kumari Faridabad 93 English 7 Prachi Jaiswal Gurugram 96 Hindi Table 4: t2_students ID Name Hometown Percentage Favourite_Subject 1 Soniya Jain Udaipur 89 Physics 2 Ishwari Dixit Delhi 86 Hindi 3 Anuja Rajput Jaipur 78 History 4 Pakhi Arora Surat 70 Sanskrit 5 Renuka Deshmukh Panipat 90 Biology 6 Jayshree Patel Pune 91 Maths 7 Prachi Jaiswal Gurugram 96 Hindi 1. UNION: UNION will be used to combine the result of two select statements. Duplicate rows will be eliminated from the results obtained after performing the UNION operation. Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI Example 1: Write a query to perform union between the table t_employees and the table t2_employees. Query: mysql> SELECT *FROM t_employees UNION SELECT *FROM t2_employees; Here, in a single query, we have written two SELECT queries. The first SELECT query will fetch the records from the t_employees table and perform a UNION operation with the records fetched by the second SELECT query from the t2_employees table. You will get the following output: ID Name Department Salary Year_of_Experience 1 Aakash Singh Development 72000 2 2 Abhishek Pawar Production 45000 1 3 Pranav Deshmukh HR 59900 3 4 Shubham Mahale Accounts 57000 2 5 Sunil Kulkarni Development 87000 3 6 Bhushan Wagh R&D 75000 2 7 Paras Jaiswal Marketing 32000 1 1 Prashant Wagh R&D 49000 1 Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI 3 Gautam Jain Development 56000 4 5 Rahul Thakur Production 76000 4 7 Anand Singh Marketing 28000 1 Since we have performed union operation between both the tables, so only the records from the first and second table are displayed except for the duplicate records. 2. UNION ALL This operator combines all the records from both the queries. Duplicate rows will be not be eliminated from the results obtained after performing the UNION ALL operation. Example 1: Write a query to perform union all operation between the table t_employees and the table t2_employees. Query: 1. mysql> SELECT *FROM t_employees UNION ALL SELECT *FROM t2_employees ; Here, in a single query, we have written two SELECT queries. The first SELECT query will fetch the records from the t_employees table and perform UNION ALL operation with the records fetched by the second SELECT query from the t2_employees table. You will get the following output: ID Name Department Salary Year_of_Experience 1 Aakash Singh Development 72000 2 Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI 2 Abhishek Pawar Production 45000 1 3 Pranav Deshmukh HR 59900 3 4 Shubham Mahale Accounts 57000 2 5 Sunil Kulkarni Development 87000 3 6 Bhushan Wagh R&D 75000 2 7 Paras Jaiswal Marketing 32000 1 1 Prashant Wagh R&D 49000 1 2 Abhishek Pawar Production 45000 1 3 Gautam Jain Development 56000 4 4 Shubham Mahale Accounts 57000 2 5 Rahul Thakur Production 76000 4 6 Bhushan Wagh R&D 75000 2 7 Anand Singh Marketing 28000 1 Since we have performed union all operation between both the tables, so all the records from the first and second table are displayed, including the duplicate records. 3. INTERSECT: It is used to combine two SELECT statements, but it only returns the records which are common from both SELECT statements. Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI Example 1: Write a query to perform intersect operation between the table t_employees and the table t2_employees. Query: mysql> SELECT *FROM t_employees INTERSECT SELECT *FROM t2_employe es; Here, in a single query, we have written two SELECT queries. The first SELECT query will fetch the records from the t_employees table and perform INTERSECT operation with the records fetched by the second SELECT query from the t2_employees table. You will get the following output: ID Name Hometown Percentage Favourite_Subject 2 Abhishek Production 45000 1 Pawar 4 Shubham Accounts 57000 2 Mahale 6 Bhushan R&D 75000 2 Wagh Since we have performed intersect operation between both the tables, so only the common records from both the tables are displayed. Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI 4. MINUS o It displays the rows which are present in the first query but absent in the second query with no duplicates. Example 1: Write a query to perform a minus operation between the table t_employees and the table t2_employees. Query: mysql> SELECT *FROM t_employees MINUS SELECT *FROM t2_employees; Here, in a single query, we have written two SELECT queries. The first SELECT query will fetch the records from the t_employees table and perform MINUS operation with the records fetched by the second SELECT query from the t2_employees table. You will get the following output: ID Name Department Salary Year_of_Experience 1 Aakash Singh Development 72000 2 3 Pranav HR 59900 3 Deshmukh 5 Sunil Kulkarni Development 87000 3 7 Paras Jaiswal Marketing 32000 1 Since we have performed Minus operation between both the tables, so only the unmatched records from both the tables are displayed. Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI SQL Aggregate Functions : o SQL aggregation function is used to perform the calculations on multiple rows of a single column of a table. It returns a single value. o It is also used to summarize the data. Types of SQL Aggregation Function 1. COUNT FUNCTION o COUNT function is used to Count the number of rows in a database table. It can work on both numeric and non-numeric data types. o COUNT function uses the COUNT(*) that returns the count of all the rows in a specified table. COUNT(*) considers duplicate and Null. Syntax COUNT(*) or COUNT( [ALL|DISTINCT] expression ) Sample table: Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI PRODUCT_MAST PRODUCT COMPANY QTY RATE COST Item1 Com1 2 10 20 Item2 Com2 3 25 75 Item3 Com1 2 30 60 Item4 Com3 5 10 50 Item5 Com2 2 20 40 Item6 Cpm1 3 25 75 Item7 Com1 5 30 150 Item8 Com1 3 10 30 Item9 Com2 2 25 50 Item10 Com3 4 30 120 Example: COUNT() SELECT COUNT(*) FROM PRODUCT_MAST; Output: 10 Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI Example: COUNT with WHERE SELECT COUNT(*) FROM PRODUCT_MAST WHERE RATE>=20; Output: 7 Example: COUNT() with DISTINCT SELECT COUNT(DISTINCT COMPANY) FROM PRODUCT_MAST; Output: 3 Example: COUNT() with GROUP BY SELECT COMPANY, COUNT(*) FROM PRODUCT_MAST GROUP BY COMPANY; Output: Com1 5 Com2 3 Com3 2 SUM Function Sum function is used to calculate the sum of all selected columns. It works on numeric fields only. Syntax Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI SUM() or SUM( [ALL|DISTINCT] expression ) Example: SUM() SELECT SUM(COST) FROM PRODUCT_MAST; Output: 670 Example: SUM() with WHERE SELECT SUM(COST) FROM PRODUCT_MAST WHERE QTY>3; Output: 320 Example: SUM() with GROUP BY SELECT SUM(COST) FROM PRODUCT_MAST WHERE QTY>3 GROUP BY COMPANY; Output: Com1 150 Com2 170 AVG function The AVG function is used to calculate the average value of the numeric type. AVG function returns the average of all non-Null values. Syntax AVG() or AVG( [ALL|DISTINCT] expression ) Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI Example: SELECT AVG(COST) FROM PRODUCT_MAST; Output: 67.00 4. MAX Function MAX function is used to find the maximum value of a certain column. This function determines the largest value of all selected values of a column. Syntax MAX() or MAX( [ALL|DISTINCT] expression ) Example: SELECT MAX(RATE) FROM PRODUCT_MAST; OUTPUT: 30 5. MIN Function MIN function is used to find the minimum value of a certain column. This function determines the smallest value of all selected values of a column. Syntax MIN() or MIN( [ALL|DISTINCT] expression ) Example: SELECT MIN(RATE) FROM PRODUCT_MAST; Output: 10 Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI Normalization: What is Normalization? Normalization is the process of organizing the data in the database. Normalization is used to minimize the redundancy from a relation or set of relations. It is also used to eliminate undesirable characteristics like Insertion, Update, and Deletion Anomalies. Normalization divides the larger table into smaller and links them using relationships. The normal form is used to reduce redundancy from the database table. Why do we need Normalization? The main reason for normalizing the relations is removing these anomalies. Failure to eliminate anomalies leads to data redundancy and can cause data integrity and other problems as the database grows. Normalization consists of a series of guidelines that helps to guide you in creating a good database structure. Data modification anomalies can be categorized into three types: Insertion Anomaly: Insertion Anomaly refers to when one cannot insert a new tuple into a relationship due to lack of data. Deletion Anomaly: The delete anomaly refers to the situation where the deletion of data results in the unintended loss of some other important data. Updatation Anomaly: The update anomaly is when an update of a single data value requires multiple rows of data to be updated. Types of Normal Forms: Normalization works through a series of stages called Normal forms. The normal forms apply to individual relations. The relation is said to be in particular normal form if it satisfies constraints. Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI Following are the various types of Normal forms: Normal Form Description 1NF A relation is in 1NF if it contains an atomic value. 2NF A relation will be in 2NF if it is in 1NF and all non-key attributes are fully functional dependent on the primary key. 3NF A relation will be in 3NF if it is in 2NF and no transition dependency exists. BCNF A stronger definition of 3NF is known as Boyce Codd's normal form. 4NF A relation will be in 4NF if it is in Boyce Codd's normal form and has no multi-valued dependency. 5NF A relation is in 5NF. If it is in 4NF and does not contain any join dependency, joining should be lossless. Advantages of Normalization Normalization helps to minimize data redundancy. Greater overall database organization. Data consistency within the database. Much more flexible database design. Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI Enforces the concept of relational integrity. Disadvantages of Normalization You cannot start building the database before knowing what the user needs. The performance degrades when normalizing the relations to higher normal forms, i.e., 4NF, 5NF. It is very time-consuming and difficult to normalize relations of a higher degree. Careless decomposition may lead to a bad database design, leading to serious problems. Normalization using functional First Normal Form (1NF) A relation will be 1NF if it contains an atomic value. It states that an attribute of a table cannot hold multiple values. It must hold only single-valued attribute. First normal form disallows the multi-valued attribute, composite attribute, and their combinations. Example: Relation EMPLOYEE is not in 1NF because of multi-valued attribute EMP_PHONE. EMPLOYEE table: EMP_ID EMP_NAME EMP_PHONE EMP_STATE 7272826385, 14 John UP 9064738238 20 Harry 8574783832 Bihar 7390372389, 12 Sam Punjab 8589830302 Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI The decomposition of the EMPLOYEE table into 1NF has been shown below: EMP_ID EMP_NAME EMP_PHONE EMP_STATE 14 John 7272826385 UP 14 John 9064738238 UP 20 Harry 8574783832 Bihar 12 Sam 7390372389 Punjab 12 Sam 8589830302 Punjab Second Normal Form (2NF) In the 2NF, relational must be in 1NF. In the second normal form, all non-key attributes are fully functional dependent on the primary key Example: Let's assume, a school can store the data of teachers and the subjects they teach. In a school, a teacher can teach more than one subject. TEACHER table TEACHER_ID SUBJECT TEACHER_AGE 25 Chemistry 30 25 Biology 30 47 English 35 83 Math 38 83 Computer 38 Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI In the given table, non-prime attribute TEACHER_AGE is dependent on TEACHER_ID which is a proper subset of a candidate key. That's why it violates the rule for 2NF. To convert the given table into 2NF, we decompose it into two tables: TEACHER_DETAIL table: TEACHER_ID TEACHER_AGE 25 30 47 35 83 38 TEACHER_SUBJECT table: TEACHER_ID SUBJECT 25 Chemistry 25 Biology 47 English 83 Math 83 Computer Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI Third Normal Form (3NF) A relation will be in 3NF if it is in 2NF and not contain any transitive partial dependency. 3NF is used to reduce the data duplication. It is also used to achieve the data integrity. If there is no transitive dependency for non-prime attributes, then the relation must be in third normal form. A relation is in third normal form if it holds atleast one of the following conditions for every non-trivial function dependency X → Y. 1. X is a super key. 2. Y is a prime attribute, i.e., each element of Y is part of some candidate key. Example: EMPLOYEE_DETAIL table: EMP_ID EMP_NAME EMP_ZIP EMP_STATE EMP_CITY 222 Harry 201010 UP Noida 333 Stephan 02228 US Boston 444 Lan 60007 US Chicago 555 Katharine 06389 UK Norwich 666 John 462007 MP Bhopal Super key in the table above: {EMP_ID}, {EMP_ID, EMP_NAME}, {EMP_ID, EMP_NAME, EMP_ZIP} Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI Candidate key: {EMP_ID} Non-prime attributes: In the given table, all attributes except EMP_ID are non- prime. Here, EMP_STATE & EMP_CITY dependent on EMP_ZIP and EMP_ZIP dependent on EMP_ID. The non-prime attributes (EMP_STATE, EMP_CITY) transitively dependent on super key(EMP_ID). It violates the rule of third normal form. That's why we need to move the EMP_CITY and EMP_STATE to the new table, with EMP_ZIP as a Primary key. EMPLOYEE table: EMP_ID EMP_NAME EMP_ZIP 222 Harry 201010 333 Stephan 02228 444 Lan 60007 555 Katharine 06389 666 John 462007 Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI EMPLOYEE_ZIP table: EMP_ZIP EMP_STATE EMP_CITY 201010 UP Noida 02228 US Boston 60007 US Chicago 06389 UK Norwich 462007 MP Bhopal Boyce Codd normal form (BCNF) BCNF is the advance version of 3NF. It is stricter than 3NF. A table is in BCNF if every functional dependency X → Y, X is the super key of the table. For BCNF, the table should be in 3NF, and for every FD, LHS is super key. Example: Let's assume there is a company where employees work in more than one department. Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI EMPLOYEE table: EMP_ID EMP_COUNTRY EMP_DEPT DEPT_TYPE EMP_DEPT_NO 264 India Designing D394 283 264 India Testing D394 300 364 UK Stores D283 232 364 UK Developing D283 549 In the above table Functional dependencies are as follows: 1. EMP_ID → EMP_COUNTRY 2. EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO} Candidate key: {EMP-ID, EMP-DEPT} The table is not in BCNF because neither EMP_DEPT nor EMP_ID alone are keys. To convert the given table into BCNF, we decompose it into three tables: EMP_COUNTRY table: EMP_ID EMP_COUNTRY 264 India 264 India Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI EMP_DEPT table: EMP_DEPT DEPT_TYPE EMP_DEPT_NO Designing D394 283 Testing D394 300 Stores D283 232 Developing D283 549 EMP_DEPT_MAPPING table: EMP_ID EMP_DEPT D394 283 D394 300 D283 232 D283 549 Functional dependencies: 1. EMP_ID → EMP_COUNTRY 2. EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO} Candidate Key: For the first table: EMP_ID For the second table: EMP_DEPT For the third table: {EMP_ID, EMP_DEPT} Now, this is in BCNF because left side part of both the functional dependencies is a key. Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI Fourth Normal Form (4NF) o A relation will be in 4NF if it is in Boyce Codd normal form and has no multi-valued dependency. o For a dependency A → B, if for a single value of A, multiple values of B exists, then the relation will be a multi-valued dependency. Example STUDENT STU_ID COURSE HOBBY 21 Computer Dancing 21 Math Singing 34 Chemistry Dancing 74 Biology Cricket 59 Physics Hockey The given STUDENT table is in 3NF, but the COURSE and HOBBY are two independent entity. Hence, there is no relationship between COURSE and HOBBY. In the STUDENT relation, a student with STU_ID, 21 contains two courses, Computer and Math and two hobbies, Dancing and Singing. So there is a Multi-valued dependency on STU_ID, which leads to unnecessary repetition of data. So to make the above table into 4NF, we can decompose it into two tables: Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI STUDENT_COURSE STU_ID COURSE 21 Computer 21 Math 34 Chemistry 74 Biology 59 Physics STUDENT_HOBBY STU_ID HOBBY 21 Dancing 21 Singing 34 Dancing 74 Cricket 59 Hockey Fifth Normal Form (5NF) o A relation is in 5NF if it is in 4NF and not contains any join dependency and joining should be lossless. o 5NF is satisfied when all the tables are broken into as many tables as possible in order to avoid redundancy. Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI o 5NF is also known as Project-join normal form (PJ/NF). Example SUBJECT LECTURER SEMESTER Computer Anshika Semester 1 Computer John Semester 1 Math John Semester 1 Math Akash Semester 2 Chemistry Praveen Semester 1 In the above table, John takes both Computer and Math class for Semester 1 but he doesn't take Math class for Semester 2. In this case, combination of all these fields required to identify a valid data. Suppose we add a new Semester as Semester 3 but do not know about the subject and who will be taking that subject so we leave Lecturer and Subject as NULL. But all three columns together acts as a primary key, so we can't leave other two columns blank. So to make the above table into 5NF, we can decompose it into three relations P1, P2 & P3: P1 SEMESTER SUBJECT Semester 1 Computer Semester 1 Math Semester 1 Chemistry Semester 2 Math Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI P2 SUBJECT LECTURER Computer Anshika Computer John Math John Math Akash Chemistry Praveen P3 SEMSTER LECTURER Semester 1 Anshika Semester 1 John Semester 1 John Semester 2 Akash Semester 1 Praveen Multivalued Dependency o Multivalued dependency occurs when two attributes in a table are independent of each other but, both depend on a third attribute. o A multivalued dependency consists of at least two attributes that are dependent on a third attribute that's why it always requires at least three attributes. Example: Suppose there is a bike manufacturer company which produces two colors(white and black) of each model every year. Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI BIKE_MODEL MANUF_YEAR COLOR M2011 2008 White M2001 2008 Black M3001 2013 White M3001 2013 Black M4006 2017 White M4006 2017 Black Here columns COLOR and MANUF_YEAR are dependent on BIKE_MODEL and independent of each other. In this case, these two columns can be called as multivalued dependent on BIKE_MODEL. The representation of these dependencies is shown below: 1. BIKE_MODEL → → MANUF_YEAR 2. BIKE_MODEL → → COLOR This can be read as "BIKE_MODEL multidetermined MANUF_YEAR" and "BIKE_MODEL multidetermined COLOR". Join Dependency o Join decomposition is a further generalization of Multivalued dependencies. o If the join of R1 and R2 over C is equal to relation R, then we can say that a join dependency (JD) exists. o Where R1 and R2 are the decompositions R1(A, B, C) and R2(C, D) of a given relations R (A, B, C, D). o Alternatively, R1 and R2 are a lossless decomposition of R. o A JD ⋈ {R1, R2,..., Rn} is said to hold over a relation R if R1, R2,....., Rn is a lossless- join decomposition. Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI o The *(A, B, C, D), (C, D) will be a JD of R if the join of join's attribute is equal to the relation R. o Here, *(R1, R2, R3) is used to indicate that relation R1, R2, R3 and so on are a JD of R. File Organization The File is a collection of records. Using the primary key, we can access the records. The type and frequency of access can be determined by the type of file organization which was used for a given set of records. File organization is a logical relationship among various records. This method defines how file records are mapped onto disk blocks. File organization is used to describe the way in which the records are stored in terms of blocks, and the blocks are placed on the storage medium. The first approach to map the database to the file is to use the several files and store only one fixed length record in any given file. An alternative approach is to structure our files so that we can contain multiple lengths for records. Files of fixed length records are easier to implement than the files of variable length records. Objective of file organization It contains an optimal selection of records, i.e., records can be selected as fast as possible. To perform insert, delete or update transaction on the records should be quick and easy. The duplicate records cannot be induced as a result of insert, update or delete. For the minimal cost of storage, records should be stored efficiently. Types of file organization: File organization contains various methods. These particular methods have pros and cons on the basis of access or selection. In the file organization, the programmer decides the best- suited file organization method according to his requirement. Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI Types of file organization are as follows: 1. Fixed length records 2. Variable Length records Fixed-Length Records Fixed-length records means setting a length and storing the records into the file. If the record size exceeds the fixed size, it gets divided into more than one block. Due to the fixed size there occurs following two problems: Partially storing subparts of the record in more than one block requires access to all the blocks containing the subparts to read or write in it. It is difficult to delete a record in such a file organization. It is because if the size of the existing record is smaller than the block size, then another record or a part fills up the block. However, including a certain number of bytes is the solution to the above problems. It is known as File Header. The allocated file header carries a variety of information about the file, such as the address of the first record. The address of the second record gets stored in the first record and so on. This process is similar to pointers. The method of insertion and deletion is easy in fixed- length records because the space left or freed by the deleted record is exactly similar to the space required to insert the new records. But this process fails for storing the records of variable lengths. Variable-Length Records Variable-length records are the records that vary in size. It requires the creation of multiple blocks of multiple sizes to store them. These variable-length records are kept in the following ways in the database system: Storage of multiple record types in a file. It is kept as Record types that enable repeating fields like multisets or arrays. It is kept as Record types that enable variable lengths either for one field or more. Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI In variable-length records, there exist the following two problems: 1. Defining the way of representing a single record so as to extract the individual attributes easily. 2. Defining the way of storing variable-length records within a block so as to extract that record in a block easily. Thus, the representation of a variable-length record can be divided into two parts: 1. An initial part of the record with fixed-length attributes such as numeric values, dates, fixed-length character attributes for storing their value. 2. The data for variable-length attributes such as varchar type is represented in the initial part of the record by (offset, length) pair. The offset refers to the place where that record begins, and length refers to the length of the variable-size attribute. Thus, the initial part stores fixed-size information about each attribute, i.e., whether it is the fixed-length or variable-length attribute. Slotted-page Structure There occurs a problem to store variable-length records within the block. Thus, such records are organized in a slotted-page structure within the block. In the slotted-page structure, a header is present at the starting of each block. This header holds information such as: The number of record entries in the header No free space remaining in the block An array containing the information on the location and size of the records. Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI Inserting and Deleting Method The variable-length records reside in a contiguous manner within the block. When a new record is to be inserted, it gets the place at the end of the free space. It is because free space is contiguous as well. Also, the header fills an entry with the size and location information of the newly inserted record. When an existing record is deleted, space is freed, and the header entry sets to deleted. Before deleting, it moves the record and occupies it to create the free space. The end-of-free- space gets the update. Then all the free space again sets between the first record and the final entry. The primary technique of the slotted-page structure is that no pointer should directly point the record. Instead, it should point to the header entry that contains the information of its location. This stops fragmentation of space inside the block but supports indirect pointers to the record. Data Dictionary in DBMS We learned and understood about relations and its representation. In the relational database system, it maintains all information of a relation or table, from its schema to the applied constraints. All the metadata is stored. In general, metadata refers to the data about data. So, storing the relational schemas and other metadata about the relations in a structure is known as Data Dictionary or System Catalog. A data dictionary is like the A-Z dictionary of the relational database system holding all information of each relation in the database. The types of information a system must store are: Name of the relations Name of the attributes of each relation Lengths and domains of attributes Name and definitions of the views defined on the database Various integrity constraints Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI With this, the system also keeps the following data based on users of the system: Name of authorized users Accounting and authorization information about users. The authentication information for users, such as passwords or other related information. In addition to this, the system may also store some statistical and descriptive data about the relations, such as: Number of tuples in each relation Method of storage for each relation, such as clustered or non-clustered. A system may also store the storage organization, whether sequential, hash, or heap. It also notes the location where each relation is stored: If relations are stored in the files of the operating system, the data dictionary note, and stores the names of the file. If the database stores all the relations in a single file, the data dictionary notes and store the blocks containing records of each relation in a data structure similar to a linked list. At last, it also stores the information regarding each index of all the relations: Name of the index. Name of the relation being indexed. Attributes on which the index is defined. The type of index formed. All the above information or metadata is stored in a data dictionary. The data dictionary also maintains updated information whenever they occur in the relations. Such metadata constitutes a miniature database. Some systems store the metadata in the form of a relation in the database itself. The system designers design the way of representation of the data dictionary. Also, a data dictionary stores the data in a non-formalized manner. It does not use any normal form so as to fastly access the data stored in the dictionary. Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI For example, in the data dictionary, it uses underline below the value to represent that the following field contains a primary key. So, whenever the database system requires fetching records from a relation, it firstly finds in the relation of data dictionary about the location and storage organization of the relation. After confirming the details, it finally retrieves the required record from the database. Indexing in DBMS Indexing is used to optimize the performance of a database by minimizing the number of disk accesses required when a query is processed. The index is a type of data structure. It is used to locate and access the data in a database table quickly. Index structure: Indexes can be created using some database columns. The first column of the database is the search key that contains a copy of the primary key or candidate key of the table. The values of the primary key are stored in sorted order so that the corresponding data can be accessed easily. The second column of the database is the data reference. It contains a set of pointers holding the address of the disk block where the value of the particular key can be found. Indexing Methods Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI Ordered indices The indices are usually sorted to make searching faster. The indices which are sorted are known as ordered indices. Example: Suppose we have an employee table with thousands of record and each of which is 10 bytes long. If their IDs start with 1, 2, 3....and so on and we have to search student with ID-543. In the case of a database with no index, we have to search the disk block from starting till it reaches 543. The DBMS will read the record after reading 543*10=5430 bytes. In the case of an index, we will search using indexes and the DBMS will read the record after reading 542*2= 1084 bytes which are very less compared to the previous case. Primary Index If the index is created on the basis of the primary key of the table, then it is known as primary indexing. These primary keys are unique to each record and contain 1:1 relation between the records. As primary keys are stored in sorted order, the performance of the searching operation is quite efficient. The primary index can be classified into two types: Dense index and Sparse index. Dense index The dense index contains an index record for every search key value in the data file. It makes searching faster. In this, the number of records in the index table is same as the number of records in the main table. It needs more space to store index record itself. The index records have the search key and a pointer to the actual record on the disk. Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI Sparse index In the data file, index record appears only for a few items. Each item points to a block. In this, instead of pointing to each record in the main table, the index points to the records in the main table in a gap. Clustering Index A clustered index can be defined as an ordered data file. Sometimes the index is created on non-primary key columns which may not be unique for each record. In this case, to identify the record faster, we will group two or more columns to get the unique value and create index out of them. This method is called a clustering index. The records which have similar characteristics are grouped, and indexes are created for these group. Example: suppose a company contains several employees in each department. Suppose we use a clustering index, where all employees which belong to the same Dept_ID are considered within a single cluster, and index pointers point to the cluster as a whole. Here Dept_Id is a non-unique key. Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI The previous schema is little confusing because one disk block is shared by records which belong to the different cluster. If we use separate disk block for separate clusters, then it is called better technique. Secondary Index In the sparse indexing, as the size of the table grows, the size of mapping also grows. These mappings are usually kept in the primary memory so that address fetch should be faster. Then the secondary memory searches the actual data based on the address got from mapping. If the mapping size grows then fetching the address itself becomes slower. In this case, the sparse index will not be efficient. To overcome this problem, secondary indexing is introduced. In secondary indexing, to reduce the size of mapping, another level of indexing is introduced. In this method, the huge range for the columns is selected initially so that the mapping size of the first level becomes small. Then each range is further divided into smaller ranges. The mapping of the first level is stored in the primary memory, so that address fetch is faster. The mapping of the second level and actual data are stored in the secondary memory (hard disk). Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI Hashing in DBMS Hashing technique is used to calculate the direct location of a data record on the disk without using index structure. In this technique, data is stored at the data blocks whose address is generated by using the hashing function. The memory location where these records are stored is known as data bucket or data blocks. In this, a hash function can choose any of the column value to generate the address. Most of the time, the hash function uses the primary key to generate the address of the data block. A hash function is a simple mathematical function to any complex mathematical function. We can even consider the primary key itself as the address of the data block. That means each row whose address will be the same as a primary key stored in the data block. Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI The above diagram shows data block addresses same as primary key value. This hash function can also be a simple mathematical function like exponential, mod, cos, sin, etc. Suppose we have mod (5) hash function to determine the address of the data block. In this case, it applies mod (5) hash function on the primary keys and generates 3, 3, 1, 4 and 2 respectively, and records are stored in those data block addresses. Types of Hashing: Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI Static Hashing Dynamic Hashing Static Hashing In static hashing, the resultant data bucket address will always be the same. That means if we generate an address for EMP_ID =103 using the hash function mod (5) then it will always result in same bucket address 3. Here, there will be no change in the bucket address. Hence in this static hashing, the number of data buckets in memory remains constant throughout. In this example, we will have five data buckets in the memory used to store the data. Operations of Static Hashing Searching a record When a record needs to be searched, then the same hash function retrieves the address of the bucket where the data is stored. o Insert a Record When a new record is inserted into the table, then we will generate an address for a new record based on the hash key and record is stored in that location. Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI o Delete a Record To delete a record, we will first fetch the record which is supposed to be deleted. Then we will delete the records for that address in memory. o Update a Record To update a record, we will first search it using a hash function, and then the data record is updated. If we want to insert some new record into the file but the address of a data bucket generated by the hash function is not empty, or data already exists in that address. This situation in the static hashing is known as bucket overflow. This is a critical situation in this method. To overcome this situation, there are various methods. Some commonly used methods are as follows: 1. Open Hashing When a hash function generates an address at which data is already stored, then the next bucket will be allocated to it. This mechanism is called as Linear Probing. For example: suppose R3 is a new address which needs to be inserted, the hash function generates address as 112 for R3. But the generated address is already full. So the system searches next available data bucket, 113 and assigns R3 to it. Close Hashing When buckets are full, then a new data bucket is allocated for the same hash result and is linked after the previous one. This mechanism is known as Overflow chaining. Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI For example: Suppose R3 is a new address which needs to be inserted into the table, the hash function generates address as 110 for it. But this bucket is full to store the new data. In this case, a new bucket is inserted at the end of 110 buckets and is linked to it. Dynamic Hashing o The dynamic hashing method is used to overcome the problems of static hashing like bucket overflow. o In this method, data buckets grow or shrink as the records increases or decreases. This method is also known as Extendable hashing method. o This method makes hashing dynamic, i.e., it allows insertion or deletion without resulting in poor performance. How to search a key o First, calculate the hash address of the key. o Check how many bits are used in the directory, and these bits are called as i. o Take the least significant i bits of the hash address. This gives an index of the directory. o Now using the index, go to the directory and find bucket address where the record might be. How to insert a new record o Firstly, you have to follow the same procedure for retrieval, ending up in some bucket. Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI o If there is still space in that bucket, then place the record in it. o If the bucket is full, then we will split the bucket and redistribute the records. For example: Consider the following grouping of keys into buckets, depending on the prefix of their hash address: The last two bits of 2 and 4 are 00. So it will go into bucket B0. The last two bits of 5 and 6 are 01, so it will go into bucket B1. The last two bits of 1 and 3 are 10, so it will go into bucket B2. The last two bits of 7 are 11, so it will go into B3. Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI Insert key 9 with hash address 10001 into the above structure: `Since key 9 has hash address 10001, it must go into the first bucket. But bucket B1 is full, so it will get split. The splitting will separate 5, 9 from 6 since last three bits of 5, 9 are 001, so it will go into bucket B1, and the last three bits of 6 are 101, so it will go into bucket B5. Keys 2 and 4 are still in B0. The record in B0 pointed by the 000 and 100 entry because last two bits of both the entry are 00. Keys 1 and 3 are still in B2. The record in B2 pointed by the 010 and 110 entry because last two bits of both the entry are 10. Key 7 are still in B3. The record in B3 pointed by the 111 and 011 entry because last two bits of both the entry are 11. Advantages of dynamic hashing o In this method, the performance does not decrease as the data grows in the system. It simply increases the size of memory to accommodate the data. o In this method, memory is well utilized as it grows and shrinks with the data. There will not be any unused memory lying. o This method is good for the dynamic database where data grows and shrinks frequently. Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI Disadvantages of dynamic hashing o In this method, if the data size increases then the bucket size is also increased. These addresses of data will be maintained in the bucket address table. This is because the data address will keep changing as buckets grow and shrink. If there is a huge increase in data, maintaining the bucket address table becomes tedious. o In this case, the bucket overflow situation will also occur. But it might take little time to reach this situation than static hashing. Mrs.C.Subalakshmi, AP/CSE Dr.MGRERI