IM101 Week 1-5.docx

Full Transcript

**[LESSON PROPER FOR WEEK 1]** **What Is a Database?** A database, in the most general sense, is an organized collection of data. More specifically, a database is an electronic system that allows data to be easily accessed, manipulated and updated. In other words, a database is used by an organiza...

**[LESSON PROPER FOR WEEK 1]** **What Is a Database?** A database, in the most general sense, is an organized collection of data. More specifically, a database is an electronic system that allows data to be easily accessed, manipulated and updated. In other words, a database is used by an organization as an electronic way to store, manage and retrieve information. The database is one of the cornerstones of enterprise IT, and its ability to organize, process and manage information in a structured and controlled manner is *the* key to many aspects of modern business efficiency. However, databases go way beyond simply storing data. As we'll see later, the inherent logic and efficiency in *how* the data is stored and retrieved can provide an incredibly powerful business tool to an organization. This is especially true when databases are properly exploited for their reporting and business intelligence capabilities. **History of Databases** It is important to first realize that the organized, systematic methodology of storing records we know and heavily depend on in databases is not a recent invention. What *is* recent is the computerization of this methodology beginning in the 1960s. Note that even paper-based records, including ledger-based bookkeeping, are (technically) all forms of a database. That is, a database does not necessarily have to be computerized. Computerization only produced a database management system (DBMS), which is obviously several orders of magnitude more powerful, accurate and capable than what a humble ledger or a puny human brain can achieve. And although we are mostly using the term "database" to refer to the DBMS, the two are not the same thing; all thumbs (DBMSs) are fingers (databases), but not all fingers are thumbs. The ancient Egyptians used elaborate record-keeping systems to keep stock of grain harvests. The Library of Alexandria employed a sophisticated method to keep track of huge numbers of books and scrolls. These were all early examples of databases, although of course their capabilities would be laughable compared to the hugely capable computerized DBMSs of the 21st century. But even way back in time, back when the entire field of computing was still in its single-celled-organism stage (the 1960s), many people could already visualize that computers would be truly useful if they could provide a way of reliably storing and retrieving data. The development of databases therefore occurred almost in perfect step with the general development and growth of the computing capabilities of the day. As disk capacity and processor speed grew, so did the storage capacity and feature sets of the contemporary database offerings. One important leap that occurred in the mid-1960s was the switch from tape-based storage to direct access storage, or disks. This change allowed multitasking interactive data access, as opposed to the single-operator, batch-type processing necessitated by tapes. The earliest database systems were navigational in nature. This means that applications processed and read data by using pointers embedded in the data itself. The pointer led to the next data item and could be doubly linked, allowing linkage to both the previous and next data items. This is similar to how hyperlinks work on a Web page by leading the reader to a related Web page from the current one. The two main data models at this time were the hierarchical model epitomized by IBM's IMS system, and the Codasyl, or network, model. But all these were bested and reduced to mere interesting footnotes in history by the emergence of the relational model by a brilliant computer scientist by the name of E.F. Codd and the Relational Model. The relational model was a radical departure from the reigning hierarchical model in that it focused on the ability to search a database by content rather than by following a linked navigation system. This offered the significant advantage of allowing databases to grow and store more and more data, all without having to change or rewrite the applications that accessed that data. Essentially, Codd single-handedly designed a way to divorce the skeleton or structure of the database from the data records held in the database. So elegant was this model that it is the de facto standard for database design to this day, with such databases termed relational databases. There are a few very important non-relational databases (especially with the advent of big data and Web 2.0), but the relational model is still used for the overwhelming majority of commercial database offerings. Today, E.F. Codd's name would mostly evoke a nonchalant "E.F. who?" among most people, even many in the IT industry. However, his work has directly led to the huge benefits and efficiency that relational databases provide. His contribution to the world of computing is comparable in scale to that of Sir Isaac Newton's to the world of physics. Codd attended Oxford college, studying mathematics and chemistry, then worked as a pilot in the Royal Air Force during WWII before moving to the U.S. in 1948 to work as a mathematical programmer for IBM. After spending a decade in Canada, he returned to the U.S. in 1963 and received his Ph.D. in 1965. In 1970, Codd published a paper on data management titled "A Relational Model of Data for Large Shared Data Banks" for IBM. The giant company, however, was heavily invested in the hierarchical model via its Information Management System (IMS), and Big Blue executives were not interested in developing a competitor for one of their own lucrative product lines. Showing guile rarely seen in academic or scientific types, Codd slyly showed his model to select IBM customers, who upon viewing it needed little convincing of its superiority. The influential customers in turn put pressure on the very same IBM executives to develop the model and they reluctantly (and, one imagines, seething quietly with fury at Codd) placed the model under development in IBM's Future Systems project, with the system itself known as System R. However, the head honchos were still unwilling to threaten IMS, and sabotaged Codd's work by placing the System R project in the hands of developers who were unfamiliar with it. The developers thus failed to use Codd's own Alpha language for development, instead electing to use a much simpler language known as SEQUEL. This turned out to be an accidental masterstroke, however, since SEQUEL is much easier to understand and use. For copyright reasons, the name was changed to SQL, and is very familiar to database developers and administrators today as the language of choice for writing database queries. A shrewd young businessman who was developing his own database system read about SQL at a conference in 1979. He recognized its superiority and copied the language into a database product by his own small company. The businessman had also previously seen Codd's work on the relational model, and became convinced that it was the way to go for database systems. He based his own product on it, even though IBM refused to share System R's code with him. Remember, IBM was not interested in the relational model. That small company has grown quite a bit; today it's known as Oracle Corp. As for the businessman, his name is Larry Ellison, and his conviction helped him become one of the richest people in the world. It just goes to show how badly IBM miscalculated the potential of Codd's relational model. In fact, Oracle DB is the most widely used relational database for corporations today. **The Relational Database** As we have already seen, the work of E.F. Codd established the relational model of databases as the clearly superior method of data storage. The elegance of the relational model in storing and manipulating data has been so effective that since the late 1970s, none of the many pretenders to its throne have managed to overthrow it. So, let's go into some detail on exactly how the relational model works. A relational database is essentially a group of tables or, to use the technical name, entities (refer to rules 0 and 1 in Codd's 12 Rules of Relational Databases). Each table is made up of rows (tuples) and columns (attributes). The tables have relationships between them that are defined as using a certain column in one table that references a column in another table. That is the basic definition of a relational database. But as you will soon see, it can get much more elaborate than this. For instance, one of the fundamental concepts of relational databases is that of referential integrity. This rule states that relationships between tables must always remain consistent. In other words, any field located in a foreign key must be in agreement with the primary key that the foreign key references. Therefore, any updates or deletions to a primary key field must either also be applied to all its foreign keys, or must not be allowed to happen. The same restriction also applies to foreign keys; any updates (but not necessarily deletions) must either also be propagated back to the corresponding parent primary key, or not allowed. The idea of referential integrity is best explained by illustration. Assume that there are two tables in a bank's database: the CUSTOMER\_MASTER table for holding basic customer/account holder data, and the ACCOUNTS\_MASTER table for storing basic data about bank accounts. To uniquely identify each customer/account holder in the CUSTOMER\_MASTER table, we create a primary key column, which we call CUSTOMER\_ID. You can see that in order to identify the customer to which a certain bank accounts belongs in the ACCOUNTS\_MASTER table, we must reference an existing customer in the CUSTOMER\_MASTER table. This means we need to create a CUSTOMER\_ID column in the ACCOUNTS\_MASTER table as well. This is called a foreign key. This column is special because the values it contains are not new values that you can just conjure up. They must reference identical, existing values in the primary-key column of another table, in this case, the CUSTOMER\_ID column of the CUSTOMER\_MASTER table Clear, yes? Good. Now, referential integrity simply means that you cannot edit any CUSTOMER\_ID value in CUSTOMER\_MASTER without also editing the corresponding value in the ACCOUNTS\_MASTER table. If you change Andrew Smith's customer ID in CUSTOMER\_MASTER, you must also change it in ACCOUNTS\_MASTER. This makes perfect sense if you think about it. Otherwise, how would we link Andrew Smith's accounts back to him? Remember, the ACCOUNTS\_MASTER table does not hold any information about account holders apart from storing their CUSTOMER\_IDs as a foreign key. By the same logic, if a customer ID in the bank accounts table were allowed to exist without a corresponding customer ID in the CUSTOMER\_MASTER table, this usually means that a bank account can exist without an account holder, which clearly does not make sense. Now, following this train of thought a bit further, if you delete a customer ID in CUSTOMER\_MASTER, you must also delete all corresponding entries in ACCOUNTS\_MASTER. This is simply the manifestation of a neat follow-through action in real life, where if someone stops being a customer, you must also do away with their bank accounts. As Star Trek's Spock would put it, it's only logical! **The 12 Rules of Relational Databases** As the relational model started to become fashionable for database design in the early 1980s, Codd was at first bemused then angered by the trend by every other database vendor to slap the relational moniker on their product, even when it didn't apply. He came up with a list of 12 rules that determined whether a database could be called "relational". His difficulties with IBM reached a peak at this time, and he left to form his own consulting company with another lecturer/ consultant called Chris Date. Below are Codd's 12 rules or, as some call them, 12 commandments. There are actually 13, but they are numbered from 0 to 12, hence the name. We will look at them in more detail once we delve into the features of relational databases and reference them then so that you can really understand Codd's rules in context. In raw form, these rules do not make much sense for those who don't work in the database field, but anyway, here goes: **0. Foundation Rule** A relational database management system must manage its stored data using only its relational capabilities. **1. Information Rule** All information in the database should be represented in one and only one way -- as values in a table. **2. Guaranteed Access Rule** Each and every datum (atomic value) is guaranteed to be logically accessible by resorting to a combination of table name, primary key value and column name. **3. Systematic Treatment of Null Values** Null values (which are distinct from empty character strings, strings of blank characters, zeros or any other number) are supported in the fully relational DBMS for representing missing information in a systematic way that is independent of data type. **4. Dynamic Online Catalog Based on the Relational Model** The database description is represented at the logical level in the same way as ordinary data, so authorized users can apply the same relational language to its interrogation as they apply to regular data. **5. Comprehensive Data Sublanguage Rule** A relational system may support several languages and various modes of terminal use. However, there must be at least one language whose statements are expressible, per some well-defined syntax, as character strings and whose ability to support all of the following is comprehensible: - - - - - - **6. View Updating Rule** All views that are theoretically updateable are also updateable by the system. **7. High-Level Insert, Update and Delete** The ability to handle a base relation or a derived relation as a single operand applies not only to the retrieval of data, but also to the insertion, update and deletion of data. **8. Physical Data Independence** Application programs and terminal activities remain logically unimpaired whenever any changes are made in either storage representation or access methods. **9. Logical Data Independence** Application programs and terminal activities remain logically unimpaired when information-preserving changes of any kind -- and those that theoretically permit unimpairment -- are made to the base tables. **10. Integrity Independence** Integrity constraints specific to a particular relational database must be definable in the relational data sublanguage and storable in the catalog, not in the application programs. **11. Distribution Independence** The data manipulation sublanguage of a relational DBMS must enable application programs and terminal activities to remain logically unimpaired whether and whenever data are physically centralized or distributed. **12. Nonsubversion Rule** If a relational system has or supports a low-level (single-record-at-a-time) language, that low-level language cannot be used to subvert or bypass the integrity rules or constraints expressed in the higher-level (multiple-records-at-a-time) relational language. **Fundamental Database Concepts** We will now look at some key database concepts and data objects. Any database administrator worth his salt must be absolutely familiar with these. And they aren't just theoretical; virtually all DBAs will be intimately involved with these concepts and objects on a near-daily basis. They are to database administration what knowledge of the human body is to the field of medicine. We will only define each term briefly here. To better grasp the concepts, together with real-world examples, follow the term links and visit the "Related Terms" section to understand how each concept relates to and works with others in the realm of database administration. **Tables** The table is the basic data-storage unit in a relational database. Tables consist of columns and rows. The columns are the attributes or qualities that we want to express, while the rows hold the actual data, with one (or no) items per row. Think of the layout of a spreadsheet; this is very similar to the logical organization of a relational table. A simple example of the tables in the database of a commercial bank can be found below. **Relationship** ![](media/image31.png) Relationships are THE reason why relational databases work so well. If you only learn one concept about databases, this is the one to learn. As the name implies, relationships are the very core of relational databases.In relational databases, a relationship exists between two tables when one of them has a foreign key that references the primary key of the other table. (More on foreign and primary keys in a bit). In the diagram below, you can see examples of relationships. For instance, the AccountTypeID field (column) in the AccountTypes table references the AccountTypeID column of the Customer table. **Row** A row, also called a record, represents a set of data about a specific item. Every record in a table has exactly the same structure, but of course different data. Think of the rows in an Excel spreadsheet -- the concept of rows in a database is very similar. Each row in a table consists of distinct data items, with one (or zero) items for each column of the table. Rows are also called tuples, although this term is not very common. See an example of a record or row below: **Column** A column is a specific set of values in a table of the same type. It defines a specific attribute of the table or data. For example, we can create a column called CUSTOMER\_SURNAME in a table. This is a self-explanatory column whose purpose is to store customer surnames, one value for each row. Again, think of the rows in an Excel spreadsheet and you'll have a pretty good idea of how columns in a relational table work. **Primary Key** A primary key is a special column or combination of columns that uniquely identifies each record (row) in the table. The primary key column must be unique for each row, and must not contain any nulls (non-values). So, for example, to identify yourself in various databases belonging to various U.S. government departments, a unique identifier, the Social Security number, is assigned and used. The primary key, together with the closely related foreign key concept, are the main way in which relationships are defined. Primary keys may also be a combination of columns. For instance, for many companies, a calendar month is a short-term financial period. Therefore, to uniquely identify any period, you can combine the month column and the year column, such as May 2011, to form a primary key to uniquely identify each and every financial period. **Foreign Key** We cannot talk about the yin of primary keys without the yang of foreign keys. The two go hand-in-hand. A primary key uniquely defines a record, while a foreign key is used to reference the same record from another table. In a commercial bank's database, assume that you have a CUSTOMER\_ID column as the primary key in the CUSTOMER\_MASTER table. This uniquely identifies each customer; let us also assume the same table also contains other relevant customer information in other columns, such as CUSTOMER\_SURNAME, CUSTOMER\_FIRSTNAME, CUSTOMER\_SOCIAL\_SEC\_NUMBER, CUSTOMER, CUSTOMER\_GENDER, and so on. We also have another table called LOANS\_MASTER to keep track of loans given to the same bank's customers. We now only need a single column in this table to identify which customer has received a particular loan. We can call this column CUSTOMERID, and it will reference the CUSTOMER\_ID column of the CUSTOMER\_MASTER table. We don't need to store all the other customer information (name, gender, Social Security number, etc.) in the loans table. This is the elegance of the relational model. **SQL** Structured Query Language (SQL) is the de facto language used for the management and manipulation of data in relational databases. SQL can be used to query, insert, update and modify data. All major relational databases support SQL, which makes life much easier for database administrators (DBAs), who have to support databases on several different platforms. Proficiency in SQL is usually one of the very first things any DBA must learn early in his or her career. Note that some people pronounce SQL as one word, "sequel". Note that the SQL language is different from SQL Server, a relational database platform from Microsoft. It can be confusing for beginners because of the use of the generic term SQL. Most commercial RDBMS platforms have their own customized SQL implementations, but these tend to be fully compatible with the standard SQL. **Other Types of Databases** **Databases vs. Spreadsheets** "Excel also stores my data and I can retrieve and manipulate the data using filters, join it to other files and worksheets, perform advanced functions like VLOOKUP, PivotTable, and so on. So why do I need this fancy database thing you are talking about? You IT guys just want to milk money from me. No, my Excel sheets are perfectly adequate!" Sound familiar? It might be if you've ever worked as a database consultant for small businesses. It would seem at first glance that a lot of the functionality offered by databases can be achieved much more easily (and cheaply!) by just using spreadsheets. However, the spreadsheet has a number of limitations that make it unsuitable for managing some data situations: - - - - Spreadsheets are much easier to create and maintain. Databases require more investment in terms of both of financial outlay and human training. However, the reward for this is a much more robust and secure storage and retrieval data system. The point at which you need to move away from spreadsheets and into a database-based system may be when you answer "yes" to one or more of the following questions: - - - - Need more convincing? Well an authority no less than the U.S. government has decreed, via Section 404 of the Sarbanes-Oxley Act, that all public companies must move the reporting of key financial data away from spreadsheets. **Relational Databases** As we have already established, the reasons for the dominance of relational databases are simplicity, robustness, flexibility, performance, scalability and compatibility in managing generic data. However, to offer all of this, relational databases have to be incredibly complex internally. For example, a relatively simple SELECT statement could have hundreds of potential query execution paths, which the optimizer would evaluate at run time. All of this is hidden to us as users, but under the hood, RDBMS determines the "execution plan" that best answers requests by using things like cost-based algorithms. America's best-selling sedan of the 2000s was the Toyota Camry, and it's not hard to see why. The Camry is not the best in many of the categories used to judge cars, such as safety, gas mileage, interior volume, reliability and several others. However, it is always near the top in each category, giving it an overall aggregate score that puts it on top. So, what you may ask, does a midsized family sedan have to do with our discussion on databases? Like the Camry, the relational database does not excel or really shine in any one of the qualities of a good database, but it fulfills all sections nicely, enough to make it the default go-to option. With the advent of the Web 2.0 and especially cloud-based computing, there has been an increased need for Web-capable databases to serve up, store and manage mind-bogglingly large amounts of content. Content such as Facebook's user profiles and posts for millions around the world; Google's billions of searches and Web crawls of other websites; Dropbox's millions of stored user documents and files; eBay's millions of auction listings, and so on. Broadly speaking, the buzzword for this area is "big data." For all these Web-centric databases, the main concern is scalability. As more and more applications are launched in environments that have massive workloads (think of the diverse range of Web services available on the Web today), their scalability requirements can change very quickly and grow very large. Relational databases scale well, but usually only when that scaling happens on a single server. When the capacity of that single server is reached, you need to scale out and distribute that load across multiple servers, moving into so-called distributed computing. This is when the complexity of relational databases starts to rub against their potential to scale. Try scaling to hundreds or thousands of servers, and the complexities become overwhelming. The characteristics that make RDBMS so appealing are the very same that also drastically reduce their viability as platforms for large distributed systems. For cloud services to be viable, vendors have had to address this limitation, because a cloud platform without a rapidly scalable data store is next to useless. So, to provide customers with a scalable place to store application data, vendors have had to implement a new type of database system that focuses obsessively on scalability, at the expense of the other benefits that come with relational databases. Next, we will look at these new, non-relational databases in more detail. **Non-Relational Databases** One of the most severe limitations of relational databases is that each item can only contain one attribute. In the bank example we looked at before, each aspect of a customer's relationship with a bank must (or rather is best) stored as separate row items in separate tables. The customer's master details are in one table, the account details are in another table, the loan details in yet another, investments in a different table, and so on. All these tables are linked to each other through the use of relations, or primary keys and foreign keys. Non-relational databases, specifically a database's key-value stores or key-value pairs, are radically different from this model. Key-value pairs allow you to store several related items in one "row" of data in the same table. We place the word "row" in quotes because a row here is not really the same thing as the row of a relational table or spreadsheet, although it is still useful to call it that for comparison's sake. For instance, in a non-relational table for the same bank, each row would contain the customer's details as well as their account, loan and investment details. All data relating to one customer would be conveniently stored together as one record. This seems an obviously superior method of storing data. So why on earth would we still use relational databases over this model? Well, the problem with key-value stores is that, unlike relational databases, they cannot enforce relationships between data items. For instance, in our key-value database, the customer details (name, social security, address, account number, loan processing number, etc.) would all be stored as one data record (instead of being stored in several tables, as in the relational model). The customer's transactions (account withdrawals, account deposits, loan repayments, bank charges, etc.) would also be stored as another single data record. In the relational model, there is an inbuilt and foolproof method of ensuring and enforcing business logic at the database layer, for instance that a withdrawal is charged to the correct bank account. In key-value stores, this responsibility falls squarely on application logic. Many people are very jittery about leaving this crucial responsibility just to the application, which is why relational databases are not going away any time soon. However, when it comes to Web-based databases, the aspect of rigorously enforcing business rules is often far down the list of priorities. Topmost on the list is the ability to service large numbers of user requests, typically read-only queries. For instance, on a site like the giant online auction house eBay.com, the majority of users simply browse and look through posted items (read-only operations). Only a fraction of these users actually places bids or reserve the items (read-write operations). And remember, we are talking about millions, sometimes billions, of page views per day. Such a site's owners are more interested in quick response time to ensure faster page loading for the site's users, rather than the traditional priorities of enforcing business rules or ensuring a balance between reads and writes. You can extrapolate this for other large well-known sites on the Web -- Google, Facebook, Amazon, Twitter and so on. Relational-model databases can be tweaked and set up to run large-scale read-only operations (through data warehousing and data marts), and thus potentially still serve such customers. However, the real challenge is the relational model's lack of scalability, as we noted earlier. This is where non-relational models can really shine. They can easily distribute their data loads across dozens, hundreds and in extreme cases (think Google search) even thousands of servers. With each server handling only a small percentage of the total requests from users, response time is very good for each individual user. Although this distributed computing model also exists for relational databases, it is a real pain to implement. This is because the relational model insists on data integrity at all levels, and this must be maintained, even as the data is accessed and modified by several different servers. This is the reason for the non-relational model as the architecture of choice for Web 2.0 applications such as cloud-computing and social networking. Some examples of non-relational databases that power well-known sites are Amazon's SimpleDB and Google Search's BigTable. Other non-relational engines, either open-source or commercially available, are CouchDB, Mongo, Drizzle and the unusually named Project Voldemort. **Other Important Database Concepts** We have discussed the fundamental database concepts, but these are not the only ones. There are also other important secondary concepts and data structures worth learning about. In this section, we'll take a brief look at these. **Indexes** An index in an RDBMS is a data structure that works closely with tables and columns to speed up data retrieval operations. It works a lot like the index at the beginning of a book. In other words, it provides a reference point that allows you to quickly find and access the data you want without having to traverse the entire book (database). **Schema** A schema is the structure behind data organization. It is a visual overview of how different tables are related to each other. This serves to map out and implement the underlying business rules for which the database is created. The Oracle DB has a somewhat different definition of schema. Here, schema refers to a user's collection of database objects. The schema name and username are the same but function quite distinctly; i.e., a user may be deleted while his collection of objects (schema) within the database remains intact, and can even be reassigned to another user. See a visual example of a simple database schema below: ![](media/image49.png) **Normalization** Normalization is the process of (re)organizing data in a database so that it meets two basic requirements: there is no data redundancy (all data is stored in only one place), and data dependencies are logical (all related data items are stored together). For instance, for a bank's database all customer static data, such as name, address and age, should be stored together. All account information, such as account holder, account type, account branch and so on, should also be stored together; it should also be stored separately from the customer static data. Normalization is important for many reasons, but chiefly because it enables databases to take up as little disk space as possible, resulting in increased performance. There are several incremental types of normalization, and they can get somewhat complex. **Constraints** In the RDBMS world, constraint refers to the exact same thing as in the real world. A constraint is a restriction on the type of data you can input into a certain column. Constraints are always defined on columns. A common constraint is the not-null constraint. It simply specifies that all rows in a table must have a value in the column defined as not null. **Transactions (Commits and Rollbacks)** Think of a bank building out its database systems. Imagine if it crashed right as a wire transfer was in progress. Big problems, right? This is the basic idea behind a transaction: All items in a series of changes need to be made together. In the case of a simple transfer, if you debit one account, you need to credit anther account. In relational databases, saving a transaction is known as a commit, and undoing any unsaved changes is known as a rollback. That is the basic definition, but it gets more complicated when you consider that databases typically have to serve several users simultaneously. Before the transactions is saved, what happens when other users query the same data? At what point do the other users see the saved data? All RDBMSs must be capable of satisfactorily answering these questions, and they do this through the commit/rollback features. Databases must also provide fault tolerance, even in case of disk failure. When data is committed, there must be a good-as-gold guarantee that the data is actually saved. Relational databases have ingenious ways of achieving this, such as two-phase commits and use of log files. **ACID** The term ACID here does not refer to psychedelic trip-inducing substances that help DBAs work better. Rather, it is an acronym describing four highly desirable properties of any RDBMS: - - - - **[LESSON PROPER FOR WEEK 2]** **Database Management System or DBMS** in short refers to the technology of storing and retrieving user's data with utmost efficiency along with appropriate security measures. This tutorial explains the basics of DBMS such as its architecture, data models, data schemas, data independence, E-R model, relation model, relational database design, and storage and file structure and much more. **[Why to Learn DBMS?]** Traditionally, data was organized in file formats. DBMS was a new concept then, and all the research was done to make it overcome the deficiencies in traditional style of data management. A modern DBMS has the following characteristics; **Real-world entity** − A modern DBMS is more realistic and uses real-world entities to design its architecture. It uses the behavior and attributes too. For example, a school database may use students as an entity and their age as an attribute. **Relation-based tables** − DBMS allows entities and relations among them to form tables. A user can understand the architecture of a database just by looking at the table names. **Isolation of data and application** − A database system is entirely different than its data. A database is an active entity, whereas data is said to be passive, on which the database works and organizes. DBMS also stores metadata, which is data about data, to ease its own process. **Less redundancy** − DBMS follows the rules of normalization, which splits a relation when any of its attributes is having redundancy in values. Normalization is a mathematically rich and scientific process that reduces data redundancy. **Consistency** − Consistency is a state where every relation in a database remains consistent. There exist methods and techniques, which can detect attempt of leaving database in inconsistent state. A DBMS can provide greater consistency as compared to earlier forms of data storing applications like file-processing systems. **Query Language** − DBMS is equipped with query language, which makes it more efficient to retrieve and manipulate data. A user can apply as many and as different filtering options as required to retrieve a set of data. Traditionally it was not possible where file-processing system was used. **[Applications of DBMS]** Database is a collection of related data and data is a collection of facts and figures that can be processed to produce information. Mostly data represents recordable facts. Data aids in producing information, which is based on facts. For example, if we have data about marks obtained by all students, we can then conclude about toppers and average marks. A database management system stores data in such a way that it becomes easier to retrieve, manipulate, and produce information. Following are the important characteristics and applications of DBMS. **ACID Properties** − DBMS follows the concepts of Atomicity, Consistency, Isolation, and Durability (normally shortened as ACID). These concepts are applied on transactions, which manipulate data in a database. ACID properties help the database stay healthy in multi-transactional environments and in case of failure. **Multiuser and Concurrent Access** − DBMS supports multi-user environment and allows them to access and manipulate data in parallel. Though there are restrictions on transactions when users attempt to handle the same data item, but users are always unaware of them. **Multiple views** − DBMS offers multiple views for different users. A user who is in the Sales department will have a different view of database than a person working in the Production department. This feature enables the users to have a concentrate view of the database according to their requirements. **Security** − Features like multiple views offer security to some extent where users are unable to access data of other users and departments. DBMS offers methods to impose constraints while entering data into the database and retrieving the same at a later stage. DBMS offers many different levels of security features, which enables multiple users to have different views with different features. For example, a user in the Sales department cannot see the data that belongs to the Purchase department. Additionally, it can also be managed how much data of the Sales department should be displayed to the user. Since a DBMS is not saved on the disk as traditional file systems, it is very hard for miscreants to break the code. **[Users]** A typical DBMS has users with different rights and permissions who use it for different purposes. Some users retrieve data and some back it up. The users of a DBMS can be broadly categorized as follows − - - - **DBMS Architecture** The design of a DBMS depends on its architecture. It can be centralized or decentralized or hierarchical. The architecture of a DBMS can be seen as either single tier or multi-tier. An n-tier architecture divides the whole system into related but independent **n** modules, which can be independently modified, altered, changed, or replaced. In 1-tier architecture, the DBMS is the only entity where the user directly sits on the DBMS and uses it. Any changes done here will directly be done on the DBMS itself. It does not provide handy tools for end-users. Database designers and programmers normally prefer to use single-tier architecture. If the architecture of DBMS is 2-tier, then it must have an application through which the DBMS can be accessed. Programmers use 2-tier architecture where they access the DBMS by means of an application. Here the application tier is entirely independent of the database in terms of operation, design, and programming. **3-tier Architecture** A 3-tier architecture separates its tiers from each other based on the complexity of the users and how they use the data present in the database. It is the most widely used architecture to design a DBMS. ![](media/image27.png) - - - **Multiple-tier database architecture** is highly modifiable, as almost all its components are independent and can be changed independently. **[Data Models]** Data models define how the logical structure of a database is modeled. Data Models are fundamental entities to introduce abstraction in a DBMS. Data models define how data is connected to each other and how they are processed and stored inside the system. The very first data model could be flat data-models, where all the data used are to be kept in the same plane. Earlier data models were not so scientific; hence they were prone to introduce lots of duplication and update anomalies. **Entity-Relationship Model** Entity-Relationship (ER) Model is based on the notion of real-world entities and relationships among them. While formulating real-world scenario into the database model, the ER Model creates entity set, relationship set, general attributes and constraints. ER Model is best used for the conceptual design of a database. ER Model is based on; - - These concepts are explained below. - - **Mapping cardinalities −** - - - - **Relational Model** The most popular data model in DBMS is the Relational Model. It is more scientific a model than others. This model is based on first-order predicate logic and defines a table as an **n-ary relation**. ![](media/image30.png) The main highlights of this model are; - - - - - **[Database Schema]** A database schema is the skeleton structure that represents the logical view of the entire database. It defines how the data is organized and how the relations among them are associated. It formulates all the constraints that are to be applied on the data. A database schema defines its entities and the relationship among them. It contains a descriptive detail of the database, which can be depicted by means of schema diagrams. It's the database designers who design the schema to help programmers understand the database and make it useful. A database schema can be divided broadly into two categories; - - **[Database Instance]** It is important that we distinguish these two terms individually. Database schema is the skeleton of database. It is designed when the database doesn\'t exist at all. Once the database is operational, it is very difficult to make any changes to it. A database schema does not contain any data or information. A database instance is a state of operational database with data at any given time. It contains a snapshot of the database. Database instances tend to change with time. A DBMS ensures that its every instance (state) is in a valid state, by diligently following all the validations, constraints, and conditions that the database designers have imposed. **[Data Independence]** A database system normally contains a lot of data in addition to users' data. For example, it stores data about data, known as metadata, to locate and retrieve data easily. It is rather difficult to modify or update a set of metadata once it is stored in the database. But as a DBMS expands, it needs to change over time to satisfy the requirements of the users. If the entire data is dependent, it would become a tedious and highly complex job. ![](media/image42.png) **Metadata** itself follows a layered architecture, so that when we change data at one layer, it does not affect the data at another level. This data is independent but mapped to each other. **[Logical Data Independence]** **Logical data** is data about database, that is, it stores information about how data is managed inside. For example, a table (relation) stored in the database and all its constraints, applied on that relation. **Logical data independence** is a kind of mechanism, which liberalizes itself from actual data stored on the disk. If we do some changes on table format, it should not change the data residing on the disk. **[Physical Data Independence]** All the schemas are logical, and the actual data is stored in bit format on the disk**. Physical data independence** is the power to change the physical data without impacting the schema or logical data. For example, in case we want to change or upgrade the storage system itself − suppose we want to replace hard-disks with SSD − it should not have any impact on the logical data or schemas. **[Commercial RDBMS Systems]** We have now covered most of the basics of relational databases. They are the most common database type out there, in addition to being one of the most important types of software, right up there with operating systems, office productivity and games. So, it will come as no surprise to hear that the large, successful RDBMS vendors are some of the titans of the software industry, and are also household names not only in the software world, but even in mainstream news and culture. Names like Oracle and Microsoft are ubiquitous and very familiar, even to non-IT types. Many RDBMS vendors also churn out both related and completely unrelated products. But a common thread for all of them is that the RDBMS is one of their most crucial product lines. They listen closely and work to gather feedback from the marketplace, which does not always happen in the software universe. However, for strategic business reasons, many of their products do not work well with competitors' offerings, or with other software. For instance, SQL Server from Microsoft is only available for the Windows operating system, which is also from Microsoft. There have been complaints that Oracle DB does not mesh as well with the Windows operating system as it does with Linux, and so on. An increasing trend in the industry is consolidation and bundling of the RDBMS with other software from the same manufacturer, such as a preferred operating system or other complementary software such as: - - - We will now take a quick look at some commercial RDB offerings and the companies behind them: **[Oracle]** Oracle is one of the behemoths of the RDBMS world. Founded by the charismatic, adventure-loving CEO Larry Ellison in 1977, today the company is a multibillion-dollar giant in the world of commercial databases thanks to its flagship product, Oracle DB. Oracle also produces a bewildering array of other products, from middleware to enterprise resource planning systems and customer relationship management (CRM) offerings. Many of these were not developed in-house, but instead came through acquisitions of other software companies such as PeopleSoft, Siebel Systems, Sun Microsystems and BEA Systems. Several of these products were integrated, with mixed success, into the Fusion middleware product. The Oracle DB is widely used in enterprise-level databases. It comes in different editions to meet different needs. Oracle DB is fully compliant with the SQL language, although it also maintains its proprietary version called SQL\*Plus. Oracle DB is the leading RDBMS, with a market share of 48.8 percent of the RDBMS market as at end of 2011. An interesting point to note about Oracle is that in 2009, it acquired Sun Microsystems, the license holder of MySQL, one of Oracle DB's key competitors. As a result, Oracle has two RDBMS offerings. However, Oracle DB and MySQL may not interfere destructively with each other, as they play in slightly different market spaces and cater to slightly different needs. **[Microsoft]** Microsoft is another big boy in the world of RDBMS software with its SQL Server product, although it is better known for its universal Windows operating system and Office suite of office-productivity programs. SQL Server was developed in conjunction with Sybase systems around 1989, but the two companies parted ways and developed separate products. Microsoft kept the SQL Server name and Sybase opted to rename its offering Adaptive Server Enterprise to avoid confusion with Microsoft's SQL Server. The RDBMS only runs on the Windows range of operating systems. SQL Server uses a proprietary query language called T-SQL, which is very similar to and compatible with the standard SQL. The RDBMS commands about 20 percent of market share as of the end of 2011, but has also been increasing its share in recent years. SQL Server and Oracle DB have a lot in common, ranging from the data structures to transaction processing methods and database objects. Like Oracle DB, SQL Server also supports advanced ETL (Extraction, Transformation, Loading) operations, which help in moving data to data warehouses. Both also offer advanced reporting functionality. **[Postgres]** Postgres, also known as PostgreSQL, is an open-source relational database that can also support database objects. It is not owned by any one person, but is maintained by the PostgreSQL Global Development Group, a dedicated group of volunteers managed and employed by companies in the open-source software-development field, such as RedHat and EnterpriseDB. Postgres is available on Linux, Windows and MacOS. **[MySQL]** MySQL is another open-source RDBMS. It is a full-featured database system sponsored by Swedish company MySQL AB, which is now owned by Oracle after its parent company Sun Microsystems was bought out by Oracle in 2010. MySQL is very popular for Web-based, back-end databases, either individually or as part of the Linux, Apache, MySQL, PHP (LAMP) stack used to deliver Web-centric applications. **[DB2]** Like SQL Server and Oracle DB, DB2 from IBM is a full-featured object RDBMS from a major player in the software industry. Originally developed in the early '80s exclusively for IBM's mainframes, it was later ported to other platforms, such as Linux, Unix, Windows (LUW) and IBM's own OS/2. It is a widely used commercial RDBMS, and also has a small free version for developers called Express-C. In 2009, IBM released version 9.7 of DB2, which closely mimics the features of Oracle DB, the market leader. This has helped it capture some sales by making it easy for Oracle-savvy database professionals to easily understand and start working on DB2. **[LESSON PROPER FOR WEEK 3]** **Entity Relational Model** The ER model defines the conceptual view of a database. It works around real-world entities and the associations among them. At view level, the ER model is considered a good option for designing databases. **[Entity]** An entity can be a real-world object, either animate or inanimate, that can be easily identifiable. For example, in a school database, students, teachers, classes, and courses offered can be considered as entities. All these entities have some attributes or properties that give them their identity. An entity set is a collection of similar types of entities. An entity set may contain entities with attribute sharing similar values. For example, a student set may contain all the students of a school; likewise, a teachers set may contain all the teachers of a school from all faculties. Entity sets need not be disjoint. **[Attributes]** Entities are represented by means of their properties, called **attributes**. All attributes have values. For example, a student entity may have name, class, and age as attributes. There exists a domain or range of values that can be assigned to attributes. For example, a student\'s name cannot be a numeric value. It has to be alphabetic. A student\'s age cannot be negative, etc. **[Types of Attributes]** - - - - - **[These attribute types can come together in a way like]**; - - - - **[Entity-Set and Keys]** Key is an attribute or collection of attributes that uniquely identifies an entity among entity set. For example, the roll number of a student makes him/her identifiable among students. - - - **[Relationship]** The association among entities is called a relationship. For example, an employee **works at** a department, a student **enrolls** in a course. Here, Works at and Enrolls are called relationships. **[Relationship Set]** A set of relationships of similar type is called a relationship set. Like entities, a relationship too can have attributes. These attributes are called **descriptive attributes**. **[Degree of Relationship]** The number of participating entities in a relationship defines the degree of the relationship. - - - **[Mapping Cardinalities]** **Cardinality** defines the number of entities in one entity set, which can be associated with the number of entities of other set via relationship set. - - - - **[Entity]** Entities are represented by means of rectangles. Rectangles are named with the entity set they represent. **[Attributes]** Attributes are the properties of entities. Attributes are represented by means of ellipses. Every ellipse represents one attribute and is directly connected to its entity (rectangle). ![](media/image67.png) If the attributes are **composite**, they are further divided in a tree like structure. Every node is then connected to its attribute. That is, composite attributes are represented by ellipses that are connected with an ellipse. **Multivalued** attributes are depicted by double ellipse. ![](media/image41.png) **Derived** attributes are depicted by dashed ellipse. **[Relationship]** Relationships are represented by diamond-shaped box. Name of the relationship is written inside the diamond-box. All the entities (rectangles) participating in a relationship, are connected to it by a line. **[Binary Relationship and Cardinality]** A relationship where two entities are participating is called a **binary relationship**. Cardinality is the number of instances of an entity from a relation that can be associated with the relation. - ![](media/image56.png) - - ![](media/image72.png) - **[Participation Constraints]** - - ![](media/image32.png) The ER Model has the power of expressing database entities in a conceptual hierarchical manner. As the hierarchy goes up, it generalizes the view of entities, and as we go deep in the hierarchy, it gives us the detail of every entity included. Going up in this structure is called **generalization**, where entities are clubbed together to represent a more generalized view. For example, a particular student named Mira can be generalized along with all the students. The entity shall be a student, and further, the student is a person. The reverse is called **specialization** where a person is a student, and that student is Mira. **[Generalization]** As mentioned above, the process of generalizing entities, where the generalized entities contain the properties of all the generalized entities, is called generalization. In generalization, a number of entities are brought together into one generalized entity based on their similar characteristics. For example, pigeon, house sparrow, crow and dove can all be generalized as Birds. **[Specialization]** Specialization is the opposite of generalization. In specialization, a group of entities is divided into sub-groups based on their characteristics. Take a group 'Person' for example. A person has name, date of birth, gender, etc. These properties are common in all persons, human beings. But in a company, persons can be identified as employee, employer, customer, or vendor, based on what role they play in the company. ![](media/image60.png) Similarly, in a school database, persons can be specialized as teacher, student, or a staff, based on what role they play in school as entities. **[Inheritance]** We use all the above features of ER-Model in order to create classes of objects in object-oriented programming. The details of entities are generally hidden from the user; this process known as **abstraction**. Inheritance is an important feature of Generalization and Specialization. It allows lower-level entities to inherit the attributes of higher-level entities. For example, the attributes of a Person class such as name, age, and gender can be inherited by lower-level entities such as Student or Teacher. **[Normalization]** If a database design is not perfect, it may contain anomalies, which are like a bad dream for any database administrator. Managing a database with anomalies is next to impossible. - - - Normalization is a method to remove all these anomalies and bring the database to a consistent state. **[First Normal Form]** First Normal Form is defined in the definition of relations (tables) itself. This rule defines that all the attributes in a relation must have atomic domains. The values in an atomic domain are indivisible units. ![](media/image1.png) We re-arrange the relation (table) as below, to convert it to First Normal Form. Each attribute must contain only a single value from its pre-defined domain. **[Second Normal Form]** Before we learn about the second normal form, we need to understand the following − - - If we follow second normal form, then every non-prime attribute should be fully functionally dependent on prime key attribute. That is, if X → A holds, then there should not be any proper subset Y of X, for which Y → A also holds true. ![](media/image71.png) We see here in Student\_Project relation that the prime key attributes are Stu\_ID and Proj\_ID. According to the rule, non-key attributes, i.e. Stu\_Name and Proj\_Name must be dependent upon both and not on any of the prime key attribute individually. But we find that Stu\_Name can be identified by Stu\_ID and Proj\_Name can be identified by Proj\_ID independently. This is called **partial dependency**, which is not allowed in Second Normal Form. We broke the relation in two as depicted in the above picture. So there exists no partial dependency. **[Third Normal Form]** For a relation to be in Third Normal Form, it must be in Second Normal form and the following must satisfy − - - - - We find that in the above Student\_detail relation, Stu\_ID is the key and only prime key attribute. We find that City can be identified by Stu\_ID as well as Zip itself. Neither Zip is a superkey nor is City a prime attribute. Additionally, Stu\_ID → Zip → City, so there exists **transitive dependency**. To bring this relation into third normal form, we break the relation into two relations as follows − ![](media/image16.png) **[Boyce-Codd Normal Form]** Boyce-Codd Normal Form (BCNF) is an extension of Third Normal Form on strict terms. BCNF states that − - In the above image, Stu\_ID is the super-key in the relation Student\_Detail and Zip is the super-key in the relation ZipCodes. So, Stu\_ID → Stu\_Name, Zip and Zip → City Which confirms that both the relations are in BCNF. [ ] [ ] [ ] **[Functional Dependency]** Functional dependency (FD) is a set of constraints between two attributes in a relation. Functional dependency says that if two tuples have same values for attributes A1, A2,\..., An, then those two tuples must have to have same values for attributes B1, B2, \..., Bn. Functional dependency is represented by an arrow sign (→) that is, X→Y, where X functionally determines Y. The left-hand side attributes determine the values of attributes on the right-hand side. **[Armstrong\'s Axioms]** If F is a set of functional dependencies then the closure of F, denoted as F^+^, is the set of all functional dependencies logically implied by F. Armstrong\'s Axioms are a set of rules, that when applied repeatedly, generates a closure of functional dependencies. - - - **[Trivial Functional Dependency]** - - - **[LESSON PROPER FOR WEEK 4]** [[IM101 Week 4.pdf]](https://drive.google.com/file/d/1bSb6eslx9jImMERtZSBZsRFnWq22RIAU/view?usp=drive_link) Page 1 of 15 Advance Database System Structured Query Language Basics I. LESSON PROPER What is SQL? SQL stands for Structured Query Language SQL lets you access and manipulate databases SQL became a standard of the American National Standards Institute (ANSI) in 1986, and of the International Organization for Standardization (ISO) in 1987 What Can SQL do? SQL can execute queries against a database SQL can retrieve data from a database SQL can insert records in a database SQL can update records in a database SQL can delete records from a database SQL can create new databases SQL can create new tables in a database SQL can create stored procedures in a database SQL can create views in a database SQL can set permissions on tables, procedures, and views Using SQL in Your Web Site To build a web site that shows data from a database, you will need: An RDBMS database program (i.e. MS Access, SQL Server, MySQL) To use a server-side scripting language, like PHP or ASP To use SQL to get the data you want To use HTML / CSS to style the page RDBMS RDBMS stands for Relational Database Management System. RDBMS is the basis for SQL, and for all modern database systems such as MS SQL Server, IBM DB2, Oracle, MySQL, and Microsoft Access. The data in RDBMS is stored in database objects called tables. A table is a collection of related data entries and it consists of columns and rows. Page 2 of 15 Advance Database System Structured Query Language Basics SQL Components: Data Definition Language (DDL): Deals with the structure of the database (e.g., creating and altering tables). Data Manipulation Language (DML): Deals with the data within the tables (e.g., querying and updating data). Data Control Language (DCL): Deals with permissions and access control (e.g., granting or revoking permissions). SQL Syntax 1\. Basic SQL Statements Most of the actions you need to perform on a database are done with SQL statements. SQL statements consist of keywords that are easy to understand. The following SQL statement returns all records from a table named \"Employees\": Keywords: Reserved words (e.g., SELECT, FROM, WHERE). Identifiers: Names of tables, columns, etc. Operators: Symbols used in expressions (e.g., =, \>, \ - **Common Table Expressions (CTEs):** **What is a CTE?** A Common Table Expression (CTE) is a temporary result set that you can reference within a SELECT, INSERT, UPDATE, or DELETE statement. It helps to simplify complex queries, enhance readability, and improve maintainability. Syntax: **Key Points:** - - **Basic CTE Example** **Scenario:** Retrieve a list of employees and their departments from the employees and departments tables. - - - - ![](media/image5.png) **Using CTEs to Simplify Complex Queries** **Multiple CTEs** CTEs can be chained together, allowing you to build upon previous CTEs. - **CTEs with Aggregation** You can use CTEs to perform aggregations and then use the results in another query. - ![](media/image52.png) **CTEs in Data Manipulation** CTEs can also be used in INSERT, UPDATE, and DELETE statements to organize and structure complex operations. - **Recursive CTEs** **1. Introduction to Recursive CTEs** Recursive CTEs are used to handle hierarchical or tree-structured data. They consist of two parts: - - Syntax: ![](media/image44.png) **Recursive CTE Example** **Scenario:** Retrieve all employees and their subordinates in a hierarchy. - - - **Advanced Queries and Joins:** **1. Inner Join** An Inner Join returns only the rows where there is a match between the columns being joined in both tables. If a row in one table does not have a corresponding row in the other table, it is excluded from the result set. Syntax: ![](media/image6.png) Example: Suppose we have two tables: employees and departments. This query returns a list of employees along with their department names, but only for employees who are associated with a department. **2. Left Join (Left Outer Join)** A Left Join (or Left Outer Join) returns all rows from the left table and the matched rows from the right table. If there is no match, NULL values are returned for columns from the right table. Syntax: ![](media/image23.png) Example: Suppose we have two tables: employees and departments. This query returns all employees and their department names. Employees who are not assigned to any department will have NULL values in the department\_name column. **3. Right Join (Right Outer Join)** A Right Join (or Right Outer Join) returns all rows from the right table and the matched rows from the left table. If there is no match, NULL values are returned for columns from the left table. Syntax: ![](media/image45.png) Example: This query returns all departments and their employees. Departments without any employees will have NULL values in the employee.name column. **4. Full Join (Full Outer Join)** A Full Join (or Full Outer Join) returns all rows when there is a match in either the left or right table. Rows that do not have matches in the other table will contain NULLs for columns from the non-matching table. Syntax: ![](media/image66.png) Example: This query returns all employees and all departments. Employees not linked to any department and departments without employees will have NULLs in the corresponding columns. **5. Cross Join** A Cross Join returns the Cartesian product of both tables. Each row from the first table is paired with every row from the second table. Syntax: ![](media/image59.png) Example: This query will produce a result set where every employee is paired with every department, potentially resulting in a large number of rows. **6. Self Join** A Self Join is a join where a table is joined with itself. This is useful for querying hierarchical data or comparing rows within the same table. Syntax: ![](media/image69.png) Example: Assume we have an employees table where each employee has a manager\_id pointing to another employee\'s id. This query lists each employee alongside their manager's name. **7. Join with Aggregation** This involves joining tables and then using aggregate functions (like COUNT, SUM, AVG) to summarize data. Syntax: ![](media/image10.png) Example: To count the number of employees in each department: This query counts how many employees are in each department and groups the results by department name. **Summary of Joins** - - - - - - - **Subqueries and Nested Queries** **1. Single-Row Subquery** A Single-Row Subquery returns at most one row of results. It is typically used in the WHERE clause to compare a column to a single value. Syntax: ![](media/image17.png) Example: Find employees who work in the department with the highest salary: In this example, the subquery retrieves the department ID with the highest salary, and the outer query fetches the names of employees in that department. **2. Multi-Row Subquery** A Multi-Row Subquery returns multiple rows of results. It is used with operators such as IN, ANY, or ALL to compare a column against a set of values. Syntax: ![](media/image46.png) Example: Find employees who work in departments that have more than 10 employees: Here, the subquery identifies departments with more than 10 employees, and the outer query retrieves employees from those departments. **3. Correlated Subquery** A Correlated Subquery references columns from the outer query. It executes once for each row processed by the outer query. Syntax: ![](media/image63.png) Example: Find employees who earn more than the average salary in their respective departments: In this case, the subquery uses the department\_id from the outer query to compare each employee's salary with the average salary of their department. **4. Subquery in SELECT Clause** A subquery in the SELECT clause is used to calculate a value for each row returned by the outer query. This is useful for including computed values based on related tables. Syntax: ![](media/image37.png) Example: Get employees along with the total number of employees in their department: Here, the subquery calculates the total number of employees in the same department for each employee listed by the outer query. **5. Nested Queries** Nested Queries involve placing one query inside another. They can be combinations of different types of subqueries used within the WHERE, SELECT, or FROM clauses. Syntax: ![](media/image38.png) Example: Find employees who work in the department that has the maximum average salary: Here, nested subqueries are used to find the department with the highest average salary, and then to find employees in that department. **6. Subquery with EXISTS** The EXISTS operator is used to check if the subquery returns any rows. It returns TRUE if the subquery returns one or more rows and FALSE otherwise. This is useful for checking the existence of rows that meet certain conditions. Syntax: ![](media/image35.png) Example: Find departments that have at least one employee: In this case, the subquery checks if there are any employees in each department, and the outer query returns departments where the subquery returns at least one row. **Summary** - - - - - - **Indexing** Indexing is a database optimization technique that improves the speed of data retrieval operations. Indexes work like a book's index, allowing the database to quickly locate and access the required data without scanning the entire table. **Types of Indexes:** **1. Single-Column Index** An index created on a single column of a table. It speeds up query performance by allowing rapid access to rows based on that column's values. Example: ![](media/image8.png) This index improves the performance of queries that filter or sort by the name column of the employees table. **2. Composite Index** An index that involves multiple columns. It is useful when queries filter or sort based on more than one column, allowing for efficient access to rows based on the combination of columns. Example: This index improves performance for queries that filter by both department\_id and salary. 3\. **Unique Index** An index that ensures all values in the indexed column(s) are unique. It helps enforce data integrity by preventing duplicate values. Example: ![](media/image24.png) This index ensures that all email addresses in the employees table are unique. **Query Optimization** Query optimization involves refining SQL queries to improve their execution speed and efficiency. It is essential for managing large datasets and complex queries. **1. EXPLAIN Command** The EXPLAIN command provides insight into how the database engine executes a query. It shows the execution plan, including which indexes are used and how tables are joined. Example: The output includes details about the index usage and join operations, helping you understand how the query is processed. **2. Optimize Queries** - Selecting all columns with SELECT \* retrieves unnecessary data, increasing I/O and processing time. Always specify the columns you need. Example: ![](media/image70.png) This query retrieves only the name and salary columns, avoiding the overhead of fetching all columns. - Ensure that indexes are used effectively by aligning them with the queries you run frequently. Indexes should match columns used in WHERE, JOIN, ORDER BY, and GROUP BY clauses.= Example: For a query that filters by department\_id and salary, ensure there's a composite index on these columns: - \`Use LIMIT or TOP to restrict the number of rows returned, especially for queries that involve sorting or complex joins. Example: ![](media/image39.png) This query retrieves only the top 10 highest-paid employees in department 10, reducing processing time. - Regularly review and refactor queries to improve performance. Use query analysis tools and the EXPLAIN command to identify bottlenecks and optimize your SQL code. **Summary** - - -

Use Quizgecko on...
Browser
Browser