unit V.pdf

Indexing Data is stored in the form of records and every record has a key field, which helps it to be recognize uniquely. Indexing is a data structure technique to efficiently retrieve records from the database on some attributes on which the indexing has been done. indexing in database is similar what we see in books, Indexing in DBMS o Indexing is used to optimize the performance of a database by minimizing the number of disk accesses required when a query is processed. o The index is a type of data structure. It is used to locate and access the data in a database table quickly. o It is defined based on the indexing attribute. Index structure, Indexes can be created using some database columns. o The first column of the database is the search key that contains a copy of the primary key or candidate key of the table. The values of the primary key are stored in sorted order so that the corresponding data can be accessed easily. o The second column of the database is the data reference. It contains a set of pointers holding the address of the disk block where the value of the particular key can be found. Ordered indices The indices are usually sorted to make searching faster. The indices which are sorted are known as ordered indices. Example: Suppose we have an employee table with thousands of record and each of which is 10 bytes long. If their IDs start with 1, 2, 3... and so on and we have to search student with ID-543. o In the case of a database with no index, we have to search the disk block from starting till it reaches 543. The DBMS will read the record after reading 543*10=5430 bytes. o In the case of an index, we will search using indexes and the DBMS will read the record after reading 542*2= 1084 bytes which are very less compared to the previous case. Indexing Methods: Primary Index o If the index is created on the basis of the primary key of the table, then it is known as primary indexing. These primary keys are unique to each record and contain 1:1 relation between the records. o As primary keys are stored in sorted order, the performance of the searching operation is quite efficient. o The primary index can be classified into two types: Dense index and Sparse index. Dense index o The dense index contains an index record for every search key value in the data file. It makes searching faster. o In this, the number of records in the index table is same as the number of records in the main table. o It needs more space to store index record itself. The index records have the search key and a pointer to the actual record on the disk. Sparse index o In the data file, index record appears only for a few items. Each item points to a block. o In this, instead of pointing to each record in the main table, the index points to the records in the main table in a gap. Clustering Index o A clustered index can be defined as an ordered data file. Sometimes the index is created on non-primary key columns which may not be unique for each record. o In this case, to identify the record faster, we will group two or more columns to get the unique value and create index out of them. This method is called a clustering index. o The records which have similar characteristics are grouped, and indexes are created for these group. Example: suppose a company contains several employees in each department. Suppose we use a clustering index, where all employees which belong to the same Dept_ID are considered within a single cluster, and index pointers point to the cluster as a whole. Here Dept_Id is a non-unique key. The previous schema is little confusing because one disk block is shared by records which belong to the different cluster. If we use separate disk block for separate clusters, then it is called better technique. Secondary Index In the sparse indexing, as the size of the table grows, the size of mapping also grows. These mappings are usually kept in the primary memory so that address fetch should be faster. Then the secondary memory searches the actual data based on the address got from mapping. If the mapping size grows then fetching the address itself becomes slower. In this case, the sparse index will not be efficient. To overcome this problem, secondary indexing is introduced. In secondary indexing, to reduce the size of mapping, another level of indexing is introduced. In this method, the huge range for the columns is selected initially so that the mapping size of the first level becomes small. Then each range is further divided into smaller ranges. The mapping of the first level is stored in the primary memory, so that address fetch is faster. The mapping of the second level and actual data are stored in the secondary memory (hard disk). For example: o If you want to find the record of roll 111 in the diagram, then it will search the highest entry which is smaller than or equal to 111 in the first level index. It will get 100 at this level. o Then in the second index level, again it does max (111) 10000; Thus, to make the system understand the user query, it needs to be translated in the form of relational algebra. We can bring this query in the relational algebra form as: o σsalary>10000 (πsalary (Employee)) o πsalary (σsalary>10000 (Employee)) After translating the given query, we can execute each relational algebra operation by using different algorithms. So, in this way, a query processing begins its working. Evaluation For this, with addition to the relational algebra translation, it is required to annotate the translated relational algebra expression with the instructions used for specifying and evaluating each operation. Thus, after translating the user query, the system executes a query evaluation plan. Query Evaluation Plan o In order to fully evaluate a query, the system needs to construct a query evaluation plan. o The annotations in the evaluation plan may refer to the algorithms to be used for the particular index or the specific operations. o Such relational algebra with annotations is referred to as Evaluation Primitives. The evaluation primitives carry the instructions needed for the evaluation of the operation. o Thus, a query evaluation plan defines a sequence of primitive operations used for evaluating a query. The query evaluation plan is also referred to as the query execution plan. o A query execution engine is responsible for generating the output of the given query. It takes the query execution plan, executes it, and finally makes the output for the user query. Optimization o The cost of the query evaluation can vary for different types of queries. Although the system is responsible for constructing the evaluation plan, the user does need not to write their query efficiently. o Usually, a database system generates an efficient query evaluation plan, which minimizes its cost. This type of task performed by the database system and is known as Query Optimization. o For optimizing a query, the query optimizer should have an estimated cost analysis of each operation. It is because the overall operation cost depends on the memory allocations to several operations, execution costs, and so on. Finally, after selecting an evaluation plan, the system evaluates the query and produces the output of the query. Data Mining o The process of extracting information to identify patterns, trends, and useful data that would allow the business to take the data-driven decision from huge sets of data is called Data Mining. o It is basically the process carried out for the extraction of useful information from a bulk of data or data warehouses. o In other words, we can say that Data Mining is the process of investigating hidden patterns of information to various perspectives for categorization into useful data, which is collected and assembled in particular areas such as data warehouses, efficient analysis, data mining algorithm, helping decision making and other data requirement to eventually cost-cutting and generating revenue. o Data mining is the act of automatically searching for large stores of information to find trends and patterns that go beyond simple analysis procedures. Data mining utilizes complex mathematical algorithms for data segments and evaluates the probability of future events. Data Mining is also called Knowledge Discovery of Data (KDD). o Data Mining is a process used by organizations to extract specific data from huge databases to solve business problems. It primarily turns raw data into useful information. o Data Mining can be applied to any type of data e.g. Data Warehouses, Transactional Databases, Relational Databases, Multimedia Databases, Spatial Databases, Time-series Databases, World Wide Web. Data Mining as a whole process The whole process of Data Mining consists of three main phases: Data Pre-processing – Data cleaning, integration, selection, and transformation takes place Data Extraction – Occurrence of exact data mining Data Evaluation and Presentation – Analyzing and presenting results Applications of Data Mining o Financial Analysis o Biological Analysis o Scientific Analysis o Intrusion Detection o Fraud Detection o Research Analysis

Document Details

Tags

Related

Full Transcript

Upgrade to continue