Unit I Data Warehousing & Data Mining PDF
Document Details
Uploaded by Deleted User
Sri Vidya College of Engineering & Technology
Tags
Summary
This document provides an introduction to data warehousing and data mining. It discusses the overall architecture, components, and key characteristics of a data warehouse, differentiating it from online transaction processing (OLTP) systems.
Full Transcript
Sri Vidya College of Engineering & Technology, Virudhunagar Course Material (Lecture Notes) INTRODUCTION A data warehouse is a database designed to enable business intelligence activities: it exists to help users understand and enhance their organ...
Sri Vidya College of Engineering & Technology, Virudhunagar Course Material (Lecture Notes) INTRODUCTION A data warehouse is a database designed to enable business intelligence activities: it exists to help users understand and enhance their organization's performance. It is designed for query and analysis rather than for transaction processing, and usually contains historical data derived from transaction data, but can include data from other sources. Data warehouses separate analysis workload from transaction workload and enable an organization to consolidate data from several sources. This helps in: Maintaining historical records Analyzing the data to gain a better understanding of the business and to improve the business In addition to a relational database, a data warehouse environment can include an extraction, transportation, transformation, and loading (ETL) solution, statistical analysis, reporting, data mining capabilities, client analysis tools, and other applications that manage the process of gathering data, transforming it into useful, actionable information, and delivering it to business users. Data warehouses are distinct from online transaction processing (OLTP) systems. With a data warehouse you separate analysis workload from transaction workload. Thus data warehouses are very much read-oriented systems. They have a far higher amount of data reading versus writing and updating. This enables far better analytical performance and avoids impacting your transaction systems. A data warehouse system can be optimized to consolidate data from many sources to achieve a key goal: it becomes your organization's "single source of truth". There is great value in having a consistent source of data that all users can look to; it prevents many disputes and enhances decision-making efficiency. A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented Integrated Nonvolatile Time Varient IT6702 Data warehousing and Data Mining Unit I Page 1 Sri Vidya College of Engineering & Technology, Virudhunagar Course Material (Lecture Notes) Subject Oriented Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales data, you can build a data warehouse that concentrates on sales. Using this data warehouse, you can answer questions such as "Who was our best customer for this item last year?" or "Who is likely to be our best customer next year?" This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented. Integrated Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated. Nonvolatile Nonvolatile means that, once entered into the data warehouse, data should not change. This is logical because the purpose of a data warehouse is to enable you to analyze what has occurred. Time Varient A data warehouse's focus on change over time is what is meant by the term time variant. In order to discover trends and identify hidden patterns and relationships in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. Key Characteristics of a Data Warehouse The key characteristics of a data warehouse are as follows: Data is structured for simplicity of access and high-speed query performance. End users are time-sensitive and desire speed-of-thought response times. Large amounts of historical data are used. Queries often retrieve large amounts of data, perhaps many thousands of rows. Both predefined and ad hoc queries are common. The data load involves multiple sources and transformations. In general, fast query performance with high data throughput is the key to a successful data warehouse. CS2032 Data warehousing and Data Mining Unit I Page 2 Sri Vidya College of Engineering & Technology, Virudhunagar Course Material (Lecture Notes) DATA WAREHOUSING COMPONENTS Overall Architecture The data warehouse architecture is based on a relational database management system server that functions as the central repository for informational data. Operational data and processing is completely separated from data warehouse processing. This central information repository is surrounded by a number of key components designed to make the entire environment functional, manageable and accessible by both the operational systems that source data into the warehouse and by end-user query and analysis tools. Typically, the source data for the warehouse is coming from the operational applications. As the data enters the warehouse, it is cleaned up and transformed into an integrated structure and format. The transformation process may involve conversion, summarization, filtering and condensation of data. Because the data contains a historical component, the warehouse must be capable of holding and managing large volumes of data as well as different data structures for the same database over time. The next sections look at the seven major components of data warehousing: Data Warehouse Database The central data warehouse database is the cornerstone of the data warehousing environment. This database is almost always implemented on the relational database management system (RDBMS) technology. However, this kind of implementation is often constrained by the fact that traditional RDBMS products are optimized for transactional database processing. Certain data warehouse attributes, such as very large database size, ad hoc query processing and the need for flexible user view creation including aggregates, multi-table joins and drill-downs, have become drivers for different technological approaches to the data warehouse database. These approaches include: Parallel relational database designs for scalability that include shared-memory, shared disk, or shared-nothing models implemented on various multiprocessor configurations (symmetric multiprocessors or SMP, massively parallel processors or MPP, and/or clusters of uni- or multiprocessors). CS2032 Data warehousing and Data Mining Unit I Page 3 Sri Vidya College of Engineering & Technology, Virudhunagar Course Material (Lecture Notes) An innovative approach to speed up a traditional RDBMS by using new index structures to bypass relational table scans. Multidimensional databases (MDDBs) that are based on proprietary database technology; conversely, a dimensional data model can be implemented using a familiar RDBMS. Multi-dimensional databases are designed to overcome any limitations placed on the warehouse by the nature of the relational data model. MDDBs enable on-line analytical processing (OLAP) tools that architecturally belong to a group of data warehousing components jointly categorized as the data query, reporting, analysis and mining tools. Sourcing, Acquisition, Cleanup and Transformation Tools A significant portion of the implementation effort is spent extracting data from operational systems and putting it in a format suitable for informational applications that run off the data warehouse. The data sourcing, cleanup, transformation and migration tools perform all of the conversions, summarizations, key changes, structural changes and condensations needed to transform disparate data into information that can be used by the decision support tool. They produce the programs and control statements, including the COBOL programs, MVS job-control language (JCL), UNIX scripts, and SQL data definition language (DDL) needed to move data into the data warehouse for multiple operational systems. These tools also maintain the meta data. The functionality includes: Removing unwanted data from operational databases Converting to common data names and definitions Establishing defaults for missing data Accommodating source data definition changes The data sourcing, cleanup, extract, transformation and migration tools have to deal with some significant issues including: Database heterogeneity. DBMSs are very different in data models, data access language, data navigation, operations, concurrency, integrity, recovery etc. CS2032 Data warehousing and Data Mining Unit I Page 4 Sri Vidya College of Engineering & Technology, Virudhunagar Course Material (Lecture Notes) Data heterogeneity. This is the difference in the way data is defined and used in different models - homonyms, synonyms, unit compatibility (U.S. vs metric), different attributes for the same entity and different ways of modeling the same fact. These tools can save a considerable amount of time and effort. However, significant shortcomings do exist. For example, many available tools are generally useful for simpler data extracts. Frequently, customized extract routines need to be developed for the more complicated data extraction procedures. Meta data Meta data is data about data that describes the data warehouse. It is used for building, maintaining, managing and using the data warehouse. Meta data can be classified into: Technical meta data, which contains information about warehouse data for use by warehouse designers and administrators when carrying out warehouse development and management tasks. Business meta data, which contains information that gives users an easy-to-understand perspective of the information stored in the data warehouse. Equally important, meta data provides interactive access to users to help understand content and find data. One of the issues dealing with meta data relates to the fact that many data extraction tool capabilities to gather meta data remain fairly immature. Therefore, there is often the need to create a meta data interface for users, which may involve some duplication of effort. Meta data management is provided via a meta data repository and accompanying software. Meta data repository management software, which typically runs on a workstation, can be used to map the source data to the target database; generate code for data transformations; integrate and transform the data; and control moving data to the warehouse. As user's interactions with the data warehouse increase, their approaches to reviewing the results of their requests for information can be expected to evolve from relatively simple manual analysis for trends and exceptions to agent-driven initiation of the analysis based on user-defined thresholds. The definition of these thresholds, configuration parameters for the software agents CS2032 Data warehousing and Data Mining Unit I Page 5 Sri Vidya College of Engineering & Technology, Virudhunagar Course Material (Lecture Notes) using them, and the information directory indicating where the appropriate sources for the information can be found are all stored in the meta data repository as well. Access Tools The principal purpose of data warehousing is to provide information to business users for strategic decision-making. These users interact with the data warehouse using front-end tools. Many of these tools require an information specialist, although many end users develop expertise in the tools. Tools fall into four main categories: query and reporting tools, application development tools, online analytical processing tools, and data mining tools. Query and Reporting tools can be divided into two groups: reporting tools and managed query tools. Reporting tools can be further divided into production reporting tools and report writers. Production reporting tools let companies generate regular operational reports or support high- volume batch jobs such as calculating and printing paychecks. Report writers, on the other hand, are inexpensive desktop tools designed for end-users. Managed query tools shield end users from the complexities of SQL and database structures by inserting a meta layer between users and the database. These tools are designed for easy-to-use, point-and-click operations that either accept SQL or generate SQL database queries. Often, the analytical needs of the data warehouse user community exceed the built-in capabilities of query and reporting tools. In these cases, organizations will often rely on the tried-and-true approach of in-house application development using graphical development environments such as PowerBuilder, Visual Basic and Forte. These application development platforms integrate well with popular OLAP tools and access all major database systems including Oracle, Sybase, and Informix. OLAP tools are based on the concepts of dimensional data models and corresponding databases, and allow users to analyze the data using elaborate, multidimensional views. Typical business applications include product performance and profitability, effectiveness of a sales program or marketing campaign, sales forecasting and capacity planning. These tools assume that the data is organized in a multidimensional model. CS2032 Data warehousing and Data Mining Unit I Page 6 Sri Vidya College of Engineering & Technology, Virudhunagar Course Material (Lecture Notes) A critical success factor for any business today is the ability to use information effectively. Data mining is the process of discovering meaningful new correlations, patterns and trends by digging into large amounts of data stored in the warehouse using artificial intelligence, statistical and mathematical techniques. Data Marts The concept of a data mart is causing a lot of excitement and attracts much attention in the data warehouse industry. Mostly, data marts are presented as an alternative to a data warehouse that takes significantly less time and money to build. However, the term data mart means different things to different people. A rigorous definition of this term is a data store that is subsidiary to a data warehouse of integrated data. The data mart is directed at a partition of data (often called a subject area) that is created for the use of a dedicated group of users. A data mart might, in fact, be a set of denormalized, summarized, or aggregated data. Sometimes, such a set could be placed on the data warehouse rather than a physically separate store of data. In most instances, however, the data mart is a physically separate store of data and is resident on separate database server, often a local area network serving a dedicated user group. Sometimes the data mart simply comprises relational OLAP technology which creates highly denormalized dimensional model (e.g., star schema) implemented on a relational database. The resulting hypercubes of data are used for analysis by groups of users with a common interest in a limited portion of the database. These types of data marts, called dependent data marts because their data is sourced from the data warehouse, have a high value because no matter how they are deployed and how many different enabling technologies are used, different users are all accessing the information views derived from the single integrated version of the data. Data Warehouse Administration and Management Data warehouses tend to be as much as 4 times as large as related operational databases, reaching terabytes in size depending on how much history needs to be saved. They are not synchronized in real time to the associated operational data but are updated as often as once a day if the application requires it. CS2032 Data warehousing and Data Mining Unit I Page 7 Sri Vidya College of Engineering & Technology, Virudhunagar Course Material (Lecture Notes) In addition, almost all data warehouse products include gateways to transparently access multiple enterprise data sources without having to rewrite applications to interpret and utilize the data. Furthermore, in a heterogeneous data warehouse environment, the various databases reside on disparate systems, thus requiring inter-networking tools. The need to manage this environment is obvious. Managing data warehouses includes security and priority management; monitoring updates from the multiple sources; data quality checks; managing and updating meta data; auditing and reporting data warehouse usage and status; purging data; replicating, subsetting and distributing data; backup and recovery and data warehouse storage management. Information Delivery System The information delivery component is used to enable the process of subscribing for data warehouse information and having it delivered to one or more destinations according to some user-specified scheduling algorithm. In other words, the information delivery system distributes warehouse-stored data and other information objects to other data warehouses and end-user products such as spreadsheets and local databases. Delivery of information may be based on time of day or on the completion of an external event. The rationale for the delivery systems component is based on the fact that once the data warehouse is installed and operational, its users don't have to be aware of its location and maintenance. All they need is the report or an analytical view of data at a specific point in time. With the proliferation of the Internet and the World Wide Web such a delivery system may leverage the convenience of the Internet by delivering warehouse-enabled information to thousands of end-users via the ubiquitous world wide network. In fact, the Web is changing the data warehousing landscape since at the very high level the goals of both the Web and data warehousing are the same: easy access to information. The value of data warehousing is maximized when the right information gets into the hands of those individuals who need it, where they need it and they need it most. However, many corporations have struggled with complex client/server systems to give end users the access they need. The issues become even more difficult to resolve when the users are physically remote from the data warehouse location. The Web removes a lot of these issues by giving users universal and CS2032 Data warehousing and Data Mining Unit I Page 8 Sri Vidya College of Engineering & Technology, Virudhunagar Course Material (Lecture Notes) relatively inexpensive access to data. Couple this access with the ability to deliver required information on demand and the result is a web-enabled information delivery system that allows users dispersed across continents to perform a sophisticated business-critical analysis and to engage in collective decision-making. BUILDING A DATA WAREHOUSE 1. Business Considerations: Return on Investment 1. Approach The subject oriented nature of the data warehouse determines the scope of the information in the data warehouse. Organizations embarking on data warehousing development can chose on of the two approaches Top-down approach: Meaning that the organization has developed an enterprise data model, collected enterprise wide business requirement, and decided to build an enterprise data warehouse with subset data marts Bottom-up approach: Implying that the business priorities resulted in developing individual data marts, which are then integrated into the enterprise data warehouse. 2. Organizational Issues The requirements and environments associated with the informational applications of a data warehouse are different. Therefore an organization will need to employ different development practices than the ones it uses for operational applications 2. Design Consideration In general, a data warehouse’s design point is to consolidate data from multiple, often heterogeneous, sources into a query data base. The main factors include Heterogeneity of data sources, which affects data conversion, quality, time-liness Use of historical data, which implies that data may be” old” Tendency of database to grow very large Data Content: Typically a data warehouse may contain detailed data, but the data is cleaned up and transformed to fit the warehouse model, and certain transactional attributes of the data are filtered out. The content and the structure of the data warehouses are reflected in its data model. The data model is a template for how information will be organized with in the integrated data warehouse framework. CS2032 Data warehousing and Data Mining Unit I Page 9 Sri Vidya College of Engineering & Technology, Virudhunagar Course Material (Lecture Notes) Meta data: Defines the contents and location of the data in the warehouse, relationship between the operational databases and the data warehouse, and the business view of the warehouse data that are accessible by end-user tools. The warehouse design should prevent any direct access to the warehouse data if it does not use meta data definitions to gain the access. Data distribution: As the data volumes continue to grow, the data base size may rapidly outgrow a single server. Therefore, it becomes necessary to know how the data should be divided across multiple servers. The data placement and distribution design should consider several options including data distribution by subject area, location, or time. Tools: Data warehouse designers have to be careful not to sacrifice the overall design to fit to a specific tool. Selected tools must be compatible with the given data warehousing environment each other. Performance consideration: Rapid query processing is a highly desired feature that should be designed into the data warehouse. Nine decisions in the design of a data warehouse: 1. Choosing the subject matter 2. Deciding what a fact table represents 3. Identifying and conforming the decisions 4. Choosing the facts 5. Storing pre calculations in the fact table 6. Rounding out the dimension table 7. Choosing the duration of the data base 8. The need to track slowly changing dimensions 9. Deciding the query priorities and the query modes 3. Technical Considerations A number of technical issues are to be considered when designing and implementing a data warehouse environment.these issues includes. The hardware platform that would house the data warehouse. The data base management system that supports the warehouse data base. The communication infrastructure that connects the warehouse, data marts, operational systems, and end users. The hardware platform and software to support the meta data repository The systems management framework that enables the centralized management and administration of the entire environment. CS2032 Data warehousing and Data Mining Unit I Page 10 Sri Vidya College of Engineering & Technology, Virudhunagar Course Material (Lecture Notes) 4. Implementation Considerations A data warehouse cannot be simply bought and installed-its implementation requires the integration of many products within a data ware house. Access tools Data Extraction, clean up, Transformation, and migration Data placement strategies Meta data User sophistication levels: Casual users, Power users, Experts MAPPING DATA WAREHOUSE TO A MULTIPROCESSOR ARCHITECTURE 1. Relational Data base Technology for data warehouse The size of a data warehouse rapidly approaches the point where the search of a data warehouse rapidly approaches the point where the search for better performance and scalability becomes a real necessity. The search is pursuing two goals Speed Up: the ability to execute the same request on the same amount of data in less time Scale-Up: The ability to obtain the same performance on the same request as the data base size increases. 1. Types of Parallelism Parallel execution of tasks within the SQL statements can be done in either of two ways. Horizontal parallelism: Which means that the data base is partitioned across multiple disks and the parallel processing occurs in the specific tasks, that is performed concurrently on different processors against different sets of data Vertical Parallelism: which occurs among different tasks all components query operations are executed in parallel in a pipelined fashion. In other words an output from one task becomes an input into another task as soon as records become available. CS2032 Page 11 Sri Vidya College of Engineering & Technology, Virudhunagar Course Material (Lecture Notes) 2. Data Partitioning Data partitioning is a key requirement for effective parallel execution of data base operations. It spreads data from data base tables across multiple disks so that I/O operations such as read and write can be performed in parallel. Random partitioning includes random data striping across multiple disks on single servers. In round robin partitioning, each new record id placed on the new disk assigned to the data base. Intelligent partitioning assumes that DBMS knows where a specific record id located and does not waste time searching for it across all disks. This partitioning allows a DBMS to fully exploit parallel architectures and also enables higher availability. Intelligent partitioning includes Hash Partitioning : Hash algorithm is used to calculate the partition number Key range partitioning : Partitions are based on the partition key Schema partitioning :Each table is placed in each disk, Useful for small references User-defined partitioning: Tables are partitioned based on user defined expressions. 2. Database Architecture for parallel Processing 1. Shared-Memory Architecture Also called as shared-everything style.Traditional approach to implement an RDBMS on SMP hardware. Simple to implement. The key point of this approach is that a single RDBMS server can potentially utilize all processors, access all memory, and access the entire database, thus providing the user with a consistent single system image CS2032 Data warehousing and Data Mining Unit I Page 12 Sri Vidya College of Engineering & Technology, Virudhunagar Course Material (Lecture Notes) 2. Shared-disk Architecture It implements the concept of shared ownership of the entire data base between RDBMS servers, each of which is running on a node of distributed memory system. Each RDBMS server can read, write, update and delete records from the same shared data base, which would require the system to implement a form of distributed lock manager (DLM). Pining: In worst case scenario, if all nodes are reading and updating same data, the RDBMS and its DLM will have to spend a lot of resources synchronizing multiple buffer pool. This problem is called as pining Data skew: Un even distribution of data Shared-disk architectures can reduce performance bottle-necks resulting from data skew CS2032 Data warehousing and Data Mining Unit I Page 13 Sri Vidya College of Engineering & Technology, Virudhunagar Course Material (Lecture Notes) 3. Shared-Nothing Architecture The data is partitioned across many disks, and DBMS is “partitioned” across multiple conservers, each of which resides on individual nodes of the parallel system and has an ownership of its own disk and thus, its own data base partition. It offers non-linear scalability. These requirements includes Support for function shipping Parallel join strategies Support for data repartitioning Query compilation Support for data base transactions Support for the single system image of the data base environment. CS2032 Data warehousing and Data Mining Unit I Page 14 Sri Vidya College of Engineering & Technology, Virudhunagar Course Material (Lecture Notes) 4. Combined Architecture Inter server parallelism of the distributed memory architecture means that each query is parallelized across multiple servers. While intraserver parallelism of the shared memory architecture means that a query is parallelized with in the server. 3. Parallel RDBMS Feature Some of the demands from the DBMS vendors are Scope and techniques of parallel DBMS operations Optimized implementation Application transparency The parallel environment DBMS Management tools Price/Performance 4. Alternative Technologies In addition to parallel data base technology, a number of vendors are working on other solutions improving performance in data warehousing environments. These includes Advanced database indexing products CS2032 Data warehousing and Data Mining Unit I Page 15 Sri Vidya College of Engineering & Technology, Virudhunagar Course Material (Lecture Notes) Specialized RDBMS designed specially for the data warehousing Multidimensional data bases 5. Parallel DBMS Vendors Oracle Informix IBM Sybase Microsoft DATA EXTRACTION, CLEAN UP AND TRANSFORMATION TOOLS 1. Tool Requirements The tools that enable sourcing of the proper data contents and formats from operational and external data stores into the data warehouse have to perform a number of important tasks that include Data transformation from one format to another on the basis of possible differences between the source and target platforms Data consolidation and integration, which may include combining several source records into a single record to be loaded into the warehouse. Meta data synchronization and management and calculation based on the application of the business rules that force certain transformation. 2. Vendor Approaches The tasks of capturing data from a source data system, cleaning and transforming it, and then loading the results into a target data system can be carried out either by separate products, or by single integrated solution. Code generator Data base data replication tools Rule-driven dynamic transformation engines 3. Access to Legacy Data The middleware strategy is the foundation for the tools such as Enterprise/Access from Apertus Corporation. CS2032 Data warehousing and Data Mining Unit I Page 16 Sri Vidya College of Engineering & Technology, Virudhunagar Course Material (Lecture Notes) The data layer provides data access and transaction services for management of corporate data asserts The process layer provides services to manage automation and support for current business processes. The user layer manages user interaction with process and/or data layer services. It allows the user interfaces to change independently of the underlying business processes. Meta data 1. Meta data-definition Meta data is one of the most important aspects of data warehousing. It is data about data stored in the warehouse and its users. Meta data contains The location of and description of the warehouse system and data components Names, definition, structure and content of the data warehouse and end user views Identification of authoritative data sources Integration and transformation rules used to populate the data warehouse; these include the mapping method from operational data bases into the warehouse, and algorithm used to convert Integration and transformation rules used to deliver data to end-user analytical tools 2. Meta data interchange initiative A Meta Data Coalition Introduction The Meta Data Coalition was founded by a group of industry-leading vendors aimed at defining a tactical set of standard specifications for the access and interchange of meta data between different software tools. What follows is an overview of Version 1.0 of the Meta Data Interchange Specification (MDIS) initiative taken by the Meta Data Coalition. Goals of the Meta Data Interchange Specification Initiative Situation Analysis The volatility of our global economy and an increasingly competitive business climate are driving companies to leverage their existing resources in new, creative, and more effective ways. Enterprise data, once viewed as merely fodder for the operational systems CS2032 Data warehousing and Data Mining Unit I Page 17 Sri Vidya College of Engineering & Technology, Virudhunagar Course Material (Lecture Notes) that ran the day-to-day mechanics of business, is now being recognized not only as one of these valuable resources but as a strategic business asset. However, as the rate of change continues to accelerate-in response to both business pressures and technological advancement-managing this strategic asset and providing timely, accurate, and manageable access to enterprise data becomes increasingly critical. This need to find faster, more comprehensive and efficient ways to provide access to and manage enterprise data has given rise to a variety of new architectures and approaches, such as data warehouses, distributed client/server computing, and integrated enterprise-wide applications. In these new environments, meta data, or the information about the enterprise data, is emerging as a critical element in effective data management. Vendors as well as users have been quick to appreciate the value of meta data, but the rapid proliferation of data manipulation and management tools has resulted in almost as many different "flavors" and treatments of meta data as there are tools. THE CHALLENGE To enable full-scale enterprise data management, different tools must be able to freely and easily access, and in some cases manipulate and update, the meta data created by other tools and stored in a variety of different storage facilities. The only viable mechanism to enable disparate tools from independent vendors to exchange this meta data is to establish at least a minimum common denominator of interchange specifications and guidelines to which the different vendors' tools can comply. Establishing and adhering to a core set of industry meta data interchange specifications will enable IS managers to select what they perceive as "best of breed" to build the tool infrastructure that best fits their unique environment needs. In choosing the interchange- compliant tools, they can be assured of the accurate and efficient exchange of meta data essential to meeting their users' business information needs. The MetaData Coalition was established to bring industry vendors and users together to address a variety of difficult problems and issues with regard to exchanging, sharing, and CS2032 Data warehousing and Data Mining Unit I Page 18 Sri Vidya College of Engineering & Technology, Virudhunagar Course Material (Lecture Notes) managing meta data. This is intended as a coalition of interested parties with a common focus and shared goals, not a traditional standards body or regulatory group in any way. Terminology and Basic Assumptions The MetaData Interchange Specification (MDIS) draws a distinction between: The Application Metamodel - the tables, etc., used to "hold" the meta data for schemas, etc., for a particular application; for example, the set of tables used to store meta data in Composer may differ significantly from those used by the Bachman Data Analyst. The MetaData Metamodel - the set of objects that the MetaData Interchange Specification can be used to describe. These represent the information that is common (i.e., represented) by one or more classes of tools, such as data discovery tools, data extraction tools, replication tools, user query tools, database servers, etc. The meta data metamodel should be: Independent of any application metamodel Character-based so as to be hardware/platform-independent Fully qualified so that the definition of each object is uniquely identified Basic Assumptions The MetaData Coalition has made the following assumptions: Because users' information needs are growing more complex, the IS organization would ideally like the interchange specification to support (to the greatest extent possible) the bidirectional interchange of meta data so that updates can be made in the most natural place. For example, the user might initially specify the source-to-target mapping between a legacy database and a RDBMS target in a CASE tool but, after using a data extraction tool to generate and execute programs to actually move the data, discover that the mapping was somehow incorrect. The most natural place to test out the "fix" to this problem is in the context of the data extraction tool. Once the correction is verified, one updates the metamodel in the CASE tool, rather than having to go to the CASE tool, change the mapping, and trigger the meta data interchange between the CASE tool and the data extraction tool before being able to test the new mapping. Vendors would like to support the MetaData Interchange Specification with a minimum amount of additional development. In light of these assumptions, the meta data model must be sufficiently extensible to allow a vendor to store the entire metamodel for any application. In other words, MDIS should provide mechanisms for extending the meta data model so that CS2032 Data warehousing and Data Mining Unit I Page 19 Sri Vidya College of Engineering & Technology, Virudhunagar Course Material (Lecture Notes) additional (and possibly encrypted) information can be passed. An example of when a vendor might want encryption is in the case of a tool that generates parameters for invoking some internal routine. Because these parameters might provide other vendors with information regarding what is considered a proprietary part of their tool, the vendor may wish to encrypt these parameters. If one assumed that all updates to the model occurred in the context of a single tool, e.g., the CASE tool in the example above, the MDIS would not benefit from "carrying along" any of the tool-specific meta data. However, as the above example indicates, this assumption is not the "natural" meta data interchange flow. Consequently, some type of mechanism for providing extensions to the type of information exchanged by the interchange specification is necessary if one hopes to achieve bidirectional interchange between vendor applications. The MetaData Interchange Framework For Version 1.0, the MetaData Council is recommending the ASCII-based batch approach so that vendors can implement support for the specification with minimum overhead and the customer benefits from the availability of meta data interchange as quickly as possible. ASCII Batch Approach An ASCII Batch approach relies on the ASCII file format that contains the description of the common meta data components and standardized access requirements that make up the interchange specification meta data model. In this approach, the entire ASCII file containing the MDIS schema and access parameters is reloaded whenever a tool accesses the meta data through the specification API. This approach requires only the addition of a simple import/export function to the tools and would not require updating the tool in the event of meta data model changes, because the most up-to-date schema will always be available through the access framework. This eliminates the amount of retrofitting required to enable tools to remain compliant with the MDIS, because the burden for update stays primarily within the framework itself. The MetaData Interchange Specification There are two basic aspects of the proposed specification: Those that pertain to the semantics and syntax used to represent the meta data to be exchanged. CS2032 Data warehousing and Data Mining Unit I Page 20 Sri Vidya College of Engineering & Technology, Virudhunagar Course Material (Lecture Notes) These items are those that are typically found in a specifications document. Those that pertain to some framework in which the specification will be used. This second set of items is two file-based semaphores that are used by the specification's import and export functions to help the user of the specification control consistency. Components defining the semantics and syntax that define the specification: The Metamodel The MetaData Interchange Specification Metamodel describes the entities and relationships that are used to directly represent meta data in the MDIS. The goal in designing this metamodel is twofold: To choose the set of entities and relationships that represents the objects that the majority of tools require. To provide some mechanism for extensibility in the case that some tool requires the representation of some other type of object. The Mechanism for Extending the Metamodel The mechanism chosen to provide extensibility to the specification is analogous to the "properties" object found in LISP environments: a character field of arbitrary length that consists of a set of identifiers and a value, where the identifiers are used by the import function of the specification to locate and identify the private meta data in question and the value is the actual meta data itself. Note that because some tools may consider their private meta data proprietary, the actual value for this meta data may be encrypted. The MDIS Access Framework Version 1.0 of the MDIS includes information which will support a bidirectional flow of meta data while maintaining meta data consistency. Three types of information are required: Versioning information in the header of the file containing the meta data A Tool Profile which describes what type of data elements a tool directly represents and/or updates A Configuration Profile which describes the "legal flow of meta data." For example, although source-to-target mapping may be specified in the context of some analysis tool, once that meta data has been exported to ETI*EXTRACT and the mapping is changed because of CS2032 Data warehousing and Data Mining Unit I Page 21 Sri Vidya College of Engineering & Technology, Virudhunagar Course Material (Lecture Notes) errors found in expected data, one may want to require that all future changes to mapping originate in ETI*EXTRACT. If the configuration profile is set properly, the import function for ETI*EXTRACT would err off if asked to import a conversion specification from the analysis tool with a version number greater than the version number of the one originally imported from the mapping tool. The components of the meta data interchange standard frameworks are The standard meta data model The standard access framework Tool profile The user configuration 3. Meta data repository The data warehouse architecture framework represents a higher level of abstraction than the meta data interchange standard framework and by design. The warehouse design should prevent any direct access to the warehouse data if it does not use Meta data definitions to gain the access The framework provides the following benefits CS2032 Data warehousing and Data Mining Unit I Page 22 Sri Vidya College of Engineering & Technology, Virudhunagar Course Material (Lecture Notes) It provides a comprehensive suite of tools for enterprise wide meta data management It reduces and eliminates information redundancy, inconsistency, and under utilization. It simplifies management and improves organization, control, and accounting of information assets. It increases identification, understanding, coordinating and utilization of enterprise wide information assets It provides effective data administration tools to better manage corporate information assets with full-function data dictionary It increases flexibility, control, and reliability of the application development process and accelerates internal application development. It leverages investment in legacy systems with the ability inventory and utilize existing application It provides a universal relational model for heterogeneous RDBMS to interact and share information. It enforces CASE development standards and eliminates redundancy with the ability to share and reuse meta data. CS2032 Data warehousing and Data Mining Unit I Page 23 Sri Vidya College of Engineering & Technology, Virudhunagar Course Material (Lecture Notes) 4. Meta data Management A frequently occurring problem in data warehousing is the inability to communicate to the end user what information resides in the data warehouse how it can be accessed. The key to providing users and applications with a roadmap to the information stored in the warehouse is the meta data. It defines all data elements and their attributes, data sources and timing, and the rules that govern data use and data transformation. Meta data needs to be collected as the warehouse is designed and built. Must enforce integrity and redundancy CS2032 Data warehousing and Data Mining Unit I Page 24