DATA_WAREHOUSING_AND_DATA_MINING.pdf
Document Details
Full Transcript
Data Warehousing And Data Mining DECAP446 Edited by Sartaj Singh Data Warehousing And Data Mining Edited By: Sartaj Singh CONTENT Unit 1: Data Warehousing and Online Analytical Processing 1 Dr. Rajni...
Data Warehousing And Data Mining DECAP446 Edited by Sartaj Singh Data Warehousing And Data Mining Edited By: Sartaj Singh CONTENT Unit 1: Data Warehousing and Online Analytical Processing 1 Dr. Rajni Bhalla, Lovely Professional University Unit 2: Introduction to Data Mining 21 Dr. Rajni Bhalla, Lovely Professional University Unit 3: Data Warehousing Architecture 36 Dr. Rajni Bhalla, Lovely Professional University Unit 4: Installation and development environment overview 56 Dr. Rajni Bhalla, Lovely Professional University Unit 5: Introduction to Mining Tools 73 Dr. Rajni Bhalla, Lovely Professional University Unit 6: Extracting Data Sets 94 Dr. Rajni Bhalla, Lovely Professional University Unit 7: Data Preprocessing 110 Dr. Rajni Bhalla, Lovely Professional University Unit 8: Data Pre-processing Using Rapid Miner 125 Dr. Rajni Bhalla, Lovely Professional University Unit 9: Association and Correlation Analysis 145 Dr. Rajni Bhalla, Lovely Professional University Unit 10: Clustering Algorithms and Cluster Analysis 159 Dr. Rajni Bhalla, Lovely Professional University Unit 11: Classification 183 Dr. Rajni Bhalla, Lovely Professional University Unit 12: Prediction and Classification Using Weka Tool 207 Dr. Rajni Bhalla, Lovely Professional University Unit 13: Clustering methods using Weka Tool 220 Dr. Rajni Bhalla, Lovely Professional University Unit 14: Applications of Data Warehousing and Data Mining 236 Dr. Rajni Bhalla, Lovely Professional University Unit 1: Data Warehousing and Online Analytical Processing Notes Dr. Rajni Bhalla, Lovely Professional University Unit - 1: Data Warehousing and Online Analytical Processing CONTENTS Objectives Introduction 1.1 What Is a Data Warehouse? 1.2 The need for a Separate Data Warehouse 1.3 Data Warehouse Models: Enterprise Warehouse, Data Mart, and Virtual Warehouse 1.4 Differences between Operational Database Systems and Data Warehouses. 1.5 Data Warehouse Modeling: Data Cube 1.6 Conceptual Modeling of Data Warehouse 1.7 Concept Hierarchies 1.8 Measures: Their Categorization and Computation 1.9 OLAP Operations 1.10 Operations in the Multidimensional Data Model (OLEP) 1.11 DataWarehouse Design and Usage 1.12 From Online Analytical Processing to Multidimensional Data Mining 1.13 DataWarehouse Implementation 1.14 Indexing OLAP Data: Bitmap Index and Join Index Summary Keywords Self Assessment Review Questions Answers: Self Assessment Further Readings Objectives After studying this unit, you will be able to: Know data warehouse concept. Outlines the difference between the operational database system and data warehouse. Acquire how data cubes model n-dimensional data Examine the way to index OLAP information by bitmap indexing and join indexing. Explores different methods to compute data cubes efficiently. Study the design and usage of data warehousing for information processing, analytical processing, and data mining. Introduction Data warehouses simplify and combine data in multidimensional space. The building of data warehouses includes data cleaning, data integration, and data transformation, and can be seen as ansignificant preprocessing step for data mining. Furthermore, data warehouses offer online analytical processing (OLAP) tools for the collaborative analysis of multidimensional data of diverse granularities, which simplifies effective data generalization and data mining. Numerous Lovely Professional University 1 Notes Data Warehousing and Data Mining other data mining tasks, such as association, classification, prediction, and clustering, can be combined with OLAP operations to improve interactive mining of knowledge at several levels of abstraction. Henceforth, the data warehouse has convert an progressively important stage for data analysis and OLAP and will deliver an effective platform for data mining. So, data warehousing and OLAP form an important step in the knowledge discovery process. This chapter focus on the overview of data warehouse and OLAP technology.” 1.1 What Is a Data Warehouse? Data warehouses have been well-defined in many ways,making it hard to articulate a demanding definition. Lightly speaking, a data warehouse refers to a data repository that is maintained separately from an organization’s operational databases. Data warehouse systems allow for the integration of a variety of application systems. They support information processing by providing a solid platform of consolidated historic data for analysis.” According to William H. Inmon, a leading architect in the construction of data warehouse systems, “A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process”. This short but comprehensive definition presents the major features of a data warehouse. The four keywords—subject-oriented, integrated, time-variant, and non-volatile—differentiate data warehouses from other data source systems, such as relational database systems, transaction processing systems, and file systems. Let’s take a closer look at each of these key features. Subject-oriented: A data warehouse is systematized around major subjects such as customer, supplier, product, and sales. Rather than focussed on the day-to-day operations and transaction processing of an organization, a data warehouse emphases on the modeling and analysis of data for decision-makers. Henceforth, data warehouses usually provide a simple and succinct view of particular subject issues by excluding data that are not useful in the decision support process. Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous sources, such as relational databases, flat files, and online transaction records. Data cleaning and data integration techniques are applied to ensure consistency in naming conventions, encoding structures, attribute measures, and so on. Time-variant: Data is stored to provide information from a historic perspective (e.g., the past 5–10 years). Every key structure in the data warehouse contains, either implicitly or explicitly, a time element. Non-volatile: A data warehouse is always a physically separate store of data transformed from the application data found in the operational environment. Due to this separation, a data warehouse does not require transaction processing, recovery, and concurrency control mechanisms. It usually requires only two operations in data accessing: initial loading of data and access of data. Example: A typical data warehouse is organized around major subjects, such as customer, vendor, product, and sales rather than concentrating on the day-to-day operations and transaction processing of an organization. Features of a Data Warehouse It is separate from the Operational Database. Integrates data from heterogeneous systems. Stores HUGE amount of data, more historical than current data. Does not require data to be highly accurate. Queries are generally complex. 2 Lovely Professional University Unit 1: Data Warehousing and Online Analytical Processing Notes The goal is to execute statistical queries and provide results that can influence decision-making in favor of the Enterprise. These systems are thus called Online Analytical Processing Systems (OLAP). 1.2 The need for a Separate Data Warehouse Because operational databases store huge amounts of data, you may wonder, “Why not perform online analytical processing directly on such databases instead of spending additional time and resources to construct a separate data warehouse?” A major reason for such a separation is to help promote the high performance of both systems. An operational database is designed and tuned from known tasks and workloads like indexing and hashing using primary keys, searching for particular records, and optimizing “canned” queries. On the other hand, data warehouse queries are often complex. They involve the computation of large data groups at summarized levels and may require the use of special data organization, access, and implementation methods based on multidimensional views. Processing OLAP queries in operational databases would substantially degrade the performance of operational tasks. Moreover, an operational database supports the concurrent processing of multiple transactions. Concurrency control and recovery mechanisms (e.g., locking and logging) are required to ensure the consistency and robustness of transactions. An OLAP query often needs read-only access to data records for summarization and aggregation. Concurrency control and recovery mechanisms, if applied for such OLAP operations, may jeopardize the execution of concurrent transactions and thus substantially reduce the throughput of an OLTP system. Finally, the separation of operational databases from data warehouses is based on the different structures, contents, and uses of the data in these two systems. Decision support requires historic data, whereas operational databases do not typically maintain historic data. In this context, the data in operational databases, though abundant, are usually far from complete for decision making. Decision support requires consolidation (e.g., aggregation and summarization) of data from heterogeneous sources, resulting in high-quality, clean, integrated data. In contrast, operational databases contain only detailed raw data, such as transactions, which need to be consolidated before analysis. Because the two systems provide quite different functionalities and require different kinds of data, it is presently necessary to maintain separate databases. 1.3 Data Warehouse Models: Enterprise Warehouse, Data Mart, and Virtual Warehouse From the architecture point of view, there are three data warehouse models: the enterprise warehouse, the data mart, and the virtual warehouse. Enterprise warehouse: An enterprise warehouse collects all of the information about subjects spanning the entire organization. It provides corporate-wide data integration, usually from one or more operational systems or external information providers, and is cross-functional in scope. It typically contains detailed data as well as summarized data and can range in size from a few gigabytes to hundreds of gigabytes, terabytes, or beyond. An enterprise data warehouse may be implemented on traditional mainframes, computer superservers, or parallel architecture platforms. It requires extensive business modeling and may take years to design and build. Datamart: A data mart contains a subset of corporate-wide data that is of value to a specific group of users. The scope is confined to specific selected subjects. For example, a marketing data mart may confine its subjects to a customer, item, and sales. The data contained in data marts tend to be summarized. Depending on the source of data, data marts can be categorized into the following two classes: 1. Independent data marts are sourced from data captured from one or more operational systems or external information providers, or data generated locally within a particular department or geographic area. 2. Dependent data marts are sourced directly from enterprise data warehouses. Virtual warehouse: A virtual warehouse is a set of views over operational databases. For efficient query processing, only some of the possible summary views may be materialized. A virtual warehouse is easy to build but requires excess capacity on operational database servers. Lovely Professional University 3 Notes Data Warehousing and Data Mining A recommended method for the development of data warehouse systems is to implement the warehouse incrementally and evolutionarily, as shown in Figure 1. First, a high-level corporate data model is defined within a reasonably short period (such as one or two months) that provides a corporate-wide, consistent, integrated view of data among different subjects and potential usages. This high-level model, although it will need to be refined in the further development of enterprise data warehouses and departmental data marts, will greatly reduce future integration problems. Second, independent data marts can be implemented in parallel with the enterprise warehouse based on the same corporate data model set noted before. Third, distributed data marts can be constructed to integrate different data marts via hub servers. Finally, a multitier data warehouse is constructed where the enterprise warehouse is the sole custodian of all warehouse data, which is then distributed to the various dependent data marts. Figure 1 A recommended approach for data warehouse development 1.4 Differences between Operational Database Systems and Data Warehouses. Because most people are familiar with commercial relational database systems, it is easy to understand what a data warehouse is by comparing these two kinds of systems. The major task of online operational database systems is to perform online transaction and query processing. These systems are called online transaction processing (OLTP) systems. They cover most of the day-to-day operations of an organization such as purchasing, inventory, manufacturing, banking, payroll, registration, and accounting. Data warehouse systems, on the other hand, service users or knowledge workers in the role of data analysis and decision making. Such systems can organize and present data in various formats to accommodate the diverse needs of different users. These systems are known as online analytical processing (OLAP) systems. The major distinguishing features between OLTP and OLAP are summarised inTable 1. Table 1 Difference between OLTP and OLAP System Feature OLTPSystem OLAPSystem Characteristic Operational Processing Informational Processing Knowledge workers, including Clerks, clients, and information Users managers, executives, and technology professionals. analysts. Customer-oriented and used Market-oriented and used for for transaction and query data analysis long-term System orientation processing Day to day informational requirements, operations decision support. Manages current data that Manages large amounts of Data contents typically, are too detailed to be historical data, provides easily used for decision making. facilities for summarization and 4 Lovely Professional University Unit 1: Data Warehousing and Online Analytical Processing Notes aggregation, and stores and manages information at different levels of granularity. Adopts an entity-relationship Adopts either a star or (ER) data model and an Database design snowflake model and a subject- application-oriented database oriented database design. design In contrast, an OLAP system often spans multiple versions of a database schema, due to the Focuses mainly on the current evolutionary process of an data within an enterprise or organization. OLAP systems View department, without referring also dealwith information that to historical data or data in originates from different different organizations. organizations, integrating information from many data stores. Because of their huge volume, Volume of data Not very large OLAP data are stored on multiple storage media. Accesses to OLAP systems are Consists mainly of short, mostly read-only operations atomic transactions. Such a (since most data warehouses Access patterns system requires concurrency store historical rather than up- control and recovery to-date information), although mechanisms. many could be complex queries. Access mode Read/write Mostly write Focus Data in Information out Operations Index/hash on a primary key Lots of scans Continue…. Number of records accessed Tens Millions Number of users Thousands Hundreds DB size 100 MB to GB 100 GB to TB High performance, high Priority availability High flexibility, end-user autonomy Transaction throughput Metric Query response time Exactly what the difference between operational database and data warehouse? Explain with the suitable of suitable example. 1.5 Data Warehouse Modeling: Data Cube The entity-relationship data model is commonly used in the design of relational databases, where a database schema consists of a set of entities and the relationships between them. Such a data model Lovely Professional University 5 Notes Data Warehousing and Data Mining is appropriate for on-line transaction processing. The data warehouse requires a concise, subject- oriented schema that facilitates OLAP.Data warehouses and OLAP tools are based on a multidimensional data model. This model views data in the form of a data cube.A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts. In general terms, dimensions are the perspectives or entities concerning which an organization wants to keep records.Each dimension may have a table associated with it, called a dimension table, which further describes the dimension. For example, a dimension table for an item may contain the attributes item name, brand, and type. Dimension tables can be specified by users or experts, or automatically generated and adjusted based on data distributions. A multidimensional data model is typically organized around a central theme, such as sales. This theme is represented by a fact table. Facts are numeric measures. Think of them as the quantities by which we want to analyze relationships between dimensions. Examples of facts for a sales data warehouse include dollars sold (sales amount in dollars), units sold (number of units sold), and the amount budgeted. The fact table contains the names of the facts, or measures, as well as keys to each of the related dimension tables. You will soon get a clearer picture of how this works when we look at multidimensional schemas. Table 22-D View of Sales Data for AllElectronics According to time and item Note: The sales are from branches located in the city of Vancouver. The measure displayed is dollars sold (in thousands). Table 33-D View of Sales Data for AllElectronics According to time, item, and location Table 2and Table 3 show the data at different degrees of summarization. In the data warehousing research literature, a data cube like those shown in Figure 2 and Figure 3 is often referred to as a cuboid. Given a set of dimensions, we can generate a cuboid for each of the possible subsets of the given dimensions. The result would form a lattice of cuboids, each showing the data at a different level of summarization, or group-by. The lattice of cuboids is then referred to as a data cube. Figure 4 shows a lattice of cuboids forming a data cube for the dimensions time, item, location, and supplier. 6 Lovely Professional University Unit 1: Data Warehousing and Online Analytical Processing Notes Figure 2 A 3-D data cube representation of the data in Table 3 according to time, item, and location. Figure 3 A 4-D data cube representation of sales data, according to time, item, location, and supplier. The cuboid that holds the lowest level of summarization is called the base cuboid. For example, the 4-D cuboid in Figure 3 is the base cuboid for the given time, item, location, and supplier dimensions. Figure 2 is a 3-D (non-base) cuboid for time, item,and location summarized for all suppliers. The 0-D cuboid, which holds the highest level of summarization, is called the apex cuboid. In our example, this is the total sales, or dollars sold, summarized over all four dimensions. The apex cuboid is typically denoted by all.” Figure 4Lattice of cuboids, making up a 4-D data cube for time, item, location, and supplier. Example: “Date” can be grouped into “day”, “month”, “quarter”, “year” or “week”, which forms a lattice structure. Lovely Professional University 7 Notes Data Warehousing and Data Mining 1.6 Conceptual Modeling of Data Warehouse The most popular data model for a data warehouse is a multidimensional model, which can exist in the form of a star schema, a snowflake schema, or a fact constellation schema. Let’s look at each of these. Star schema: The most common modeling paradigm is the star schema, in which the data warehouse contains (1) a large central table (fact table) containing the bulk of the data, with no redundancy, and (2) a set of smaller attendant tables (dimension tables), one for each dimension. The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern around the central fact table. A star schema for AllElectronics sales is shown in Figure 5. Sales are considered along four dimensions: time, item, branch, and location. The schema contains a central fact table for sales that contain keys to each of the four dimensions, along with two measures: dollars sold and units sold. To minimize the size of the fact table, dimension identifiers (e.g., time key and item key) are system-generated identifiers. Notice that in the star schema, each dimension is represented by only one table, and each table contains a set of attributes. Figure 5Star schema of a sales data warehouse. Example: Let, an organization sells products throughout the world. The main four major dimensions are time, location, time, and branch. Snowflake schema: The snowflake schema is a variant of the star schema model, where some dimension tables are normalized, thereby further splitting the data into additional tables. The resulting schema graph forms a shape similar to a snowflake. The major difference between the snowflake and star schema models is that the dimension tables of the snowflake model may be kept in the normalized form to reduce redundancies. Such a table is easy to maintain and saves storage space. However, this space savings is negligible in comparison to the typical magnitude of the fact table. Furthermore, the snowflake structure can reduce the effectiveness of browsing, since more joins will be needed to execute a query. Consequently, the system performance may be adversely impacted. Hence, although the snowflake schema reduces redundancy, it is not as popular as the star schema in data warehouse design. 8 Lovely Professional University Unit 1: Data Warehousing and Online Analytical Processing Notes Figure 6Snowflake schema of a sales data warehouse. A snowflake schema for AllElectronics sales is given in Figure 6. Here, the sales fact table is identical to that of the star schema in Figure 5. The main difference between the two schemas is in the definition of dimension tables. The single dimension table for an item in the star schema is normalized in the snowflake schema, resulting in new item and supplier tables. For example, the item dimension table now contains the attributes item key, item name, brand, type, and supplier key, where supplier key is linked to the supplier dimension table, containing supplier key and supplier type information. Similarly, the single dimension table for location in the star schema can be normalized into two new tables: location and city. The city key in the new location table links to the city dimension. Notice that, when desirable, further normalization can be performed on province or state and country in the snowflake schema shown in Figure 6. Fact constellation: Sophisticated applications may require multiple fact tables to share dimension tables. This kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema or a fact constellation. Figure 7Fact constellation schema of a sales and shipping data warehouse. A fact constellation schema is shown in Figure 7. This schema specifies two fact tables, sales, and shipping. The sales table definition is identical to that of the star schema (Figure 5). The shipping table has five dimensions, or keys—item key, time key, shipper key, from location, and to location—and two measures—dollars cost and units shipped. A fact constellation schema allows dimension tables to be shared between fact tables. For example, the dimensions tables for time, item, and location are shared between the sales and shipping fact tables. 1.7 Concept Hierarchies A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-level, more general concepts. Consider a concept hierarchy for the dimension location. City values for location include Lovely Professional University 9 Notes Data Warehousing and Data Mining Vancouver, Toronto, New York, and Chicago. Each city, however, can be mapped to the province or state to which it belongs. For example, Vancouver can be mapped to British Columbia, and Chicago to Illinois. The provinces and states can in turn be mapped to the country (e.g., Canada or the United States) to which they belong. These mappings form a concept hierarchy for the dimension location, mapping a set of low-level concepts (i.e., cities) to higher-level, more general concepts (i.e., countries). This concept hierarchy is illustrated in Figure 8. Figure 8A concept hierarchy for location. 1.8 Measures: Their Categorization and Computation “How are measures computed?” To answer this question, we first study how measures can be categorized. Note that a multidimensional point in the data cube space can be defined by a set dimension–value pairs; for example,( time =“Q1”, location = “Vancouver”, item =“computer”). A data cube measure is a numeric function that can be evaluated at each point in the data cube space. A measured value is computed for a given point by aggregating the data corresponding to the respective dimension-value pairs defining the given point.We will look at concrete examples of this shortly. Distributive: An aggregate function is distributive if it can be computed in a distributed manner as follows. Suppose the data are partitioned into n sets. We apply the function to each partition, resulting in n aggregate values. If the result derived by applying the function to the n aggregate values is the same as that derived by applying the function to the entire data set (without partitioning), the function can be computed in a distributed manner. For example, sum() can be computed for a data cube by first partitioning the cube into a set of sub-cubes, computing sum() for each sub-cube, and then summing up the counts obtained for each sub-cube. Hence, sum() is a distributive aggregate function. For the same reason, count(), min(), and max() are distributive aggregate functions. Algebraic: An aggregate function is algebraic if it can be computed by an algebraic function with M arguments (where M is a bounded positive integer), each of which is obtained by applying a distributive aggregate function. For example, avg() (average) can be computed by sum()/count(), where both sum() and count() are distributive aggregate functions. Similarly, it can be shown that min N() and max N() (which find the N minimum and N maximum values, respectively, in a given set) and standard deviation() are algebraic aggregate functions. A measure is algebraic if it is obtained by applying an algebraic aggregate function. Holistic: An aggregate function is holistic if there is no constant bound on the storage size needed to describe a sub-aggregate. That is, there does not exist an algebraic function with M arguments (where M is a constant) that characterizes the computation. Common examples of holistic functions include median(), mode(), and rank(). A measure is holistic if it is obtained by applying a holistic aggregate function. 10 Lovely Professional University Unit 1: Data Warehousing and Online Analytical Processing Notes 1.9 OLAP Operations “A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts. Dimensions are the entities concerning which an organization wants to keep records. Facts are numerical measures. It is the quantities by which we want to analyze relationships between dimensions. The data cube is used by the users of the decision support system to see their data. The cuboid that holds the lowest level of summarization is called the base cuboid. The 0-D cuboid, which holds the highest level of summarization, is called the apex cuboid. Figure 9A Lattice of Cuboids 1.10 Operations in the Multidimensional Data Model (OLEP) In the multidimensional model, data are organized into multiple dimensions, and each dimension contains multiple levels of abstraction defined by concept hierarchies. The organization provides users with the flexibility to view data from different perspectives. Some OLAP data cube operations exist to materialize these different views, allowing interactive querying and analysis of the data at hand. Five basic OLAP commands are used to perform data retrieval from a Data warehouse. 1. Roll-up (drill-up):-The roll-up operation performs aggregation on a data cube either by climbing up the hierarchy or by dimension reduction. Delhi, New York, Patiala, and Los Angeles wins 5, 2, 3, and 5 medals respectively.So in this example, we roll upon Location from cities to countries. 2. Drill-down:-Drill-down is the reverse of roll-up. That means lower-level summary to higher- level summary. Drill-down can be performed either by:- Stepping down a concept hierarchy for a dimension. By introducing a new dimension. Lovely Professional University 11 Notes Data Warehousing and Data Mining Drill-down on Location from countries to cities. 3. Slice and dice:-The slice operation perform a selection on one dimension of the given cube, resulting in a subcube. Reduces the dimensionality of the cubes. For example, if we want to make a selection where Medal = 5 Location Medal Delhi 5 Los Angles 5 The dice operation defines a sub-cube by performing a selection on two or more dimensions. For example, if we want to make a selection where Medal = 3 or Location = New York. 4.Pivot:-Pivot is also known as rotate. It rotates the data axis to view the data from different perspectives. Discuss OLAP application identify its characteristics. 1.11 DataWarehouse Design and Usage “What can business analysts gain from having a data warehouse?” First, having a data warehouse may provide a competitive advantage by presenting relevant information from which to measure performance and make critical adjustments to help win over competitors. Second, a data warehouse can enhance business productivity because it can quickly and efficiently gather the information that accurately describes the organization. Third, a data warehouse facilitates customer relationship management because it provides a consistent view of customers and items across all lines of business, all departments, and all markets. Finally, a data warehouse may bring about cost reduction by tracking trends, patterns, and exceptions over long periods consistently and reliably. To design an effective data warehouse we need to understand and analyze business needs and construct a business analysis framework. The construction of a large and complex information system can be viewed as the construction of a large and complex building, for which the owner, architect, and builder have different views. Four different views regarding a data warehouse design must be considered: the top-down view, the data source view, the data warehouse view, and the business query view. The top-down view allows the selection of the relevant information necessary for the data warehouse. This information matches current and future business needs. The data source view exposes the information being captured, stored, and managed by operational systems. This information may be documented at various levels of detail and accuracy, from individual data source tables to integrated data source tables. Data sources are often modeled by traditional data modeling techniques, such as the entity-relationship model or CASE (computer-aided software engineering) tools. 12 Lovely Professional University Unit 1: Data Warehousing and Online Analytical Processing Notes The data warehouse view includes fact tables and dimension tables. It represents the information that is stored inside the data warehouse, including pre-calculated totals and counts, as well as information regarding the source, date, and time of origin, added to provide historical context. Finally, the business query view is the data perspective in the data warehouse from the end-user’s viewpoint. Building and using a data warehouse is a complex task because it requires business skills, technology skills, and program management skills. Regarding business skills, buildinga data warehouse involves understanding how systems store and manage their data, howto build extractors that transfer data from the operational system to the data warehouse,and how to build warehouse refresh software that keeps the data warehouse reasonablyup-to-date with the operating system’s data. Using a data warehouse involves understandingthe significance of the data it contains, as well as understanding and translatingthe business requirements into queries that can be satisfied by the data warehouse. Regarding technology skills, data analysts are required to understand how to make assessments from quantitative information and derive facts based on conclusions from historic information in the data warehouse. These skills include the ability to discover patterns and trends, extrapolate trends based on history and look for anomalies or paradigm shifts, and to present coherent managerial recommendations based on such analysis. Finally, program management skills involve the need to interface with many technologies, vendors, and end-users to deliver results in a timely and cost- effective manner. DataWarehouse Design Process Let’s look at various approaches to the data warehouse design process and the steps involved. A data warehouse can be built using a top-down approach, a bottom-up approach, or a combination of both. The top-down approach starts with overall design and planning. It is useful in cases where the technology is mature and well known, and where the business problems that must be solved are clear and well understood. The bottom-up approach starts with experiments and prototypes. This is useful in the early stage of business modeling and technology development. It allows an organization to move forward at considerably less expense and to evaluate the technological benefits before making significant commitments. In the combined approach, an organization can exploit the planned and strategic nature of the top-down approach while retaining the rapid implementation and opportunistic application of the bottom-up approach. In general, the warehouse design process consists of the following steps: 1. Choose a business process to model (e.g., orders, invoices, shipments, inventory, account administration, sales, or the general ledger). If the business process is organizational and involves multiple complex object collections, a data warehouse model should be followed. However, if the process is departmental and focuses on the analysis of one kind of business process, a data mart model should be chosen. 2. Choose the business process grain, which is the fundamental, atomic level of data to be represented in the fact table for this process (e.g., individual transactions, individual daily snapshots, and so on). 3. Choose the dimensions that will apply to each fact table record. Typical dimensions are time, item, customer, supplier, warehouse, transaction type, and status. 4. Choose the measures that will populate each fact table record. Typical measures are numeric additive quantities like dollars sold and units sold. Because data warehouse construction is a difficult and long-term task, its implementation scope should be clearly defined. The goals of an initial data warehouse implementation should be specific, achievable, and measurable. This involves determining the time and budget allocations, the subset of the organization that is to be modeled, the number of data sources selected, and the number and types of departments to be served. Once a data warehouse is designed and constructed, the initial deployment of the warehouse includes initial installation, roll-out planning, training, and orientation. Platform upgrades and Lovely Professional University 13 Notes Data Warehousing and Data Mining maintenance must also be considered. Data warehouse administration includes data refreshment, data source synchronization, planning for disaster recovery, managing access control and security, managing data growth, managing database performance, and data warehouse enhancement and extension. Scope management includes controlling the number and range of queries, dimensions, and reports; limiting the data warehouse’s size; or limiting the schedule, budget, or resources. Various kinds of data warehouse design tools are available. Data warehouse development tools provide functions to define and edit metadata repository contents (e.g., schemas, scripts, or rules), answer queries, output reports, and ship metadata to and from relational database system catalogs. Planning and analysis tools study the impact of schema changes and refresh performance when changing refresh rates or time windows. DataWarehouse Usage for Information Processing Data warehouses and data marts are used in a wide range of applications.There are three kinds of data warehouse applications: information processing, analytical processing, and data mining. Information processing supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts, or graphs. A current trend in data warehouse information processing is to construct low-cost web-based accessing tools that are then integrated with web browsers. Analytical processing supports basic OLAP operations, including slice-and-dice, drill-down, roll-up, and pivoting. It generally operates on historic data in both summarized and detailed forms. The major strength of online analytical processing over information processing is themultidimensional data analysis of data warehouse data. Data mining supports knowledge discovery by finding hidden patterns and associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools. “How does data mining relate to information processing and online analytical processing?” Information processing, based on queries, can find useful information. However, answers to such queries reflect the information directly stored in databases or computable by aggregate functions. They do not reflect sophisticated patterns or regularities buried in the database. Therefore, information processing is not data mining. Because data mining systems can also mine generalized class/concept descriptions, this raises some interesting questions: “Do OLAP systems perform data mining? Are OLAP systems are data mining systems?” The functionalities of OLAP and data mining can be viewed as disjoint: OLAP is a data summarization/aggregation tool that helps simplify data analysis, while data mining allows the automated discovery of implicit patterns and interesting knowledge hidden in large amounts of data. OLAP tools are targeted toward simplifying and supporting interactive data analysis, whereas the goal of data mining tools is to automate as much of the process as possible, while still allowing users to guide the process. In this sense, data mining goes one step beyond traditional online analytical processing. Data mining is not confined to the analysis of data stored in data warehouses. It may analyze data existing at more detailed granularities than the summarized data provided in a data warehouse. It may also analyze transactional, spatial, textual, and multimedia data that are difficult to model with current multidimensional database technology. In this context, data mining covers a broader spectrum than OLAP for data mining functionality and the complexity of the data handled. 1.12 From Online Analytical Processing to Multidimensional Data Mining The data mining field has conducted substantial research regarding mining on various data types, including relational data, data from data warehouses, transaction data, time-series data, spatial data, text data, and flat files. Multidimensional data mining (also known as exploratory multidimensional data mining, online analytical mining, or OLAM) integrates OLAP with data mining to uncover knowledge in multidimensional databases. Among the many different paradigms and architectures of data mining systems, multidimensional data mining is particularly important for the following reasons: High quality of data in data warehouses: Most data mining tools need to work on integrated, consistent, and cleaned data, which requires costly data cleaning, data integration, and data transformation as preprocessing steps. A data warehouse constructed by such preprocessing serves as a valuable source of high-quality data for OLAP as well as for data mining. Notice that data mining may serve as a valuable tool for data cleaning and data integration as well. Available information processing infrastructure surrounding data warehouses: Comprehensive information processing and data analysis infrastructures have been or will be systematically constructed 14 Lovely Professional University Unit 1: Data Warehousing and Online Analytical Processing Notes surrounding data warehouses, which include access, integration, consolidation, and transformation of multiple heterogeneous databases, ODBC/OLEDB connections, Web access, and service facilities, and reporting and OLAP analysis tools. It is prudent to make the best use of the available infrastructures rather than constructing everything from scratch. OLAP-based exploration of multidimensional data: Effective data mining needs exploratory data analysis. A user will often want to traverse through a database, select portions of relevant data, and analyze them at different granularities, and present knowledge/results in different forms. Multidimensional data mining provides facilities for mining on different subsets of data and at varying levels of abstraction—by drilling, pivoting, filtering, dicing, and slicing on a data cube and/or intermediate data mining results. This, together with data/knowledge visualization tools, greatly enhances the power and flexibility of data mining. Online selection of data mining functions: Users may not always know the specific kinds of knowledge they want to mine. By integrating OLAP with various data mining functions, multidimensional data mining provides users with the flexibility to select desired data mining functions and swap data mining tasks dynamically. 1.13 DataWarehouse Implementation Data warehouses contain huge volumes of data. OLAP servers demand that decision support queries be answered in the order of seconds. Therefore, it is crucial for data warehouse systems to support highly efficient cube computation techniques, access methods, and query processing techniques. Efficient Data Cube Computation: An Overview At the core of multidimensional data analysis is the efficient computation of aggregations across many sets of dimensions. In SQL terms, these aggregations are referred to as group-by’s. Each group-by can be represented by a cuboid, where the set of group-by’s forms a lattice of cuboids defining a data cube. In this subsection, we explore issues relating to the efficient computation of data cubes. The compute cube Operator and the Curse of Dimensionality One approach to cube computation extends SQL to include a compute cube operator. The compute cube operator computes aggregates over all subsets of the dimensions specified in the operation. This can require excessive storage space, especially for large numbers of dimensions. We start with an intuitive look at what is involved in the efficient computation of data cubes.Suppose that you want to create a data cube for allElectronics sales contain the following: city, item, year, and sales in dollars. You want to be able to analyze the data, with queries such as the following: “Compute the sum of sales, grouping by city and item.” “Compute the sum of sales, grouping by city.” “Compute the sum of sales, grouping by item.” What is the total number of cuboids, or group-by’s, that can be computed for this data cube? Taking the three attributes, city, item, and year, as the dimensions for the data cube, and sales in dollars as the measure, the total number of cuboids, or group by’s, that can be computed for this data cube is 2^3=8. The possible group-by’s are the following: {(city, item, year), (city, item), (city, year), (item, year), (city), (item), (year),( ),} where ( )means that the group-by is empty (i.e., the dimensions are not grouped). These group-by’s form a lattice of cuboids for the data cube, as shown in Figure 10. Lovely Professional University 15 Notes Data Warehousing and Data Mining Figure 10Lattice of cuboids, making up a 3-D data cube. The base cuboid contains all three dimensions, city, item, and year. It can return the total sales for any combination of the three dimensions. The apex cuboid, or 0-D cuboid, refers to the case where the group-by is empty. It contains the total sum of all sales. The base cuboid is the least generalized (most specific) of the cuboids. The apex cuboid is the most generalized (least specific) of the cuboids and is often denoted as all. If we start at the apex cuboid and explore downward in the lattice, this is equivalent to drilling down within the data cube. If we start at the base cuboid and explore upward, this is akin to rolling up. Online analytical processing may need to access different cuboids for different queries. Therefore, it may seem like a good idea to compute in advance all or at least some of the cuboids in a data cube.A major challenge related to this pre-computation, however, is that the required storage space may explode if all the cuboids in a data cube are pre-computed, especially when the cube has many dimensions. The storage requirements are even more excessive when many of the dimensions have associated concept hierarchies, each with multiple levels. This problem is referred to as the curse of dimensionality. “How many cuboids are there in an n-dimensional data cube?” If there were no hierarchies associated with each dimension, then the total number of cuboids for an n-dimensional data cube, as we have seen, is 2n.However, in practice, many dimensions do have hierarchies. For example, time is usually explored not at only one conceptual level (e.g., year), but rather at multiple conceptual levels such as in the hierarchy “day programs menu. For mac: download the.dmg and add it to your applications folder. RapidMiner Products There are many products of RapidMiner that are used to perform multiple operations. Some of the products are: RapidMiner Studio:With RapidMiner Studio, one can access, load, and analyze both traditional structured data and unstructured data like text, images, and media. It can also extract information from these types of data and transform unstructured data into structured. RapidMiner Auto Model:Auto Model is an advanced version of RapidMiner Studio that increments the process of building and validating data models. You can customize the processes and can put them in production based on your needs. Majorly three kinds of problems can be resolved with Auto Model namely prediction, clustering, and outliers. RapidMiner Turbo Prep:Data preparation is time-consuming and RapidMiner Turbo Prep is designed to make the preparation of data much easier. It provides a user interface where your data is always visible front and center, where you can make changes step-by-step and instantly see the results, with a wide range of supporting functions to prepare the data for model-building or presentation. TOOL CHARACTERISTICS Usability: Easy to use Tool orientation:The tool is designed for general-purpose analysis Data mining type:This tool is made for Structured data mining, Text mining, Image mining, Audio mining, Video mining, Data gathering, Social network analysis. Manipulation type:This tool is designed for Data extraction, Data transformation, Data analysis, Data visualization, Data conversion, Data cleaning 5.3 Features of RapidMiner Application &Interface:RapidMiner Studio is a visual data science workflow designer accelerating the prototyping & validation of models. Data Access: With RapidMiner Studio, you can access, load, and analyze any type of data – both traditional structured data and unstructured data like text, images, and media. It can also extract information from these types of data and transform unstructured data into structured. Data Exploration: Immediately understand and create a plan to prepare the data automatically extract statistics and key information. Data Prep:The richness of the data preparation capabilities in RapidMiner Studio can handle any real-life data transformation challenges, so you can format and create the optimal data set for predictive analytics. RapidMiner Studio can blend structured with unstructured data and then leverage all the data for predictive analysis. Any data preparation process can be saved for reuse. Modeling: RapidMiner Studio comes equipped with an un-paralleled set of modeling capabilities and machine learning algorithms for supervised and unsupervised learning. They are flexible, robust and allow you to focus on building the best possible models for any use case. 74 Lovely Professional University Unit 05: Introduction to Mining Tools Notes Validation:RapidMiner Studio provides the means to accurately and appropriately estimate model performance. Where other tools tend to too closely tie modeling and model validation, RapidMiner Studio follows a stringent modular approach which prevents information used in pre-processing steps from leaking from model training into the application of the model. This unique approach is the only guarantee that no overfitting is introduced and no overestimation of prediction performances can occur. Scoring:RapidMiner Studio makes the application of models easy, whether you are scoring them in the RapidMiner platform or using the resulting models in other applications. Code Control &Management:Unlike many other predictive analytics tools, RapidMiner Studio covers even the trickiest data science use cases without the need to program. Beyond all the great functionality for preparing data and building models, RapidMiner Studio has a set of utility-like process control operations that lets you build processes that behave like a program to repeat and loop over tasks, branch flows and call on system resources. RapidMiner Studio also supports a variety of scripting languages. 5.4 How does it work? You visually design a data mining process. A process is like a flow chart for mining operators. Figure 1: Working of RapidMiner All data that you load will be contained in an example set. Each example is described by Attributes (a.k.a. features) and all attributes have Value Types and their specific Roles. The Value types define how data is treated Numeric data has an order (2 is closer to 1 than to 5) Nominal data has no order (red is as different from green as from blue) Table 1: Different Value Type Value Type Description binominal Only two different values are permitted polynomial More than two different values are permitted integer Whole numbers, positive and negative real Real numbers, positive and negative date_time Date as well as time date Only date-time Only time Lovely Professional University 75 Notes Data Warehousing And Data Mining Roles define how the attribute is treated by the Operators. Table 2: Types of Roles Role Description Id A unique identifier, no two examples in an example set can have the same value Regular (default) Regular attribute that contains data Label The target attribute for classification tasks Weight The weight of the Examples concerning the label Cluster Created by RapidMiner as the result of a clustering task Prediction Created by RapidMiner as the result of a classification task The Repository This is where you store your data and processes.Only if you load data from the repository, RapidMiner can show you which attributes exist. Add data via the “Add Data” button or the “Store” operator. You can load data via drag ‘n’ drop or the “Retrieve” operator. If you have a question starting with “Why does RapidMiner not show me …?” Then the answer most likely is “Because you did not load your data into the Repository!” Figure2:Reository A repository is simply a folder that holds all of your RapidMiner data sets (we call them "ExampleSets), processes, and other file objects that you will create using RapidMiner Studio. This folder can be stored locally on your computer, or on a RapidMiner Server. Figure3: Store Operator 76 Lovely Professional University Unit 05: Introduction to Mining Tools Notes This operator stores an IO Object at a location in the data repository. The location of the object to be stored is specified through the repository entry parameter. The stored object can be used by other processes by using the Retrieve operator. Figure 4: Retrieve Operator Example: Themost simple process which can be created with RapidMiner: It just retrieves data from a repository and delivers it as a result to the user to allow for inspection. Benefits of RapidMiner The main benefits of RapidMiner are its robust features, user-friendly interface, and maximization of data usage. Learn more of its benefits below: Robust features and user-friendly interface RapidMiner’s tools and features offer powerful capabilities for the users while at the same time are presented through a user-friendly interface that allows users to perform productively in their works from the start. Thus, each of the tools’ robust components is easy to be operated. One feature of the system is the visual workflow designer which is a tool that provides a visual environment to the users. This environment is where analytics processes can be designed, created, and then deployed. Visual presentation and models can also be made and processed here. All of these can easily be done by the users because of the friendly environment. Maximize the usage of data The system provides the users with the right set of tools which makes relevant uses of even the most disorganized, uncluttered, and seemingly useless data. This can be accomplished by enabling the users and their team to structure data in an easy way for them to comprehend. To do this, RapidMiner offers capabilities that simplify data access and management that will empower users to load, access, and evaluate all types of data such as images and texts. Not only does the system allows the usage of any data but it also allows them to create models and plans out of them, which can then be used as a basis for decision making and formulation of strategies. RapidMiner has data exploration features, such as descriptive statistics and graphs and visualization, which allows users to get valuable insights out of the information they gained.RapidMiner is also powerful enough to provide analytics that is based on real-life data transformation settings. This means that users can manipulate their data any way they want since they have control of its formatting and the system. Because of this, they can create the optimal data set when performing predictive analytics. How to use RapidMiner Use the “Design Perspective” to create your Process See your current Process – “Process” Access your data and processes – “Repository” Add operators to the process – “Operators” Configure the operators – “Parameters” Learn about operators – “Help” Lovely Professional University 77 Notes Data Warehousing And Data Mining Figure 5:Design View Use the “Results Perspective” to inspect the output The “Data View” shows your example set The “Statistics View” contains metadata and statistics The “Visualizations View” allows you to visualize the data Figure 6:The Result View Finding an operator Once you get familiar with operator names, you can find them more easily using the filter at the top of the operator window. 78 Lovely Professional University Unit 05: Introduction to Mining Tools Notes Figure 7: Ways to find an Operator 5.5 Weka Weka is data mining software that uses a collection of machine learning algorithms. These algorithms can be applied directly to the data or called from the Java code.The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. WEKA an open-source software that provides tools for data preprocessing, implementation of several Machine Learning algorithms, and visualization tools so that you can develop machine learning techniques and apply them to real-world data mining problems. What WEKA offers is summarized in the following diagram − Figure 8: Working of WEKA Lovely Professional University 79 Notes Data Warehousing And Data Mining If you observe the beginning of the flow of the image, you will understand that there are many stages in dealing with Big Data to make it suitable for machine learning − First, you will start with the raw data collected from the field. This data may contain several null values and irrelevant fields. You use the data preprocessing tools provided in WEKA to cleanse the data.Then, you would save the preprocessed data in your local store for applying ML algorithms. Next, depending on the kind of ML model that you are trying to develop you would select one of the options such as Classify, Cluster, or Associate. The Attributes Selection allows the automatic selection of features to create a reduced dataset.Note that under each category, WEKA provides the implementation of several algorithms. You would select an algorithm of your choice, set the desired parameters, and run it on the dataset. Then, WEKA would give you the statistical output of the model processing. It provides you a visualization tool to inspect the data.The various models can be applied to the same dataset. You can then compare the outputs of different models and select the best that meets your purpose.Thus, the use of WEKA results in quicker development of machine learning models on the whole. Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. The WEKA GUI Chooser application will start and you would see the following screen − Figure 9:Weka GUI Chooser The GUI Chooser application allows you to run five different types of applications as listed here − Explorer Experimenter KnowledgeFlow Workbench Simple CLI When you click on the Explorer button in the Applications selector, it opens the following screen – 80 Lovely Professional University Unit 05: Introduction to Mining Tools Notes Figure 10: Weka Explorer On the top, you will see several tabs as listed here − Preprocess Classify Cluster Associate Select Attributes Visualize Under these tabs, there are several pre-implemented machine learning algorithms. Let us look into each of them in detail now. Preprocess Tab Initially, as you open the explorer, only the Preprocess tab is enabled. The first step in machine learning is to preprocess the data. Thus, in the Preprocess option, you will select the data file, process it, and make it fit for applying the various machine learning algorithms. Classify Tab The Classify tab provides you several machine learning algorithms for the classification of your data. To list a few, you may apply algorithms such as Linear Regression, Logistic Regression, Support Vector Machines, Decision Trees, RandomTree, RandomForest, NaiveBayes, and so on. The list is very exhaustive and provides both supervised and unsupervised machine learning algorithms. Cluster Tab Under the Cluster tab, there are several clustering algorithms provided - such as SimpleKMeans, FilteredClusterer, HierarchicalClusterer, and so on. Associate Tab Under the Associate tab, you would find Apriori, FilteredAssociator, and FPGrowth. Lovely Professional University 81 Notes Data Warehousing And Data Mining Select Attributes Tab Select Attributes allows you to feature selections based on several algorithms such as ClassifierSubsetEval, PrinicipalComponents, etc. Example:They can be used, for example, to store an identifier with each instance in a dataset. Visualize Tab The Visualize option allows you to visualize your processed data for analysis. WEKA provides several ready-to-use algorithms for testing and building your machine learning applications. To use WEKA effectively, you must have a sound knowledge of these algorithms, how they work, which one to choose under what circumstances, what to look for in their processed output, and so on. In short, you must have a solid foundation in machine learning to use WEKA effectively in building your apps. we start with the first tab that you use to preprocess the data. This is common to all algorithms that you would apply to your data for building the model and is a common step for all subsequent operations in WEKA. For a machine-learning algorithm to give acceptable accuracy, you must cleanse your data first. This is because the raw data collected from the field may contain null values, irrelevant columns, and so on. First, you will learn to load the data file into the WEKA Explorer. The data can be loaded from the following sources − Local file system Web Database Analyze your data with WEKA Explorer using various learning schemes and interpret received results. We will see all three options of loading data in detail. Loading Data from Local File System Just under the Machine Learning tabs that you studied in the previous lesson, you would find the following three buttons − − Open file... Open URL... Open DB... Click on the Open file... button. A directory navigator window opens as shown in the following screen − 82 Lovely Professional University Unit 05: Introduction to Mining Tools Notes Figure 11: Opening a File using Explorer Now, navigate to the folder where your data files are stored. WEKA installation comes up with many sample databases for you to experiment with. These are available in the data folder of the WEKA installation. For learning purposes, select any data file from this folder. The contents of the file would be loaded in the WEKA environment. We will very soon learn how to inspect and process this loaded data. Before that, let us look at how to load the data file from the Web. Developed by the University of Waikato, New Zealand, Weka stands for Waikato Environment for Knowledge Analysis. Loading Data from Web Once you click on the Open URL... button, you can see a window as follows – Figure 12:Loading Data from Web We will open the file from a public URL Type the following URL in the popup box − https://storm.cis.fordham.edu/~gweiss/data-mining/weka-data/weather.nominal.arff You may specify any other URL where your data is stored. The Explorer will load the data from the remote site into its environment. Lovely Professional University 83 Notes Data Warehousing And Data Mining Loading Data from DB Once you click on the Open DB... button, you can see a window as follows − Figure 13: Loading of Data Using DB Set the connection string to your database, set up the query for data selection, process the query, and load the selected records in WEKA.WEKA supports a large number of file formats for the data. The types of files that it supports are listed in the drop-down list box at the bottom of the screen. This is shown in the screenshot given below. Figure 14: Files Supported by Weka As you would notice it supports several formats including CSV and JSON. The default file type is Arff. Arff Format An Arff file contains two sections - header and data. The header describes the attribute types. The data section contains a comma-separated list of data. 84 Lovely Professional University Unit 05: Introduction to Mining Tools Notes As an example for Arff format, the Weather data file loaded from the WEKA sample databases is shown below − Figure 15:ARFF File From the screenshot, you can infer the following points − The @relation tag defines the name of the database. The @attribute tag defines the attributes. The @data tag starts the list of data rows each containing the comma-separated fields. The attributes can take nominal values as in the case of outlook shown here − @attribute outlook (sunny, overcast, rainy) The attributes can take real values as in this case − @attribute temperature real You can also set a Target or a Class variable called to play as shown here − @attribute play (yes, no) The Target assumes two nominal values yes or no. Create your ARFF file and load it using Weka. Lovely Professional University 85 Notes Data Warehousing And Data Mining Figure 16:Opening a file using Weka When you open the file, your screen looks like as shown here − Figure 17: Pre-processing of Data Understanding Data Let us first look at the highlighted Current relation sub-window. It shows the name of the database that is currently loaded. You can infer two points from this sub window − There are 14 instances - the number of rows in the table. The table contains 5 attributes - the fields, which are discussed in the upcoming sections. On the left side, notice the Attributes sub-window that displays the various fields in the database. 86 Lovely Professional University Unit 05: Introduction to Mining Tools Notes Figure 18: Selected Relation Attributes The weather database contains five fields - outlook, temperature, humidity, windy, and play. When you select an attribute from this list by clicking on it, further details on the attribute itself are displayed on the right-hand side. Let us select the temperature attribute first. When you click on it, you would see the following screen Figure 19: Statistics of Selected Attributes In the Selected Attribute subwindow, you can observe the following − The name and the type of attribute are displayed. The type for the temperature attribute is Nominal. The number of Missing values is zero. There are three distinct values with no unique value. Lovely Professional University 87 Notes Data Warehousing And Data Mining The table underneath this information shows the nominal values for this field as hot, mild, and cold. It also shows the count and weight in terms of a percentage for each nominal value. At the bottom of the window, you see the visual representation of the class values. If you click on the Visualize All button, you will be able to see all features in one single window as shown here − Figure 20: Visualization of Selected Data Removing Attributes Many a time, the data that you want to use for model building comes with many irrelevant fields. For example, the customer database may contain his mobile number which is relevant in analyzing his credit rating. Figure 21:Attribute Removal To remove Attribute/s select them and click on the Remove button at the bottom. The selected attributes would be removed from the database. After you fully preprocess the data, you can save it for model building. Next, you will learn to preprocess the data by applying filters to this data. 88 Lovely Professional University Unit 05: Introduction to Mining Tools Notes Applying Filters Some of the machine learning techniques such as association rule mining requires categorical data. To illustrate the use of filters, we will use weather-numeric.arff database that contains two numeric attributes - temperature and humidity. We will convert these to nominal by applying a filter to our raw data. Click on the Choose button in the Filter subwindow and select the following filter − weka→filters→supervised→attribute→Discretize Figure 22:Application of Filters After you are satisfied with the preprocessing of your data, save the data by clicking the Save... button. You will use this saved file for model building. 5.6 Comparison between Rapid Miner and Weka In the comparative study, I have concentrated on 2 of the commonly used tools: Rapid Miner Weka. Features for Comparative Study A comparative study is done based upon the following features: Usability Speed Visualization Algorithms supported Data Set Size Memory Usage Primary Usage Interface Type Supported Lovely Professional University 89 Notes Data Warehousing And Data Mining Table 3:Comparison Features Rapid Miner Weka Usability Easy to use Easiest to use. Speed Require more memory to Works faster on any machine. operate. Visualization More options Fewer options. Algorithms supported Classification and clustering Classification and clustering Data Set Size Support small and large data Supports only small data sets. sets. Memory Usage Requires more memory Requires less memory and hence works faster. Primary Usage Data mining, predictive Machine learning analysis Interface type supported GUI GUI/CLI Summary A perspective consists of a freely configurable selection of individual user interface elements, the so-called views. RapidMiner will eventually also ask you automatically if switching to another perspective. All work steps or building blocks for different data transformation or analysis tasks are called operators in RapidMiner. Those operators are presented in groups in the Operator View on the left side. One of the first steps in a process for data analysis is usually to load some data into the system. RapidMiner supports multiple methods for accessing datasets. It is always recommended to use the repository whenever this is possible instead of files. Open Recent Process opens the process which is selected in the list below the actions. Alternatively, you can open this process by double-clicking inside the list. WEKA supports many different standard data mining tasks such as data pre-processing, classification, clustering, regression, visualization and feature selection. The WEKA application allows novice users a tool to identify hidden information from database and file systems with simple to use options and visual interfaces. Keywords Process: A connected set of Operators that help you to transform and analyze your data. Port:To build a process, you must connect the output from each Operator to the input of the next via a port. Repository: your central data storage entity. It holds connections, data, processes, and results, either locally or remotely. Operators: The elements of a Process, each Operator takes input and creates output, depending on the choice of parameters. Filters. Processes that transform instances and sets of instances are called filters. New Process: Starts a new analysis process. 90 Lovely Professional University Unit 05: Introduction to Mining Tools Notes Self Assessment Questions 1) __________________ is the central RapidMiner perspective where all analysis processes are created and managed. a) Design Perspective b) Result Perspective c) Welcome Perspective d) All of the Above 2) ____________________Opens the repository browser and allows you to select a process to be opened within the process Design Perspective. a) Open Template b) Online Tutorial c) Open Process d) Open Recent Process 3) Which of the following contains a large number of operators for writing data and objects into external formats such as files, databases, etc. a) Data Transformation b) Export c) Evaluation d) Import 4) Rapidminer comes with : a) Over 125 mining algorithms b) Over 100 data cleaning and preparation functions. c) Over 30 charts for data visualization d) All of the above 5) ___________________s designed to make the preparation of data much easier a) RapidMiner Auto Model b) RapidMiner Turbo Prep c) RapidMiner Studio d) None 6) Which of the following kinds of problems can be resolved with Auto Model. a) Prediction. b) Clustering. c) outliers. d) All of the above 7) The ________________ allows the automatic selection of features to create a reduced dataset. a) Attributes Selection b) Cluster, or Associate. c) Classification d) None 8) Which of the following is/are the features of RapidMiner. a) Application & Interface b) Data Access c) Data Exploration d) All 9) Which of the following is/are the features of RapidMiner. a) attribute selection Lovely Professional University 91 Notes Data Warehousing And Data Mining b) Experiments c) workflow and visualization. d) All 10) Which of the following is the data mining tool? a) RapidMiner b) Weka. c) Both a and b. d) None 11) Two fundamental goals of Data Mining are ________. a) Analysis and Description b) Data cleaning and organizing the data c) Prediction and Description d) Data cleaning and organizing the data 12) What is WEKA? a) Waikato Environment for Knowledge Learning b) Waikato Environmental for Knowledge Learning c) Waikato Environment for Knowledge Learn d) None. 13) Which of the following statements about the query tools is correct? a) Tools developed to query the database b) Attributes of a database table that can take only numerical values c) Both and B d) None of the above 14) Which one of the following refers to the binary attribute? a) The natural environment of a certain species b) Systems that can be used without knowledge of internal operations c) This takes only two values. In general, these values will be 0 and 1, and they can be coded as one bit d) All of the above 15) Data mining tools that exist in the market are a) Weka. b) Rapid Miner. c) Orange d) All of the above Answers: Self Assessment 1. A 2. C 3. B 4. D 5. B 6. D 7. A 8. D 9. D 10. C 11. C 12. A 13. A 14. C 15. D 92 Lovely Professional University Unit 05: Introduction to Mining Tools Notes Review Questions Q1) Explain different perspectives available in RapidMiner Studio? Q2) Differentiate between RapidMiner and Weka by considering different features. Q3) Explain the various functions available under explorer in Weka? Q4) Elaborate on the RapidMiner GUI in detail? Q5) Write down the different methods of creating a repository in rapidMiner? Further Readings Witten, I. H., Frank, E., Trigg, L. E., Hall, M. A., Holmes, G., & Cunningham, S. J. (1999). Weka: Practical machine learning tools and techniques with Java implementations. Markov, Z., & Russell, I. (2006). An introduction to the WEKA data mining system. ACM SIGCSE Bulletin, 38(3), 367-368. Kotu, V., & Deshpande, B. (2014). Predictive analytics and data mining: concepts and practice with rapidminer. Morgan Kaufmann. Hofmann, M., &Klinkenberg, R. (Eds.). (2016). RapidMiner: Data mining use cases and business analytics applications. CRC Press. Web Links https://www.cs.waikato.ac.nz/ml/weka/ https://www.analyticsvidhya.com/learning-paths-data-science-business-analytics- business-intelligence-big-data/weka-gui-learn-machine-learning/ https://storm.cis.fordham.edu/~gweiss/data-mining/weka.html https://docs.rapidminer.com/9.5/server/configure/connections/creating-other- conns.html https://docs.rapidminer.com/latest/studio/connect/ Lovely Professional University 93 Dr. Rajni Bhalla, Lovely Professional University Unit 6: Extracting Data Sets Notes Unit 06: Extracting Data Sets CONTENTS Objectives Introduction 6.1 Read Excel 6.2 Store Operator 6.3 Retrieve Operator 6.4 Graphical Representation of data in Rapidminer Summary Keywords Self Assessment Questions Answer for Self Assessment Review Questions Further Readings Objectives After this lecture, you will be able to Know the process of accessing and loading information from the Repository into the Process using retrieve Operator. Implementation of storage operator to store data and model. Various methods of visualizing the data. Creation of a new repository and usage of an existing repository. Introduction Following the directed dialogue or using the drag and drop feature to import data to your repository is easy. Simply drag the file from your file browser onto the canvas and follow the on- screen instructions. Check that the data types are right and that the goal or mark is correctly flagged. The fact that this "import then open" method is not like other methods of data opening is a significant difference. 6.1 Read Excel Read excel operator reads an ExampleSet from the specified Excel file. Figure 1: Read Excel operator Data from Microsoft Excel spreadsheets can be loaded using this operator. Excel 95, 97, 2000, XP, and 2003 data can be read by this operator. The user must specify which of the workbook's spreadsheets will be used as a data table. Each row must represent an example, and each column must represent an attribute in the table.Please keep in mind that the first row of the Excel sheet can be used for attribute names that are defined by a parameter. The data table can be put anywhere on the sheet and can include any formatting instructions, as well as empty rows and columns. Empty cells or cells with only "?" can be used to show missing data values in Excel. 94 Lovely Professional University Notes Data Warehousing and Data Mining The Design perspective is your creative canvas and the location where you will spend the majority of your time. You will merge operators into data mining processes here. Figure 2: Design view In the Repository view, tomake a new folder: 1. Right-click on Local Repository and select Properties. 2. Select Create Folder from the drop-down menu. 3. Give the new folder a name, such as Getting Started, and then click OK. Connect a data folder and a processes folder to the procedure from the previous step.Your Repositories view should look similar to this: Figure 3: Repository View The easiest and shortest way to import an Excel file is to use the import configuration wizard from the Parameters panel. The easiest approach, which could take a little more time, is to set all of the parameters in the Parameters panel first, then use the wizard. Before creating a method that uses the Excel file, please double-check that it can be read correctly. To get started with RapidMiner Studio, build a local repository on your computer.