Data Warehouse and Data Mining BA-1 PDF

D ATA WA R E H O U S E A N D D ATA MINING Unit-II OLTP, or Online Transaction Processing, refers to a class of systems that manage transaction-oriented applications, typically for data entry and retrieval transaction processing. 1. Real-time Processing: OLTP systems process transactions in real-time. This means that as soon as a transaction occurs, it is processed immediately. O LT P, O R O N L I N E 2. Transactional Integrity: They ensure the integrity and consistency of the data. This is typically achieved TRANSACTION through the use of ACID (Atomicity, Consistency, Isolation, Durability) properties to guarantee that PROCESSING transactions are processed reliably. 3. High Volume of Short Transactions: OLTP systems are optimized to handle a large number of short, simple transactions such as insertions, updates, and deletions. Examples include banking transactions, order entry, and retail sales. Concurrency Control: They must efficiently handle many users concurrently accessing and modifying the data. Techniques such as locking and multiversion concurrency control O LT P, O R O N L I N E (MVCC) are used to manage concurrent access. TRANSACTION Availability and Reliability: OLTP PROCESSING systems are mission-critical and need to be highly available and reliable. Downtime can lead to significant disruptions and losses. Normalized Database Schema: OLTP databases are typically highly normalized to reduce data redundancy and ensure data integrity. E X A M P L E S O F O LT P A P P L I CAT I O N S : Banking Systems: Processing transactions like deposits, withdrawals, transfers, and payments. Retail Point of Sale (POS) Systems: Handling sales transactions, inventory updates, and customer information. Reservation Systems: Managing bookings for airlines, hotels, and car rentals. Order Processing Systems: Managing customer orders, shipments, and invoices. O N L I N E A N A LY T I C A L PROCESSING (OLAP) Online analytical processing (OLAP) is software technology you can use to analyse business data from different points of view. Organizations collect and store data from multiple data sources, such as websites, applications, smart meters, and internal systems. OLAP combines and groups this data into categories to provide actionable insights for strategic planning. I M P O R TA N C E O F O L A P Faster decision Non-technical Integrated data making user support view Functional Examples of OLAP Areas applications Marketing Market research analysis, sales forecasting, customer and promotion analysis OLAP Finance Budgeting, financial A P P L I CAT I O N S modeling and performance analysis The various applications of OLAP in Sales Sales forecasting and different functional areas. analysis Manufacturing Production planning and defect analysis F E AT U R E S O F O L A P The important features of OLAP are given as follows. Multi-dimensional view of data Time intelligence Complex calculations support OLAP BENEFITS The key benefits offered by OLAP are: Increased productivity of end-users. Retention of organizational control over the integrity of corporate data. Reduced backlog of applications development for IT staff. Improved profitability and potential revenue. Reduced query drag and network traffic on the data warehouse or OLTP systems. STRENGTHS OF OLAP The major strengths of any OLAP are listed below. It is apowerful visualization tool. OLAP tools are good for analyzing time series. It provides fast, interactive response time. OLAP tools are offered by many vendors. It can be of help in identifying outliers and clusters. C O M PA R I S O N B E T W E E N O LT P A N D O L A P Characteristics OLTP OLAP Type of Users These systems are developed for office On the other hand, these systems are workers. developed for decision makers. Number of Users Number of users can vary from OLAP systems may be accessible to hundreds or even thousands of users in dozens of users or selected group of case of a large organization. managers in an organization. Functions OLTP systems support day-to-day OLAP systems on the other hand activities of an organization and are support decision-making functions of an mission-critical. organization using analytical investigations and are management critical. They are more functionality driven. Processing It involves repetitive processing. OLAP is Ad hoc, Unstructured and uses heuristic processing. C O M PA R I S O N B E T W E E N O LT P A N D O L A P Characteristics OLTP OLAP Nature of Query One record at a time is processed by These systems deal with individual OLTP systems, e.g., a record of a customer records. Many records are customer who might be on the phone or processed at a time by OLAP systems, in the store. and these systems give aggregated data or summary to a manger. Design OLTP database systems are On the other hand, OLAP systems are application-oriented. subject-oriented. Database View OLTP is based on relational model. OLAP supports multi-dimensional view of data. Type of Queries It allows simple queries. It allows complex queries. Transaction throughput It involves high level of transaction It provides a medium which leads to low (Number of transactions throughput. level of transaction throughput. per second) D ATA WA R E H O U S E A data warehouse is a historical database and should be considered as the long-term memory of an organization. Historical data is not to be tampered with; no insertion, up-dation and deletion are to be made. Usually, it is used only for retrieval such as verification and data analysis. Data warehouses are typically used for Online Analytical Processing (OLAP) to support management queries. Tech giant IBM, devised the concept of the data warehouse. They called them ‘information warehouses’. Since then, Data warehousing systems have evolved as one of the principal technologies employed by modern enterprises for better profitability. HISTORICAL DEVELOPMENTS IN DATA WAREHOUSING The present concept of data warehousing owes much to American computer scientist, Bill Inmon, known as the ‘father of data warehousing’. According to Bill Inmon, a data warehouse could be defined as, ‘A subject-oriented, integrated, time- variant, and non-volatile collection of data in support of management’s D E F I N I N G D ATA decision-making process.’ WAREHOUSING TERMS USED IN THE DEFINITION Subject-Oriented: Data warehouses are subject oriented - they are built around the major data entity or subjects of an organization. Integrated A data warehouse integrates (combines) data from multiple systems to provide a wide view of any enterprise’s data, hence they are said to be ‘integrated’. Time variant Unlike Operational Data Store, in data warehouses data is not always up to date as it contains historical data which is valid or accurate till some point of time (or time interval). Non-volatile Data warehouse is described as a long term enterprise memory due to its non-volatile nature because the data is not updated in real time but is rather refreshed on a regular basis. Hence, the new data which is added goes in like a supplement to the database, rather than a replacement. The database continually absorbs this new data, incrementally integrating it with the previous data. D ATA WA R E H O U S E ARCHITECTURE Every data warehouse has three fundamental components Load Manager Warehouse Manager Data Access Manager FUNCTIONS OF LOAD MANAGER DATA EXTRACTION: DATA DATA LOADING: SCHEDULING AND PERFORMANCE DATA QUALITY TRANSFORMATION: AUTOMATION: OPTIMIZATION: MANAGEMENT: ERROR HANDLING AND LOGGING: FUNCTIONS OF WAREHOUSE MANAGER 1. ETL Management: 2. Data Warehouse Architecture: 3. Data Governance: 4. Reporting and Analytics Support: 5. Documentation and Training: F U N CT I O N S O F D ATA AC C E S S M A N AG E R Role-Based Regular Least Privilege Access Control Monitoring and Principle: (RBAC): Auditing: Collaboration Data Encryption User Training and and Masking: and Awareness: Communication: LOAD MANAGER The Load manager is responsible for Data collection from operational systems. It also performs data conversion into some usable form to be further utilized by the user. It includes all the programs and application interfaces which are required for extracting data from the operational systems, it’s preparation and finally loading of data into the data warehouse itself. It should perform the following tasks: Data Identification Data Validation for its accuracy Data Extraction from the original source Data Cleansing Data formatting Data standardization (i.e. bringing data into conformity with some standard format) Consolidates data from multiple sources to one place Establishment of Data Integrity using Integrity Constraints WAREHOUSE MANAGER The Warehouse manager is the main part of Data Warehousing system as it holds the massive amount of information from myriad sources. It organizes data in a way so it becomes easy for anyone to analyze or find the required information. It is the core of the data warehouse itself. It maintains three levels of information, i.e, detailed, lightly summarized and highly summarized. It also maintains mete data, i.e., data about data. QUERY MANAGER Finally the Query manager is that interface which connects the end users with the information stored in data warehouse through the usage of specialized end-user tools. These tools are known as Data mining access tools. The market today is flooded with such tools which have common functionalities and a provision for customizing more features specific to an enterprise. These have various categories such as query and reporting, statistics, data discovery, etc. B E N E F I TS O F D ATA WA R E H O U S I N G Potential high ROI (Return on Investment) Investing in Data Warehousing is itself a very big investment, but past reports suggest ROI growth up to 400% with Data Warehousing, making it valuable for business. Unbeatable competitive advantage Implementation of Data Warehousing could give companies a competitive edge over their rivals. With Data Warehousing companies could discover previously unavailable facts and figures, trends and untapped information. Such new revelations would enhance the quality of decisions. High Productivity in corporate decision making and business intelligence Data Warehousing combines data from multiple sources into meaningful information which could be analyzed and referred by managers to improve their decisions for organization. Cost effective With Data Warehousing, it is possible to streamline the organization thereby reducing overheads and so reducing product costs. Enhanced customer service Data Warehousing provides essential support when communicating with customers and so helps improve customer satisfaction and retain them. P RO B L E M S O R L I M I TAT I O N S O F D ATA WAREHOUSING Underestimation of resources for data ETL Erroneous source systems Required data not captured Increased end user queries or demands Loss of information during data homogenization High demand of resources Data ownership Necessary maintenance Long-duration projects Complexity of integration D ATA M A R TS A department specific data warehouse is termed as ‘data mart’, which is a small localized data warehouse built for a single purpose. It is usually built to cater to the needs of a group of users or a department in an organization. Thus, a data mart can be defined as ‘a specialized, subject-oriented, integrated, time-variant, volatile data store in support of specific subset of management’s decisions.’ In a simplified way we can define data marts as ‘A subset of data warehouses that support the requirements of a particular department or business function’. Data marts usually focus on the data requirements of some specific department rather than the whole organization. Detailed information is not contained by data marts (unlike data warehouses) DIFFERENCES They are easy to navigate, transfer and explore compared to data warehouses B E T W E E N D ATA M A R T which work on large volumes of data. A N D D ATA WAREHOUSE A D VA N TAG E S O F D ATA M A R TS With data marts the user gets pertinent, to-the-point data. Data marts respond quickly. Data operations such as data cleaning, loading, transformation, and integration are far easier and cheaper as data marts work with low volumes of data. Implementing and setting up a data mart is simpler than implementing a data warehouse for the whole enterprise. Compared to a data warehouse, implementing a data mart is a lot more economical. The potential users of a data mart can be grouped in a better way rather than involving large numbers of unnecessary members. Data marts are designed on the basis that there is no need to serve the entire enterprise. Therefore, the department can summarize, select and structure their own departments’ data independently. Data marts can allow each department to work on a specific piece of historical data rather than the whole data. Departments can customize software for their data mart as per their needs. Data marts are cost efficient. L I M I TAT I O N S O F D ATA M A R TS Once in operation, due to inherent limitations in design it becomes cumbersome to extend their scope to other departments. Data Integration problems are encountered often. When the data mart develops multiple dimensions then the scalability problem becomes common. Data mining is defined as follows: ‘Data mining is a collection of techniques D ATA M I N I N G for efficient automated discovery of previously unknown, valid, novel, useful and understandable patterns in large databases. The patterns must be actionable so they may be used in an enterprise’s decision making.’ D ATA M I N I N G From this definition, the important take aways are: Data mining is a process of automated discovery of previously unknown patterns in large volumes of data. This large volume of data is usually the historical data of an organization known as the data warehouse. Data mining deals with large volumes of data, in Gigabytes or Terabytes of data and sometimes as much as Zetabytes of data (in case of big data). Patterns must be valid, novel, useful and understandable. Data mining allows businesses to determine historical patterns to predict future behaviour. Although data mining is possible with smaller amounts of data, the bigger the data the better the accuracy in prediction. There is considerable hype about data mining at present, and the Gartner Group has listed data mining as one of the top ten technologies to watch. R E A S O N S FO R P O P U L A R I T Y O F D ATA M I N I N G Growth in generation and storage of corporate data Need for sophisticated decision making Evolution of technology Availability of much cheaper storage, easier data collection and better database management for data analysis and understanding Point of sale terminals and bar codes on many products, railway bookings, educational institutions, mobile phones, electronic gadgets, e-commerce, etc., all generate data. Great volumes of data generated with the recent prevalence of Internet banking, ATMs, credit and debit cards; medical data, hospitals; automatic toll collection on toll roads, growing air travel; passports, visas, etc. Decline in the costs of hard drives Growth in worldwide disk capacities Thus, the need for analyzing and synthesizing information is growing in the fiercely competitive business environment of today. W H AT CA N D ATA M I N I N G D O A N D N OT D O ? Data mining is a powerful tool that helps to determine the relationships and patterns within data. However, it does not work by itself and does not eliminate the requirement for understanding data, analytical methods and to know business. Data mining extracts hidden information from the data, but it is not able to assess the value of the information. One should know the important patterns and relationships to work with data over time. In addition to discovering new patterns, data mining can also highlight other empirical observation that are not instantly visible through simple observation. It is important to note that the relationships or patterns predicted through data mining are not necessarily causes for an action or behavior. D ATA M I N I N G A P P L I CAT I O N S The applications of data mining exist in almost every field. Some of the important applications of data mining are in finance, telecom, insurance and retail sectors include loan/credit card approval, fraud detection, market segmentation, trend analysis, better marketing, market basket analysis, customer churn and web site design and promotion. D ATA M I N I N G PROCESS D ATA M I N I N G T E C H N I Q U E S Data mining can be classified into four major techniques as given below. Predictive modeling (Linear Regression, Logistic Regression) Database segmentation (Cluster Analysis) Link analysis Associations discovery Sequential pattern discovery Similar time sequence discovery Deviation detection LINEAR REGRESSION Linear regression analysis is used to predict the value of a variable based on the value of another variable. The variable you want to predict is called the dependent variable. The variable you are using to predict the other variable's value is called the independent variable. This form of analysis estimates the coefficients of the linear equation, involving one or more independent variables that best predict the value of the dependent variable. Linear regression fits a straight line or surface that minimizes the discrepancies between predicted and actual output values. Linear regression is used in everything from biological, behavioral, environmental and social sciences to business. W H Y L I N E A R R E G R E S S I O N I S I M P O R TA N T Linear-regression models are relatively simple and provide an easy-to-interpret mathematical formula that can generate predictions. Linear regression can be applied to various areas in business and academic study. Linear-regression models have become a proven way to scientifically and reliably predict the future. Because linear regression is a long-established statistical procedure, the properties of linear- regression models are well understood and can be trained very quickly. KEY ASSUMPTIONS OF EFFECTIVE LINEAR REGRESSION The variables should be measured at a continuous level. Examples of continuous variables are time, sales, weight and test scores. Use a scatterplot to find out quickly if there is a linear relationship between those two variables. The observations should be independent of each other (that is, there should be no dependency). Your data should have no significant outliers. Check for homoscedasticity — a statistical concept in which the variances along the best-fit linear- regression line remain similar all through that line. The residuals (errors) of the best-fit regression line follow normal distribution. APPLICATIONS OF LINEAR REGRESSION IN BUSINESS DECISIONS Sales Pricing Customer Risk Inventory Forecasting Strategy Retention Management Management Advertising Human Market Trend Financial Operational and Promotion Resources Analysis Analysis Efficiency Effectiveness Management LOGISTIC REGRESSION Logistic regression is a statistical method used for binary classification, where the outcome variable is categorical and typically has two possible outcomes (e.g., success/failure, yes/no, 1/0). It models the probability that a given input point belongs to a particular category. Logistic regression is essential in many business applications for several reasons: Binary Outcomes: Many business decisions involve binary outcomes, making logistic regression highly relevant. Probability Estimates: It provides probability estimates, which are crucial for risk assessment and decision-making. Interpretable: The coefficients in logistic regression models are interpretable, allowing businesses to understand the impact of each predictor on the outcome. Efficiency: It is computationally efficient and can be implemented quickly, making it suitable for large datasets. I M P O R TA N C E O F LO G I ST I C R E G R E S S I O N Binary Outcome: The dependent variable should be binary. Independence of Observations: Each observation is independent of the others. Linearity of Independent Variables and Log Odds: The relationship between the independent variables and the log odds of the dependent variable is linear. No Multicollinearity: The independent variables are not highly correlated with each other. Large Sample Size: Logistic regression requires a sufficiently large sample size to provide reliable results. ASSUMPTIONS OF LOGISTIC REGRESSION A P P L I CAT I O N S I N B U S I N E S S D E C I S I O N S Customer Churn Marketing Credit Scoring Fraud Detection Prediction Response Product Healthcare Employee Attrition Risk Management Recommendations Outcomes C L U S T E R A N A LY S I S Clustering is defined as grouping a set of similar objects into classes or clusters. In other words, during cluster analysis, the data is grouped into classes or clusters, so that records within a cluster (intra-cluster) have high similarity with one another but have high dissimilarities in comparison to objects in other clusters A P P L I C AT I O N S O F C L U S T E R A N A LY S I S Marketing: It helps marketers find out distinctive groups among their customer bases, and this knowledge helps them improve their targeted marketing programs. Land use: Clustering is used for identifying areas of similar land use from the databases of earth observations. Insurance: Clustering is helpful for recognizing clusters of insurance policyholders with a high regular claim cost. City-planning: It also helps in identifying clusters of houses based on house type, geographical location, and value. Earthquake studies: Clustering is helpful for analysis of earthquakes as it has been noticed that earthquake epicenters are clustered along continent faults. Biology studies: Clustering helps in defining plant and animal classifications, identifying genes with similar functionalities, and in gaining insights into structures inherent to populations. Web discovery: Clustering is helpful in categorizing documents on the web for information discovery. Fraud detection: Clustering is also helpful in outlier detection applications such as credit card fraud detection. D E S I R E D F E AT U R E S O F C L U ST E R I N G Scalability: Clustering algorithms should be capable of handling small as well as large datasets smoothly. Ability to handle different types of attributes: Clustering algorithms should be able to handle different kinds of data such as binary, categorical and interval-based (numerical) data. Independent of data input order: The clustering results should not be dependent on the ordering of input data. Identification of clusters with different shapes: The clustering algorithm should be capable of identifying clusters of any shape. Ability to handle noisy data: Usually, databases consist of noisy, erroneous or missing data, and algorithm must be able to handle these. High performance: To have a high-performance algorithm, it is desirable that the algorithm should need to perform only one scan of the dataset. This capability would reduce the cost of input-output operations. Interpretability: The results of clustering algorithms should be interpretable, logical and usable. Ability to stop and resume: For a large dataset, it is desirable to stop and resume the task as it can take a huge amount of time to accomplish the full task and breaks may be necessary. Minimal user guidance: The clustering algorithm should not expect too much supervision from the analyst, because commonly the analyst has a limited knowledge of the dataset. In clustering, distance metrics play a vital role in comprehending the similarity between the objects. In the next section, we will discuss different distance metrics that play an important role in the process of clustering of objects. L I N K A N A LY S I S Link analysis aims to establish links, called associations, between the individual record, or sets of records, in a database. There are three specialisations of link analysis. Associations discovery Sequential pattern discovery Similar time sequence discovery L I N K A N A LY S I S Associations discovery locates items that imply the presence of other items in the same event. There are association rules which are used to define association. Sequential pattern discovery finds patterns between events such that the presence of one set of items is followed by another set of items in a database of events over a period. Time sequence discovery is used to determine whether links exist between two sets of data that are time-dependent. DECISION TREE A decision tree is a graphical representation used to make decisions and predict outcomes. It uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. Components of a Decision Tree Root Node: Represents the entire dataset and the starting point of the tree. Decision Nodes: Nodes that represent decisions based on specific attributes. These nodes split into branches. Branches: Arrows connecting nodes, representing the decision rules. Leaf Nodes (Terminal Nodes): Represent the final outcomes or decisions, providing the result of the decision path. A P P L I CAT I O N S I N B U S I N E S S D E C I S I O N S Customer Segmentation: Identifying distinct customer groups based on purchasing behavior, demographics, or other attributes to target marketing efforts more effectively. Risk Management: Assessing the likelihood of different risks and their impacts, aiding in making informed decisions about risk mitigation strategies. Credit Scoring: Evaluating the creditworthiness of loan applicants by analyzing historical data and predicting the probability of default. Sales Forecasting: Predicting future sales based on historical data, trends, and other relevant factors. Operational Efficiency: Identifying inefficiencies in business processes and determining the best course of action to improve productivity. Product Recommendations: Analyzing customer preferences and purchasing history to suggest relevant products or services. Strategic Planning: Assisting in strategic decision-making by evaluating potential outcomes of different business strategies and selecting the optimal path. D E V I AT I O N D E T E CT I O N Deviation detection is a relatively new technique in terms of commercially available data mining tools. It is based on identifying the outliers in the database, which indicates deviation from some previously known expectation and norm. This operation can be performed using statistics and visualization techniques. Applications of deviation detection include fraud detection in the use of credit cards and insurance claims, quality control, and defects tracing. D ATA M I N I N G I S S U E S A N D C H A L L E N G E S Data Quality Incomplete Data: Missing values can lead to inaccurate models and predictions. Noisy Data: Data with errors or outliers can distort analysis. Inconsistent Data: Conflicting or duplicated data can complicate the mining process. Data Integration Heterogeneous Data Sources: Integrating data from various sources with different formats, structures, and semantics. Data Cleaning: Ensuring data is accurate and relevant, which is time-consuming and resource-intensive. Data Privacy and Security Privacy Concerns: Protecting sensitive information from unauthorized access. Data Security: Ensuring data is secure from breaches and attacks. D ATA M I N I N G I S S U E S A N D C H A L L E N G E S Data Mining Methodology and User Interaction Scalability: Handling large volumes of data efficiently. High Dimensionality: Managing datasets with a large number of attributes (features). Complex and Dynamic Data: Adapting to complex data types (e.g., multimedia, time-series) and data that changes over time. User Interface and Visualization: Providing intuitive tools for users to interact with and understand the data mining results. Performance Issues Efficiency: Developing algorithms that can process large datasets quickly. Accuracy: Ensuring the models and patterns discovered are accurate and reliable. Interpretability: Making sure the results are understandable to non-technical stakeholders. D ATA M I N I N G I S S U E S A N D C H A L L E N G E S Data Preprocessing Data Cleaning: Removing noise and handling missing data. Data Transformation: Normalizing or transforming data to bring it to a suitable format for mining. Data Reduction: Reducing the volume but producing the same or similar analytical results. Model Overfitting and Underfitting Overfitting: Models that are too complex may perform well on training data but poorly on new, unseen data. Underfitting: Models that are too simple may not capture the underlying trends in the data. D ATA M I N I N G I S S U E S A N D C H A L L E N G E S Evaluation and Validation Model Validation: Ensuring the models are validated using appropriate techniques like cross-validation. Performance Metrics: Selecting the right metrics to evaluate model performance, which may vary based on the problem context. Ethical Issues Bias and Fairness: Ensuring the data and models do not perpetuate or amplify existing biases. Transparency: Making sure the data mining process is transparent and the results are explainable. Staying Current with Technology and Techniques Rapid Advancements: Keeping up with the fast-paced developments in data mining tools and techniques. Skill Gap: Ensuring the availability of skilled professionals who can work with advanced data mining methods. ADDRESSING THESE CHALLENGES Advanced Algorithms: Developing robust algorithms that can handle noise, missing values, and large datasets efficiently. Data Preprocessing Techniques: Improving methods for cleaning, integrating, and transforming data. Security Measures: Implementing strong security protocols to protect data privacy and integrity. Visualization Tools: Creating intuitive visualization tools to help users understand and interact with data mining results. Regulations and Ethical Standards: Adhering to regulatory requirements and ethical standards to ensure fair and unbiased outcomes.

Data Warehouse and Data Mining BA-1 PDF

Document Details

Tags

Related

Summary

Full Transcript