Document Details
Uploaded by BoomingOrientalism
Prasad V. Potluri Siddhartha Institute of Technology
Tags
Full Transcript
DATA MINING TECHNIQUES UNIT I: Data Mining: Introduction, Need of Data Mining, Definition, KDD process, Kinds of Data, Kinds of Patterns, Applications, Major Issues in Data Mining. Data: Introduction, Data Objects, Attribute Types-Nominal, Binary, Interval-Scaled, Ratio-Scaled, Discrete...
DATA MINING TECHNIQUES UNIT I: Data Mining: Introduction, Need of Data Mining, Definition, KDD process, Kinds of Data, Kinds of Patterns, Applications, Major Issues in Data Mining. Data: Introduction, Data Objects, Attribute Types-Nominal, Binary, Interval-Scaled, Ratio-Scaled, Discrete versus Continuous Attributes, Similarity and distance measures Data Mining Data mining is the process of extracting knowledge or insights from large amounts of data using various statistical and computational techniques. The data can be structured, semi-structured or unstructured, and can be stored in various forms such as databases, data warehouses, and data lakes. The primary goal of data mining is to discover hidden patterns and relationships in the data that can be used to make informed decisions or predictions. This involves exploring the data using various techniques such as clustering, classification, regression analysis, association rule mining, and anomaly detection. Data mining has a wide range of applications across various industries, including marketing, finance, healthcare, and telecommunications. For example: In marketing, data mining can be used to identify customer segments and target marketing campaigns, while in healthcare, it can be used to identify risk factors for diseases and develop personalized treatment plans. Use Cases of Data Mining Data mining has a wide range of applications and uses cases across many industries and domains. Some of the most common use cases of data mining include: Market Basket Analysis: Market basket analysis is a common use case of data mining in the retail and e-commerce industries. It involves analyzing data on customer purchases to identify items that are frequently purchased together, and using this information to make recommendations or suggestions to customers. Fraud Detection: Data mining is widely used in the financial industry to detect and prevent fraud. It involves analyzing data on transactions and customer behavior to identify patterns or anomalies that may indicate fraudulent activity. Customer Segmentation: Data mining is commonly used in the marketing and advertising industries to segment customers into different groups based on their characteristics and behavior. This information can then be used to tailor marketing and advertising campaigns to specific segments of customers. Predictive Maintenance: Data mining is increasingly used in the manufacturing and industrial sectors to predict when equipment or machinery is likely to fail or require maintenance. It involves analyzing data on the performance and usage of equipment to identify patterns that can indicate potential failures, and using this information to schedule maintenance and prevent downtime. Network Intrusion Detection: Data mining is used in the cybersecurity industry to detect network intrusions and prevent cyber attacks. It involves analyzing data on network traffic and behavior to identify patterns that may indicate an attempted intrusion, and using this information to alert security teams and prevent attacks. Need of the Data Mining: Data mining is the process of searching and analyzing a large batch of raw data in order to identify patterns and extract useful information. Companies use data mining software to learn more about their customers. It can help them to develop more effective marketing strategies, increase sales, and decrease costs. Data mining relies on effective data collection, warehousing, and computer processing. Data mining gives businesses a competitive advantage by helping to find insights in the data from digital transactions. By understanding customer behavior in greater depth, companies can create new products, services, or marketing techniques. Here are some of the advantages that data mining can bring to a business: Optimize pricing: By using data mining to analyze different pricing variables, such as demand, elasticity, distribution and brand perception, businesses can set prices at a level that maximizes profit. Optimize marketing: Data mining allows businesses to segment their customers by behavior and need. In turn, this allows them to deliver personalized ads which perform better and are more relevant to customers. Greater productivity: Analyzing employee behavior patterns can feed into HR initiatives to improve employee engagement and productivity. Greater efficiency: From customer buying patterns to supplier pricing behavior, businesses can use data mining and data analysis to improve efficiencies and reduce costs. Increased customer retention: Dating mining can uncover insights which help you understand your customers in greater depth. In turn, this can improve your interactions with customers, increasing retention. Improved products and services: Using data mining to locate and fix any areas where quality falls short can decrease product returns. Marketing Businesses can use data mining to improve their marketing activity. For example, insights from data mining can be used to understand where prospects see ads, what demographics to target, where to place digital ads, and what marketing strategies work best with customers. Manufacturing For companies which produce their own goods, data mining can be used to analyze the cost of raw materials, whether materials are being used most efficiently, how time is spent along the manufacturing process, and what barriers impact the process. Data mining can be used to support just-in-time fulfillment by predicting when new supplies should be ordered or when equipment needs to be replaced. Fraud detection The purpose of data mining is to find patterns, trends, and correlations that link data points together. An organization can use data mining to identify outliers or correlations that should not exist. For example, a business may analyze its cash flow and find reoccurring payments to an unknown account. Kinds of Data/Types of Data Data mining can be performed on the following types of data Relational Database: A relational database is a collection of multiple data sets formally organized by tables, records, and columns from which data can be accessed in various ways without having to recognize the database tables. Tables convey and share information, which facilitates data search ability, reporting, and organization. Data warehouses: A Data Warehouse is the technology that collects the data from various sources within the organization to provide meaningful business insights. The huge amount of data comes from multiple places such as Marketing and Finance. The extracted data is utilized for analytical purposes and helps in decision- making for a business organization. The data warehouse is designed for the analysis of data rather than transaction processing. Data Repositories: The Data Repository generally refers to a destination for data storage. However, many IT professionals utilize the term more clearly to refer to a specific kind of setup within an IT structure. For example, a group of databases, where an organization has kept various kinds of information. Object-Relational Database: A combination of an object-oriented database model and relational database model is called an object-relational model. It supports Classes, Objects, Inheritance, etc. One of the primary objectives of the Object-relational data model is to close the gap between the Relational database and the object-oriented model practices frequently utilized in many programming languages, for example, C++, Java, C#, and so on. Transactional Database: A transactional database refers to a database management system (DBMS) that has the potential to undo a database transaction if it is not performed appropriately. Even though this was a unique capability a very long while back, today, most of the relational database systems support transactional database activities. kind of patterns can be mined in data mining Different types of data can be mined in data mining. However, the data should have a pattern to get helpful information. Based on the data functionalities, patterns can be further classified into two categories. Advantages of Data Mining The Data Mining technique enables organizations to obtain knowledge-based data. Data mining enables organizations to make lucrative modifications in operation and production. Compared with other statistical data applications, data mining is a cost-efficient. Data Mining helps the decision-making process of an organization. It Facilitates the automated discovery of hidden patterns as well as the prediction of trends and behaviors. It can be induced in the new system as well as the existing platforms. It is a quick process that makes it easy for new users to analyze enormous amounts of data in a short time. Disadvantages of Data Mining There is a probability that the organizations may sell useful data of customers to other organizations for money. As per the report, American Express has sold credit card purchases of their customers to other organizations. Many data mining analytics software is difficult to operate and needs advance training to work on. Different data mining instruments operate in distinct ways due to the different algorithms used in their design. Therefore, the selection of the right data mining tools is a very challenging task. The data mining techniques are not precise, so that it may lead to severe consequences in certain conditions. The Data Mining Process To be most effective, data analysts generally follow a certain flow of tasks along the data mining process. Without this structure, an analyst may encounter an issue in the middle of their analysis that could have easily been prevented had they prepared for it earlier. The data mining process is usually broken into the following steps. Step 1: Understand the Business Before any data is touched, extracted, cleaned, or analyzed, it is important to understand the underlying entity and the project at hand. What are the goals the company is trying to achieve by mining data? What is their current business situation? Step 2: Understand the Data Once the business problem has been clearly defined, it's time to start thinking about data. This includes what sources are available, how they will be secured and stored, how the information will be gathered, and what the final outcome or analysis may look like. This step also includes determining the limits of the data, storage, security, and collection and assesses how these constraints will affect the data mining process. Step 3: Prepare the Data Data is gathered, uploaded, extracted, or calculated. It is then cleaned, standardized, scrubbed for outliers, assessed for mistakes, and checked for reasonableness. During this stage of data mining, the data may also be checked for size as an oversized collection of information may unnecessarily slow computations and analysis. Step 4: Build the Model With a clean data set in hand, it's time to crunch the numbers. Data scientists use the types of data mining above to search for relationships, trends, associations, or sequential patterns. The data may also be fed into predictive models to assess how previous bits of information may translate into future outcomes. Step 5: Evaluate the Results The data-centered aspect of data mining concludes by assessing the findings of the data model or models. The outcomes from the analysis may be aggregated, interpreted, and presented to decision-makers that have largely been excluded from the data mining process to this point. In this step, organizations can choose to make decisions based on the findings. Step 6: Implement Change and Monitor The data mining process concludes with management taking steps in response to the findings of the analysis. The company may decide the information was not strong enough or the findings were not relevant, or the company may strategically pivot based on findings. In either case, management reviews the ultimate impacts of the business and recreates future data mining loops by identifying new business problems or opportunities. How Data Mining Works Data mining involves exploring and analyzing large blocks of information to glean meaningful patterns and trends. It is used in credit risk management , fraud detection, and spam filtering. It also is a market research tool that helps reveal the sentiment or opinions of a given group of people. The data mining process breaks down into four steps: 1.Data is collected and loaded into data warehouses on site or on a cloud service. 2.Business analysts, management teams, and information technology professionals access the data and determine how they want to organize it. 3.Custom application software sorts and organizes the data. 4.The end user presents the data in an easy-to-share format, such as a graph or table. Applications of Data Mining common applications of data mining in various industries Scientific Analysis: Scientific simulations are generating bulks of data every day. This includes data collected from nuclear laboratories, data about human psychology, etc. Data mining techniques are capable of the analysis of these data. Example of scientific analysis: Sequence analysis in bioinformatics Classification of astronomical objects Medical decision support. Intrusion Detection: A network intrusion refers to any unauthorized activity on a digital network. Network intrusions often involve stealing valuable network resources. Data mining technique plays a vital role in searching intrusion detection, network attacks, and anomalies. These techniques help in selecting and refining useful and relevant information from large data sets. Data mining technique helps in classify relevant data for Intrusion Detection System. Intrusion Detection system generates alarms for the network traffic about the foreign invasions in the system. For example: Detect security violations Misuse Detection Anomaly Detection Business Transactions: Every business industry is memorized for perpetuity. Such transactions are usually time-related and can be inter-business deals or intra-business operations. The effective and in-time use of the data in a reasonable time frame for competitive decision-making is definitely the most important problem to solve for businesses that struggle to survive in a highly competitive world. Data mining helps to analyze these business transactions and identify marketing approaches and decision-making. Example : Direct mail targeting Stock trading Customer segmentation Market Basket Analysis: Market Basket Analysis is a technique that gives the careful study of purchases done by a customer in a supermarket. This concept identifies the pattern of frequent purchase items by customers. This analysis can help to promote deals, offers, sale by the companies and data mining techniques helps to achieve this analysis task. Example: Data mining concepts are in use for Sales and marketing to provide better customer service, to improve cross-selling opportunities, to increase direct mail response rates. Education: For analyzing the education sector, data mining uses Educational Data Mining (EDM) method. This method generates patterns that can be used both by learners and educators. By using data mining EDM we can perform some educational task: Predicting students admission in higher education Predicting students profiling Predicting student performance Research: A data mining technique can perform predictions, classification, clustering, associations, and grouping of data with perfection in the research area. Rules generated by data mining are unique to find results. In most of the technical research in data mining, we create a training model and testing model. The training/testing model is a strategy to measure the precision of the proposed model. It is called Train/Test because we split the data set into two sets: a training data set and a testing data set. A training data set used to design the training model whereas testing data set is used in the testing model. Example: Classification of uncertain data. Information-based clustering. Decision support system Healthcare and Insurance: A Pharmaceutical sector can examine its new deals force activity and their outcomes to improve the focusing of high-value physicians and figure out which promoting activities will have the best effect in the following upcoming months, Whereas the Insurance sector, data mining can help to predict which customers will buy new policies, identify behavior patterns of risky customers and identify fraudulent behavior of customers. Claims analysis i.e which medical procedures are claimed together. Identify successful medical therapies for different illnesses. Transportation: A diversified transportation company with a large direct sales force can apply data mining to identify the best prospects for its services. A large consumer merchandise organization can apply information mining to improve its business cycle to retailers. Determine the distribution schedules among outlets. Analyze loading patterns. Financial/Banking Sector: A credit card company can leverage its vast warehouse of customer transaction data to identify customers most likely to be interested in a new credit product. Credit card fraud detection. Major issues in Data Mining Data mining, the process of extracting knowledge from data, has become increasingly important as the amount of data generated by individuals, organizations, and machines has grown exponentially. Data mining is not an easy task, as the algorithms used can get very complex and data is not always available at one place. It needs to be integrated from various heterogeneous data sources. Mining Methodology and User Interaction It refers to the following kinds of issues Mining different kinds of knowledge in databases − Different users may be interested in different kinds of knowledge. Therefore, it is necessary for data mining to cover a broad range of knowledge discovery task. Interactive mining of knowledge at multiple levels of abstraction − The data mining process needs to be interactive because it allows users to focus the search for patterns, providing and refining data mining requests based on the returned results. Data mining query languages and ad hoc data mining − Data Mining Query language that allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse query language and optimized for efficient and flexible data mining. Presentation and visualization of data mining results − Once the patterns are discovered it needs to be expressed in high level languages, and visual representations. These representations should be easily understandable. Handling noisy or incomplete data − The data cleaning methods are required to handle the noise and incomplete objects while mining the data regularities. If the data cleaning methods are not there then the accuracy of the discovered patterns will be poor. Pattern evaluation − The patterns discovered should be interesting because either they represent common knowledge or lack novelty. Data : Data is Collection of Data Objects and their attributes. Attribute : An Attribute is a Property or Characteristic of an object. Attribute is also known as Variable , filed, Characteristic or feature. Data object: A collection of attributes describe an object. Object is also known as record, point, case ,sample, entity or instance. Data Set: A Dataset is an organized collection of data. They are generally associated with a unique body of work and typically cover one topic at a time. Types of attributes: This is the initial phase of data preprocessing involves categorizing attributes into different types, which serves as a foundation for subsequent data processing steps. Attributes can be broadly classified into two main types: This type of data is also referred to as categorical data. Nominal data represents data that is qualitative and cannot be measured or compared with numbers. In nominal data, the values represent a category, and there is no inherent order or hierarchy. Examples of nominal data include gender, race, religion, and occupation. Nominal data is used in data mining for classification and clustering tasks. 2]Ordinal Data: This type of data is also categorical, but with an inherent order or hierarchy. Ordinal data represents qualitative data that can be ranked in a particular order. For instance, education level can be ranked from primary to tertiary, and social status can be ranked from low to high. In ordinal data, the distance between values is not uniform. This means that it is not possible to say that the difference between high and medium social status is the same as the difference between medium and low social status. Ordinal data is used in data mining for ranking and classification tasks. 3]Binary Data: This type of data has only two possible values, often represented as 0 or 1. Binary data is commonly used in classification tasks, where the target variable has only two possible outcomes. Examples of binary data include yes/no, true/false, and pass/fail. Binary data is used in data mining for classification and association rule mining tasks. Symmetric: In a symmetric attribute, both values or states are considered equally important or interchangeable. For example, in the attribute “Gender” with values “Male” and “Female,” neither value holds precedence over the other, and they are considered equally significant for analysis purposes. Asymmetric: An asymmetric attribute indicates that the two values or states are not equally important or interchangeable. For instance, in the attribute “Result” with values “Pass” and “Fail,” the states are not of equal importance; passing may hold greater significance than failing in certain contexts, such as academic grading or certification exams 4]Interval Data: This type of data represents quantitative data with equal intervals between consecutive values. Interval data has no absolute zero point, and therefore, ratios cannot be computed. Examples of interval data include temperature, IQ scores, and time. Interval data is used in data mining for clustering and prediction tasks. 5]Ratio Data: This type of data is similar to interval data, but with an absolute zero point. In ratio data, it is possible to compute ratios of two values, and this makes it possible to make meaningful comparisons. Examples of ratio data include height, weight, and income. Ratio data is used in data mining for prediction and association rule mining tasks. 6]Text Data: This type of data represents unstructured data in the form of text. Text data can be found in social media posts, customer reviews, and news articles. Text data is used in data mining for sentiment analysis, text classification, and topic modeling tasks. Discrete data vs Continuous data 3. Jaccard Index: The Jaccard distance measures the similarity of the two data set items as the intersection of those items divided by the union of the data items.