Dataware Housing and Mining PDF
Document Details
Uploaded by PreferableHafnium
Tags
Summary
This document discusses data warehousing and mining, focusing on the crucial role of data preprocessing in data mining. It details common steps like data cleaning, integration, transformation, and reduction, emphasizing their importance for ensuring data quality and analysis accuracy.
Full Transcript
### What Is Data Mining? #### The Scope of Data Mining ### Tasks of Data Mining ### Architecture of Data Mining Knowledge Base: Data Mining Engine: Pattern Evaluation Module: User interface: ### Data Mining Process: #### State the problem and formulate the hypothesis #### Collect the data...
### What Is Data Mining? #### The Scope of Data Mining ### Tasks of Data Mining ### Architecture of Data Mining Knowledge Base: Data Mining Engine: Pattern Evaluation Module: User interface: ### Data Mining Process: #### State the problem and formulate the hypothesis #### Collect the data #### Preprocessing the data #### Data Preprocessing in Data Mining ================================= **Data preprocessing is an important step in the data mining process. It refers to the cleaning, transforming, and integrating of data in order to make it ready for analysis. The goal of data preprocessing is to improve the quality of the data and to make it more suitable for the specific data mining task.** ### Some common steps in data preprocessing include: **Data preprocessing is an important step in the data mining process that involves cleaning and transforming raw data to make it suitable for analysis. Some common steps in data preprocessing include:** **Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data, such as missing values, outliers, and duplicates. Various techniques can be used for data cleaning, such as imputation, removal, and transformation.** **Data Integration: This involves combining data from multiple sources to create a unified dataset. Data integration can be challenging as it requires handling data with different formats, structures, and semantics. Techniques such as record linkage and data fusion can be used for data integration.** **Data Transformation: This involves converting the data into a suitable format for analysis. Common techniques used in data transformation include normalization, standardization, and discretization. Normalization is used to scale the data to a common range, while standardization is used to transform the data to have zero mean and unit variance. Discretization is used to convert continuous data into discrete categories.** **Data Reduction: This involves reducing the size of the dataset while preserving the important information. Data reduction can be achieved through techniques such as feature selection and feature extraction. Feature selection involves selecting a subset of relevant features from the dataset, while feature extraction involves transforming the data into a lower-dimensional space while preserving the important information.** **Data Discretization: This involves dividing continuous data into discrete categories or intervals. Discretization is often used in data mining and machine learning algorithms that require categorical data. Discretization can be achieved through techniques such as equal width binning, equal frequency binning, and clustering.** **Data Normalization: This involves scaling the data to a common range, such as between 0 and 1 or -1 and 1. Normalization is often used to handle data with different units and scales. Common normalization techniques include min-max normalization, z-score normalization, and decimal scaling.** **Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy of the analysis results. The specific steps involved in data preprocessing may vary depending on the nature of the data and the analysis goals.** **By performing these steps, the data mining process becomes more efficient and the results become more accurate.** **Preprocessing in Data Mining: \ Data preprocessing is a data mining technique which is used to transform the raw data in a useful and efficient format. ** ** ** ![](media/image4.png) **Steps Involved in Data Preprocessing: ** **1. Data Cleaning: \ The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves handling of missing data, noisy data etc. \ ** - 1. 2. - 3. 4. 5. **2. Data Transformation: \ This step is taken in order to transform the data in appropriate forms suitable for mining process. This involves following ways: ** 1. 2. 3. 4. **3. Data Reduction: \ Data reduction is a crucial step in the data mining process that involves reducing the size of the dataset while preserving the important information. This is done to improve the efficiency of data analysis and to avoid overfitting of the model. Some common steps involved in data reduction are:** **Feature Selection: This involves selecting a subset of relevant features from the dataset. Feature selection is often performed to remove irrelevant or redundant features from the dataset. It can be done using various techniques such as correlation analysis, mutual information, and principal component analysis (PCA).** **Feature Extraction: This involves transforming the data into a lower-dimensional space while preserving the important information. Feature extraction is often used when the original features are high-dimensional and complex. It can be done using techniques such as PCA, linear discriminant analysis (LDA), and non-negative matrix factorization (NMF).** **Sampling: This involves selecting a subset of data points from the dataset. Sampling is often used to reduce the size of the dataset while preserving the important information. It can be done using techniques such as random sampling, stratified sampling, and systematic sampling.** **Clustering: This involves grouping similar data points together into clusters. Clustering is often used to reduce the size of the dataset by replacing similar data points with a representative centroid. It can be done using techniques such as k-means, hierarchical clustering, and density-based clustering.** **Compression: This involves compressing the dataset while preserving the important information. Compression is often used to reduce the size of the dataset for storage and transmission purposes. It can be done using techniques such as wavelet compression, JPEG compression, and gzip compression.\ ** **In the observational setting, data are usually \"collected\" from the existing databses, data warehouses, and data marts. Data preprocessing usually includes at least two common tasks:** 1. **Outlier detection (and removal) -- Outliers are unusual data values that are not consistent with most observations. Commonly, outliers result from measurement errors, coding and recording errors, and, sometimes, are natural, abnormal values. Such nonrepresentative samples can seriously affect the model produced later. There are two strategies for dealing with outliers:** a. **Detect and eventually remove outliers as a part of the preprocessing phase, or** b. **Develop robust modeling methods that are insensitive to outliers.** 2. **Scaling, encoding, and selecting features -- Data preprocessing includes several steps such as variable scaling and different types of encoding. For example, one feature with the range \[0, 1\] and the other with the range \[−100, 1000\] will not have the same weights in the applied technique; they will also influence the final data-mining results differently. Therefore, it is recommended to scale them and bring both features to the same weight for further analysis. Also, application-specific encoding methods usually achieve** #### Estimate the model #### Interpret the model and draw conclusions ##### **The Data mining Process** 6. **Classification of Data mining Systems:** - **Database Technology** - **Statistics** - **Machine Learning** - **Information Science** - ![](media/image6.png) **Visualization** #### Some Other Classification Criteria: - **Classification according to kind of databases mined** - **Classification according to kind of knowledge mined** - **Classification according to kinds of techniques utilized** - ![](media/image6.png) **Classification according to applications adapted** #### Classification according to kind of databases mined #### Classification according to kind of knowledge mined ###### Classification according to kinds of techniques utilized ###### Classification according to applications adapted ### Major Issues In Data Mining: ### Knowledge Discovery in Databases(KDD) ##### **Architecture of KDD** **Unit-II** **Data Warehouse:** ### Data Warehouse Design Process: ### A Three Tier Data Warehouse Architecture: ![](media/image8.png) #### Tier-1: #### Tier-2: #### Tier-3: 3. **Data Warehouse Models:** #### Enterprise warehouse: #### Data mart: #### Virtual warehouse: 4. **Meta Data Repository:** ### OLAP(Online analytical Processing): - **Consolidation (Roll-Up)** - **Drill-Down** - **Slicing And Dicing** 1. ### Types of OLAP: 1. #### Relational OLAP (ROLAP): #### Multidimensional OLAP (MOLAP): #### Hybrid OLAP (HOLAP): Data Warehouse Implementation ============================= **There are various implementation in data warehouses which are as follows** Data Warehouse Implementation **1. Requirements analysis and capacity planning: The first process in data warehousing involves defining enterprise needs, defining architectures, carrying out capacity planning, and selecting the hardware and software tools. This step will contain be consulting senior management as well as the different stakeholder.** **2. Hardware integration: Once the hardware and software has been selected, they require to be put by integrating the servers, the storage methods, and the user software tools.** **3. Modeling: Modelling is a significant stage that involves designing the warehouse schema and views. This may contain using a modeling tool if the data warehouses are sophisticated.** **Advertisement** **4. Physical modeling: For the data warehouses to perform efficiently, physical modeling is needed. This contains designing the physical data warehouse organization, data placement, data partitioning, deciding on access techniques, and indexing.** **5. Sources: The information for the data warehouse is likely to come from several data sources. This step contains identifying and connecting the sources using the gateway, ODBC drives, or another wrapper.** **6. ETL: The data from the source system will require to go through an ETL phase. The process of designing and implementing the ETL phase may contain defining a suitable ETL tool vendors and purchasing and implementing the tools. This may contains customize the tool to suit the need of the enterprises.** **7. Populate the data warehouses: Once the ETL tools have been agreed upon, testing the tools will be needed, perhaps using a staging area. Once everything is working adequately, the ETL tools may be used in populating the warehouses given the schema and view definition.** **8. User applications: For the data warehouses to be helpful, there must be end-user applications. This step contains designing and implementing applications required by the end-users.** **9. Roll-out the warehouses and applications: Once the data warehouse has been populated and the end-client applications tested, the warehouse system and the operations may be rolled out for the user\'s community to use.** Implementation Guidelines ------------------------- ![Data Warehouse Implementation](media/image10.png) **1. Build incrementally: Data warehouses must be built incrementally. Generally, it is recommended that a data marts may be created with one particular project in mind, and once it is implemented, several other sections of the enterprise may also want to implement similar systems. An enterprise data warehouses can then be implemented in an iterative manner allowing all data marts to extract information from the data warehouse.** **2. Need a champion: A data warehouses project must have a champion who is active to carry out considerable researches into expected price and benefit of the project. Data warehousing projects requires inputs from many units in an enterprise and therefore needs to be driven by someone who is needed for interacting with people in the enterprises and can actively persuade colleagues.** **3. Senior management support: A data warehouses project must be fully supported by senior management. Given the resource-intensive feature of such project and the time they can take to implement, a warehouse project signal for a sustained commitment from senior management.** **4. Ensure quality: The only record that has been cleaned and is of a quality that is implicit by the organizations should be loaded in the data warehouses.** **5. Corporate strategy: A data warehouse project must be suitable for corporate strategies and business goals. The purpose of the project must be defined before the beginning of the projects.** **6. Business plan: The financial costs (hardware, software, and peopleware), expected advantage, and a project plan for a data warehouses project must be clearly outlined and understood by all stakeholders. Without such understanding, rumors about expenditure and benefits can become the only sources of data, subversion the projects.** **7. Training: Data warehouses projects must not overlook data warehouses training requirements. For a data warehouses project to be successful, the customers must be trained to use the warehouses and to understand its capabilities.** **8. Adaptability: The project should build in flexibility so that changes may be made to the data warehouses if and when required. Like any system, a data warehouse will require to change, as the needs of an enterprise change.** **9. Joint management: The project must be handled by both IT and business professionals in the enterprise. To ensure that proper communication with the stakeholder and which the project is the target for assisting the enterprise\'s business, the business professional must be involved in the project along with technical professionals.** What is Data Cube? ================== **When data is grouped or combined in multidimensional matrices called Data Cubes. The data cube method has a few alternative names or a few variants, such as \"Multidimensional databases,\" \"materialized views,\" and \"OLAP (On-Line Analytical Processing).\"** **The general idea of this approach is to materialize certain expensive computations that are frequently inquired.** **For example, a relation with the schema sales (part, supplier, customer, and sale-price) can be materialized into a set of eight views as shown in fig, where psc indicates a view consisting of aggregate function value (such as total-sales) computed by grouping three attributes part, supplier, and customer, p indicates a view composed of the corresponding aggregate function values calculated by grouping part alone, etc.** What is Data Cube **A data cube is created from a subset of attributes in the database. Specific attributes are chosen to be measure attributes, i.e., the attributes whose values are of interest. Another attributes are selected as dimensions or functional attributes. The measure attributes are aggregated according to the dimensions.** **For example, XYZ may create a sales data warehouse to keep records of the store\'s sales for the dimensions time, item, branch, and location. These dimensions enable the store to keep track of things like monthly sales of items, and the branches and locations at which the items were sold. Each dimension may have a table identify with it, known as a dimensional table, which describes the dimensions. For example, a dimension table for items may contain the attributes item\_name, brand, and type.** **Data cube method is an interesting technique with many applications. Data cubes could be sparse in many cases because not every cell in each dimension may have corresponding data in the database.** **Techniques should be developed to handle sparse cubes efficiently.** **If a query contains constants at even lower levels than those provided in a data cube, it is not clear how to make the best use of the precomputed results stored in the data cube.** **The model view data in the form of a data cube. OLAP tools are based on the multidimensional data model. Data cubes usually model n-dimensional data.** **A data cube enables data to be modeled and viewed in multiple dimensions. A multidimensional data model is organized around a central theme, like sales and transactions. A fact table represents this theme. Facts are numerical measures. Thus, the fact table contains measure (such as Rs\_sold) and keys to each of the related dimensional tables.** **Dimensions are a fact that defines a data cube. Facts are generally quantities, which are used for analyzing the relationship between dimensions.** ![What is Data Cube](media/image12.png) **Example: In the 2-D representation, we will look at the All Electronics sales data for items sold per quarter in the city of Vancouver. The measured display in dollars sold (in thousands).** What is Data Cube 3-Dimensional Cuboids --------------------- **Let suppose we would like to view the sales data with a third dimension. For example, suppose we would like to view the data according to time, item as well as the location for the cities Chicago, New York, Toronto, and Vancouver. The measured display in dollars sold (in thousands). These 3-D data are shown in the table. The 3-D data of the table are represented as a series of 2-D tables.** ![What is Data Cube](media/image14.png) **Conceptually, we may represent the same data in the form of 3-D data cubes, as shown in fig:** What is Data Cube **Let us suppose that we would like to view our sales data with an additional fourth dimension, such as a supplier.** **In data warehousing, the data cubes are n-dimensional. The cuboid which holds the lowest level of summarization is called a base cuboid.** **For example, the 4-D cuboid in the figure is the base cuboid for the given time, item, location, and supplier dimensions.** ![What is Data Cube](media/image16.png) **Figure is shown a 4-D data cube representation of sales data, according to the dimensions time, item, location, and supplier. The measure displayed is dollars sold (in thousands).** **The topmost 0-D cuboid, which holds the highest level of summarization, is known as the apex cuboid. In this example, this is the total sales, or dollars sold, summarized over all four dimensions.** **The lattice of cuboid forms a data cube. The figure shows the lattice of cuboids creating 4-D data cubes for the dimension time, item, location, and supplier. Each cuboid represents a different degree of summarization.** What is Data Cube **Data Cube Computation-** Data Cube Computations ====================== **Data Cube Generalization** Introduction to Data Generalization ----------------------------------- Clustering ---------- ### Example **from sklearn.cluster import KMeans** **\# Load customer data** **customer\_data = load\_customer\_data()** **\# Use k-means clustering to group customers into 3 clusters** **kmeans = KMeans(n\_clusters=3)** **kmeans.fit(customer\_data)** **\# View the resulting clusters** **print(kmeans.labels\_)** Sampling -------- ### Example **import random** **\# Load customer data** **customer\_data = load\_customer\_data()** **\# Select a random sample of 1000 customers** **sample\_size = 1000** **random\_sample = random.sample(customer\_data, sample\_size)** **\# Perform analysis on the sample** **results = analyze\_sample(random\_sample)** **\# Use the results to make inferences about the larger population** **infer\_population\_trends(results, sample\_size, len(customer\_data))** Dimensionality Reduction ------------------------ ### Example **from sklearn.decomposition import PCA** **\# Load dataset** **data = load\_dataset()** **\# Use PCA to reduce the number of features to 3** **pca = PCA(n\_components=3)** **pca.fit(data)** **\# View the transformed data** **print(pca.transform(data))** Other Basic Approaches of Data Generalization --------------------------------------------- ### Data Cube Approach ### Example **\# Load sales data** **sales\_data = load\_sales\_data()** **\# Create a data cube with dimensions for time, location, and product type** **data\_cube = create\_data\_cube(sales\_data, \[\'time\', \'location\', \'product\_type\'\])** **\# View sales data for a specific time period, location, and product type** **sales\_data = data\_cube.slice(time=\'Q1 2021\', location=\'New York\',** **product\_type=\'Clothing\')** **print(sales\_data)** ### Attribute Orientation Induction ### Example **\# Load customer data** **customer\_data = load\_customer\_data()** **\# Use attribute orientation induction to classify customers into differenet segments** **segments = classify\_customers(customer\_data)** **\# View the resulting segments** **print(segments)** ### Association Rule Mining: #### Problem Definition: Example: ------- -- ------- ------- -- **1** **1** **0** **2** **0** **1** **3** **0** **0** **4** **1** **1** **5** **1** **0** ------- -- ------- ------- -- #### Important concepts of Association Rule Mining: ### Market basket analysis: Example: 3. **Frequent Pattern Mining:** Based on the completeness of patterns to be mined: Based on the levels of abstraction involved in the rule set: Based on the number of data dimensions involved in the rule: Based on the types of values handled in the rule: Based on the kinds of rules to be mined: Based on the kinds of patterns to be mined: 4. Efficient Frequent Itemset Mining Methods: ------------------------------------------ 2. ### Finding Frequent Itemsets Using Candidate Generation:The Apriori Algorithm Example: -- -- -- -- 1. **In the first iteration of the algorithm, each item is a member of the set of candidate1- itemsets, C1. The algorithm simply scans all of the transactions in order to countthe number of occurrences of each item.** 2. **Suppose that the minimum support count required is 2, that is, min sup = 2. The set of frequent 1-itemsets, L1, can thenbe determined. It consists of the candidate 1-itemsets satisfying minimum support.In our example, all of the candidates in C1 satisfy minimum support.** 3. **To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 on L1 togenerate a candidate set of 2-itemsets, C2.No candidates are removed fromC2 during the prune step because each subset of thecandidates is also frequent.** 4. **Next, the transactions inDare scanned and the support count of each candidate itemsetInC2 is accumulated.** 5. **The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate2- itemsets in C2 having minimum support.** 6. **The generation of the set of candidate 3-itemsets,C3, Fromthejoin step, we first getC3 =L2x L2 = ({I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4},{I2, I3, I5}, {I2, I4, I5}. Based on the Apriori property that all subsets of a frequentitemsetmust also be frequent, we can determine that the four latter candidates cannotpossibly be frequent.** 7. **The transactions in D are scanned in order to determine L3, consisting of those candidate 3-itemsets in C3 having minimum support.** 8. **The algorithm uses L3x L3 to generate a candidate set of 4-itemsets, C4.** ![](media/image44.png) ### Generating Association Rules from Frequent Itemsets: Example: ![](media/image47.png) 5. **Mining Multilevel Association Rules:** ### Approaches ForMining Multilevel Association Rules: #### UniformMinimum Support: ![](media/image49.png) #### Reduced Minimum Support: #### Group-Based Minimum Support: ### Mining Multidimensional Association Rules from Relational Databases and Data Warehouses: ### Mining Quantitative Association Rules: ### From Association Mining to Correlation Analysis: ![](media/image52.jpeg) **Constraint Based Association Mining-** Chapter-4 ========= ### Classification and Prediction: ### Issues Regarding Classification and Prediction: #### Preparing the Data for Classification and Prediction: Data cleaning: Relevance analysis: Data Transformation And Reduction 2. #### Comparing Classification and Prediction Methods: - Accuracy: - Speed: - Robustness: - Scalability: - Interpretability: ### Classification by Decision Tree Induction: - **Each internal nodedenotes a test on an attribute.** - **Each branch represents an outcome of the test.** - **Each leaf node holds a class label.** - **The topmost node in a tree is the root node.** #### Algorithm For Decision Tree Induction: ![](media/image54.png) - **Data partition** - **Attribute list** - **Attribute selection method** A is discrete-valued: A is continuous-valued: A is discrete-valued and a binary tree must be produced: a. **If A is Discrete valued (b)If A is continuous valued (c) IfA is discrete-valued and a binary tree must be produced:** ### Bayesian Classification: ##### **Bayesian classification is based on Bayes' theorem.** 4. **Bayes' Theorem:** ![](media/image57.png) ### Naïve Bayesian Classification: 1. **Let D be a training set of tuples and their associated class labels. As usual, each tuple is represented by an n-dimensional attribute vector, X = (x1, x2,...,xn), depicting n measurements made on the tuple from n attributes, respectively, A1, A2,..., An.** 2. **Suppose that there are m classes, C1, C2,..., Cm. Given a tuple, X, the classifier will predict that X belongs to the class having the highest posterior probability, conditioned on X.** 3. **As P(X) is constant for all classes, only P(X\|Ci)P(Ci) need be maximized. If the class** 4. **Given data sets with many attributes, it would be extremely computationally expensiveto compute P(X\|Ci). In order to reduce computation in evaluating P(X\|Ci), the naive assumption of class conditional independence is made. This presumes that the values of the attributes areconditionally independent of one another, given the class label of the tuple. Thus,** - **If *A~k~*is categorical, then *P*(*x~k~*\|*Ci*) is the number of tuples of class *Ci*in *D* havingthe value** - **If *A~k~*is continuous-valued, then we need to do a bit more work, but the calculationis pretty straightforward.** 5. **In order to predict the class label of *X*, *P*(*X*j*Ci*)*P*(*Ci*) is evaluated for each class *Ci*. The classifier predicts that the class label of tuple *X* is the class *Ci*if and only if** ![](media/image62.png) ### A Multilayer Feed-Forward Neural Network: ##### ![](media/image56.png)**The backpropagation algorithm performs learning on a multilayer feed- forward neural network.** ### Classification by Backpropagation: Advantages: Initialize the weights: Propagate the inputs forward: ![](media/image64.png) Backpropagate the error: ![](media/image66.png) ![](media/image68.png) Algorithm: ![](media/image70.jpeg) 5. **k-Nearest-Neighbor Classifier:** Support Vector Machine Algorithm ================================ **Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for Classification as well as Regression problems. However, primarily, it is used for Classification problems in Machine Learning.** **The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane.** **SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider the below diagram in which there are two different categories that are classified using a decision boundary or hyperplane:** ![Support Vector Machine Algorithm](media/image73.png) **Example: SVM can be understood with the example that we have used in the KNN classifier. Suppose we see a strange cat that also has some features of dogs, so if we want a model that can accurately identify whether it is a cat or dog, so such a model can be created by using the SVM algorithm. We will first train our model with lots of images of cats and dogs so that it can learn about different features of cats and dogs, and then we test it with this strange creature. So as support vector creates a decision boundary between these two data (cat and dog) and choose extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis of the support vectors, it will classify it as a cat. Consider the below diagram:** **SVM algorithm can be used for Face detection, image classification, text categorization, etc.** Types of SVM ------------ **SVM can be of two types:** - **Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be classified into two classes by using a single straight line, then such data is termed as linearly separable data, and classifier is used called as Linear SVM classifier.** - **Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset cannot be classified by using a straight line, then such data is termed as non-linear data and classifier used is called as Non-linear SVM classifier.** Hyperplane and Support Vectors in the SVM algorithm: ---------------------------------------------------- **Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional space, but we need to find out the best decision boundary that helps to classify the data points. This best boundary is known as the hyperplane of SVM.** **The dimensions of the hyperplane depend on the features present in the dataset, which means if there are 2 features (as shown in image), then hyperplane will be a straight line. And if there are 3 features, then hyperplane will be a 2-dimension plane.** **We always create a hyperplane that has a maximum margin, which means the maximum distance between the data points.** **Support Vectors:** **The data points or vectors that are the closest to the hyperplane and which affect the position of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence called a Support vector.** How does SVM works? ------------------- **Linear SVM:** **The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue. Consider the below image:** ![Support Vector Machine Algorithm](media/image75.png) **So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there can be multiple lines that can separate these classes. Consider the below image:** Support Vector Machine Algorithm **Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These points are called support vectors. The distance between the vectors and the hyperplane is called as margin. And the goal of SVM is to maximize this margin. The hyperplane with maximum margin is called the optimal hyperplane.** ![Support Vector Machine Algorithm](media/image77.png) **Non-Linear SVM:** **If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we cannot draw a single straight line. Consider the below image:** Support Vector Machine Algorithm **So to separate these data points, we need to add one more dimension. For linear data, we have used two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated as:** **By adding the third dimension, the sample space will become as below image:** ![Support Vector Machine Algorithm](media/image79.png) **So now, SVM will divide the datasets into classes in the following way. Consider the below image:** Support Vector Machine Algorithm **Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d space with z=1, then it will become as:** ![Support Vector Machine Algorithm](media/image81.png) **Hence we get a circumference of radius 1 in case of non-linear data.** **Association Classification** ### Association Rule learning in Data Mining: **Association rule learning is a machine learning method for discovering interesting relationships between variables in large databases. It is designed to detect strong rules in the database based on some interesting metrics. For any given multi-item transaction, association rules aim to obtain rules that determine how or why certain items are linked.** **Association rules are created by searching for information on common if-then patterns and using specific criteria with support and trust to define what the key relationships are. They help to show the frequency of an item in a given data since confidence is defined by the number of times an if-then statement is found to be true. However, a third criterion called lift is often used to compare expected and actual confidence. Lift shows how many times the if-then statement was predicted to be true. Create association rules to compute itemsets based on data created by two or more items. Association rules usually consist of rules that are well represented by the data.** **There are different types of data mining techniques that can be used to find out the specific analysis and result like Classification analysis, Clustering analysis, and multivariate analysis. Association rules are mainly used to analyze and predict customer behavior.** - - - ### Associative Classification in Data Mining: **Bing Liu Et Al was the first to propose associative classification, in which he defined a model whose rule is "the right-hand side is constrained to be the attribute of the classification class".An associative classifier is a supervised learning model that uses association rules to assign a target value.** **The model generated by the association classifier and used to label new records consists of association rules that produce class labels. Therefore, they can also be thought of as a list of "if-then" clauses: if a record meets certain criteria (specified on the left side of the rule, also known as antecedents), it is marked (or scored) according to the rule's category on the right. Most associative classifiers read the list of rules sequentially and apply the first matching rule to mark new records. Association classifier rules inherit some metrics from association rules, such as Support or Confidence, which can be used to rank or filter the rules in the model and evaluate their quality.** ### Types of Associative Classification: **There are different types of Associative Classification Methods, Some of them are given below.** **1. CBA (Classification Based on Associations): It uses association rule techniques to classify data, which proves to be more accurate than traditional classification techniques. It has to face the sensitivity of the minimum support threshold. When a lower minimum support threshold is specified, a large number of rules are generated.** **2. CMAR (Classification based on Mu ltiple Association Rules): It uses an efficient FP-tree, which consumes less memory and space compared to Classification Based on Associations. The FP-tree will not always fit in the main memory, especially when the number of attributes is large.** **3. CPAR (Classification based on Predictive Association Rules): Classification based on predictive association rules combines the advantages of association classification and traditional rule-based classification. Classification based on predictive association rules uses a greedy algorithm to generate rules directly from training data. Furthermore, classification based on predictive association rules generates and tests more rules than traditional rule-based classifiers to avoid missing important rules.** **Other Classification Methods:** ### Genetic Algorithms: ### Fuzzy Set Approaches: Example: 6. **Regression Analysis:** 9. **Linear Regression:** ![](media/image83.png) #### Multiple Linear Regression: #### Nonlinear Regression: Transformation of a polynomial regression model to a linear regression model: ##### **x~1~ = x, x~2~ = x^2^ ,x~3~ = x^3^** ### Classifier Accuracy: ![](media/image85.png) ### Cluster Analysis: #### Applications: 2. #### Typical Requirements Of Clustering InData Mining: - Scalability: - Ability to deal with different types of attributes: - Discovery of clusters with arbitrary shape: - Minimal requirements for domain knowledge to determine input parameters: - Ability to deal with noisy data: - Incremental clustering and insensitivity to the order of input records: - High dimensionality: - Constraint-based clustering: - Interpretability and usability: ### Major Clustering Methods: - ##### **Partitioning Methods** - **Hierarchical Methods** - **Density-Based Methods** - **Grid-Based Methods** - **Model-Based Methods** 3. **Partitioning Methods:** 4. **Hierarchical Methods:** - **Theagglomerative approach, also called the bottom-up approach, starts with each objectforming a separate group. It successively merges the objects or groups that are closeto one another, until all of the groups are merged into one or until a termination condition holds.** - **The divisive approach, also calledthe top-down approach, starts with all of the objects in the same cluster. In each successiveiteration, a cluster is split up into smaller clusters, until eventually each objectis in one cluster, or until a termination condition holds.** - **Perform careful analysis ofobject ―linkages‖ at each hierarchical partitioning, such as in Chameleon, or** - **Integratehierarchical agglomeration and other approaches by first using a hierarchicalagglomerative algorithm to group objects into microclusters, and then performingmacroclustering on the microclusters using another clustering method such as iterative relocation.** 5. #### Density-based methods: - **Most partitioning methods cluster objects based on the distance between objects. Such methods can find only spherical-shaped clusters and encounter difficulty at discovering clusters of arbitrary shapes.** - **Other clustering methods have been developed based on the notion of density. Their general idea is to continue growing the given cluster as long as the density in the neighborhood exceeds some threshold; that is, for each data point within a given cluster, the neighborhood of a given radius has to contain at least a minimum number of points. Such a method can be used to filter out noise (outliers)and discover clusters of arbitrary shape.** - **DBSCAN and its extension, OPTICS, are typical density-based methods that growclusters according to a density-based connectivity analysis. DENCLUE is a methodthat clusters objects based on the analysis of the value distributions of density functions.** 6. #### Grid-Based Methods: - **Grid-based methods quantize the object space into a finite number of cells that form a grid structure.** - **All of the clustering operations are performed on the grid structure i.e., on the quantized space. The main advantage of this approach is its fast processing time, which is typically independent of the number of data objects and dependent only on the number of cells in each dimension in the quantized space.** - **STING is a typical example of a grid-based method. Wave Cluster applies wavelet transformation for clustering analysis and is both grid-based and density-based.** 7. #### Model-Based Methods: - **Model-based methods hypothesize a model for each of the clusters and find the best fit of the data to the given model.** - **A model-based algorithm may locate clusters by constructing a density function that reflects the spatial distribution of the data points.** - **It also leads to a way of automatically determining the number of clusters based on standard statistics, taking ―noise‖ or outliers into account and thus yielding robust clustering methods.** 3. ### Tasks in Data Mining: - ##### **Clustering High-Dimensional Data** - **Constraint-Based Clustering** 8. **Clustering High-Dimensional Data:** 9. **Constraint-Based Clustering:** ### Classical Partitioning Methods: - ##### **The *k*-Means Method** - **k-Medoids Method** 10. **Centroid-Based Technique: The *K*-Means Method:** ![](media/image87.png) 1. **The k-means partitioning algorithm:** ### The *k*-Medoids Method: Case 1: Case 2: Case 3: Case 4: **Four cases of the cost function for *k*-medoids clustering** **The k-medoids algorithm for partitioning based on medoid or central objects.** ![](media/image92.png) ### Hierarchical Clustering Methods: #### Agglomerative hierarchical clustering: #### Divisive hierarchical clustering: 6. **Constraint-Based Cluster Analysis:** - Constraints on individual objects: - Constraints on the selection of clustering parameters: - Constraints on distance or similarity functions: - User-specified constraints on the properties of individual clusters: - Semi-supervised clustering based on partial supervision: 7. **Outlier Analysis:** ### Types of outlier detection: - ##### **Statistical Distribution-Based Outlier Detection** - **Distance-Based Outlier Detection** - **Density-Based Local Outlier Detection** - **Deviation-Based Outlier Detection** 13. **Statistical Distribution-Based Outlier Detection:** ![](media/image2.png)Inherent alternative distribution: Mixture alternative distribution: ![](media/image2.png)Slippage alternative distribution: Block procedures: Consecutive procedures: 14. **Distance-Based Outlier Detection:** Index-based algorithm: Nested-loop algorithm: Cell-based algorithm: #### Density-Based Local Outlier Detection: #### Deviation-Based Outlier Detection: #### Sequential Exception Technique: ![](media/image99.png) Exception set: Dissimilarity function: Cardinality function: Smoothing factor: **Social Media Mining** **Social media is a great source of information and a perfect platform for communication. Businesses and individuals can make the best of it instead of only sharing their photos and videos on the platform. The platform gives freedom to its users to connect with their target group easily and fantastically. Either a group or an established business, both face difficulties in standing up with the competitive social media industry. But through the social media platform, users can market or develop his/her brand or content with others.** **Social media mining includes social media platforms, social network analysis, and data mining to provide a convenient and consistent platform for learners, professionals, scientists, and project managers to understand the fundamentals and potentials of social media mining. It suggests various problems arising from social media data and presents fundamental concepts, emerging issues, and effective algorithms for data mining, and network analysis. It includes multiple degrees of difficulty that enhance knowledge and help in applying ideas, principles, and techniques in distinct social media mining situations.** **As per the \"Global Digital Report,\" the total number of active users on social media platforms worldwide in 2019 is 2.41 billion and increases up to 9 % year-on-year. With the universal use of Social media platforms via the internet, a huge amount of data is accessible. Social media platforms include many fields of study, such as sociology, business, psychology, entertainment, politics, news, and other cultural aspects of societies. Applying data mining to social media can provide exciting views on human behavior and human interaction. Data mining can be used in combination with social media to understand user\'s opinions about a subject, identifying a group of individuals among the masses of a population, to study group modifications over time, find influential people, or even suggest a product or activity to an individual.** ![Social Media Mining](media/image101.png) **For example, The presidential election during 2008 marked an unprecedented use of social media platforms in the United States. Social media platforms, including Facebook, YouTube played a vital role in raising funds and getting candidate\'s messages to voters. Researcher\'s extracted blog data to demonstrate correlations between the amount of social media platform used by candidates and the winner of the 2008 presidential campaign.** **[Web Mining]** **Web Mining is the process of [[Data Mining]](https://www.geeksforgeeks.org/data-mining/) techniques to automatically discover and extract information from Web documents and services. The main purpose of web mining is discovering useful information from the World-Wide Web and its usage patterns. ** **Applications of Web Mining:** **Web mining is the process of discovering patterns, structures, and relationships in web data. It involves using data mining techniques to analyze web data and extract valuable insights. The applications of web mining are wide-ranging and include:** **Personalized marketing:** ** Web mining can be used to analyze customer behavior on websites and social media platforms. This information can be used to create personalized marketing campaigns that target customers based on their interests and preferences.** **E-commerce** **Web mining can be used to analyze customer behavior on e-commerce websites. This information can be used to improve the user experience and increase sales by recommending products based on customer preferences.** **Search engine optimization: ** **Web mining can be used to analyze search engine queries and search engine results pages (SERPs). This information can be used to improve the visibility of websites in search engine results and increase traffic to the website.** **Fraud detection: ** **Web mining can be used to detect fraudulent activity on websites. This information can be used to prevent financial fraud, identity theft, and other types of online fraud.** **Sentiment analysis:** ** Web mining can be used to analyze social media data and extract sentiment from posts, comments, and reviews. This information can be used to understand customer sentiment towards products and services and make informed business decisions.** **Web content analysis: ** **Web mining can be used to analyze web content and extract valuable information such as keywords, topics, and themes. This information can be used to improve the relevance of web content and optimize search engine rankings.** **Customer service: ** **Web mining can be used to analyze customer service interactions on websites and social media platforms. This information can be used to improve the quality of customer service and identify areas for improvement.** **Healthcare: ** **Web mining can be used to analyze health-related websites and extract valuable information about diseases, treatments, and medications. This information can be used to improve the quality of healthcare and inform medical research.** **Process of Web Mining:** ***Web Mining Process*** **Web mining can be broadly divided into three different types of techniques of mining: Web Content Mining, Web Structure Mining, and Web Usage Mining. These are explained as following below.** ![](media/image103.png) ***Categories of Web Mining*** 1. 2. 3. ** ** **Comparison Between Data mining and Web mining:** **Points** **Data Mining** **Web Mining** ------------------ ------------------------------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------------------------------------------------------------- **Definition** **Data Mining is the process that attempts to discover pattern and hidden knowledge in large data sets in any system.** **Web Mining is the process of data mining techniques to automatically discover and extract information from web documents.** **Application** **Data Mining is very useful for web page analysis.** **Web Mining is very useful for a particular website and e-service.** **Target Users** **Data scientist and data engineers.** **Data scientists along with data analysts.** **Access** **[Data Mining access data privately].** **[Web Mining access data publicly.]** **Structure** **In Data Mining get the information from explicit structure.** **In Web Mining get the information from structured, unstructured and semi-structured web pages.** **Problem Type** **Clustering, classification, regression, prediction, optimization and control.** **Web content mining, Web structure mining.** **Tools** **It includes tools like machine learning algorithms.** **Special tools for web mining are Scrapy, PageRank and Apache logs.** **Skills** **It includes approaches for data cleansing, machine learning algorithms. Statistics and probability.** **It includes application level knowledge, data engineering with mathematical modules like statistics and probability.**