Podcast
Questions and Answers
Which property of a distance measure states that the distance is always non-negative?
Which property of a distance measure states that the distance is always non-negative?
The Mahalanobis distance can be zero only when the two points are identical.
The Mahalanobis distance can be zero only when the two points are identical.
True
What is the covariance matrix given in the Mahalanobis distance example?
What is the covariance matrix given in the Mahalanobis distance example?
[[0.3, 0.2], [0.2, 0.3]]
The property of similarity that shows it is the same regardless of the order of objects is called _______.
The property of similarity that shows it is the same regardless of the order of objects is called _______.
Signup and view all the answers
Match the following terms to their respective properties:
Match the following terms to their respective properties:
Signup and view all the answers
Which of the following describes the purpose of predictive modeling?
Which of the following describes the purpose of predictive modeling?
Signup and view all the answers
Association rules are primarily concerned with predicting future trends.
Association rules are primarily concerned with predicting future trends.
Signup and view all the answers
What is one example of a class attribute mentioned in the predictive modeling section?
What is one example of a class attribute mentioned in the predictive modeling section?
Signup and view all the answers
In data mining, __________ refers to identifying unusual observations that differ from the majority of the data.
In data mining, __________ refers to identifying unusual observations that differ from the majority of the data.
Signup and view all the answers
Match the data mining tasks with their descriptions:
Match the data mining tasks with their descriptions:
Signup and view all the answers
Which attribute would most likely have a categorical value?
Which attribute would most likely have a categorical value?
Signup and view all the answers
What type of analysis is performed when finding relationships between variables?
What type of analysis is performed when finding relationships between variables?
Signup and view all the answers
What is the main purpose of regression analysis?
What is the main purpose of regression analysis?
Signup and view all the answers
The object catalog mentioned has a size of 150 GB.
The object catalog mentioned has a size of 150 GB.
Signup and view all the answers
How many new high red-shift quasars were found?
How many new high red-shift quasars were found?
Signup and view all the answers
Regression is extensively studied in __________ and neural network fields.
Regression is extensively studied in __________ and neural network fields.
Signup and view all the answers
Match the types of data with their respective sizes:
Match the types of data with their respective sizes:
Signup and view all the answers
What types of characteristics are included in the classification of galaxies?
What types of characteristics are included in the classification of galaxies?
Signup and view all the answers
Regression analysis can only be applied to linear relationships.
Regression analysis can only be applied to linear relationships.
Signup and view all the answers
What are two examples of predictions made using regression analysis?
What are two examples of predictions made using regression analysis?
Signup and view all the answers
The study of __________ includes investigating the stages of formation of galaxies.
The study of __________ includes investigating the stages of formation of galaxies.
Signup and view all the answers
Which of the following is NOT an example of regression analysis?
Which of the following is NOT an example of regression analysis?
Signup and view all the answers
Which of the following characteristics of data can complicate the recognition of the proper attribute type?
Which of the following characteristics of data can complicate the recognition of the proper attribute type?
Signup and view all the answers
High dimensional data brings several challenges in data analysis.
High dimensional data brings several challenges in data analysis.
Signup and view all the answers
What type of data consists of a collection of records, each with a fixed set of attributes?
What type of data consists of a collection of records, each with a fixed set of attributes?
Signup and view all the answers
The __________ data type consists of documents represented as term vectors.
The __________ data type consists of documents represented as term vectors.
Signup and view all the answers
Match the type of data with its correct description:
Match the type of data with its correct description:
Signup and view all the answers
Which property of data refers to how the scale can affect pattern recognition?
Which property of data refers to how the scale can affect pattern recognition?
Signup and view all the answers
The presence of some attributes in sparse data is always sufficient for analysis.
The presence of some attributes in sparse data is always sufficient for analysis.
Signup and view all the answers
What is the main challenge related to sparsity in data?
What is the main challenge related to sparsity in data?
Signup and view all the answers
Data objects with the same fixed set of attributes can be visualized in a ________-dimensional space.
Data objects with the same fixed set of attributes can be visualized in a ________-dimensional space.
Signup and view all the answers
Which of the following is NOT a type of data set mentioned?
Which of the following is NOT a type of data set mentioned?
Signup and view all the answers
What is an example of noise in data?
What is an example of noise in data?
Signup and view all the answers
Outliers are always considered noise in data analysis.
Outliers are always considered noise in data analysis.
Signup and view all the answers
What are two reasons for missing values in a dataset?
What are two reasons for missing values in a dataset?
Signup and view all the answers
Duplicate data issues often arise when merging data from __________ sources.
Duplicate data issues often arise when merging data from __________ sources.
Signup and view all the answers
Match the following data quality problems with their definitions:
Match the following data quality problems with their definitions:
Signup and view all the answers
What can be done to handle missing values in a dataset?
What can be done to handle missing values in a dataset?
Signup and view all the answers
Having multiple email addresses for the same person is an example of duplicate data.
Having multiple email addresses for the same person is an example of duplicate data.
Signup and view all the answers
What is a similarity measure in data analysis?
What is a similarity measure in data analysis?
Signup and view all the answers
To clean duplicate data, a process known as __________ is utilized.
To clean duplicate data, a process known as __________ is utilized.
Signup and view all the answers
Which of the following is NOT described as a data quality problem?
Which of the following is NOT described as a data quality problem?
Signup and view all the answers
What is the main purpose of the mutual information approach as described?
What is the main purpose of the mutual information approach as described?
Signup and view all the answers
An indicator variable can take on the value of -1 when both objects have a value of 0 for a symmetric attribute.
An indicator variable can take on the value of -1 when both objects have a value of 0 for a symmetric attribute.
Signup and view all the answers
What are the two main reasons for using weights when calculating similarities between attributes?
What are the two main reasons for using weights when calculating similarities between attributes?
Signup and view all the answers
In data preprocessing, the process of __________ refers to reducing the number of attributes or objects combined into a single attribute.
In data preprocessing, the process of __________ refers to reducing the number of attributes or objects combined into a single attribute.
Signup and view all the answers
Match the following data preprocessing techniques to their descriptions:
Match the following data preprocessing techniques to their descriptions:
Signup and view all the answers
What is a primary reason for the strong competitive pressure in data mining?
What is a primary reason for the strong competitive pressure in data mining?
Signup and view all the answers
Data mining is primarily concerned with collecting data rather than analyzing it.
Data mining is primarily concerned with collecting data rather than analyzing it.
Signup and view all the answers
Name one type of data that is extensively collected by e-commerce websites.
Name one type of data that is extensively collected by e-commerce websites.
Signup and view all the answers
In data mining, unusual observations that differ from the majority of the data are referred to as __________.
In data mining, unusual observations that differ from the majority of the data are referred to as __________.
Signup and view all the answers
Match the following types of data with their descriptions:
Match the following types of data with their descriptions:
Signup and view all the answers
What challenge is often faced in data analysis due to high dimensional data?
What challenge is often faced in data analysis due to high dimensional data?
Signup and view all the answers
Give one example of how data can be gathered from social networking sites.
Give one example of how data can be gathered from social networking sites.
Signup and view all the answers
Which of the following is a characteristic that can complicate the recognition of the proper attribute type?
Which of the following is a characteristic that can complicate the recognition of the proper attribute type?
Signup and view all the answers
The presence of sparse data means that all attributes in the dataset are equally important.
The presence of sparse data means that all attributes in the dataset are equally important.
Signup and view all the answers
What is meant by 'dimensionality' in the context of data analysis?
What is meant by 'dimensionality' in the context of data analysis?
Signup and view all the answers
Data represented as __________ provides a multidimensional view of objects based on their attributes.
Data represented as __________ provides a multidimensional view of objects based on their attributes.
Signup and view all the answers
Match the types of data sets with their definitions:
Match the types of data sets with their definitions:
Signup and view all the answers
Which operation is meaningful for categorical data?
Which operation is meaningful for categorical data?
Signup and view all the answers
High dimensional data provides fewer challenges for data analysis compared to low dimensional data.
High dimensional data provides fewer challenges for data analysis compared to low dimensional data.
Signup and view all the answers
What characterizes document data in data mining?
What characterizes document data in data mining?
Signup and view all the answers
The scale at which data is analyzed refers to its __________.
The scale at which data is analyzed refers to its __________.
Signup and view all the answers
Which characteristic of data refers to the amount of space occupied by the dataset?
Which characteristic of data refers to the amount of space occupied by the dataset?
Signup and view all the answers
What type of data consists of a collection of records with a fixed set of attributes?
What type of data consists of a collection of records with a fixed set of attributes?
Signup and view all the answers
What is an example of a transaction in a grocery store?
What is an example of a transaction in a grocery store?
Signup and view all the answers
The __________ refers to the quality of the data that can negatively impact data processing efforts.
The __________ refers to the quality of the data that can negatively impact data processing efforts.
Signup and view all the answers
Match the following types of data with their examples:
Match the following types of data with their examples:
Signup and view all the answers
Which of these data types can represent relationships between variables?
Which of these data types can represent relationships between variables?
Signup and view all the answers
The average monthly temperature is an example of spatio-temporal data.
The average monthly temperature is an example of spatio-temporal data.
Signup and view all the answers
In data mining, __________ involves identifying unusual observations that differ from the majority of the data.
In data mining, __________ involves identifying unusual observations that differ from the majority of the data.
Signup and view all the answers
Poor data quality can lead to which of the following issues?
Poor data quality can lead to which of the following issues?
Signup and view all the answers
What distance measure is defined as the maximum difference between any component of the vectors?
What distance measure is defined as the maximum difference between any component of the vectors?
Signup and view all the answers
The Euclidean distance is always less than or equal to the Manhattan distance for any two points.
The Euclidean distance is always less than or equal to the Manhattan distance for any two points.
Signup and view all the answers
Write the formula for Mahalanobis distance.
Write the formula for Mahalanobis distance.
Signup and view all the answers
The Hamming distance is a special case of the ______ distance, applicable to binary vectors.
The Hamming distance is a special case of the ______ distance, applicable to binary vectors.
Signup and view all the answers
Match the following types of distances with their appropriate descriptions:
Match the following types of distances with their appropriate descriptions:
Signup and view all the answers
Which distance measure would be most appropriate for identifying outliers in a dataset with correlated features?
Which distance measure would be most appropriate for identifying outliers in a dataset with correlated features?
Signup and view all the answers
The Mahalanobis distance can be greater than the Euclidean distance for the same pair of points.
The Mahalanobis distance can be greater than the Euclidean distance for the same pair of points.
Signup and view all the answers
What does the covariance matrix (Σ) represent in the Mahalanobis distance formula?
What does the covariance matrix (Σ) represent in the Mahalanobis distance formula?
Signup and view all the answers
For binary vectors, the _______ distance can be used to calculate the number of different bits.
For binary vectors, the _______ distance can be used to calculate the number of different bits.
Signup and view all the answers
If two points are identical, what will the Mahalanobis distance equal?
If two points are identical, what will the Mahalanobis distance equal?
Signup and view all the answers
Study Notes
Data Mining: Introduction
- Data mining is the process of extracting implicit, previously unknown, and potentially useful information from data.
- Large-scale data growth exists in commercial and scientific databases. This is due to advancements in data generation and collection technologies.
- The new data mining mantra is to gather whatever data possible, whenever and wherever possible.
- Data mining often is used to gain a competitive advantage, create more personalized experiences, and solve large-scale scientific problems.
- Data mining is a key component of data science.
Why Data Mining?
- Commercial viewpoints: Vast amounts of data are being collected and processed. This data often is used by commercial companies to gain a competitive edge. Businesses collect data in order to develop new products and services, personalize customer experiences, improve customer relationship management, and enhance productivity.
- Scientific viewpoints: In scientific settings, data mining is used in hypothesis formation and analysis of massive datasets from remote sensors, telescopes, and scientific simulations.
- Societal problems: Data mining provides a tool to improve health care, predict climate change effects, and develop ways to reduce world hunger and poverty.
What is Data Mining?
- Data mining is a non-trivial process of extracting implicit, previously unknown, and potentially useful patterns from large amounts of data using automatic or semi-automatic means.
- Key steps in data mining usually include data preprocessing, data mining, and postprocessing.
Origins of Data Mining
- Draws from statistics, machine learning, pattern recognition, and database systems.
- Traditional techniques often are unable to extract patterns from large-scale data.
- Modern techniques leverage database technology, parallel computing, and distributed computing.
Data Mining Tasks
- Prediction methods: Use some variables to predict unknown or future values of other variables.
- Description methods: Find interpretable patterns in data to better describe the data.
- Data mining tasks include clustering, predictive modeling/classification, association rule discovery, and anomaly detection.
Predictive Modeling: Classification
- Find a model for the class attribute as a function of values for other attributes.
- Predicting creditworthiness is one use of this technique.
Classification Example
- Classifiers learn from a training set of data to predict the class for unknown data records.
Classification: Applications
- The applications of classification involve many areas, including credit card fraud detection, intrusion detection, identification of tumor cells, and categorizing news stories.
Classification: Application 1 – Fraud Detection
- Using data regarding customer credit card transactions to predict fraudulent cases, this process labels transactions as fraudulent or legitimate.
Classification: Application 2 – Churn Prediction
- Use detailed transactional records to predict customer losses to competitors, this is commonly used in telecommunications settings.
Classification: Application 3 – Sky Survey Cataloging
- Used to predict the class (stars or galaxies) of sky objects in telescopic survey images from the Palomar Observatory.
Classifying Galaxies
- Data size includes 72 million stars, 20 million galaxies, an object catalog (9 GB), and an image database (150 GB).
Regression
- Predict a continuous variable based on other variables.
- Methodologies include studying linear and nonlinear models in statistics and using neural networks.
- Examples include predicting sales from advertising expenditure and wind velocity.
Clustering
- Finding groups of similar objects based on their characteristics.
- Clustering minimizes intra-cluster distances while maximizing inter-cluster distances
Applications of Cluster Analysis
- Custom profiling for targeted marketing
- Group related documents
- Group similar genes and proteins
- Group stocks with similar price fluctuations
- Reduce the size of large datasets
Clustering: Application 1 – Market Segmentation
- Subdividing a market into customer subsets that can be targeted for a unique marketing mix.
- Customer's geographical differences and lifestyle factors are analyzed to identify clusters of similar customers.
Clustering: Application 2 – Document Clustering
- Grouping similar documents based on frequently occurring terms within those documents.
Association Rule Discovery: Definition
- Finding dependency rules between items in a dataset.
- Based on the occurrences of items.
Association Analysis: Applications
- Rules can promote sales and help manage inventory.
- Rules are useful in telecommunication alarm diagnosis.
- Rules are also useful in medical informatics.
Association Analysis: Example
- An example subspace differential coexpression pattern from a lung cancer dataset.
Deviation/Anomaly/Change Detection
- This technique detects significant deviations from normal behavior.
- Examples include detecting credit card fraud, network intrusion, and detecting changes in the global forest cover using sensor networks.
Motivating Challenges
- Scalability
- High dimensionality
- Heterogeneous and complex data
- Data ownership and distribution
- Non-traditional analysis
Data Mining: Data
- Attributes and objects
- Types of data
- Data quality
- Similarity and distance
- Data preprocessing
What is Data?
- Collection of data objects and their attributes.
- Attribute is a characteristic.
- Object is data object.
Attribute Values
- Values that are assigned to attributes.
- Height can be measured in either feet or meters.
- Data types and values have different properties.
Measurement of Length
- How an attribute is measured can matter.
Types of Attributes
- Nominal (zipcodes, eye color)
- Ordinal (rankings)
- Interval (calendar data)
- Ratio (temperature in Kelvin)
Properties of Attribute Values
- Distinctness: attribute values are different
- Order: Values can be ordered.
- Ratios are meaningful
- Differences are meaningful
Difference Between Ratio and Interval
- Is it physically meaningful to say that a temperature of 10° is twice that of 5° on different scales?
Attribute Types
- Categorical attributes (nominal, ordinal)
- Numerical attributes (interval, ratio)
Discrete and Continuous Attributes
- Discrete attributes: Finite or countably infinite set of values.
- Continuous attributes: Real numbers(can have a very large set).
Asymmetric Attributes
- Only presence of value is considered important (e.g., words in documents, items in customer transactions).
Critiques of the attribute categorization
- Asymmetry, Cyclic, and multivariate attributes
- Partial order, and relational attributes
- Data is approximate, noisy, and incomplete.
Key messages for attribute types
- The type of operations suitable for attributes depends on the characteristics of the data.
- Various properties of data must be considered.
- Use data types for the most accurate analysis.
Important Characteristics of Data
- Dimensionality (number of attributes)
- Sparsity (how sparse the data is)
- Resolution (granularity or scale)
- Size (amount of data)
Types of Data Sets
- Record data
- Data Matrix
- Document data
- Transaction data
- Graph data
- World Wide Web
- Molecular Structures
- Ordered data
- Spatial data
- Temporal data
- Sequential data
- Genetic sequence data
Record Data
- Record data has fixed attributes.
Data Matrix
- Data objects can be thought of as points in a multi-dimensional space.
- Each dimension represents a distinct attribute.
###Document Data
- Each document represents a term vector, each term represents a component.
- The value of a component represents the number of times the term occurs in the document.
Transaction Data
- A special type of data where each transaction is a set of items.
- Example is a grocery store transaction.
Graph Data
- Graphs, molecules, webpages are examples of graph data
Ordered Data
- Sequences of transactions
- Genomic data
- Spatio-temporal data
Data Quality
- Poor data quality negatively impacts data mining processes.
Data Quality ...
- Noise and outliers, wrong data, fake data, missing values, and duplicate data are examples of data quality problems.
Noise
- Extraneous objects or attributes.
- Distorts the data.
Outliers
- Data objects with unusual characteristics.
Missing Values
- Missing features, due to many reasons.
- Impute missing values, or simply remove entries.
Duplicate Data
- Duplicate data objects can complicate data mining.
Similarity and Dissimilarity Measures
- Similarity is a numerical measure of how alike two objects are. (0,1)
- Dissimilarity is a numerical measure of how different two objects are.
Similarity/Dissimilarity for Simple Attributes
- Calculating similarities and dissimilarities for each attribute type.
Euclidean Distance
- A way to calculate distance between two objects given all attributes are numeric.
Minkowski Distance
- Calculating the distance between objects if there are more than one attribute.
- Generalization of Euclidean Distance, parameter r affects the calculation.
Minkowski Distance: Examples
- City block (Manhattan, taxicab) and Euclidean distance are examples of Minkowski's Distance.
Mahalanobis Distance
- Accounts for the correlation between attributes and the variability of the data.
Common Properties of a Distance
- Non-negativity
- Symmetry
- Triangle inequality
Common Properties of a Similarity
- Reflexivity: s(x,y) = 1 if x = y
- Symmetry: s(x,y) = s(y,x)
Similarity Between Binary Vectors
- SMC (Simple Matching) and Jaccard Coefficient measure similarity between two objects with only binary attributes.
Cosine Similarity
- Measuring similarity between two objects considering the angle between the vectors.
Correlation measures the linear relationship between objects
- Measuring similarity using correlation measures the linear relationship between two objects given attributes.
Visually Evaluating Correlation
- Scatter plots are used to visually assess the relationship between two attributes.
Drawback of correlation
- The correlation can be zero, but there may still be a nonlinear relationship.
Correlation vs Cosine vs Euclidean Distance
- Different proximity measures can be appropriate for different scenarios depending on data types.
Comparison of Proximity Measures
- Properties of a proximity measure that are important from different application perspectives.
Information Based Measures
- Information theory is a well-developed concept that can be used to develop similarity measures.
- Measures such dealing with non-linear relationships in the data.
Information and Probability
- The more certain an outcome is, the less information it provides (and vice-versa).
- Entropy is a commonly used tool for information measure.
Entropy
- A tool for measuring uncertainty.
- A discrete value is used to measure how much information a variable provides.
Entropy Examples
- Example of calculating entropy for coins, and dice.
Entropy for Sample Data: Example
- An example of calculating entropy for hair color data.
Entropy for Sample Data
- Calculating the entropy of some attribute given the number of times each observation occurrs.
Mutual Information
- Measuring the amount of information that one variable provides about another variable.
- This measure is useful for determining if the variables are related and if one variable can be used to predict the other.
Mutual Information Examples
- Calculate mutual information for example data using the provided dataset/formula.
Maximal Information Coefficient
- Measuring mutual information between two continuous variables or more that are not in a linear relationship.
General Approach for Combining Similarities
- Creating a composite similarity metric when attributes are of a non-uniform attribute types.
Using Weights to Combine Similarities
- Giving weight to different attributes to create a more meaningful result given the attributes aren't the same type.
Data Preprocessing Methods
- Aggregation, sampling, discretization/binarization, atrribute transformation, dimensionality reduction. feature subset selection, feature creation.
Aggregation
- Combining attributes or objects.
- Reduce the number of attributes or objects.
Data Aggregation Example
- Examples of aggregation for various data points, showing data scaling and stability.
Example: Precipitation in Australia
- Example of data preprocessing.
- Precipitation patterns in Australia demonstrates how data are aggregated by months or years.
Sampling
- A technique to reduce data size by only including a fraction of the data points.
- The resulting sample must be representative of the original set.
Sampling...
- For samples to be representative of the original dataset, they need to capture the same properties/distributions.
Sample Size
- The size of the sample affects the final result.
Types of Sampling
- Simple random sampling, sampling with replacement, sampling without replacement, stratified sampling.
Attribute Transformation
- Transform attributes by a series of functions to better adjust for properties that help the data.
- This is used to account for variability differences among attributes.
Example: Sample Time Series of Plant Growth
- Data about precipitation in Australia is aggregated, and analyzed to show examples of how data may be better analyzed when aggregated.
Seasonality Accounts for Much Correlation
- Seasonality in the data can affect the relationships between attributes.
Curse of Dimensionality
- Data becomes more sparse in higher dimensionalities, affects ability to accurately analyze the data.
Dimensionality Reduction
- Ways to reduce dataset size.
- PCA and SVD are examples of dimensionality reduction techniques.
Dimensionality Reduction: PCA
- PCA is an example of a dimensionality reduction technique.
- This method finds the best projection of the data to preserve the important information/variability.
Feature Subset Selection
- Selecting a subset of attributes based on their importance to the model.
Feature Creation
- Creating new features that are more informative than the existing features.
Mapping Data to a New Space
- Using a mapping function to transform the data with the intention of better capturing important relations.
Classification Techniques - Base Classifiers
- Decision tree, rule-based, nearest neighbor, Naive Bayes, Support Vector Machines, Neural Networks.
Ensemble Classifiers
- Boosting and bagging and random forests are examples of an ensemble classifier.
Example of a Decision Tree
- Example of a Decision Tree model and its processes.
Apply Model to Test Data
- How to use the model to classify data points with unknown classes.
Another Example of Decision Tree
- Another example of a decision tree classification model
Decision Tree Classification Task
- Example of a decision tree classification model and its components.
Model Overfitting
- Errors may occur when your training data represents your model too well, and it fails to generalize to unseen data.
Classification Errors
- Training errors and test errors measure errors found with training and test sets of data.
Generalization Errors
- Expected errors in the model on randomly selected data from the original distribution.
Example Data Set
- Example of a dataset containing two classes and noisy instances that is used for testing classification models.
Increasing Number of Nodes in Decision Trees
- Graph showing how model error changes as the number of nodes increases.
Decision Tree with 4 nodes
- Graph showing how the model would classify data points with a relatively small number of nodes.
Decision Tree with 50 nodes
- Graph showing how the model would classify data points with a relatively large number of nodes.
Model Underfitting and Overfitting
- Analyzing how training and testing errors change as a function of model complexity.
Model Overfitting – Impact of Training Data Size
- Demonstrating that larger training dataset sizes can decrease error rates for models of varying complexity.
Data Mining Classification: Alternative Techniques
- Rule-based and nearest-neighbor classifiers used in classification.
Rule-Based Classifier
- Classification using IF-THEN rules.
Rule-Based Classifier (Example)
- Example of a rule-based classifier for animal classification, based on their attributes
Application of Rule-Based Classifier
- Defining how rules can cover instances.
Rule Coverage and Accuracy
- Coverage: Ratio of instances satisfied by the rule's antecedent.
- Accuracy: Ratio of instances satisfying consequent given antecedent.
How does Rule-based Classifier Work?
- Describing how a rule-based classifier works for a given set of data examples.
Characteristics of Rule Sets: Strategy 1
- Mutually exclusive rules and exhaustive rules are characteristics of a rule-based strategy. Each rule is independent, and every record has match on at least one.
Characteristics of Rule Sets: Strategy 2
- Rules are not necessarily mutually exclusive; they may overlap.
- A record may trigger more than one rule, and every record need not have a rule; default class may be useful for this type of rule set.
Ordered Rule Set
- Rules ordered by priority in a rule set.
- Highest ranked rule is assigned to that class or the default.
Rule Ordering Schemes
- A rule-based ordering system that ranks rules based on quality.
- A class-based ordering scheme organizes rules in terms of their class.
Building Classification Rules
- Direct methods develop rules from data directly.
- Indirect methods develop rules based on pre-trained models like decision trees.
Direct Method: Sequential Covering
- A direct method of producing rules from data, starting with an empty rule, growing rules based on information gain, and removing training records when covered.
Example of Sequential Covering
- Visual examples of a sequential covering algorithm on sample data.
Rule Growing
- Common strategies in rule growing.
Rule Evaluation
- Estimating the information gain of one rule given prior rule knowledge and data.
Direct Method: RIPPER
- RIPPER is a direct method of learning rules from data to perform classification.
Direct Method: RIPPER
- Optimizing the rule set using the MDL principle.
- Re-evaluate the current rule set using 2 alternatives to minimize MDL.
Indirect Methods
- Techniques to derive rules from other classification models (e.g., decision trees, neural networks).
Indirect Method: C4.5rules
- Develop rules from an unpruned decision tree.
Indirect Method: C4.5rules
- Subsets of rules based on class ordering and measures of description length.
Example
- Example data used for practicing a rule-based classifier, based on identifying animals from their attributes.
C4.5 versus C4.5rules versus RIPPER
- Comparing various decision tree algorithms including the C4.5, C4.5 rules, and RIPPER models based on their predictions on a set of animal sample data.
Advantages of Rule-Based Classifiers
- Comparatively easier to interpret than other models.
- Can be efficient in computational terms.
- Can also handle redundant/irrelevant features.
Data Mining Classification: Alternative Techniques
- A cover of instances occurs when the attributes of the instance satisfy the condition of the rule.
Rule-Based Classifier
- Classifies records using if-then rules
Rule-Based Classifier (Example)
- Example classifier for classifying animal types based on their features.
Application of Rule-Based Classifier
- Using a set of learned rules for classifying new instances.
Nearest Neighbor Classifier
- Basic strategy: classifying instances based on their proximity to known instances. Methods include calculating distance metrics, and weighting the neighbors based on distance/proximity to the new instance.
Nearest-Neighbor Classifiers
- Classifying unknown instances by determining the class labels for those similar to the unclassified data point.
- Calculating the distance between points to use for classification.
Nearest Neighbor Classification...
- Data preprocessing is often helpful for the distance measurements to ensure different features don't have a dominating effect.
Nearest Neighbor Classification...
- Selecting a k-value for the nearest neighbors method is important. Small k-values can be affected by noise and outliers, while large k-values might include data points in different classification classes.
Nearest-neighbor classifiers
- These classifiers are local classifiers.
- They can produce decision boundaries of arbitrary shapes.
Nearest Neighbor Classification...
- Missing values are a challenge when performing nearest neighbor computations, often leading to incomplete evaluation.
- Different methods for handling missing data exist.
K-NN Classifiers...
- Irrelevant and redundant attributes add noise to the data, thus possibly giving biased results.
- Removing these may improve the results.
K-NN Classifiers: Handling attributes that are interacting
- Linear and nonlinear variables may complicate the decision boundaries if not handled correctly.
Handling attributes that are interacting
- Attributes/variables that have interacting relationships may produce complicated classification regions.
- The possible solution is to create a more sophisticated classifier or analysis.
Improving KNN Efficiency
- Methods for improving efficiency of KNN computations are useful for improving on speed and scalability.
Bayesian Classifiers
- Classification using probabilistic methods.
Bayes Classifier
- Probability-based method for classification where posterior probabilities are calculated based on Bayes' theorem.
Using Bayes Theorem for Classification
- A way to estimate the posterior probabilities, using Bayes' theorem.
- The goal is to find the class value that maximizes the posterior probability.
Example Data
- Examples of a data record, including relevant attribute values.
- The purpose is to provide examples of a dataset.
Example Data
- Using examples of data to estimate the probability for each scenario (like evade or not evade).
Conditional Independence
- Attributes are conditionally independent, meaning that the probability of X does not depend on Y, as long as some other variable (e.g., z) is known.
Naïve Bayes Classifier
- Assumptions are made about independence between attributes to improve probability estimates given a target class.
Naïve Bayes on Example Data
- Using given data and Bayes' Theorem to estimate probabilities given new data instances ("x").
Estimate Probabilities from Data
- Approximating the probabilities from categorical attributes, as well as continuous attributes.
Estimate Probabilities from Data
- Using probability density estimation for continuous values, such as income distributions.
Naïve Bayes Classifier
- Classifies new data points using Maximum a posteriori(MAP), where the class label with the highest posterior probability is the assigned classification class.
- A simple classification algorithm that can be used for various classification tasks.
Issues with Naïve Bayes Classifier
- Some problems may occur when probabilities are zero if some attribute values are zero.
- Other techniques can be used to overcome these issues.
Example of a Naive Bayes Classifier
- Demonstrating how a Naive Bayes classifier would work in classifying values if used on a dataset.
Naïve Bayes (Summary)
- Summary of properties of Naive Bayes Classifier.
Naïve Bayes
- Analyzing how Naive Bayes performs on datasets containing multi-class data.
Bayesian Belief Networks
- Representing probabilistic relationships graphically.
Conditional Independence
- Independence relationships between attributes represented in a directed acyclic graph (DAG).
Conditional Independence
- Conditional independence properties of Bayesian Networks.
Probability Tables
- Probability tables defining how each attribute in a set of attributes relates to a specific class or classification.
Example of Bayesian Belief Network
- Example Bayesian belief networks for determining if a person has heart disease given their attributes.
Example of Inferencing using BBN
- Using the provided Bayesian belief network structure and example data set to determine probability of a heart disease given the attributes.
Data Mining Classification: Artificial Neural Networks
- A classifier represented as a network of interacting "nodes."
- Each node represents a processing unit.
Artificial Neural Networks (ANN)
- A very general classification algorithm useful for handling nonlinear problems or relationships between variables.
Basic Architecture of Perceptron
- A very simple neural network with a single node, used to classify points based on their position in relation to a given hyperplane/decision boundary.
Perceptron Example
- Examples of input data instances, and how a perceptron might perform classification given such examples.
Perceptron Learning Rule
- Learning the perceptron network can be performed iteratively by adjusting the weights after an instance has been tested according to its performance/accuracy against the correct classification label.
Perceptron Learning Rule
- The process uses the error (the result) to adjust the weights by a learning rate that's given to update the weights and can be run until a stopping criterion is met.
Example of Perceptron Learning
- Visual example of perceptron learning and weight updates; weights are updated until a stopping criterion is met.
Perceptron Learning
- Perceptron learning will converge to a solution given the data is linearly separable.
Nonlinearly Separable Data
- Perceptron is not able to classify data if relationships between variables are not linear/are non-linear.
Multi-layer Neural Network
- A multi-layered neural network (that includes at least one hidden layer) is more expressive than a single perceptron node, and is able to classify nonlinearly separable problems
Multi-layer Neural Network
- More than one hidden layer implies that the network captures complex interactions between attribute/input features.
Why Multiple Hidden Layers?
- Activations at multiple layers can provide a better understanding of complex relationships than only considering initial inputs and/or output.
- Deeper networks can learn better representations within the data, thus providing higher quality classifications than those from a shallower network.
Multi-Layer Network Architecture
- A diagram which describes a multi-layer neural network with separate input and output layers with multiple hidden layers between them.
- This arrangement allows more complex/non-linear decisions to be made.
Activation Functions
- Different activation functions to apply weights.
- The choice of function depends on the dataset and the application.
Learning Multi-layer Neural Network
- The process of learning a neural network using gradient descent to adjust the weights and adjust the activations.
Gradient Descent
- Method used for adjustment of weights in ANNs given loss functions across all training points to reach an optimal solution.
Computing Gradients
- The calculation/computations for finding derivatives using chain rule for a given loss function.
Backpropagation Algorithm
- A method for calculating gradients "backwards" for an ANN's layers given all training data.
Design Issues in ANN
- Considerations and attributes that affect an ANN's design, including number of nodes in the layers and the initial weights (and/or biases) in the network.
Characteristics of ANN
- Explains various properties, including that, they are universal approximators, but sensitive to noise, can handle redundant/irrelevant attributes but have high computational complexity.
Deep Learning Trends
- Describes different trends, like improvements in computing resources and techniques.
Support Vector Machines
- A method for classification that defines a hypersurface that maximizes the distance between the classes.
Support Vector Machines
- Finding a linear hyperplane to separate the data (when possible).
Learning Linear SVM
- Using Lagrange multipliers to solve the optimization problem for finding weights and bias for the linear hyperplane.
Example of Linear SVM
- Example of a training dataset and how a linear SVM might perform classification, resulting in a computed linear hyperplane.
Learning Linear SVM
- Methods and attributes (such as support vectors) that affect the decision boundary in this method.
Support Vector Machines
- What happens when the data points are not linearly separable?
Support Vector Machines
- How do you find the best separating hyperplane for non-linearly separable data?
Nonlinear Support Vector Machines
- Techniques converting data to a higher dimensional space such that data has a linear relationship between classes.
Learning Nonlinear SVM
- Methods and properties that can be seen to ensure data can be transformed and analyzed in high dimensional spaces.
Kernel Trick *
- A strategy to avoid calculating the transformation into a potentially high-dimensional space.
Example of Nonlinear SVM
- Visual example of how a nonlinear SVM may perform classification given non-linear data
Learning Nonlinear SVM
- Advantages of using a kernel function for nonlinear SVMs, and limitations.
Characteristics of SVM
- Various properties of the SVM algorithm, including high computational complexity.
Data Mining Classification: Imbalanced Class Problem
- Classification problems in which one class dominates the other class/classes.
Class Imbalance Problem
- Challenges in using data mining techniques for scenarios that have one class greatly dominating the other classes.
Confusion Matrix
- A table used to estimate the different types of errors made by classification models.
Accuracy
- The ratio of correctly classified instances in a given dataset.
Problem with Accuracy
- Measuring accuracy on imbalanced datasets is unreliable because accuracy measurements may lead you to incorrect conclusions/results given one class may dominate the others.
Which model is better?
- Comparing various classification models based on their performance measures on different imbalanced datasets using several performance measures.
Alternative Measures
- Ways to evaluate classification models beyond just accuracy (e.g., precision, recall, F-measure).
Alternative Measures
- F-measure, precision, recall on data using a confusion matrix.
ROC (Receiver Operating Characteristic)
- A method for evaluating a classifier's performance in terms of true positive rate (TPR) and false positive rate (FPR).
ROC Curve
- Plotting true positive rates (TPR) against false positive rates (FPR).
ROC Curve Example
- Example graphs showing the ROC curve and its tradeoffs between different classifier types.
How to Construct an ROC curve
- How/Method to construct an ROC curve and plot associated thresholds.
Using ROC for Model Comparison
- Using area under the curve (AUC) to compare models.
Dealing with Imbalanced Classes - Summary
- Practical strategies for handling imbalanced classes, and which evaluation measures are best suited for use with such data.
Which Classifier is better?
- Comparing different classifiers based on their performance measures.
Building Classifiers with Imbalanced Training Set
- Strategies for overcoming issues with imbalanced datasets.
Data Mining Cluster Analysis: Basic Concepts
- Cluster analysis is a data mining technique used to cluster data objects into groups given the different types of clusters (contigous, prototype-based, and density-based).
- Partitional clusters are a division of data for clustering.
- Hierarchical clusters are a ranking system when clustering.
What is Cluster Analysis?
- Cluster analysis places objects into groups given that the objects in the group are similar (or related) to one another, and different from other groups/objects.
Applications of Cluster Analysis
- Group related documents or genes and proteins with similar functionality, or group stocks or products. – Summarise large datasets.
Notion of a Cluster can be Ambiguous
- There can be various definitions for the notion of a cluster.
- It matters depending on the domain and the desired results.
Types of Clusterings
- Partitional clustering (dividing data objects into disjoint subsets) and Hierarchical clustering (a series of nested clusters) are two types of clustering techniques.
Partitional Clustering
- A division of data points into non-overlapping subsets.
Hierarchical Clustering
- A set of nested clusters organized in a hierarchical tree.
Other Distinctions Between Sets of Clusters
- Exclusive vs. Non-exclusive clusters
- Exclusive clustering (point only belongs to one cluster); non-exclusive clusters (point can belong to multiple)
- Partial vs. Complete clusters; you may not want to cover the whole amount of data in the cluster.
- Determining which of these is better depends on the data and desired result.
Types of Clusters
- Well-separated clusters
- Prototype-based clusters
- Contiguity-based clusters
- Density-based clusters
- Described by an Objective Function (that you use to minimize or maximize to get good clusters/results).
Types of Clusters: Well-Separated
- A cluster is a set where every point in that cluster is closer to other points in the same cluster, than to those points not in that cluster.
Types of Clusters: Prototype-Based
- A cluster is a set of objects where any object in a given cluster is closer to that cluster's prototype/center than to any other cluster's center.
Types of Clusters: Contiguity-Based
- A cluster is a set of points in which any point in that cluster is closer to the other points in that cluster than to points outside of that cluster.
Types of Clusters: Density-Based
- A cluster is a set of data points such that if some objective function/objective measure is used, it will reach a local or global minima.
Types of Clusters: Objective Function
- Objective functions are used to determine or find optimal clusters to fulfil a criteria for an objective.
- They are NP-Hard in most cases.
- Global objectives try to maximize a function encompassing the entire dataset, while local strategies look for immediate/nearby improvements.
Characteristics of the Input Data Are Important
- The type of proximity or density measure is important as that affects the algorithm's performance and accuracy.
- Characteristics of data including sparseness, attribute types, relationships between attributes, noise, outliers and distribution matter.
Clustering Algorithms
- Common clustering algorithms include k-means, hierarchical clustering, and density-based clustering.
K-means Clustering
- A Partitional clustering algorithm.
- Each cluster has an associated centroid (center point).
- Each data point is assigned to the cluster with the closest centroid.
Example of K-means Clustering
- Iterative steps of the k-means algorithm.
K-means Clustering – Details
- Simple iterative algorithm.
- Initial centroids are chosen randomly; the clusters and their centroids change in each iteration until a desired threshold/set of criteria (such as relatively few points changing clusters) is reached.
- The computational cost of K-means clustering is O(n * K * I * d)
K-means Objective Function
- Sum of squared errors (SSE) objective function measures clustering error.
- SSE represents the sum of squared distances between each point in a given cluster, and that cluster's centroid.
Two different K-means Clusterings
- Demonstrating how different initializations can lead to different clustering outcomes given some dataset instances.
Importance of Choosing Initial Centroids
- Initializations of centroids for random data are not guaranteed to produce the best/ideal result.
Problems with Selecting Initial Points
- The chance of choosing a good/ideal centroid from each cluster is small, especially if the number of clusters is large.
- The initial centroids sometimes don't readjust themselves as expected given the data/data analysis.
10 Clusters Example
- Example of clusters and how centroids affect clustering results in more complex/challenging datasets
Solutions to Initial Centroids Problem
- Multiple runs: Selecting initial centroids multiple times through a process and selecting among those initial centroids may produce better results than just one run.
- Selecting most-separated points.
- K-means++
- Bisecting K-means
K-means++
- Using a strategy of selecting the next centroid by randomly selecting a centroid using a probability that's proportional to the minimum distance to the selected centroids.
Bisecting K-means
- A hierarchical method for creating K clusters.
- It repeats splitting clusters into two until there are K clusters.
Limitations of K-means
- K-means can have trouble with differing sizes, non-globular shapes, and datasets containing outliers.
Limitations of K-means: Differing Sizes, Densities, and Non-globular Shapes, Outliers
- Demonstrations of how K-means clustering might perform poorly given certain types of datasets
Overcoming K-means Limitations
- Generating multiple clusters (potentially many more clusters than the desired number.
- Post-processing for re-grouping clusters, may improve the final clustering results for many/challenging sets of data.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Test your knowledge on key concepts of data mining and predictive modeling. This quiz covers distance measures, covariance matrices, similarity properties, and various data mining tasks. See how well you understand the core principles essential for effective data analysis.