Data Mining & Predictive Modeling Concepts

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

Which property of a distance measure states that the distance is always non-negative?

Non-negativity (correct)
Maximum Similarity
Symmetry
Triangle Inequality

The Mahalanobis distance can be zero only when the two points are identical.

True (A)

What is the covariance matrix given in the Mahalanobis distance example?

[[0.3, 0.2], [0.2, 0.3]]

The property of similarity that shows it is the same regardless of the order of objects is called _______.

Symmetry

Signup and view all the answers

Match the following terms to their respective properties:

Non-negativity = d(x, y) = 0 iff x = y Triangle Inequality = d(x, z) ≤ d(x, y) + d(y, z) Symmetry (Distance) = d(x, y) = d(y, x) Symmetry (Similarity) = s(x, y) = s(y, x)

Signup and view all the answers

Which of the following describes the purpose of predictive modeling?

To find a model for class attributes as a function of other attributes (D)

Signup and view all the answers

Association rules are primarily concerned with predicting future trends.

False (B)

Signup and view all the answers

What is one example of a class attribute mentioned in the predictive modeling section?

Credit Worthy

Signup and view all the answers

In data mining, __________ refers to identifying unusual observations that differ from the majority of the data.

Anomaly Detection

Signup and view all the answers

Match the data mining tasks with their descriptions:

Clustering = Grouping similar data points Predictive Modeling = Assessing outcomes based on input features Anomaly Detection = Identifying unusual data points Association Rules = Finding relationships between variables

Signup and view all the answers

Which attribute would most likely have a categorical value?

Credit Worthy (A), Employed (D)

Signup and view all the answers

What type of analysis is performed when finding relationships between variables?

Association Analysis

Signup and view all the answers

What is the main purpose of regression analysis?

To predict a value of a continuous variable based on other variables (D)

Signup and view all the answers

The object catalog mentioned has a size of 150 GB.

False (B)

Signup and view all the answers

How many new high red-shift quasars were found?

16

Signup and view all the answers

Regression is extensively studied in __________ and neural network fields.

statistics

Signup and view all the answers

Match the types of data with their respective sizes:

Object Catalog = 9 GB Image Database = 150 GB Stars = 72 million Galaxies = 20 million

Signup and view all the answers

What types of characteristics are included in the classification of galaxies?

Characteristics of light waves received (B)

Signup and view all the answers

Regression analysis can only be applied to linear relationships.

False (B)

Signup and view all the answers

What are two examples of predictions made using regression analysis?

Predicting sales amounts and wind velocities

Signup and view all the answers

The study of __________ includes investigating the stages of formation of galaxies.

classifying galaxies

Signup and view all the answers

Which of the following is NOT an example of regression analysis?

Identifying galaxies (B)

Signup and view all the answers

Which of the following characteristics of data can complicate the recognition of the proper attribute type?

Noisiness (C)

Signup and view all the answers

High dimensional data brings several challenges in data analysis.

True (A)

Signup and view all the answers

What type of data consists of a collection of records, each with a fixed set of attributes?

Record Data

Signup and view all the answers

The __________ data type consists of documents represented as term vectors.

Document

Signup and view all the answers

Match the type of data with its correct description:

Record Data = Consists of a collection of records with fixed attributes Graph Data = Represented as networks or interconnected structures Ordered Data = Data that is arranged based on a sequence or order Matrix Data = Data represented in a multi-dimensional matrix format

Signup and view all the answers

Which property of data refers to how the scale can affect pattern recognition?

Resolution (A)

Signup and view all the answers

The presence of some attributes in sparse data is always sufficient for analysis.

False (B)

Signup and view all the answers

What is the main challenge related to sparsity in data?

Only presence counts

Signup and view all the answers

Data objects with the same fixed set of attributes can be visualized in a ________-dimensional space.

multi

Signup and view all the answers

Which of the following is NOT a type of data set mentioned?

Generalized (B)

Signup and view all the answers

What is an example of noise in data?

A person's voice being distorted on a poor phone call. (D)

Signup and view all the answers

Outliers are always considered noise in data analysis.

False (B)

Signup and view all the answers

What are two reasons for missing values in a dataset?

Information is not collected or attributes may not be applicable to all cases.

Signup and view all the answers

Duplicate data issues often arise when merging data from __________ sources.

heterogeneous

Signup and view all the answers

Match the following data quality problems with their definitions:

Noise = Modification of original values Outliers = Data objects with significantly different characteristics Missing values = Absence of certain data Duplicate data = Repetitive or nearly identical entries

Signup and view all the answers

What can be done to handle missing values in a dataset?

Estimate the missing values. (B)

Signup and view all the answers

Having multiple email addresses for the same person is an example of duplicate data.

True (A)

Signup and view all the answers

What is a similarity measure in data analysis?

A numerical measure of how alike two data objects are.

Signup and view all the answers

To clean duplicate data, a process known as __________ is utilized.

data cleaning

Signup and view all the answers

Which of the following is NOT described as a data quality problem?

Low storage capacity (D)

Signup and view all the answers

What is the main purpose of the mutual information approach as described?

To measure the degree of association between two continuous variables (C)

Signup and view all the answers

An indicator variable can take on the value of -1 when both objects have a value of 0 for a symmetric attribute.

False (B)

Signup and view all the answers

What are the two main reasons for using weights when calculating similarities between attributes?

To treat attributes differently and to incorporate their importance into the similarity measure.

Signup and view all the answers

In data preprocessing, the process of __________ refers to reducing the number of attributes or objects combined into a single attribute.

aggregation

Signup and view all the answers

Match the following data preprocessing techniques to their descriptions:

Aggregation = Combining multiple attributes into fewer ones Dimensionality Reduction = Reducing the number of variables under consideration Sampling = Selecting a subset of data for analysis Discretization = Converting continuous data into discrete bins

Signup and view all the answers

What is a primary reason for the strong competitive pressure in data mining?

To provide better, customized services (B)

Signup and view all the answers

Data mining is primarily concerned with collecting data rather than analyzing it.

False (B)

Signup and view all the answers

Name one type of data that is extensively collected by e-commerce websites.

Purchase data

Signup and view all the answers

In data mining, unusual observations that differ from the majority of the data are referred to as __________.

outliers

Signup and view all the answers

Match the following types of data with their descriptions:

Structured Data = Data that is organized in a predefined manner Unstructured Data = Data that does not have a pre-defined format Semi-Structured Data = Data that has some organizational properties but does not fit into a conventional database Temporal Data = Data that varies with time

Signup and view all the answers

What challenge is often faced in data analysis due to high dimensional data?

Difficulty in pattern recognition (D)

Signup and view all the answers

Give one example of how data can be gathered from social networking sites.

User engagement metrics

Signup and view all the answers

Which of the following is a characteristic that can complicate the recognition of the proper attribute type?

Noise in data (C)

Signup and view all the answers

The presence of sparse data means that all attributes in the dataset are equally important.

False (B)

Signup and view all the answers

What is meant by 'dimensionality' in the context of data analysis?

The number of attributes in a dataset.

Signup and view all the answers

Data represented as __________ provides a multidimensional view of objects based on their attributes.

data matrix

Signup and view all the answers

Match the types of data sets with their definitions:

Record Data = A collection of records with a fixed set of attributes Graph Data = Data that represents connections, like the World Wide Web Ordered Data = Data that has a natural order, such as temporal data Document Data = Data represented as term vectors

Signup and view all the answers

Which operation is meaningful for categorical data?

Calculating the mode (D)

Signup and view all the answers

High dimensional data provides fewer challenges for data analysis compared to low dimensional data.

False (B)

Signup and view all the answers

What characterizes document data in data mining?

Each document is represented as a term vector.

Signup and view all the answers

The scale at which data is analyzed refers to its __________.

resolution

Signup and view all the answers

Which characteristic of data refers to the amount of space occupied by the dataset?

Size (B)

Signup and view all the answers

What type of data consists of a collection of records with a fixed set of attributes?

Structured Data (D)

Signup and view all the answers

What is an example of a transaction in a grocery store?

Bread, Coke, Milk

Signup and view all the answers

The __________ refers to the quality of the data that can negatively impact data processing efforts.

data quality

Signup and view all the answers

Match the following types of data with their examples:

Transaction Data = Shopping items purchased Graph Data = Connections between entities Ordered Data = Sequences of transactions Spatio-Temporal Data = Monthly temperature records

Signup and view all the answers

Which of these data types can represent relationships between variables?

Graph Data (D)

Signup and view all the answers

The average monthly temperature is an example of spatio-temporal data.

True (A)

Signup and view all the answers

In data mining, __________ involves identifying unusual observations that differ from the majority of the data.

anomaly detection

Signup and view all the answers

Poor data quality can lead to which of the following issues?

Erroneous conclusions (B)

Signup and view all the answers

What distance measure is defined as the maximum difference between any component of the vectors?

Supremum distance (C)

Signup and view all the answers

The Euclidean distance is always less than or equal to the Manhattan distance for any two points.

True (A)

Signup and view all the answers

Write the formula for Mahalanobis distance.

Mahalanobis distance = ((x - y)T Σ^(-1) (x - y))^0.5

Signup and view all the answers

The Hamming distance is a special case of the ______ distance, applicable to binary vectors.

Manhattan

Signup and view all the answers

Match the following types of distances with their appropriate descriptions:

L1 = City block (Manhattan) distance L2 = Euclidean distance L∞ = Supremum distance Mahalanobis = Distance accounting for covariance

Signup and view all the answers

Which distance measure would be most appropriate for identifying outliers in a dataset with correlated features?

Mahalanobis distance (B)

Signup and view all the answers

The Mahalanobis distance can be greater than the Euclidean distance for the same pair of points.

True (A)

Signup and view all the answers

What does the covariance matrix (Σ) represent in the Mahalanobis distance formula?

The covariance matrix represents the variance and covariance of the data points, indicating the relationships between their dimensions.

Signup and view all the answers

For binary vectors, the _______ distance can be used to calculate the number of different bits.

Hamming

Signup and view all the answers

If two points are identical, what will the Mahalanobis distance equal?

0 (C)

Signup and view all the answers

Flashcards

Noise in Data

Extraneous information in the data that doesn't reflect the true value. For attributes, it refers to changes to the original values.

Outliers

Data points that are significantly different from the rest of the data, often due to errors or exceptional cases.

Missing Values

Data values that are missing or incomplete.

Duplicate Data

Data entries that are exact or near-exact duplicates of each other.

Signup and view all the flashcards

Similarity Measure

A measure that quantifies how similar two data objects are, often expressed as a number.

Signup and view all the flashcards

Data Mining Definition

Data mining is a process of discovering meaningful patterns and knowledge from large datasets. It aims to extract insights and understand hidden relationships in data.

Signup and view all the flashcards

What is clustering in data mining?

Clustering is a data mining task that groups similar data points together. It aims to identify clusters of similar objects based on their characteristics.

Signup and view all the flashcards

What is Association Rule Mining?

Association rule mining focuses on discovering interesting relationships or patterns between different attributes in a dataset. These relationships are expressed as rules, like "If a customer buys bread, they are likely to buy milk too."

Signup and view all the flashcards

What is predictive modeling?

Predictive modeling is a machine learning technique that uses existing data to predict future outcomes. It builds models based on historical data to forecast trends.

Signup and view all the flashcards

What is classification in data mining?

Classification is a type of predictive modeling that aims to assign data points to predefined categories or classes. For example, classifying emails into spam or not spam.

Signup and view all the flashcards

What is anomaly detection?

Anomaly detection is the process of identifying data points that deviate significantly from the expected behavior or patterns within a dataset. These anomalies can indicate errors, fraud, or unusual events.

Signup and view all the flashcards

Regression

A type of data mining task that uses a statistical model to predict a specific continuous target variable based on other variables. This model can be linear or nonlinear.

Signup and view all the flashcards

Anomaly Detection

Finding objects with unusual or extreme properties. This often involves searching for things like distant galaxies or rare astronomical objects.

Signup and view all the flashcards

Data Mining

A process used to automatically identify patterns and trends in data, leading to insights that would be difficult or time-consuming to spot manually.

Signup and view all the flashcards

Object Catalog

A large collection of data (e.g., galaxies, stars) organized in a structured manner, often used for analysis and research purposes.

Signup and view all the flashcards

Time Series Prediction

A process used to predict the future value of a variable, based on historical patterns and trends in the data.

Signup and view all the flashcards

Red Shift

A measure that describes how far away an object is from the Earth, indicating its distance in space.

Signup and view all the flashcards

Quasar

A type of celestial object that exhibits extremely high redshift values, indicating that it's very far away from us and likely formed in the early universe.

Signup and view all the flashcards

Galaxy Class

The characteristics of a galaxy, often used to understand its past evolution and formation.

Signup and view all the flashcards

Image Features

The information extracted from images, such as the shapes, colors, and brightness of celestial objects. It's used to understand the object's properties.

Signup and view all the flashcards

Mahalanobis Distance

The Mahalanobis distance is a measure of distance between two points, considering their correlation. It uses the covariance matrix to normalize the distance calculation, making it robust to different scales and directions. It's often used for outlier detection and classification.

Signup and view all the flashcards

Metric

A metric is a distance function that meets specific criteria, allowing for meaningful comparisons and analysis. It ensures that distances are non-negative, symmetric, and satisfy the triangle inequality.

Signup and view all the flashcards

Triangle Inequality

The triangle inequality states that the summed distance from two points (x, y) through an intermediate point (z) is always greater than or equal to the direct distance between x and y. It allows for finding the shortest path between points, as the most efficient route must always be shorter than any other path.

Signup and view all the flashcards

Similarity

A similarity metric is a function that measures how similar two objects are, typically ranging from 0 to 1, where 1 represents complete similarity and 0 represents complete dissimilarity. It can be symmetric, meaning the similarity between A and B is the same as the similarity between B and A.

Signup and view all the flashcards

Covariance Matrix

The covariance matrix is a square matrix that shows the correlations between different features (variables) in a dataset. Its elements are the covariances between each pair of features. It plays a role in understanding the relationships between variables and is used in various statistical and machine learning methods.

Signup and view all the flashcards

Record Data

Data that consists of a collection of records, each of which has a fixed set of attributes.

Signup and view all the flashcards

Data Matrix

A data representation where objects with the same set of numeric attributes are considered points in a multi-dimensional space, forming a matrix with objects as rows and attributes as columns.

Signup and view all the flashcards

Document Data

Each document is transformed into a vector where each component represents a term (word) and its value corresponds to the frequency of that term in the document.

Signup and view all the flashcards

Dimensionality

The number of attributes or features used to describe a data object.

Signup and view all the flashcards

Sparsity

The presence or absence of attributes in data, where only the presence of attributes is of significance.

Signup and view all the flashcards

Resolution

The level of detail or granularity at which data is measured or recorded.

Signup and view all the flashcards

Size

The overall size or volume of data under consideration.

Signup and view all the flashcards

Graph Data

Data that is organized in a graph-like structure, where nodes represent entities and edges represent relationships between them.

Signup and view all the flashcards

Ordered Data

Data that has an inherent order or sequence, often associated with time or space.

Signup and view all the flashcards

Incomplete Attribute Types

Attribute types that are incomplete, meaning they lack one or more properties like ordering, distinction, or ratios.

Signup and view all the flashcards

Large-scale Data

A large amount of data collected and stored, often from various sources.

Signup and view all the flashcards

Gather Whatever Data You Can

The idea that collecting as much data as possible, whenever and wherever it is available, can lead to valuable insights, even if the original purpose of collection was different.

Signup and view all the flashcards

Competitive Pressure

The pressure to provide better and more customized services to customers to gain an advantage in the marketplace.

Signup and view all the flashcards

Customer Relationship Management (CRM)

The use of data mining to improve customer relationships, such as understanding their needs and providing personalized services.

Signup and view all the flashcards

Computer Power Advancements

The increasingly affordable cost of computers and their growing processing power make data mining more accessible and efficient

Signup and view all the flashcards

Data Preprocessing

The process of transforming data from its raw form into a structured format suitable for analysis and interpretation.

Signup and view all the flashcards

Transaction Data

A special type of data where each transaction involves a set of items. For example, in a grocery store, the products purchased by a customer during one trip make up a transaction, while the individual products are the items.

Signup and view all the flashcards

Set Data

Data where each item in a set is categorized. There's no order or sequence, just a collection of related items.

Signup and view all the flashcards

Data Size

The overall size or volume of data that is being analyzed or used. Data can be small, like a few records, or massive, like a database.

Signup and view all the flashcards

Data Dimensionality

When data has multiple attributes, the number of attributes used to describe each data point. High dimensionality implies complexity.

Signup and view all the flashcards

Data Resolution

The degree of detail or granularity in which data is measured or recorded. High resolution means more specific data.

Signup and view all the flashcards

Mutual Information

A measure of the statistical dependence between two random variables, often used to find novel associations in large datasets.

Signup and view all the flashcards

Clustering

The process of grouping similar data points together based on their characteristics.

Signup and view all the flashcards

Association Rule Mining

A technique that discovers interesting relationships or patterns between different attributes in a dataset, often expressed as rules like "If a customer buys bread, they are likely to buy milk too."

Signup and view all the flashcards

Aggregation

Combines multiple attributes into a single one, often used for data reduction or to change the scale of data.

Signup and view all the flashcards

Manhattan Distance (L1 Norm)

A way to measure the dissimilarity between two data points. It's calculated as the sum of the absolute differences between corresponding components of the two points.

Signup and view all the flashcards

Euclidean Distance (L2 Norm)

This distance is calculated as the square root of the sum of squared differences between corresponding components of two points.

Signup and view all the flashcards

Supremum Distance (Lmax Norm, L∞ Norm)

This distance is calculated as the maximum difference between corresponding components of two points.

Signup and view all the flashcards

Study Notes

Data Mining: Introduction

Data mining is the process of extracting implicit, previously unknown, and potentially useful information from data.
Large-scale data growth exists in commercial and scientific databases. This is due to advancements in data generation and collection technologies.
The new data mining mantra is to gather whatever data possible, whenever and wherever possible.
Data mining often is used to gain a competitive advantage, create more personalized experiences, and solve large-scale scientific problems.
Data mining is a key component of data science.

Why Data Mining?

Commercial viewpoints: Vast amounts of data are being collected and processed. This data often is used by commercial companies to gain a competitive edge. Businesses collect data in order to develop new products and services, personalize customer experiences, improve customer relationship management, and enhance productivity.
Scientific viewpoints: In scientific settings, data mining is used in hypothesis formation and analysis of massive datasets from remote sensors, telescopes, and scientific simulations.
Societal problems: Data mining provides a tool to improve health care, predict climate change effects, and develop ways to reduce world hunger and poverty.

What is Data Mining?

Data mining is a non-trivial process of extracting implicit, previously unknown, and potentially useful patterns from large amounts of data using automatic or semi-automatic means.
Key steps in data mining usually include data preprocessing, data mining, and postprocessing.

Origins of Data Mining

Draws from statistics, machine learning, pattern recognition, and database systems.
Traditional techniques often are unable to extract patterns from large-scale data.
Modern techniques leverage database technology, parallel computing, and distributed computing.

Data Mining Tasks

Prediction methods: Use some variables to predict unknown or future values of other variables.
Description methods: Find interpretable patterns in data to better describe the data.
Data mining tasks include clustering, predictive modeling/classification, association rule discovery, and anomaly detection.

Predictive Modeling: Classification

Find a model for the class attribute as a function of values for other attributes.
Predicting creditworthiness is one use of this technique.

Classification Example

Classifiers learn from a training set of data to predict the class for unknown data records.

Classification: Applications

The applications of classification involve many areas, including credit card fraud detection, intrusion detection, identification of tumor cells, and categorizing news stories.

Classification: Application 1 – Fraud Detection

Using data regarding customer credit card transactions to predict fraudulent cases, this process labels transactions as fraudulent or legitimate.

Classification: Application 2 – Churn Prediction

Use detailed transactional records to predict customer losses to competitors, this is commonly used in telecommunications settings.

Classification: Application 3 – Sky Survey Cataloging

Used to predict the class (stars or galaxies) of sky objects in telescopic survey images from the Palomar Observatory.

Classifying Galaxies

Data size includes 72 million stars, 20 million galaxies, an object catalog (9 GB), and an image database (150 GB).

Regression

Predict a continuous variable based on other variables.
Methodologies include studying linear and nonlinear models in statistics and using neural networks.
Examples include predicting sales from advertising expenditure and wind velocity.

Clustering

Finding groups of similar objects based on their characteristics.
Clustering minimizes intra-cluster distances while maximizing inter-cluster distances

Applications of Cluster Analysis

Custom profiling for targeted marketing
Group related documents
Group similar genes and proteins
Group stocks with similar price fluctuations
Reduce the size of large datasets

Clustering: Application 1 – Market Segmentation

Subdividing a market into customer subsets that can be targeted for a unique marketing mix.
Customer's geographical differences and lifestyle factors are analyzed to identify clusters of similar customers.

Clustering: Application 2 – Document Clustering

Grouping similar documents based on frequently occurring terms within those documents.

Association Rule Discovery: Definition

Finding dependency rules between items in a dataset.
Based on the occurrences of items.

Association Analysis: Applications

Rules can promote sales and help manage inventory.
Rules are useful in telecommunication alarm diagnosis.
Rules are also useful in medical informatics.

Association Analysis: Example

An example subspace differential coexpression pattern from a lung cancer dataset.

Deviation/Anomaly/Change Detection

This technique detects significant deviations from normal behavior.
Examples include detecting credit card fraud, network intrusion, and detecting changes in the global forest cover using sensor networks.

Motivating Challenges

Scalability
High dimensionality
Heterogeneous and complex data
Data ownership and distribution
Non-traditional analysis

Data Mining: Data

Attributes and objects
Types of data
Data quality
Similarity and distance
Data preprocessing

What is Data?

Collection of data objects and their attributes.
Attribute is a characteristic.
Object is data object.

Attribute Values

Values that are assigned to attributes.
Height can be measured in either feet or meters.
Data types and values have different properties.

Measurement of Length

How an attribute is measured can matter.

Types of Attributes

Nominal (zipcodes, eye color)
Ordinal (rankings)
Interval (calendar data)
Ratio (temperature in Kelvin)

Properties of Attribute Values

Distinctness: attribute values are different
Order: Values can be ordered.
Ratios are meaningful
Differences are meaningful

Difference Between Ratio and Interval

Is it physically meaningful to say that a temperature of 10° is twice that of 5° on different scales?

Attribute Types

Categorical attributes (nominal, ordinal)
Numerical attributes (interval, ratio)

Discrete and Continuous Attributes

Discrete attributes: Finite or countably infinite set of values.
Continuous attributes: Real numbers(can have a very large set).

Asymmetric Attributes

Only presence of value is considered important (e.g., words in documents, items in customer transactions).

Critiques of the attribute categorization

Asymmetry, Cyclic, and multivariate attributes
Partial order, and relational attributes
Data is approximate, noisy, and incomplete.

Key messages for attribute types

The type of operations suitable for attributes depends on the characteristics of the data.
Various properties of data must be considered.
Use data types for the most accurate analysis.

Important Characteristics of Data

Dimensionality (number of attributes)
Sparsity (how sparse the data is)
Resolution (granularity or scale)
Size (amount of data)

Types of Data Sets

Record data
Data Matrix
Document data
Transaction data
Graph data
World Wide Web
Molecular Structures
Ordered data
Spatial data
Temporal data
Sequential data
Genetic sequence data

Record Data

Record data has fixed attributes.

Data Matrix

Data objects can be thought of as points in a multi-dimensional space.
Each dimension represents a distinct attribute.

###Document Data

Each document represents a term vector, each term represents a component.
The value of a component represents the number of times the term occurs in the document.

Transaction Data

A special type of data where each transaction is a set of items.
Example is a grocery store transaction.

Graph Data

Graphs, molecules, webpages are examples of graph data

Ordered Data

Sequences of transactions
Genomic data
Spatio-temporal data

Data Quality

Poor data quality negatively impacts data mining processes.

Data Quality ...

Noise and outliers, wrong data, fake data, missing values, and duplicate data are examples of data quality problems.

Noise

Extraneous objects or attributes.
Distorts the data.

Outliers

Data objects with unusual characteristics.

Missing Values

Missing features, due to many reasons.
Impute missing values, or simply remove entries.

Duplicate Data

Duplicate data objects can complicate data mining.

Similarity and Dissimilarity Measures

Similarity is a numerical measure of how alike two objects are. (0,1)
Dissimilarity is a numerical measure of how different two objects are.

Similarity/Dissimilarity for Simple Attributes

Calculating similarities and dissimilarities for each attribute type.

Euclidean Distance

A way to calculate distance between two objects given all attributes are numeric.

Minkowski Distance

Calculating the distance between objects if there are more than one attribute.
Generalization of Euclidean Distance, parameter r affects the calculation.

Minkowski Distance: Examples

City block (Manhattan, taxicab) and Euclidean distance are examples of Minkowski's Distance.

Mahalanobis Distance

Accounts for the correlation between attributes and the variability of the data.

Common Properties of a Distance

Non-negativity
Symmetry
Triangle inequality

Common Properties of a Similarity

Reflexivity: s(x,y) = 1 if x = y
Symmetry: s(x,y) = s(y,x)

Similarity Between Binary Vectors

SMC (Simple Matching) and Jaccard Coefficient measure similarity between two objects with only binary attributes.

Cosine Similarity

Measuring similarity between two objects considering the angle between the vectors.

Correlation measures the linear relationship between objects

Measuring similarity using correlation measures the linear relationship between two objects given attributes.

Visually Evaluating Correlation

Scatter plots are used to visually assess the relationship between two attributes.

Drawback of correlation

The correlation can be zero, but there may still be a nonlinear relationship.

Correlation vs Cosine vs Euclidean Distance

Different proximity measures can be appropriate for different scenarios depending on data types.

Comparison of Proximity Measures

Properties of a proximity measure that are important from different application perspectives.

Information Based Measures

Information theory is a well-developed concept that can be used to develop similarity measures.
Measures such dealing with non-linear relationships in the data.

Information and Probability

The more certain an outcome is, the less information it provides (and vice-versa).
Entropy is a commonly used tool for information measure.

Entropy

A tool for measuring uncertainty.
A discrete value is used to measure how much information a variable provides.

Entropy Examples

Example of calculating entropy for coins, and dice.

Entropy for Sample Data: Example

An example of calculating entropy for hair color data.

Entropy for Sample Data

Calculating the entropy of some attribute given the number of times each observation occurrs.

Mutual Information

Measuring the amount of information that one variable provides about another variable.
This measure is useful for determining if the variables are related and if one variable can be used to predict the other.

Mutual Information Examples

Calculate mutual information for example data using the provided dataset/formula.

Maximal Information Coefficient

Measuring mutual information between two continuous variables or more that are not in a linear relationship.

General Approach for Combining Similarities

Creating a composite similarity metric when attributes are of a non-uniform attribute types.

Using Weights to Combine Similarities

Giving weight to different attributes to create a more meaningful result given the attributes aren't the same type.

Data Preprocessing Methods

Aggregation, sampling, discretization/binarization, atrribute transformation, dimensionality reduction. feature subset selection, feature creation.

Aggregation

Combining attributes or objects.
Reduce the number of attributes or objects.

Data Aggregation Example

Examples of aggregation for various data points, showing data scaling and stability.

Example: Precipitation in Australia

Example of data preprocessing.
Precipitation patterns in Australia demonstrates how data are aggregated by months or years.

Sampling

A technique to reduce data size by only including a fraction of the data points.
The resulting sample must be representative of the original set.

Sampling...

For samples to be representative of the original dataset, they need to capture the same properties/distributions.

Sample Size

The size of the sample affects the final result.

Types of Sampling

Simple random sampling, sampling with replacement, sampling without replacement, stratified sampling.

Attribute Transformation

Transform attributes by a series of functions to better adjust for properties that help the data.
This is used to account for variability differences among attributes.

Example: Sample Time Series of Plant Growth

Data about precipitation in Australia is aggregated, and analyzed to show examples of how data may be better analyzed when aggregated.

Seasonality Accounts for Much Correlation

Seasonality in the data can affect the relationships between attributes.

Curse of Dimensionality

Data becomes more sparse in higher dimensionalities, affects ability to accurately analyze the data.

Dimensionality Reduction

Ways to reduce dataset size.
PCA and SVD are examples of dimensionality reduction techniques.

Dimensionality Reduction: PCA

PCA is an example of a dimensionality reduction technique.
This method finds the best projection of the data to preserve the important information/variability.

Feature Subset Selection

Selecting a subset of attributes based on their importance to the model.

Feature Creation

Creating new features that are more informative than the existing features.

Mapping Data to a New Space

Using a mapping function to transform the data with the intention of better capturing important relations.

Classification Techniques - Base Classifiers

Decision tree, rule-based, nearest neighbor, Naive Bayes, Support Vector Machines, Neural Networks.

Ensemble Classifiers

Boosting and bagging and random forests are examples of an ensemble classifier.

Example of a Decision Tree

Example of a Decision Tree model and its processes.

Apply Model to Test Data

How to use the model to classify data points with unknown classes.

Another Example of Decision Tree

Another example of a decision tree classification model

Decision Tree Classification Task

Example of a decision tree classification model and its components.

Model Overfitting

Errors may occur when your training data represents your model too well, and it fails to generalize to unseen data.

Classification Errors

Training errors and test errors measure errors found with training and test sets of data.

Generalization Errors

Expected errors in the model on randomly selected data from the original distribution.

Example Data Set

Example of a dataset containing two classes and noisy instances that is used for testing classification models.

Increasing Number of Nodes in Decision Trees

Graph showing how model error changes as the number of nodes increases.

Decision Tree with 4 nodes

Graph showing how the model would classify data points with a relatively small number of nodes.

Decision Tree with 50 nodes

Graph showing how the model would classify data points with a relatively large number of nodes.

Model Underfitting and Overfitting

Analyzing how training and testing errors change as a function of model complexity.

Model Overfitting – Impact of Training Data Size

Demonstrating that larger training dataset sizes can decrease error rates for models of varying complexity.

Data Mining Classification: Alternative Techniques

Rule-based and nearest-neighbor classifiers used in classification.

Rule-Based Classifier

Classification using IF-THEN rules.

Rule-Based Classifier (Example)

Example of a rule-based classifier for animal classification, based on their attributes

Application of Rule-Based Classifier

Defining how rules can cover instances.

Rule Coverage and Accuracy

Coverage: Ratio of instances satisfied by the rule's antecedent.
Accuracy: Ratio of instances satisfying consequent given antecedent.

How does Rule-based Classifier Work?

Describing how a rule-based classifier works for a given set of data examples.

Characteristics of Rule Sets: Strategy 1

Mutually exclusive rules and exhaustive rules are characteristics of a rule-based strategy. Each rule is independent, and every record has match on at least one.

Characteristics of Rule Sets: Strategy 2

Rules are not necessarily mutually exclusive; they may overlap.
A record may trigger more than one rule, and every record need not have a rule; default class may be useful for this type of rule set.

Ordered Rule Set

Rules ordered by priority in a rule set.
Highest ranked rule is assigned to that class or the default.

Rule Ordering Schemes

A rule-based ordering system that ranks rules based on quality.
A class-based ordering scheme organizes rules in terms of their class.

Building Classification Rules

Direct methods develop rules from data directly.
Indirect methods develop rules based on pre-trained models like decision trees.

Direct Method: Sequential Covering

A direct method of producing rules from data, starting with an empty rule, growing rules based on information gain, and removing training records when covered.

Example of Sequential Covering

Visual examples of a sequential covering algorithm on sample data.

Rule Growing

Common strategies in rule growing.

Rule Evaluation

Estimating the information gain of one rule given prior rule knowledge and data.

Direct Method: RIPPER

RIPPER is a direct method of learning rules from data to perform classification.

Direct Method: RIPPER

Optimizing the rule set using the MDL principle.
Re-evaluate the current rule set using 2 alternatives to minimize MDL.

Indirect Methods

Techniques to derive rules from other classification models (e.g., decision trees, neural networks).

Indirect Method: C4.5rules

Develop rules from an unpruned decision tree.

Indirect Method: C4.5rules

Subsets of rules based on class ordering and measures of description length.

Example

Example data used for practicing a rule-based classifier, based on identifying animals from their attributes.

C4.5 versus C4.5rules versus RIPPER

Comparing various decision tree algorithms including the C4.5, C4.5 rules, and RIPPER models based on their predictions on a set of animal sample data.

Advantages of Rule-Based Classifiers

Comparatively easier to interpret than other models.
Can be efficient in computational terms.
Can also handle redundant/irrelevant features.

Data Mining Classification: Alternative Techniques

A cover of instances occurs when the attributes of the instance satisfy the condition of the rule.

Rule-Based Classifier

Classifies records using if-then rules

Rule-Based Classifier (Example)

Example classifier for classifying animal types based on their features.

Application of Rule-Based Classifier

Using a set of learned rules for classifying new instances.

Nearest Neighbor Classifier

Basic strategy: classifying instances based on their proximity to known instances. Methods include calculating distance metrics, and weighting the neighbors based on distance/proximity to the new instance.

Nearest-Neighbor Classifiers

Classifying unknown instances by determining the class labels for those similar to the unclassified data point.
Calculating the distance between points to use for classification.

Nearest Neighbor Classification...

Data preprocessing is often helpful for the distance measurements to ensure different features don't have a dominating effect.

Nearest Neighbor Classification...

Selecting a k-value for the nearest neighbors method is important. Small k-values can be affected by noise and outliers, while large k-values might include data points in different classification classes.

Nearest-neighbor classifiers

These classifiers are local classifiers.
They can produce decision boundaries of arbitrary shapes.

Nearest Neighbor Classification...

Missing values are a challenge when performing nearest neighbor computations, often leading to incomplete evaluation.
Different methods for handling missing data exist.

K-NN Classifiers...

Irrelevant and redundant attributes add noise to the data, thus possibly giving biased results.
Removing these may improve the results.

K-NN Classifiers: Handling attributes that are interacting

Linear and nonlinear variables may complicate the decision boundaries if not handled correctly.

Handling attributes that are interacting

Attributes/variables that have interacting relationships may produce complicated classification regions.
The possible solution is to create a more sophisticated classifier or analysis.

Improving KNN Efficiency

Methods for improving efficiency of KNN computations are useful for improving on speed and scalability.

Bayesian Classifiers

Classification using probabilistic methods.

Bayes Classifier

Probability-based method for classification where posterior probabilities are calculated based on Bayes' theorem.

Using Bayes Theorem for Classification

A way to estimate the posterior probabilities, using Bayes' theorem.
The goal is to find the class value that maximizes the posterior probability.

Example Data

Examples of a data record, including relevant attribute values.
The purpose is to provide examples of a dataset.

Example Data

Using examples of data to estimate the probability for each scenario (like evade or not evade).

Conditional Independence

Attributes are conditionally independent, meaning that the probability of X does not depend on Y, as long as some other variable (e.g., z) is known.

Naïve Bayes Classifier

Assumptions are made about independence between attributes to improve probability estimates given a target class.

Naïve Bayes on Example Data

Using given data and Bayes' Theorem to estimate probabilities given new data instances ("x").

Estimate Probabilities from Data

Approximating the probabilities from categorical attributes, as well as continuous attributes.

Estimate Probabilities from Data

Using probability density estimation for continuous values, such as income distributions.

Naïve Bayes Classifier

Classifies new data points using Maximum a posteriori(MAP), where the class label with the highest posterior probability is the assigned classification class.
A simple classification algorithm that can be used for various classification tasks.

Issues with Naïve Bayes Classifier

Some problems may occur when probabilities are zero if some attribute values are zero.
Other techniques can be used to overcome these issues.

Example of a Naive Bayes Classifier

Demonstrating how a Naive Bayes classifier would work in classifying values if used on a dataset.

Naïve Bayes (Summary)

Summary of properties of Naive Bayes Classifier.

Naïve Bayes

Analyzing how Naive Bayes performs on datasets containing multi-class data.

Bayesian Belief Networks

Representing probabilistic relationships graphically.

Conditional Independence

Independence relationships between attributes represented in a directed acyclic graph (DAG).

Conditional Independence

Conditional independence properties of Bayesian Networks.

Probability Tables

Probability tables defining how each attribute in a set of attributes relates to a specific class or classification.

Example of Bayesian Belief Network

Example Bayesian belief networks for determining if a person has heart disease given their attributes.

Example of Inferencing using BBN

Using the provided Bayesian belief network structure and example data set to determine probability of a heart disease given the attributes.

Data Mining Classification: Artificial Neural Networks

A classifier represented as a network of interacting "nodes."
Each node represents a processing unit.

Artificial Neural Networks (ANN)

A very general classification algorithm useful for handling nonlinear problems or relationships between variables.

Basic Architecture of Perceptron

A very simple neural network with a single node, used to classify points based on their position in relation to a given hyperplane/decision boundary.

Perceptron Example

Examples of input data instances, and how a perceptron might perform classification given such examples.

Perceptron Learning Rule

Learning the perceptron network can be performed iteratively by adjusting the weights after an instance has been tested according to its performance/accuracy against the correct classification label.

Perceptron Learning Rule

The process uses the error (the result) to adjust the weights by a learning rate that's given to update the weights and can be run until a stopping criterion is met.

Example of Perceptron Learning

Visual example of perceptron learning and weight updates; weights are updated until a stopping criterion is met.

Perceptron Learning

Perceptron learning will converge to a solution given the data is linearly separable.

Nonlinearly Separable Data

Perceptron is not able to classify data if relationships between variables are not linear/are non-linear.

Multi-layer Neural Network

A multi-layered neural network (that includes at least one hidden layer) is more expressive than a single perceptron node, and is able to classify nonlinearly separable problems

Multi-layer Neural Network

More than one hidden layer implies that the network captures complex interactions between attribute/input features.

Why Multiple Hidden Layers?

Activations at multiple layers can provide a better understanding of complex relationships than only considering initial inputs and/or output.
Deeper networks can learn better representations within the data, thus providing higher quality classifications than those from a shallower network.

Multi-Layer Network Architecture

A diagram which describes a multi-layer neural network with separate input and output layers with multiple hidden layers between them.
This arrangement allows more complex/non-linear decisions to be made.

Activation Functions

Different activation functions to apply weights.
The choice of function depends on the dataset and the application.

Learning Multi-layer Neural Network

The process of learning a neural network using gradient descent to adjust the weights and adjust the activations.

Gradient Descent

Method used for adjustment of weights in ANNs given loss functions across all training points to reach an optimal solution.

Computing Gradients

The calculation/computations for finding derivatives using chain rule for a given loss function.

Backpropagation Algorithm

A method for calculating gradients "backwards" for an ANN's layers given all training data.

Design Issues in ANN

Considerations and attributes that affect an ANN's design, including number of nodes in the layers and the initial weights (and/or biases) in the network.

Characteristics of ANN

Explains various properties, including that, they are universal approximators, but sensitive to noise, can handle redundant/irrelevant attributes but have high computational complexity.

Deep Learning Trends

Describes different trends, like improvements in computing resources and techniques.

Support Vector Machines

A method for classification that defines a hypersurface that maximizes the distance between the classes.

Support Vector Machines

Finding a linear hyperplane to separate the data (when possible).

Learning Linear SVM

Using Lagrange multipliers to solve the optimization problem for finding weights and bias for the linear hyperplane.

Example of Linear SVM

Example of a training dataset and how a linear SVM might perform classification, resulting in a computed linear hyperplane.

Learning Linear SVM

Methods and attributes (such as support vectors) that affect the decision boundary in this method.

Support Vector Machines

What happens when the data points are not linearly separable?

Support Vector Machines

How do you find the best separating hyperplane for non-linearly separable data?

Nonlinear Support Vector Machines

Techniques converting data to a higher dimensional space such that data has a linear relationship between classes.

Learning Nonlinear SVM

Methods and properties that can be seen to ensure data can be transformed and analyzed in high dimensional spaces.

Kernel Trick *

A strategy to avoid calculating the transformation into a potentially high-dimensional space.

Example of Nonlinear SVM

Visual example of how a nonlinear SVM may perform classification given non-linear data

Learning Nonlinear SVM

Advantages of using a kernel function for nonlinear SVMs, and limitations.

Characteristics of SVM

Various properties of the SVM algorithm, including high computational complexity.

Data Mining Classification: Imbalanced Class Problem

Classification problems in which one class dominates the other class/classes.

Class Imbalance Problem

Challenges in using data mining techniques for scenarios that have one class greatly dominating the other classes.

Confusion Matrix

A table used to estimate the different types of errors made by classification models.

Accuracy

The ratio of correctly classified instances in a given dataset.

Problem with Accuracy

Measuring accuracy on imbalanced datasets is unreliable because accuracy measurements may lead you to incorrect conclusions/results given one class may dominate the others.

Which model is better?

Comparing various classification models based on their performance measures on different imbalanced datasets using several performance measures.

Alternative Measures

Ways to evaluate classification models beyond just accuracy (e.g., precision, recall, F-measure).

Alternative Measures

F-measure, precision, recall on data using a confusion matrix.

ROC (Receiver Operating Characteristic)

A method for evaluating a classifier's performance in terms of true positive rate (TPR) and false positive rate (FPR).

ROC Curve

Plotting true positive rates (TPR) against false positive rates (FPR).

ROC Curve Example

Example graphs showing the ROC curve and its tradeoffs between different classifier types.

How to Construct an ROC curve

How/Method to construct an ROC curve and plot associated thresholds.

Using ROC for Model Comparison

Using area under the curve (AUC) to compare models.

Dealing with Imbalanced Classes - Summary

Practical strategies for handling imbalanced classes, and which evaluation measures are best suited for use with such data.

Which Classifier is better?

Comparing different classifiers based on their performance measures.

Building Classifiers with Imbalanced Training Set

Strategies for overcoming issues with imbalanced datasets.

Data Mining Cluster Analysis: Basic Concepts

Cluster analysis is a data mining technique used to cluster data objects into groups given the different types of clusters (contigous, prototype-based, and density-based).
Partitional clusters are a division of data for clustering.
Hierarchical clusters are a ranking system when clustering.

What is Cluster Analysis?

Cluster analysis places objects into groups given that the objects in the group are similar (or related) to one another, and different from other groups/objects.

Applications of Cluster Analysis

Group related documents or genes and proteins with similar functionality, or group stocks or products. – Summarise large datasets.

Notion of a Cluster can be Ambiguous

There can be various definitions for the notion of a cluster.
It matters depending on the domain and the desired results.

Types of Clusterings

Partitional clustering (dividing data objects into disjoint subsets) and Hierarchical clustering (a series of nested clusters) are two types of clustering techniques.

Partitional Clustering

A division of data points into non-overlapping subsets.

Hierarchical Clustering

A set of nested clusters organized in a hierarchical tree.

Other Distinctions Between Sets of Clusters

Exclusive vs. Non-exclusive clusters
Exclusive clustering (point only belongs to one cluster); non-exclusive clusters (point can belong to multiple)
Partial vs. Complete clusters; you may not want to cover the whole amount of data in the cluster.
Determining which of these is better depends on the data and desired result.

Types of Clusters

Well-separated clusters
Prototype-based clusters
Contiguity-based clusters
Density-based clusters
Described by an Objective Function (that you use to minimize or maximize to get good clusters/results).

Types of Clusters: Well-Separated

A cluster is a set where every point in that cluster is closer to other points in the same cluster, than to those points not in that cluster.

Types of Clusters: Prototype-Based

A cluster is a set of objects where any object in a given cluster is closer to that cluster's prototype/center than to any other cluster's center.

Types of Clusters: Contiguity-Based

A cluster is a set of points in which any point in that cluster is closer to the other points in that cluster than to points outside of that cluster.

Types of Clusters: Density-Based

A cluster is a set of data points such that if some objective function/objective measure is used, it will reach a local or global minima.

Types of Clusters: Objective Function

Objective functions are used to determine or find optimal clusters to fulfil a criteria for an objective.
They are NP-Hard in most cases.
Global objectives try to maximize a function encompassing the entire dataset, while local strategies look for immediate/nearby improvements.

Characteristics of the Input Data Are Important

The type of proximity or density measure is important as that affects the algorithm's performance and accuracy.
Characteristics of data including sparseness, attribute types, relationships between attributes, noise, outliers and distribution matter.

Clustering Algorithms

Common clustering algorithms include k-means, hierarchical clustering, and density-based clustering.

K-means Clustering

A Partitional clustering algorithm.
Each cluster has an associated centroid (center point).
Each data point is assigned to the cluster with the closest centroid.

Example of K-means Clustering

Iterative steps of the k-means algorithm.

K-means Clustering – Details

Simple iterative algorithm.
Initial centroids are chosen randomly; the clusters and their centroids change in each iteration until a desired threshold/set of criteria (such as relatively few points changing clusters) is reached.
The computational cost of K-means clustering is O(n * K * I * d)

K-means Objective Function

Sum of squared errors (SSE) objective function measures clustering error.
SSE represents the sum of squared distances between each point in a given cluster, and that cluster's centroid.

Two different K-means Clusterings

Demonstrating how different initializations can lead to different clustering outcomes given some dataset instances.

Importance of Choosing Initial Centroids

Initializations of centroids for random data are not guaranteed to produce the best/ideal result.

Problems with Selecting Initial Points

The chance of choosing a good/ideal centroid from each cluster is small, especially if the number of clusters is large.
The initial centroids sometimes don't readjust themselves as expected given the data/data analysis.

10 Clusters Example

Example of clusters and how centroids affect clustering results in more complex/challenging datasets

Solutions to Initial Centroids Problem

Multiple runs: Selecting initial centroids multiple times through a process and selecting among those initial centroids may produce better results than just one run.
Selecting most-separated points.
K-means++
Bisecting K-means

K-means++

Using a strategy of selecting the next centroid by randomly selecting a centroid using a probability that's proportional to the minimum distance to the selected centroids.

Bisecting K-means

A hierarchical method for creating K clusters.
It repeats splitting clusters into two until there are K clusters.

Limitations of K-means

K-means can have trouble with differing sizes, non-globular shapes, and datasets containing outliers.

Limitations of K-means: Differing Sizes, Densities, and Non-globular Shapes, Outliers

Demonstrations of how K-means clustering might perform poorly given certain types of datasets

Overcoming K-means Limitations

Generating multiple clusters (potentially many more clusters than the desired number.
Post-processing for re-grouping clusters, may improve the final clustering results for many/challenging sets of data.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Data Mining & Predictive Modeling Concepts

Choose a study mode

Podcast

Questions and Answers

Which property of a distance measure states that the distance is always non-negative?

The Mahalanobis distance can be zero only when the two points are identical.

What is the covariance matrix given in the Mahalanobis distance example?

The property of similarity that shows it is the same regardless of the order of objects is called _______.

Match the following terms to their respective properties:

Which of the following describes the purpose of predictive modeling?

Association rules are primarily concerned with predicting future trends.

What is one example of a class attribute mentioned in the predictive modeling section?

In data mining, __________ refers to identifying unusual observations that differ from the majority of the data.

Match the data mining tasks with their descriptions:

Which attribute would most likely have a categorical value?

What type of analysis is performed when finding relationships between variables?

What is the main purpose of regression analysis?

The object catalog mentioned has a size of 150 GB.

How many new high red-shift quasars were found?

Regression is extensively studied in __________ and neural network fields.

Match the types of data with their respective sizes:

What types of characteristics are included in the classification of galaxies?

Regression analysis can only be applied to linear relationships.

What are two examples of predictions made using regression analysis?

The study of __________ includes investigating the stages of formation of galaxies.

Which of the following is NOT an example of regression analysis?

Which of the following characteristics of data can complicate the recognition of the proper attribute type?

High dimensional data brings several challenges in data analysis.

What type of data consists of a collection of records, each with a fixed set of attributes?

The __________ data type consists of documents represented as term vectors.

Match the type of data with its correct description:

Which property of data refers to how the scale can affect pattern recognition?

The presence of some attributes in sparse data is always sufficient for analysis.

What is the main challenge related to sparsity in data?

Data objects with the same fixed set of attributes can be visualized in a ________-dimensional space.

Which of the following is NOT a type of data set mentioned?

What is an example of noise in data?

Outliers are always considered noise in data analysis.

What are two reasons for missing values in a dataset?

Duplicate data issues often arise when merging data from __________ sources.

Match the following data quality problems with their definitions:

What can be done to handle missing values in a dataset?

Having multiple email addresses for the same person is an example of duplicate data.

What is a similarity measure in data analysis?

To clean duplicate data, a process known as __________ is utilized.

Which of the following is NOT described as a data quality problem?

What is the main purpose of the mutual information approach as described?

An indicator variable can take on the value of -1 when both objects have a value of 0 for a symmetric attribute.

What are the two main reasons for using weights when calculating similarities between attributes?

In data preprocessing, the process of __________ refers to reducing the number of attributes or objects combined into a single attribute.

Match the following data preprocessing techniques to their descriptions:

What is a primary reason for the strong competitive pressure in data mining?

Data mining is primarily concerned with collecting data rather than analyzing it.

Name one type of data that is extensively collected by e-commerce websites.

In data mining, unusual observations that differ from the majority of the data are referred to as __________.

Match the following types of data with their descriptions:

What challenge is often faced in data analysis due to high dimensional data?

Give one example of how data can be gathered from social networking sites.

Which of the following is a characteristic that can complicate the recognition of the proper attribute type?

The presence of sparse data means that all attributes in the dataset are equally important.

What is meant by 'dimensionality' in the context of data analysis?

Data represented as __________ provides a multidimensional view of objects based on their attributes.

Match the types of data sets with their definitions:

Which operation is meaningful for categorical data?

High dimensional data provides fewer challenges for data analysis compared to low dimensional data.

What characterizes document data in data mining?

The scale at which data is analyzed refers to its __________.

Which characteristic of data refers to the amount of space occupied by the dataset?

What type of data consists of a collection of records with a fixed set of attributes?

What is an example of a transaction in a grocery store?

The __________ refers to the quality of the data that can negatively impact data processing efforts.

Match the following types of data with their examples:

Which of these data types can represent relationships between variables?

The average monthly temperature is an example of spatio-temporal data.

In data mining, __________ involves identifying unusual observations that differ from the majority of the data.

Poor data quality can lead to which of the following issues?

What distance measure is defined as the maximum difference between any component of the vectors?

The Euclidean distance is always less than or equal to the Manhattan distance for any two points.

Write the formula for Mahalanobis distance.

The Hamming distance is a special case of the ______ distance, applicable to binary vectors.