79 Questions
What is an attribute in the context of data mining?
A property or characteristic of an object
Which term is synonymous with 'attribute' in data mining?
Variable
What are attribute values in the context of data mining?
Numbers or symbols assigned to an attribute
How are attributes and attribute values distinguished?
Same attribute can be mapped to different attribute values
What is an example of an attribute value distinction provided in the text?
Height measured in feet or meters
In the context of data mining, what does 'ID' refer to?
A unique identifier for an object
What is the distinction between 'ID' and 'age' in terms of attribute values?
ID has no limit but age has a maximum and minimum value
What is the purpose of measuring an attribute in data mining?
To describe the properties of objects
Which type of attribute provides enough information to order objects?
Ordinal
What does a ratio attribute capture?
All 4 properties
What type of attribute is temperature in Celsius or Fahrenheit?
Interval
What type of attribute is eye color?
Nominal
What type of attribute is temperature in Kelvin?
Ratio
Which attribute type has real numbers as attribute values?
Interval
What type of attribute is ID numbers?
Nominal
What type of attribute is calendar dates?
Interval
What type of attribute is monetary quantities?
Ratio
Which attribute type captures only distinctness?
Nominal
What type of attribute is counts and age?
Ratio
What type of attribute is height in {tall, medium, short}?
Ordinal
What does standardization in statistics refer to?
Subtracting off the means and dividing by the standard deviation
What is the range for the similarity measure?
[0, 1]
What is the formula for Euclidean Distance?
$dist = rac{1}{n} imes igg( igg| p_k - q_k igg|^2 igg)$
What is the generalization of Euclidean Distance?
Minkowski Distance
What is the parameter 'r' for the Minkowski Distance when it represents the 'supremum' distance?
∞
What is the range for dissimilarity measure?
(-∞, ∞)
What is the transformation equation for dissimilarity values of 0, 1, 10, 100?
Similarity values of 1, 0.5, 0.09, 0.01, respectively
What is the measure of plant growth used by ecosystem scientists?
Net Primary Production (NPP)
What does proximity refer to?
Both similarity and dissimilarity measures
What is the minimum dissimilarity value?
0
What is the upper limit for dissimilarity measure?
There is no upper limit
When is standardization necessary for Euclidean Distance?
When scales differ
What is the purpose of aggregation in data preprocessing?
Data reduction and change of scale
What is the key principle for effective sampling?
Using a sample will work almost as well as using the entire data sets, if the sample is representative
What is the main reason for employing sampling in data mining?
Processing the entire set of data of interest is too expensive or time consuming
What is the purpose of dimensionality reduction?
Avoid curse of dimensionality and reduce time and memory required by data mining algorithms
What technique is used for dimensionality reduction?
Principal Component Analysis (PCA)
What does PCA aim to find?
A projection that captures the largest amount of variation in data
What is the purpose of data cleaning?
Dealing with duplicate data issues
What is the major issue when merging data from heterogeneous sources?
Data set may include data objects that are duplicates or almost duplicates of one another
What is the purpose of feature subset selection?
To reduce the number of attributes or objects
What is the aim of discretization and binarization?
To transform continuous attributes into discrete or binary values
What is the purpose of attribute transformation?
To convert data into a more suitable form for analysis
What does the curse of dimensionality refer to?
Data becomes increasingly sparse as dimensionality increases
What does feature subset selection aim to achieve?
Remove redundant or irrelevant attributes to reduce data dimensionality
What is involved in feature creation?
Creating new attributes to capture important information more efficiently
How can data be mapped to a new space?
Through techniques like Fourier and wavelet transforms
What does discretization involve?
Converting a continuous attribute into an ordinal attribute, commonly used in classification
What does binarization involve?
Mapping a continuous or categorical attribute into one or more binary variables, commonly used for association analysis
What is attribute transformation?
Involves mapping the entire set of attribute values to a new set using functions like xk, log(x), ex, |x|, standardization, and normalization
What is normalization?
An attribute transformation technique that adjusts attributes for differences in frequency of occurrence, mean, variance, and range
What does the Iris Plant data set contain?
Three flower types and four non-class attributes
How can discretization be illustrated using the Iris data set?
Different petal width and length values imply different flower types
How can discretization be done?
Using unsupervised or supervised approaches, finding breaks in the data values with or without using class labels
What are discretization approaches provided as visual examples in the text?
Equal interval width, equal frequency, and k-means approaches
What type of data involves records with sets of items, like products purchased at a store?
Transaction data
Which type of data is represented as term vectors with the frequency of terms in the document?
Document data
What are some important characteristics of data mentioned in the text?
Dimensionality, sparsity, resolution, and size
What type of data involves sequences of transactions, genomic sequence data, and spatio-temporal data?
Ordered data
What is the term for the modification of original values in data?
Noise
Which type of data quality problem refers to data objects with significantly different characteristics?
Outliers
What type of data quality problem can be due to non-collection or inapplicability?
Missing values
What does data matrix represent data objects as?
Points in multi-dimensional space
Which type of data involves generic graphs, molecules, and webpages?
Graph-based data
What can poor data quality negatively impact?
Data processing efforts and revenue
What type of data sets include ordered data, transaction data, and graph-based data?
Graph-based data
What are some characteristics of data mentioned in the text?
Dimensionality, sparsity, resolution, and size
What type of data involves records with sets of items, like products purchased at a store?
Transaction data
What does noise refer to in the context of data quality problems?
Modification of original values
What is represented as term vectors with the frequency of terms in the document?
Document data
What type of data involves sequences of transactions, genomic sequence data, and spatio-temporal data?
Ordered data
What are some important characteristics of data mentioned in the text?
Dimensionality, sparsity, resolution, size
What type of data sets include generic graphs, molecules, and webpages?
Graph-based data
What type of attribute is temperature in Kelvin?
Interval attribute
What is an example of a data quality problem mentioned in the text?
Noise
What does a data matrix represent data objects as?
Points in multi-dimensional space
What does sparsity refer to in the context of data?
Small number of non-zero elements
What type of attribute is counts and age?
Ratio attribute
What type of data involves a collection of records with fixed attributes?
Record data
Study Notes
Data Dimensionality Reduction Techniques
- Feature subset selection is used to reduce data dimensionality by removing redundant or irrelevant attributes.
- Feature creation involves creating new attributes to capture important information more efficiently, using methods such as feature extraction, construction, and mapping data to a new space.
- Mapping data to a new space can be achieved through techniques like Fourier and wavelet transforms.
- Discretization involves converting a continuous attribute into an ordinal attribute, commonly used in classification.
- The Iris Plant data set, available from the UCI Machine Learning Repository, contains three flower types and four non-class attributes.
- Discretization can be illustrated using the Iris data set, where different petal width and length values imply different flower types.
- Discretization can be done using unsupervised or supervised approaches, finding breaks in the data values with or without using class labels.
- Binarization maps a continuous or categorical attribute into one or more binary variables, commonly used for association analysis.
- Attribute transformation involves mapping the entire set of attribute values to a new set, using functions like xk, log(x), ex, |x|, standardization, and normalization.
- Normalization is an attribute transformation technique that adjusts attributes for differences in frequency of occurrence, mean, variance, and range.
- The text provides visual examples of discretization approaches, including equal interval width, equal frequency, and k-means approaches.
- Attribute transformation and discretization techniques are essential for reducing data dimensionality and preparing data for various data mining tasks.
Data Mining: Types of Data and Data Quality
- Association analysis uses asymmetric attributes
- Types of data sets include record data, data matrix, document data, transaction data, graph-based data, and ordered data
- Important characteristics of data include dimensionality, sparsity, resolution, and size
- Record data consists of a collection of records with fixed attributes
- Data matrix represents data objects as points in multi-dimensional space
- Document data is represented as term vectors with the frequency of terms in the document
- Transaction data involves records with sets of items, like products purchased at a store
- Graph data examples include generic graphs, molecules, and webpages
- Ordered data includes sequences of transactions, genomic sequence data, and spatio-temporal data
- Poor data quality can negatively impact data processing efforts and lead to significant revenue loss
- Data quality problems include noise, outliers, and missing values
- Noise refers to the modification of original values, while outliers are data objects with significantly different characteristics. Missing values can be due to non-collection or inapplicability, and can be handled by eliminating data objects or estimating missing values.
Test your knowledge of data dimensionality reduction techniques with this quiz. Explore feature subset selection, feature creation, mapping data to a new space, discretization, binarization, attribute transformation, and more. See how these techniques are applied to the Iris Plant data set and their significance in data mining tasks.
Make Your Own Quizzes and Flashcards
Convert your notes into interactive study material.
Get started for free