## 95 Questions

What is one of the reasons for the enormous data growth in commercial and scientific databases?

Advances in data generation and collection technologies

Which company is mentioned to have Peta Bytes of web data?

Yahoo

What is one of the examples of the competitive pressure mentioned in the text?

Providing better, customized services for an edge in Customer Relationship Management

What is the new mantra (slogan) mentioned in the text regarding data gathering?

Gather whatever data you can whenever and wherever possible

What is the primary purpose of data mining?

To automate analysis of massive datasets

Which fields contribute ideas to data mining?

Machine learning, AI, and statistics

What are examples of classification tasks in data mining?

Categorizing news stories and predicting tumor cells

What are some applications of classification tasks in data mining?

Fraud detection in credit card transactions and churn prediction for telephone customers

What is the primary goal of data mining in the context of sky survey cataloging?

To predict the class of sky objects based on telescopic survey images

What are the tasks involved in data mining?

Prediction methods and finding human-interpretable patterns in data

What is the volume of earth science data archived by NASA EOSDIS per year?

Over petabytes

What does data mining involve?

The nontrivial extraction of potentially useful information from data

What are the potential opportunities through data mining?

Improving productivity and solving societal problems

What are the examples of tasks in data mining?

Prediction methods and finding human-interpretable patterns in data

What is the focus of classification tasks in data mining?

Identifying intruders in cyberspace and predicting credit worthiness

What is the primary use of data mining in fraud detection?

Detecting fraud in credit card transactions

Which of the following is an application of association rule discovery in data mining?

Market-basket analysis

What is the primary purpose of deviation/anomaly/change detection in data mining?

Detect significant deviations from normal behavior

What is the main focus of clustering in data mining?

Finding groups of similar objects

Which feature is the class model based on?

Success stories

What is the primary use of regression in data mining?

Predict continuous valued variables

What are the motivating challenges in data mining?

Scalability, high dimensionality, heterogeneous and complex data

What is an example of association analysis mentioned in the text?

Subspace differential coexpression pattern enriched with the TNF/NFB signaling pathway related to lung cancer

What is an application of clustering in data mining?

Custom profiling for targeted marketing

What does market segmentation aim to achieve in data mining?

Subdividing markets

What is the primary application of deviation/anomaly/change detection in data mining?

Credit card fraud detection

What is an example of a regression task in data mining?

Predicting sales amounts

What are the applications of association rule discovery in data mining?

Market-basket analysis, telecommunication alarm diagnosis, medical informatics

Which type of attribute captures only the order properties of length?

Ordinal

What type of attribute has distinctness, order, and addition properties?

Interval

Which type of attribute has all four properties: distinctness, order, addition, and multiplication?

Ratio

Which type of attribute is represented by a permutation of values?

Nominal

What type of attribute is represented by a transformation of the form new_value = a * old_value + b?

Interval

What type of attribute is typically represented as floating-point variables?

Interval

Which type of attribute has only a finite or countably infinite set of values?

Nominal

What type of attribute has real numbers as attribute values?

Ratio

Which type of attribute is regarded as important only in the presence of a non-zero attribute value?

Ratio

What type of attribute provides enough information to order objects but does not have the property of multiplication?

Ordinal

What type of attribute encompasses only the order properties of length?

Ordinal

Which type of attribute is represented by a transformation of the form new_value = f(old_value) where f is a monotonic function?

Ordinal

What is the purpose of aggregation in data preprocessing?

Data reduction

What is the key principle for effective sampling in data mining?

Using a sample that is representative of the original data

What is the main drawback of high dimensionality in data mining?

Increased sparsity of data

What is the primary goal of dimensionality reduction in data mining?

Avoiding curse of dimensionality

What technique in dimensionality reduction aims to capture the largest amount of variation in data?

Principal Component Analysis (PCA)

What is the major issue when merging data from heterogeneous sources?

Duplicate data

What is the purpose of feature subset selection in data preprocessing?

Reduce dimensionality

What is the technique for combining two or more attributes into a single attribute in data preprocessing?

Aggregation

What is the primary reason for using sampling in data mining?

Processing the entire data set is too expensive or time consuming

What is the main purpose of data cleaning in data preprocessing?

Dealing with duplicate data issues

What is the characteristic of a sample that makes it representative of the original data set?

Having approximately the same properties of interest as the original data

What is the purpose of stratified sampling in data mining?

Balancing representation of different groups in the sample

Which type of data set represents data objects as points in a multi-dimensional space?

Data matrix

What does noise refer to in the context of data quality problems?

Modification of original values

What is the primary characteristic of data that involves dimensionality, sparsity, resolution, and size?

Dimensionality

What type of data involves sequences of transactions, genomic sequence data, and spatio-temporal data?

Ordered data

In association analysis, what type of attributes does it use?

Asymmetric attributes

What does transaction data involve?

Sets of items in each record

What is an example of graph data mentioned in the text?

A molecule

What does missing values in data quality problems refer to?

Information not being collected

What does document data represent?

Each document as a term vector

What type of data set consists of a collection of records, each with a fixed set of attributes?

Record data

What are the important characteristics of data mentioned in the text?

Dimensionality, sparsity, resolution, and size

What is the impact of poor data quality on data processing efforts mentioned in the text?

Negatively affects data processing efforts

Which technique involves creating new attributes to capture essential information more efficiently?

Feature Creation

What does discretization involve?

Converting continuous attributes into ordinal attributes

Which technique maps continuous or categorical attributes into one or more binary variables?

Binarization

What does attribute transformation involve?

Mapping the entire set of attribute values to a new set of replacement values

What are the various methods for discretization without using class labels?

Equal interval width, equal frequency, and K-means approaches

What is used for discretization using class labels?

Entropy-based approach

What does attribute transformation include?

Standardization and normalization techniques

What is the Iris Sample Data Set primarily composed of?

Three flower types and four non-class attributes

What is the primary purpose of feature subset selection?

Eliminating redundant and irrelevant features

What does mapping data to a new space involve?

Employing techniques like Fourier and wavelet transforms

What is the process of converting continuous attributes into ordinal attributes often used in?

Classification

What does binarization typically used for?

Association analysis

What is another term for an attribute in the context of data mining?

Field

What is the primary purpose of discretization in data mining?

To transform continuous attributes into categorical attributes

Which term describes a collection of attributes that describe an object in data mining?

Record

What is the distinction between attributes and attribute values?

Same attribute can be mapped to different attribute values

What is the purpose of measurement of length in data mining?

To ensure consistency in attribute measurement

What is the term for the numbers or symbols assigned to an attribute in data mining?

Attribute values

What is the primary use of attribute values in data mining?

To represent the properties or characteristics of data objects

What term is used in data mining to describe a property or characteristic of an object?

Attribute

What is the formula for Euclidean Distance?

dist = \sqrt{\sum_{k=1}^{n} (p_k - q_k)^2}

What is the purpose of standardization in statistics?

To remove the mean and divide by the standard deviation

What is the range of similarity values?

[0,1]

What is the transformation equation for dissimilarity values of 0, 1, 10, 100 to similarity values?

transformation equation results in similarity values of 1, 0.5, 0.09, 0.01, respectively

What is the formula for Minkowski Distance?

dist = (\sum_{k=1}^{n} |p_k - q_k|^r)^{1/r}

What type of distance is calculated by setting r = 1 in the Minkowski Distance formula?

City block (Manhattan, taxicab, L1 norm) distance

What does proximity refer to in the context of data mining?

Similarity or dissimilarity

What is the purpose of normalizing using monthly Z Score in the context of plant growth data?

To subtract off monthly mean and divide by monthly standard deviation

What is the correlation value between Atlanta and Sao Paolo in the original time series?

-0.5739

What is the distance between points p2 and p4 in the Euclidean Distance matrix?

5.099

What is the maximum value of dissimilarity for two data objects?

∞

## Study Notes

Introduction to Data Mining

- NASA EOSDIS archives over petabytes of earth science data per year from remote sensors on a satellite.
- Data mining helps scientists in automated analysis of massive datasets and hypothesis formation.
- Opportunities to improve productivity and solve societal problems exist through data mining.
- Data mining involves the nontrivial extraction of potentially useful information from data.
- Data mining draws ideas from machine learning, AI, pattern recognition, statistics, and database systems.
- Data mining tasks include prediction methods and finding human-interpretable patterns in data.
- Classification tasks involve predicting credit worthiness and identifying intruders in cyberspace.
- Examples of classification tasks include categorizing news stories and predicting tumor cells.
- Classification applications include fraud detection in credit card transactions and churn prediction for telephone customers.
- Sky survey cataloging uses data mining to predict the class of sky objects based on telescopic survey images.

Introduction to Data Mining

- The class model is based on features like success stories, early class stages of formation, intermediate and late data sizes, object catalog, and image database.
- Regression is used to predict continuous valued variables based on other variables, with examples like predicting sales amounts and time series prediction of stock market indices.
- Clustering involves finding groups of similar objects, with applications in custom profiling for targeted marketing, grouping related documents for browsing, and reducing the size of large data sets.
- Market segmentation and document clustering are applications of clustering, aimed at subdividing markets and finding groups of similar documents, respectively.
- Association rule discovery involves producing dependency rules to predict the occurrence of an item based on occurrences of other items, with applications in market-basket analysis, telecommunication alarm diagnosis, and medical informatics.
- An example of association analysis is the subspace differential coexpression pattern, enriched with the TNF/NFB signaling pathway, related to lung cancer.
- Deviation/anomaly/change detection is used to detect significant deviations from normal behavior, with applications in credit card fraud detection, network intrusion detection, and identifying abnormal behavior from sensor networks.
- The motivating challenges in data mining include scalability, high dimensionality, heterogeneous and complex data, data ownership and distribution, and non-traditional analysis.

Dimensionality Reduction Techniques in Data Mining

- Feature Subset Selection is a method to reduce data dimensionality by eliminating redundant and irrelevant features.
- Feature Creation involves creating new attributes to capture essential information more efficiently, including feature extraction, feature construction, and mapping data to a new space.
- Mapping Data to a New Space employs techniques like Fourier and wavelet transforms to represent data in a different domain.
- Discretization is the process of converting continuous attributes into ordinal attributes, often used in classification.
- The Iris Sample Data Set, available from the UCI Machine Learning Repository, consists of three flower types and four non-class attributes.
- Discretization in the Iris Example involves determining the best discretization method, which can be unsupervised or supervised.
- Binarization maps continuous or categorical attributes into one or more binary variables, typically used for association analysis.
- Attribute Transformation involves mapping the entire set of attribute values to a new set of replacement values, including simple functions and normalization techniques.
- Attribute Transformation refers to functions that map the entire set of attribute values to a new set of replacement values, including simple functions and normalization techniques.
- Various methods for discretization without using class labels include equal interval width, equal frequency, and K-means approaches.
- An entropy-based approach is used for discretization using class labels, categorizing attributes for different groups of points.
- Attribute Transformation includes standardization and normalization techniques to adjust for differences among attributes in terms of frequency of occurrence, mean, variance, and range.

Test your knowledge of data mining with this quiz covering topics such as classification, regression, clustering, association rule discovery, and deviation detection. Explore the various applications and challenges in data mining while gaining an understanding of its significance in analyzing large datasets.

## Make Your Own Quizzes and Flashcards

Convert your notes into interactive study material.

Get started for free