Module 3:Data Preprocessing: Reduction and Transformation

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Which of the following is the primary goal of data reduction techniques?

To complicate the dataset, so it becomes unreadable.
To remove all the data from the data warehouse.
To decrease the processing time and storage space needed. (correct)
To increase the volume of the dataset for better analysis.

Why is data reduction a crucial step in data preprocessing?

Data warehouses store terabytes of data, which takes a long time to run. (correct)
The terabytes of data, do not require complex data analysis.
Data analysis is simpler to run using complete data sets as they are.
So the data warehouse only stores kilobytes of data

Which of the following is a key goal of dimensionality reduction techniques?

To create new, more complex attributes.
To increase the number of attributes in the dataset.
To eliminate irrelevant features and reduce noise. (correct)
To increase the time and space required for analysis.

Which of the following is true about the 'curse of dimensionality'?

It states that the possible combinations of subspaces grow exponentially. (D) Signup and view all the answers

In the context of dimensionality reduction, what does 'feature subset selection' aim to achieve?

Selecting a subset of the all attributes. (B) Signup and view all the answers

What is the primary purpose of Wavelet Transforms in data preprocessing?

To transform data while preserving relative distances at various resolutions. (A) Signup and view all the answers

Which of the following is a critical condition for applying the Discrete Wavelet Transform (DWT)?

The length of input data must be an integer power of 2. (B) Signup and view all the answers

What is a key characteristic of Wavelet Decomposition in the context of data compression?

Many small detail coefficients can be replaced by O's. (B) Signup and view all the answers

Why are hat-shape filters emphasized in Wavelet Transform?

To suppress weaker information in the boundaries. (A) Signup and view all the answers

What is the main goal of Principal Component Analysis (PCA)?

Finding a projection that captures the largest amount of variation in data (D) Signup and view all the answers

How are 'redundant attributes' defined in the context of attribute subset selection?

They duplicate much or all of the information contained in one or more other attributes. (C) Signup and view all the answers

In attribute selection, what is a key difference between 'irrelevant' and 'redundant' attributes?

Irrelevant attributes contain no useful information; redundant attributes duplicate existing information. (A) Signup and view all the answers

When using heuristic search methods for attribute selection, why is it important to choose attributes by significance tests?

To ensure the attributes meet the attribute independence assumption and choose by significance tests (D) Signup and view all the answers

In the context of attribute creation, what is the main purpose of 'attribute extraction'?

To derive new attributes that better capture important information. (D) Signup and view all the answers

What is data discretization?

Transform a continuous attribute into a set of intervals. (B) Signup and view all the answers

Which of the following is characteristic of parametric methods for numerosity reduction?

They assume the data fits some model, estimate parameters, and store only the parameters. (A) Signup and view all the answers

What is the main purpose of regression in the context of data reduction?

Fitting data under a model to estimate parameters and store only the parameters. (C) Signup and view all the answers

What is the main idea behind using histograms for numerosity reduction?

Data is divided into buckets and the average (sum) for each bucket is stored. (C) Signup and view all the answers

In data reduction, which characteristic makes clustering an effective method?

If data is clustered (B) Signup and view all the answers

What should be considered when using sampling for data reduction?

Choose a representative subset from a data set so that using the sample will approximately the same results as mining the entire dataset. (D) Signup and view all the answers

Which of the following is true of 'simple random sampling'?

There is an equal probability of selecting any particular item (C) Signup and view all the answers

What is the purpose of 'stratified sampling?

To draw sampling proportionally from the dataset. (D) Signup and view all the answers

Which of the following is not a data transformation method?

Sampling (D) Signup and view all the answers

In data transformation, what does 'normalization' aim to achieve?

Scaled to fall within a specified range (D) Signup and view all the answers

What characteristic is unique of the z-score normalization?

The useful when the actual minimum and maximum of attribute A are unknown, or when there are outliers that dominate the min-max normalization (B) Signup and view all the answers

How is data discretization defined?

Divide the range of a continuous attribute into intervals (B) Signup and view all the answers

In the context of data discretization, what is the key difference between 'supervised' and 'unsupervised' methods?

Supervised methods use class labels; unsupervised methods do not. (D) Signup and view all the answers

When discretizing data through 'binning,' what distinguishes 'equal-width' from 'equal-depth' partitioning?

Equal-width partitions into equal-sized ranges; equal-depth ensures each interval contains approximately the same number of samples. (D) Signup and view all the answers

Which of the following is applied in concept hierarchy generation?

Analysis of the number of distinct values (B) Signup and view all the answers

How is similarity defined in the context of data proximity measures?

Numerical measure of how alike two data objects are (A) Signup and view all the answers

How are similarity and dissimilarity related?

Proximity generally alludes to similarity or dissimilarity. (A) Signup and view all the answers

What is the mode of a data matrix?

Two modes (C) Signup and view all the answers

What is the mode of dissimilarity matrix?

Single mode (D) Signup and view all the answers

What is the significance of parameter 'p' in proximity measure for nominal attributes?

p: total # of variables (A) Signup and view all the answers

In the context of binary attributes, what does the Jaccard coefficient measure?

Asymmetric measure (D) Signup and view all the answers

What does a contingency table measure for binary data?

A contingency table (A) Signup and view all the answers

In the formula for the Z-score, what do μ and σ represent, respectively?

Population mean and population standard deviation (D) Signup and view all the answers

What does 'h' represent in Minkowski distance if it measure two p-dimensional data?

h is the order or norm (D) Signup and view all the answers

What distance does d(i,j)=|xi₁-xj₁|+|x₁₂-x j₂ +...+ | xip-Xjp| represent?

Manhattan distance (D) Signup and view all the answers

Which scenario is the Minkowski distance most suitable for measuring?

Greatest difference between attributes (D) Signup and view all the answers

What does the dot ( • ) represent in cosine similarity?

Dot product (D) Signup and view all the answers

What is the main task for cosine similarity measure?

Applies a document into attributes (B) Signup and view all the answers

Which data preprocessing task involves concept hierarchy climbing?

Discretization (B) Signup and view all the answers

What is the primary requirement for the length of input data when applying Discrete Wavelet Transform (DWT)?

It must be an integer power of 2. (C) Signup and view all the answers

In the context of heuristic attribute selection, what is the key assumption behind choosing the best single attribute?

Attribute independence. (A) Signup and view all the answers

Given a dataset to be discretized, what is the primary distinction between binning and K-means clustering in unsupervised data discretization?

Clustering considers data distribution for better results, while binning divides data into equal intervals without regard to distribution. (C) Signup and view all the answers

In the context of data transformation, how does 'attribute/feature construction' contribute to the preprocessing stage?

By creating entirely new attributes from the original set. (A) Signup and view all the answers

How are ordinal variables typically handled to measure dissimilarity?

They are first treated as interval-scaled variables after ranking and scaling. (C) Signup and view all the answers

What is a crucial consideration when applying clustering for data reduction purposes?

The data should exhibit a clustered structure. (D) Signup and view all the answers

How does the 'curse of dimensionality' primarily impact data analysis?

It causes data to become increasingly sparse, and distance measures become meaningless ,which drastically affects the performance of clustering and outlier analysis. (D) Signup and view all the answers

What is the implication of selecting 'samples without replacement' in the context of data sampling?

Once an item is selected, it is removed from the population and cannot be selected again. (C) Signup and view all the answers

What does the Cosine Similarity measure primarily capture in text analysis?

The angle between two document vectors, irrespective of their magnitude. (B) Signup and view all the answers

How does Wavelet Transform handle outlier data points compared to mean or median smoothing?

Wavelet transform is more effective at identifying and removing outliers. (D) Signup and view all the answers

What is the main purpose of applying a 'hat-shape' filter in Wavelet Transform?

To emphasize regions where data points cluster. (A) Signup and view all the answers

When is z-score normalization particularly useful compared to min-max normalization?

When the actual minimum and maximum of the attribute are unknown, or when there are outliers that dominate the min-max normalization. (D) Signup and view all the answers

In the context of data preprocessing, what does 'concept hierarchy generation' for nominal data involve?

Defining a sequence of attributes from general to specific, such as country to street, with the intention of clustering categorical data. (C) Signup and view all the answers

What is the primary goal of Principal Component Analysis (PCA) in the context of data reduction?

To project the data onto a new space which captures the most amount of variance in data. (B) Signup and view all the answers

When dealing with mixed attributes types in a dataset, what is a common approach to calculate the overall distance between data objects?

Use a weighted formula that combines the effects of different attribute types. (C) Signup and view all the answers

Flashcards

Data Reduction

Obtaining a reduced representation of the dataset that is much smaller in volume while preserving analytical results.

Dimensionality Reduction

Reduces the number of attributes by removing unimportant ones.

Wavelet Transform

A math tool that decomposes signals into sub-bands, useful for image compression and preserving object distances at different resolutions..

Discrete Wavelet Transform (DWT)

Discrete Wavelet Transform used for linear signal processing and multiresolution analysis, stores strongest wavelet coefficients.