Module 3:Data Preprocessing: Reduction and Transformation

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Which of the following is the primary goal of data reduction techniques?

  • To complicate the dataset, so it becomes unreadable.
  • To remove all the data from the data warehouse.
  • To decrease the processing time and storage space needed. (correct)
  • To increase the volume of the dataset for better analysis.

Why is data reduction a crucial step in data preprocessing?

  • Data warehouses store terabytes of data, which takes a long time to run. (correct)
  • The terabytes of data, do not require complex data analysis.
  • Data analysis is simpler to run using complete data sets as they are.
  • So the data warehouse only stores kilobytes of data

Which of the following is a key goal of dimensionality reduction techniques?

  • To create new, more complex attributes.
  • To increase the number of attributes in the dataset.
  • To eliminate irrelevant features and reduce noise. (correct)
  • To increase the time and space required for analysis.

Which of the following is true about the 'curse of dimensionality'?

<p>It states that the possible combinations of subspaces grow exponentially. (D)</p> Signup and view all the answers

In the context of dimensionality reduction, what does 'feature subset selection' aim to achieve?

<p>Selecting a subset of the all attributes. (B)</p> Signup and view all the answers

What is the primary purpose of Wavelet Transforms in data preprocessing?

<p>To transform data while preserving relative distances at various resolutions. (A)</p> Signup and view all the answers

Which of the following is a critical condition for applying the Discrete Wavelet Transform (DWT)?

<p>The length of input data must be an integer power of 2. (B)</p> Signup and view all the answers

What is a key characteristic of Wavelet Decomposition in the context of data compression?

<p>Many small detail coefficients can be replaced by O's. (B)</p> Signup and view all the answers

Why are hat-shape filters emphasized in Wavelet Transform?

<p>To suppress weaker information in the boundaries. (A)</p> Signup and view all the answers

What is the main goal of Principal Component Analysis (PCA)?

<p>Finding a projection that captures the largest amount of variation in data (D)</p> Signup and view all the answers

How are 'redundant attributes' defined in the context of attribute subset selection?

<p>They duplicate much or all of the information contained in one or more other attributes. (C)</p> Signup and view all the answers

In attribute selection, what is a key difference between 'irrelevant' and 'redundant' attributes?

<p>Irrelevant attributes contain no useful information; redundant attributes duplicate existing information. (A)</p> Signup and view all the answers

When using heuristic search methods for attribute selection, why is it important to choose attributes by significance tests?

<p>To ensure the attributes meet the attribute independence assumption and choose by significance tests (D)</p> Signup and view all the answers

In the context of attribute creation, what is the main purpose of 'attribute extraction'?

<p>To derive new attributes that better capture important information. (D)</p> Signup and view all the answers

What is data discretization?

<p>Transform a continuous attribute into a set of intervals. (B)</p> Signup and view all the answers

Which of the following is characteristic of parametric methods for numerosity reduction?

<p>They assume the data fits some model, estimate parameters, and store only the parameters. (A)</p> Signup and view all the answers

What is the main purpose of regression in the context of data reduction?

<p>Fitting data under a model to estimate parameters and store only the parameters. (C)</p> Signup and view all the answers

What is the main idea behind using histograms for numerosity reduction?

<p>Data is divided into buckets and the average (sum) for each bucket is stored. (C)</p> Signup and view all the answers

In data reduction, which characteristic makes clustering an effective method?

<p>If data is clustered (B)</p> Signup and view all the answers

What should be considered when using sampling for data reduction?

<p>Choose a representative subset from a data set so that using the sample will approximately the same results as mining the entire dataset. (D)</p> Signup and view all the answers

Which of the following is true of 'simple random sampling'?

<p>There is an equal probability of selecting any particular item (C)</p> Signup and view all the answers

What is the purpose of 'stratified sampling?

<p>To draw sampling proportionally from the dataset. (D)</p> Signup and view all the answers

Which of the following is not a data transformation method?

<p>Sampling (D)</p> Signup and view all the answers

In data transformation, what does 'normalization' aim to achieve?

<p>Scaled to fall within a specified range (D)</p> Signup and view all the answers

What characteristic is unique of the z-score normalization?

<p>The useful when the actual minimum and maximum of attribute A are unknown, or when there are outliers that dominate the min-max normalization (B)</p> Signup and view all the answers

How is data discretization defined?

<p>Divide the range of a continuous attribute into intervals (B)</p> Signup and view all the answers

In the context of data discretization, what is the key difference between 'supervised' and 'unsupervised' methods?

<p>Supervised methods use class labels; unsupervised methods do not. (D)</p> Signup and view all the answers

When discretizing data through 'binning,' what distinguishes 'equal-width' from 'equal-depth' partitioning?

<p>Equal-width partitions into equal-sized ranges; equal-depth ensures each interval contains approximately the same number of samples. (D)</p> Signup and view all the answers

Which of the following is applied in concept hierarchy generation?

<p>Analysis of the number of distinct values (B)</p> Signup and view all the answers

How is similarity defined in the context of data proximity measures?

<p>Numerical measure of how alike two data objects are (A)</p> Signup and view all the answers

How are similarity and dissimilarity related?

<p>Proximity generally alludes to similarity or dissimilarity. (A)</p> Signup and view all the answers

What is the mode of a data matrix?

<p>Two modes (C)</p> Signup and view all the answers

What is the mode of dissimilarity matrix?

<p>Single mode (D)</p> Signup and view all the answers

What is the significance of parameter 'p' in proximity measure for nominal attributes?

<p>p: total # of variables (A)</p> Signup and view all the answers

In the context of binary attributes, what does the Jaccard coefficient measure?

<p>Asymmetric measure (D)</p> Signup and view all the answers

What does a contingency table measure for binary data?

<p>A contingency table (A)</p> Signup and view all the answers

In the formula for the Z-score, what do μ and σ represent, respectively?

<p>Population mean and population standard deviation (D)</p> Signup and view all the answers

What does 'h' represent in Minkowski distance if it measure two p-dimensional data?

<p>h is the order or norm (D)</p> Signup and view all the answers

What distance does d(i,j)=|xi₁-xj₁|+|x₁₂-x j₂ +...+ | xip-Xjp| represent?

<p>Manhattan distance (D)</p> Signup and view all the answers

Which scenario is the Minkowski distance most suitable for measuring?

<p>Greatest difference between attributes (D)</p> Signup and view all the answers

What does the dot ( • ) represent in cosine similarity?

<p>Dot product (D)</p> Signup and view all the answers

What is the main task for cosine similarity measure?

<p>Applies a document into attributes (B)</p> Signup and view all the answers

Which data preprocessing task involves concept hierarchy climbing?

<p>Discretization (B)</p> Signup and view all the answers

What is the primary requirement for the length of input data when applying Discrete Wavelet Transform (DWT)?

<p>It must be an integer power of 2. (C)</p> Signup and view all the answers

In the context of heuristic attribute selection, what is the key assumption behind choosing the best single attribute?

<p>Attribute independence. (A)</p> Signup and view all the answers

Given a dataset to be discretized, what is the primary distinction between binning and K-means clustering in unsupervised data discretization?

<p>Clustering considers data distribution for better results, while binning divides data into equal intervals without regard to distribution. (C)</p> Signup and view all the answers

In the context of data transformation, how does 'attribute/feature construction' contribute to the preprocessing stage?

<p>By creating entirely new attributes from the original set. (A)</p> Signup and view all the answers

How are ordinal variables typically handled to measure dissimilarity?

<p>They are first treated as interval-scaled variables after ranking and scaling. (C)</p> Signup and view all the answers

What is a crucial consideration when applying clustering for data reduction purposes?

<p>The data should exhibit a clustered structure. (D)</p> Signup and view all the answers

How does the 'curse of dimensionality' primarily impact data analysis?

<p>It causes data to become increasingly sparse, and distance measures become meaningless ,which drastically affects the performance of clustering and outlier analysis. (D)</p> Signup and view all the answers

What is the implication of selecting 'samples without replacement' in the context of data sampling?

<p>Once an item is selected, it is removed from the population and cannot be selected again. (C)</p> Signup and view all the answers

What does the Cosine Similarity measure primarily capture in text analysis?

<p>The angle between two document vectors, irrespective of their magnitude. (B)</p> Signup and view all the answers

How does Wavelet Transform handle outlier data points compared to mean or median smoothing?

<p>Wavelet transform is more effective at identifying and removing outliers. (D)</p> Signup and view all the answers

What is the main purpose of applying a 'hat-shape' filter in Wavelet Transform?

<p>To emphasize regions where data points cluster. (A)</p> Signup and view all the answers

When is z-score normalization particularly useful compared to min-max normalization?

<p>When the actual minimum and maximum of the attribute are unknown, or when there are outliers that dominate the min-max normalization. (D)</p> Signup and view all the answers

In the context of data preprocessing, what does 'concept hierarchy generation' for nominal data involve?

<p>Defining a sequence of attributes from general to specific, such as country to street, with the intention of clustering categorical data. (C)</p> Signup and view all the answers

What is the primary goal of Principal Component Analysis (PCA) in the context of data reduction?

<p>To project the data onto a new space which captures the most amount of variance in data. (B)</p> Signup and view all the answers

When dealing with mixed attributes types in a dataset, what is a common approach to calculate the overall distance between data objects?

<p>Use a weighted formula that combines the effects of different attribute types. (C)</p> Signup and view all the answers

Flashcards

Data Reduction

Obtaining a reduced representation of the dataset that is much smaller in volume while preserving analytical results.

Dimensionality Reduction

Reduces the number of attributes by removing unimportant ones.

Wavelet Transform

A math tool that decomposes signals into sub-bands, useful for image compression and preserving object distances at different resolutions..

Discrete Wavelet Transform (DWT)

Discrete Wavelet Transform used for linear signal processing and multiresolution analysis, stores strongest wavelet coefficients.

Signup and view all the flashcards

Principal Component Analysis (PCA)

A projection technique that captures the largest amount of variation in data and projects the original data into a smaller space, resulting in dimensionality reduction.

Signup and view all the flashcards

Attribute Subset Selection

Reducing data dimensionality by identifying and removing redundant or irrelevant attributes.

Signup and view all the flashcards

Attribute Creation

Creating new attributes that better capture important information in a dataset.

Signup and view all the flashcards

Numerosity Reduction

Reducing data volume by choosing smaller alternative data representations, like parametric or non-parametric methods.

Signup and view all the flashcards

Parametric Methods

Data reduction using models to fit the data, storing only model parameters and outliers.

Signup and view all the flashcards

Non-Parametric Methods

Data reduction methods that don't assume a data model, like histograms, clustering, and sampling.

Signup and view all the flashcards

Linear Regression

Modeling data to fit a straight line; often uses the least-squares method.

Signup and view all the flashcards

Multiple Regression

A regression allowing a response variable to be modeled as a linear function of a multidimensional feature vector.

Signup and view all the flashcards

Log-Linear Models

Approximating discrete multidimensional probability distributions for dimensionality reduction and data smoothing.

Signup and view all the flashcards

Histogram Analysis

Dividing data into buckets and storing the average or sum for each bucket.

Signup and view all the flashcards

Clustering

Partitioning data sets into clusters based on similarity and storing only cluster representations.

Signup and view all the flashcards

Sampling

Obtaining a small, representative subset to represent the entire dataset.

Signup and view all the flashcards

Simple Random Sampling

Each item has an equal chance of being selected in sampling.

Signup and view all the flashcards

Sampling without replacement

An object is removed from population once sampled.

Signup and view all the flashcards

Sampling with replacement

A selected object is placed back into population

Signup and view all the flashcards

Stratified Sampling

Partitioning the dataset and drawing samples from each partition proportionally.

Signup and view all the flashcards

Data Transformation

Mapping values to a new set, by removing noise, creating new attributes, summarization, scaling, and categorizing.

Signup and view all the flashcards

Smoothing

Reducing noise from data.

Signup and view all the flashcards

Normalization

Scaling values to fall within a smaller specified range, like min-max or z-score normalization.

Signup and view all the flashcards

Min-Max Normalization

Transforms original dataset to a range, involves a formula.

Signup and view all the flashcards

Z-score Normalization

Uses mean and standard deviation for normalization

Signup and view all the flashcards

Discretization

Dividing a continuous attribute range into intervals.

Signup and view all the flashcards

Nominal Attributes

Attributes which represent un-ordered sets.

Signup and view all the flashcards

Ordinal Attributes

Values from ordered set.

Signup and view all the flashcards

Numeric Attributes

Real-number data.

Signup and view all the flashcards

Equal-Width Discretization

Divides the range into equal sized N intervals.

Signup and view all the flashcards

Equal-Depth Discretization

Divides range into N intervals each containing approximately the same number of samples.

Signup and view all the flashcards

Discretization with Classification

Supervised discretization method using class labels to determine split points, like decision tree analysis

Signup and view all the flashcards

Automatic Concept Hierarchy Generation

Automatically generate a hierarchy.

Signup and view all the flashcards

Similarity

Numerical measurement of how alike two different objects are.

Signup and view all the flashcards

Dissimilarity

Measurement of how different two data objects are.

Signup and view all the flashcards

Data Matrix

Data represented in an n x p table, of n data points with p dimensions.

Signup and view all the flashcards

Dissimilarity Matrix

Data where only the registers are set as a specific distance.

Signup and view all the flashcards

Proximity Measure for Nominal Attributes

Takes 2 or more states.

Signup and view all the flashcards

Proximity Measure for Binary Attributes

A proximity measure for binary data.

Signup and view all the flashcards

Dissimilarity Binary Variables

A method of measuring dissimilarity, based on whether attributes are of a symmetric or asymmetric binary value

Signup and view all the flashcards

Z Score

Raw score to be standardized in a certain population.

Signup and view all the flashcards

Minkowski Distance

Popular measure.

Signup and view all the flashcards

City block

Manhattan city.

Signup and view all the flashcards

Euclidean Distance

Different between two binary vectors.

Signup and view all the flashcards

Supremum norm

Indicates attributes of vectors.

Signup and view all the flashcards

Ordinal Variables

An ordinal attribute, scaled by their rank.

Signup and view all the flashcards

Attributes of Mixed Types

Contains all attribute types, weighting different attribute type.

Signup and view all the flashcards

Cosine Measure

Vector dot product

Signup and view all the flashcards

Study Notes

  • The lecture discusses various methodologies and approaches to prepare and clean data for analysis.
  • The topics include data reduction, data transformation, data discretization, and measuring data similarity and dissimilarity.
  • The lecture also covers preprocessing practices on Weka.

Data Reduction Strategies

  • Data reduction aims to obtain a smaller, reduced data representation that produces similar analytical results.
  • Databases and data warehouses may store terabytes of data, making complex data analysis time-consuming.
  • Data reduction strategies include:
    • Dimensionality reduction (removing unimportant attributes)
    • Wavelet transforms
    • Principal Components Analysis (PCA)
    • Feature subset selection
    • Feature creation
    • Numerosity reduction (regression, log-linear models, histograms, clustering, sampling, etc.)
    • Data cube aggregation
    • Data compression

Dimensionality Reduction

  • The "curse of dimensionality" refers to increased data sparsity as dimensionality increases.
  • Density and distance between points, which are critical to clustering and outlier analysis, become less meaningful.
  • The possible combinations of subspaces grow exponentially.
  • Dimensionality reduction avoids the curse of dimensionality, helps eliminate irrelevant features and noise, reduces time and space requirements, and allows easier visualization.
  • Dimensionality reduction techniques include wavelet transforms, Principal Component Analysis, and supervised/nonlinear techniques (e.g., feature selection).

Wavelet Transform

  • A wavelet transform decomposes a signal into different frequency sub-bands.
  • It's applicable to n-dimensional signals.
  • Wavelet transforms preserve the relative distance between objects at different levels of resolution.
  • Wavelet transforms allow natural clusters to become more distinguishable and are used for image compression.

Wavelet Transformation Details

  • Discrete Wavelet Transform (DWT) is used for linear signal processing and multi-resolution analysis.
  • DWT provides compressed approximations by storing a small fraction of the strongest wavelet coefficients.
  • DWT is similar to Discrete Fourier Transform (DFT) but provides better lossy compression and is localized in space.
  • The length of input data must be an integer power of 2; otherwise, it is padded with 0s.
  • Has two smoothing functions
  • Applies transforms iteratively

Wavelet Decomposition Example

  • A math tool for decomposing functions
  • An example data set S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed.
  • Small detail coefficients in wavelet decomposition can be replaced by 0s for compression while retaining significant coefficients.

Haar Wavelet Transform

  • Haar wavelet transform is a type of wavelet transform.
  • It can be represented by a hierarchical decomposition structure ("error tree").

Wavelet Transform Advantages

  • Using hat-shape filters:
  • Region Emphasized
  • Suppress weaker information
  • Effective removal of outliers.
  • Reduced sensitivity
  • Multi-resolution
    • Detect arbitrary shaped clusters
  • Efficient with Complexity O(N).
  • Particularly useful when the dimensionality of the data is low

PCA

  • Principal Component Analysis finds a projection that captures the largest amount of variation in data.
  • PCA projects the original data into a much smaller space, thereby reducing dimensionality.
  • Eigenvectors of the covariance matrix define this new space.

Attribute Subset Selection

  • Serves as another method to reduce dimensionality of data.
  • Redundant attributes duplicate information contained in other attributes.
    • This includes duplicate data
  • Irrelevant attributes provide no useful information for the data mining task at hand.
    • Irrelevant information like ID numbers when predicting GPA

Heuristic Attribute Selection

  • There are 2^d possible attribute combinations for d attributes.
  • Typical heuristic attribute selection methods include:
  • Single attribute selection based on attribute independence.
  • Step-wise feature selection
    • Finding the best single attribute
  • Attribute elimination
    • Finding the worst attribute
  • Combined attribute selection
  • Branch and bound by eliminating and backtracking.

Attribute Creation (Feature Generation)

  • Create new, more effective is the main goal.
  • Methodologies:
    • Attribute extraction (domain-specific)
    • Mapping data to new space (Fourier/wavelet transformation)
    • Attribute construction involves discriminatory work

Numerosity Reduction

  • Reduces data volume by choosing alternative, smaller forms.
  • Parametric methods (e.g., regression)
    • Assume data fits a model.
    • Store/predict outliers
  • Non-parametric methods
    • Histograms, clustering, sampling

Regression and Log-Linear Models

  • Linear regression models data to fit a straight line, using the least-squares method.
  • Multiple regression models a response variable Y as a linear function of a multidimensional feature vector.
  • Log-linear models approximate discrete multidimensional probability distributions.

Regression Analysis Details

  • It is a collective name for the modeling and analysis of numerical data, including a dependent variable (response variable) and one or more independent variables (explanatory variables).
  • Parameters are estimated to give a "best fit."
  • Commonly, the best fit is evaluated using the least squares method.
  • It's used for prediction (including forecasting time-series data), inference, hypothesis testing, and causal relationship modeling.

Regression Analysis and Log-Linear Models

  • In linear regression, Y = wX + b for the estimated line.
    • It needs two regression
  • Multiple regression uses Y = b0 + b1X1 + b2X2, where nonlinear examples become linear.
  • Log-linear approximate discrete data.

Histogram Analysis

  • Divides data into buckets and stores average or aggregate sums for each bucket.
  • Partitioning rules include:
  • Equal-width range
  • Equal-frequency depth

Clustering

  • Partition data sets into clusters based on similarity.
  • Store cluster representation (e.g., centroid and diameter).
  • Very effective
  • It's possible to store hierarchical
  • Has many methods
  • Analysis of cluster is later

Sampling Data

  • Sampling takes a small set to represent a whole set.
  • Allows a mining algorithm to work faster than the number of total data.
  • Best if the chosen set accurately represents the whole set.
  • Simple data will do poorly
  • Using adaptive works best.

Types of Sampling

  • Simple random sampling:
    • Each items has equal probability of being selected.
  • Sampling without replacement has no duplicates.
  • Sampling with replacement can have duplicates.
  • Stratified sampling partitions data.
    • Draws samples from each to manage skewed data

Data Transformation methods

  • A function that maps values to a smaller set of values.
  • Data Transformation smoothing: Remove noise from data.
  • Attribute/feature construction
  • New data constructed Aggregation: Summarization
  • Scale to fall within a smaller, specified range
  • Concept hierarchy climbing

Normalization

  • Min-max
    • Original to new
  • Used to normalize data
  • Z-score.
    • Useful when the actual minimum and maximum are unknown.

Standardizing Data

  • Applying a Z-score is useful when the mean is known.
  • This involves calculating from mean/standard deviation
  • An alternative way: Calculate/standardize through mean/standard deviation

Discretization Types

  • Three attribute types:
  • The three types are Nominal, Ordinal, and Numeric.
  • Three processes
  • Using Labels to separate
  • Reduce Data Size
  • Split data
  • Prepare
    • prepare for classfication

Discretization Methods

  • There are some data discretization methods.
  • These methods may include split, tree, and corelation

Simple Discretization

  • Equal-width (distance) partitioning divides the range into N uniform intervals.
    • If the data is from A is the lowest and B is the highest
    • W = (B-A)/N
  • Equal-depth (frequency) partitioning divides the range into N intervals
    • The number are approxamitely the same
    • Categorical can be tricky

Discretization Details

  • It is possible to sort data. Applying a Partition is useful when the range is same
  • Apply means
  • Apply Boundries

Discretization Without Classes

  • Binning and clustering
  • Clustering leads to better results

Discretization

  • Classification is done recursivley
  • Bottom up vs Top Down can vary
  • Using low values is an example of a Chi-square
  • Classification is used

Concept Hierarchy Generation

  • Specified through total ordering, schema, or expert data
  • Examples of Attributes
  • Auto Gen

Automatic Hierarchy Generation

  • The number of attributes are placed at the lower level of the hierchy
  • Exceptions can occur in the year

Data Similarity

  • Similarity is a numerical to see how two data points are alike.
    • Similarity measures how alike
  • Dissimilarity measures the dissimilarity between date
    • Distance shows how unrelated two items are
  • Proximity refers to a similarity or dissimilarity

Data and Dissimilarity Matrix

  • A matrix with p dimensions and two modes.
  • A matrix that shows the distance, traingular, and single mode

Proximity Measure for Nominal Attributes

  • States with numbers and names
  • m total

Binary Attributes

Data Dissimilarity

  • Using examples
  • Symmetry is equal 1s
  • Let Y and P = 1
  • N = 0

Data Standardization

  • Using Z shows number

Distance in Numeric Data

  • distance defined by dimension

Cases for Distance

  • Manhattan is block L1
  • Hamming bits between the data that is binary
  • 2 is L norm

Ordinal Variables

  • Can be continuous
  • Replaced to intervals of scales
  • Use if statements

Mixed Types

  • Using combinations of weight formula
  • Use binary
  • Normalize
  • Determine rank

Cosine Similarity

  • Documents can have attributes
  • Term frequency occurs more then once
  • Use feature arrays
  • Vectors indicate relationships

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Data Preprocessing in Data Mining Quiz
10 questions
Data Preprocessing
5 questions

Data Preprocessing

RealizablePrehnite avatar
RealizablePrehnite
Data Preprocessing in Data Mining
26 questions
Use Quizgecko on...
Browser
Browser