Importance of Data Preprocessing

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Which of the following best describes why data preprocessing is a crucial step in data analysis?

  • It mitigates issues arising from incomplete, noisy, and inconsistent data, leading to more reliable insights. (correct)
  • It guarantees compliance with regulatory standards, regardless of data quality.
  • It ensures the data perfectly fits the analytical model chosen.
  • It automatically corrects any errors in the data, regardless of the source or nature of the error.

A dataset contains customer ages, some of which are recorded as negative values. What type of data quality issue does this represent, and which preprocessing step is most appropriate to address it?

  • Noise; replace negative ages with plausible values or remove the entries. (correct)
  • Inconsistency; standardize the age format.
  • Incompleteness; use imputation with the mean age.
  • Noise; apply data smoothing using binning.

In the context of multi-dimensional data quality, which aspect focuses on whether data is applicable and beneficial for the task at hand, offering additional utility beyond its basic attributes?

  • Value Added (correct)
  • Completeness
  • Interpretability
  • Accuracy

Which data preprocessing task is characterized by consolidating data from various sources, followed by addressing redundancies and inconsistencies, to provide a unified view?

<p>Data Integration (B)</p> Signup and view all the answers

A machine learning model performs poorly because the dataset contains a feature with values ranging from -1,000,000 to +1,000,000. Which data preprocessing technique is most appropriate to address this issue?

<p>Data Transformation (A)</p> Signup and view all the answers

Which data preprocessing technique aims to simplify data representation while preserving data integrity?

<p>Data Reduction (B)</p> Signup and view all the answers

A dataset containing income values is being prepared for analysis. Applying a logarithmic transformation to income values is an example of what?

<p>Data Transformation (A)</p> Signup and view all the answers

Which of the following is most likely to use Dispersion Analysis?

<p>To better understand data by organizing central tendencies, variation and spread (B)</p> Signup and view all the answers

What is a key distinction between using the 'mean' and the 'median' as measures of central tendency, particularly in datasets with outliers?

<p>The median is not affected by extreme values where the mean is. (D)</p> Signup and view all the answers

Suppose a dataset has the values 2, 3, 5, 6, 99. The mean is 23, the median is 5, and mode does not exist. Which measure of central tendency best respresents the data and why?

<p>Median, it is not affected by outliters. (B)</p> Signup and view all the answers

What inherent challenge arises from grouping individual data into classes, when calculating measures of central tendency?

<p>It introduces the potential for significant distortion because the central value of each class and the frequency of values inside each class are taken into account. (B)</p> Signup and view all the answers

When is using standard deviation most effective?

<p>Interval and Ratio data. (A)</p> Signup and view all the answers

In statistical analyses, why is it important to understand the different scales of measurement (Nominal, Ordinal, Interval, and Ratio) when determining which measure of central tendency to use?

<p>Because different scales of measurement dictate which statistical operations are meaningful, influencing the appropriateness of mean, median, and mode. (D)</p> Signup and view all the answers

Explain the key difference between interval and ratio scales of measurement.

<p>Ratio scales allow for the calculation of meaningful ratios, while interval scales do not due to an arbitrary zero point. (D)</p> Signup and view all the answers

You are working with geographic data representing land use types, and wish to quickly locate the common land use. Which measure of central tendency would be most appropriate?

<p>Mode (A)</p> Signup and view all the answers

Which of the following is an example of ordinal data?

<p>Customer satisfaction ratings (e.g., Very Satisfied, Satisfied, Neutral, Dissatisfied, Very Dissatisfied) (C)</p> Signup and view all the answers

Which scenario exemplifies a dataset with 'incomplete' data?

<p>A survey where respondents were not required to answer all questions, leading to missing entries for certain attributes. (C)</p> Signup and view all the answers

Which scenario is an example of 'inconsistent data'?

<p>A database containing customer addresses in varying formats. (B)</p> Signup and view all the answers

A scientist is measuring the mass of a chemical compound, but the scale isn't properly calibrated and adds or subtracts 0.1 grams to each measurement. What kind of error is this, and how can it be addressed?

<p>Systematic error; recalibrate the scale. (A)</p> Signup and view all the answers

Which of the following is NOT a common cause of noisy data?

<p>Consistent and standardized naming conventions (A)</p> Signup and view all the answers

Which approach to handling missing data involves estimating the missing values based on other features present in the dataset?

<p>Imputation (C)</p> Signup and view all the answers

In what circumstances is ignoring missing data tuples most appropriate?

<p>When the task in classification is missing (C)</p> Signup and view all the answers

What is a significant drawback of deleting observations with missing data?

<p>May introduce bias if the missing data is not random. (B)</p> Signup and view all the answers

Which of the given options falls into the category of imputing the missing data?

<p>Cold-deck imputation (B)</p> Signup and view all the answers

How does hot-deck imputation address the issue of missing data?

<p>By replacing missing values with values of similar cases within the same dataset. (B)</p> Signup and view all the answers

What is the primary goal when using distribution-based imputation techniques?

<p>To capture the &quot;observed&quot; empirical distribution of data. (D)</p> Signup and view all the answers

In statistical imputation, what is done with the 'missing' value and the 'features' of the dataset?

<p>Consider the &quot;missing&quot; value as the &quot;output&quot; and the rest of the features as imput. (B)</p> Signup and view all the answers

What is the main idea about predicting missing value for Predictive imputation?

<p>Let a classifier model the underpinnings of the missingness mechanism. (B)</p> Signup and view all the answers

Which scenarios is due to having Incorrect values?

<p>Faulty data collection instruments (D)</p> Signup and view all the answers

What is one of the options of how to handle noisy data?

<p>Combined computer and human inspection (C)</p> Signup and view all the answers

What is the fundamental principle behind binning as a method for data smoothing?

<p>Sorting data and partitioning it into bins, then smoothing values within each bin. (C)</p> Signup and view all the answers

What distinguishes equal-width binning from equal-depth binning?

<p>Equal-width binning divides the range of values into equal intervals, while equal-depth binning aims to have the frequency of samples in each bin. (A)</p> Signup and view all the answers

Consider a dataset with widely varying values. What problem does Data Discretization address?

<p>Outliers that dominate presentation (A)</p> Signup and view all the answers

What is missing means?

<p>Use commonly used values or average the value . (A)</p> Signup and view all the answers

When handling missing values, what is the difference between 'removing the attribute' versus 'create new attribute'?

<p>Creating a flag column and the other reduces your value. (B)</p> Signup and view all the answers

Flashcards

What is incomplete data?

Attribute values are missing, lacking certain attributes of interest, or only aggregate data is contained.

What is noisy data?

Data contains errors or outliers.

What is inconsistent data?

Data containing discrepancies in codes or names.

What is data cleaning?

To fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies.

Signup and view all the flashcards

What is data integration?

Integrating multiple databases, data cubes, or files.

Signup and view all the flashcards

What is data transformation?

Normalization and aggregation of data.

Signup and view all the flashcards

What is data reduction?

Obtaining reduced representation in volume while producing the same or similar analytical results.

Signup and view all the flashcards

What are measures of central tendency?

A measure of the location of the middle or the center of a distribution.

Signup and view all the flashcards

What is 'Mean'?

The most commonly used measure of central tendency; average of all observations.

Signup and view all the flashcards

What is 'Median'?

The value of a variable such that half of the observations are above, and half are below.

Signup and view all the flashcards

What is 'Mode'?

The most frequently occurring value in a distribution.

Signup and view all the flashcards

What is continuous data?

Data that can include any value (i.e., real numbers).

Signup and view all the flashcards

What is discrete data?

Data only consisting of discrete values; numbers between are not defined.

Signup and view all the flashcards

What is grouped data?

Raw individual data is categorized into several classes, and then analyzed

Signup and view all the flashcards

What is individual data?

The raw individual data is analysed without being grouped

Signup and view all the flashcards

What is nominal scale data?

Data that can simply be broken down into categories.

Signup and view all the flashcards

What is dichotomous data?

Data that has just two types.

Signup and view all the flashcards

What is ordinal scale?

Ordinal scale data can be categorized and can be placed in an order.

Signup and view all the flashcards

What is the interval scale?

Taking the notion of ranking items in order one step further, since the distance between adjacent points on the scale are equal

Signup and view all the flashcards

What is the ratio scale?

Similar to the interval scale, but with the addition of having a meaningful zero value.

Signup and view all the flashcards

What is Value Imputation?

A data cleaning task to fill in missing values.

Signup and view all the flashcards

What is cold-deck imputation?

To fill in missing values using means or the measure of central tendency.

Signup and view all the flashcards

What is hot-deck imputation?

Identify most similar case to the one with the missing value and impute its value.

Signup and view all the flashcards

What is distribution-based imputation?

Assign value based on the probability distribution from the non-missing values.

Signup and view all the flashcards

What is statistical imputation?

Build a regressor to classify the input value considering the missing value as the target.

Signup and view all the flashcards

Predictive imputation

Let a classifier model the underpinnings of the missingness mechanism.

Signup and view all the flashcards

What is noisy data?

Random error or variance in a measured variable.

Signup and view all the flashcards

What is binning?

First sort data and partition into (equal-frequency) bins.

Signup and view all the flashcards

What is smoothing?

A method where first data is sorted and partitioned with bin means, bin median, or bin boundaries.

Signup and view all the flashcards

How to 'handle' missing values?

Ignoring the data.

Signup and view all the flashcards

Study Notes

Data Preprocessing

  • Data preprocessing is important to explore the appropriate technique to prepare data for data analytics

Why Data Preprocessing?

  • Real-world data is often dirty, meaning it is incomplete, noisy, or inconsistent
  • Incomplete data lacks attribute values or contains aggregated data
  • Noisy data contains errors or outliers
  • Inconsistent data contains discrepancies in codes or names
  • Data can be "dirty" due to issues during collection, human/computer error, or transmission errors
  • Data preprocessing is important because quality data leads to quality data mining results

Why is Data Preprocessing Important?

  • Decisions must be based on the quality of data
  • Duplicate or missing data can cause incorrect statistics
  • Data warehouses need consistent integration of quality data
  • Data extraction, cleaning, and transformation are the majority of the work in building a data warehouse

Multi-Dimensional measure of Data Quality

  • Data quality can be measured by accuracy, completeness, consistency, timeliness, believability, value added, interpretability, and accessibility.
  • Broad categories include intrinsic, contextual, representational, and accessibility

Major Tasks in Data Preprocessing

  • Data cleaning involves filling missing values, smoothing noisy data, identifying/removing outliers, and resolving inconsistencies
  • Data integration involves integrating multiple databases, data cubes, or files
  • Data transformation includes normalization and aggregation
  • Data reduction obtains a reduced representation in volume while producing similar analytical results
  • Data discretization is a part of data reduction with particular importance for numerical data

Mining Data Descriptive Characteristics

  • Understanding data better involves understanding central tendency, variation, and spread
  • Data Dispersion includes median, max, min, quantiles, outliers, variance
  • Numerical dimensions correspond to sorted intervals
  • Data dispersion is analyzed with multiple granularities of precision using boxplots or quantile analysis
  • Dispersion analysis on computed measures involves folding measures into numerical dimensions, using boxplots or quantile analysis

Measures of Central Tendency

  • Central tendency measures the location of the middle or center of a distribution
  • Common measures include mean, median and mode

Measures of Central Tendency - Mean

  • The mean, also known as the average, is calculated by summing all the scores and dividing by the number of scores
  • Each observation is equally significant
  • The Sample Mean is the sum of all x values divided by n
  • The Population Mean is the sum of all x values divided by N
  • The advantage of mean is that the value is sensitive to any change within the observations
  • The disadvantage of mean is that the value is very sensitive to outliers

Weighted Mean

  • A weighted mean can be calculated using a weighting factor
  • Population can be the weighting factor:
    • Population A $23,000 100,000
    • Population B $20,000 50,000
    • Population C $25,000 150,000

Measures of Central Tendency - Median

  • The median is the value that divides the distribution into two equal-sized groups
  • If the number of observations is odd, the median is the middle value
  • If the number of observations is even, the median is the average of the two middle values
  • Median = Not affected by extreme values at the end of distribution of outliers

Measures of Central Tendency - Mode

  • The mode is the most frequently occurring value in the distribution
  • This can only be used with nominal data
  • The mode can be located quickly

Characteristics of Data

  • Not all data is the same
  • There are limitations as to what you can and can’t do with a data set, depending on the characteristics of the data

Continuous vs. Discrete Data

  • Continuous data has a value (i.e. real numbers)
    • EG: 1, 1.43, and 3.1415926
    • Distance, tree height, amount of precipitation, etc
  • Discrete data consist of discrete values, and the numbers defined must be whole numbers
    • EG: 1, 2, 3
    • of vegetation types

Grouped vs. Individual Data

  • Individual data concerns the effects of grouping the data
  • EG: Family income value vs classes or groupings
  • EG: Elevation 1000m vs classes or groupings
  • In grouped data, the raw individual data is categorized into several classes, and then analyzed
  • Grouping may introduce significant distortion
  • Grouping reduces the amount of information

Scales of Measurement

  • Data is the plural of a datum that is generated by recording measurements
  • Measurements involve the categorization of an item i.e assigning an item to a set of types, when the measure is qualitative
  • To give a quantitative measurement use numbers

Type of Data

  • Data used in statistical analyses can be divided into four types: nominal, ordinal, interval, and ratio
  • Nominal scale data are broken down into categories of names or types.
  • Dichotomous nominal data has just two types
    • Yes/No, Female/Male etc
  • Multichotomous data has more than two types
    • Vegetation types, soil types, counties, eye color, etc.
  • Nominal scale is not a scale because categories cannot be ranked, or ordered in any way

Ordinal Scale

  • Data can be categorized and placed in order.
  • Assigned a relative importance value
  • Star system restaurant rankings - 5 stars > 4 stars; 4 stars > 3 stars etc

Interval Scale

  • Takes the notion of ranking items in order one step further, the distance between adjacent points on the scale are equal
  • EG: Fahrenheit Scale - degrees are equal, but there is no absolute zero point) subtract degrees e.g 10 to 90, you cannot create ratios of multiply values

Ratio Scale

  • This is similar to the interval scale, but with the addition of having a meaningful zero value, which allows us to compare values using multiplication and division operations
  • precipitation, weights, heights, etc
  • EG - Rain - 2 inches is twice as much as 1 inch
  • EG - Age - 100 years is twice as old as 50 years

Which is Better: Mean, Median, or Mode?

  • Mean is selected by default, and the key advantage of the mean is that it is sensitive to any change.
  • The key disadvantage is that observations can be outliers.
  • The mean should be interval data
  • Median can be determined for ordinal data
  • Mode can be used with nominal, ordinal, interval, and ratio data
  • The mode can b used with nominal data

Data Cleaning

  • Data cleaning is a big problem in data warehousing
  • Importance the number one problem in data warehousing
  • Data cleaning tasks Fill:
    • Identify
    • Correct
    • Resolve inconsistencies, redundancy caused by data integrations

Handling Missing Data

  • Ignore the tuple, when classifying not effective because of the missing values per attribute varies considerably
  • Fill in manually, too time consuming and infeasible
  • Fill in automatically with a global constant
  • Imputation:
    • Delete observations
    • Hot deck,
    • Cold deck,
    • Predictive imputation

Imputing the Missing Data

  • Delete missing observations, can be ok if the data is small
  • Cold Deck Imputation Fill:
    • Measure of central tendency
  • Hot Deck Imputation:
    • Can identify more than one case so can average the values out that it identifies
  • Distributed- Based Imputation:
    • Tries to capture the observed of the distribution of the data
  • Statistical Imputation
    • Considers the missing value
  • Predictive Imputation:
    • Classifier model for the missing mechanism

Noisy Data

  • Noise: random error or variance in a measured variable
  • Requires data cleaning
    • Duplicate records
    • Incomplete data

Handling Noisy Data

-Binning partition data and then smooth by bin means, smooth boundaries, etc.

  • Smoothing by fitting the data into regression functions
  • Clustering and then detect and remove outliers deal with potentially outliers

Simple Discretization Methods: Binning

  • Equal width with uniform grid
  • Outliers my dominate presentation
  • Skewed data is handled well
  • Managing categorical data can be tricky

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Data Preprocessing II Quiz
5 questions

Data Preprocessing II Quiz

FruitfulLapisLazuli avatar
FruitfulLapisLazuli
Data Mining: Addressing Data Quality Issues
37 questions
Use Quizgecko on...
Browser
Browser