Module 2:Data Preprocessing1: Improving Data Quality

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Which of the following is NOT a dimension of data quality that preprocessing aims to improve?

  • Accessibility, referring to how easily different users can get access to required data. (correct)
  • Completeness, ensuring that no information is missing from the dataset.
  • Accuracy, ensuring the data is correct and free from errors.
  • Believability, reflecting the degree to which users trust the integrity of the data.

A dataset contains customer addresses, but several entries have incomplete street numbers. This is an example of which data quality issue?

  • Inconsistency
  • Incompleteness (correct)
  • Inaccuracy
  • Timeliness

Imagine that a value of -10 has been entered for the salary attribute. This is an example of?

  • Complete data
  • Intentional data
  • Noisy data (correct)
  • Consistent data

In a customer database, one table lists customer names as 'Robert' while another lists the same customer as 'Bob.' This situation exemplifies:

<p>Inconsistency (C)</p> Signup and view all the answers

Which task involves filling in missing values, smoothing noisy data, identifying outliers, and resolving inconsistencies within a dataset?

<p>Data Cleaning (D)</p> Signup and view all the answers

What is the primary goal of data transformation in the context of data preprocessing?

<p>Converting data into appropriate forms to improve mining results. (A)</p> Signup and view all the answers

What is the main purpose of data reduction techniques in data preprocessing?

<p>To obtain a smaller representation of the dataset while preserving its integrity. (A)</p> Signup and view all the answers

Which major task in data preprocessing involves merging data from various sources, such as different databases or files?

<p>Data Integration (C)</p> Signup and view all the answers

In handling missing values, when is it LEAST effective to simply ignore the sample with the missing value?

<p>When the percentage of ignored samples becomes excessively high. (C)</p> Signup and view all the answers

Which of the following methods is suitable for automatically filling in missing values in a dataset?

<p>Using a global constant such as 'unknown'. (B)</p> Signup and view all the answers

What is the purpose of binning techniques in handling noisy data?

<p>To smooth data by averaging or using median values within bins. (A)</p> Signup and view all the answers

Which of the following approaches can be used to handle noisy data by fitting data into regression functions?

<p>Regression. (C)</p> Signup and view all the answers

What is 'data scrubbing' in the context of data discrepancy detection?

<p>Using simple domain knowledge to detect errors and make corrections. (B)</p> Signup and view all the answers

What is the role of ETL tools in data migration and integration?

<p>Allowing users to specify transformations through a graphical user interface. (B)</p> Signup and view all the answers

What potential problem arises specifically during data integration when merging data from multiple sources?

<p>Entity identification (D)</p> Signup and view all the answers

Why might attribute values differ for the same real-world entity when integrating data from multiple sources?

<p>Different attribute representations, scales, or units (D)</p> Signup and view all the answers

What is a key consideration in handling redundancy during data integration?

<p>Object identification and derivable data. (A)</p> Signup and view all the answers

In the context of data integration, what does 'derivable data' refer to?

<p>Data that can be inferred or calculated from other attributes in the dataset (B)</p> Signup and view all the answers

Which statistical method can be used to detect redundancies between nominal attributes in data integration?

<p>Chi-square test (D)</p> Signup and view all the answers

How does correlation analysis help in handling redundancy during data integration?

<p>By measuring how strongly one attribute implies another (B)</p> Signup and view all the answers

In a contingency table for the chi-square test, what do the rows and columns typically represent?

<p>Distinct values of the two attributes being analyzed (B)</p> Signup and view all the answers

In the context of the chi-square test for nominal data, what does it mean if the test rejects the hypothesis?

<p>Suggests a significant association between attributes A and B. (C)</p> Signup and view all the answers

What does the covariance between two numeric attributes indicate?

<p>The degree of linear relationship, positive or negative, between the attributes. (D)</p> Signup and view all the answers

What does a covariance of zero between two random variables typically imply?

<p>No linear relationship between the variables. (A)</p> Signup and view all the answers

How is the chi-square value computed when performing the chi-square test?

<p>Calculated using observed and expected frequencies in a contingency table. (B)</p> Signup and view all the answers

What is the purpose of calculating degrees of freedom in the chi-square test?

<p>To use the chi-square distribution table to determine statistical significance. (B)</p> Signup and view all the answers

What does WEKA provide for data preprocessing?

<p>A variety of tools for transforming datasets, without writing program code. (D)</p> Signup and view all the answers

In the context of covariance, what does a positive covariance value between two stocks suggest?

<p>The two stocks tend to rise or fall together. (B)</p> Signup and view all the answers

What is the purpose of Weka's Explorer interface in the context of data mining?

<p>To visually explore datasets and apply preprocessing techniques. (B)</p> Signup and view all the answers

Data integration combines data from multiple sources into a coherent store. What task is involved in this process?

<p>Schema Integration (D)</p> Signup and view all the answers

Which of the following routines works to 'clean' the data through filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies?

<p>Data cleaning. (D)</p> Signup and view all the answers

What is the purpose of Data Transformation?

<p>Convert the data into appropiate forms for better mining results. (D)</p> Signup and view all the answers

What is the purpose of Data Reduction?

<p>Get a reduced representation of the data set. (D)</p> Signup and view all the answers

What should you use for nominal data when handling redundancy in data integration?

<p>We use the X² (chi-square) test. (A)</p> Signup and view all the answers

What can you use for numeric attributes when handling redundancy in data integration?

<p>Correlation coefficient and covariance. (D)</p> Signup and view all the answers

What does Data cleaning routines work to do?

<p>Clean the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. (A)</p> Signup and view all the answers

What is noisy data?

<p>Random error in data (A)</p> Signup and view all the answers

What is the purpose of clustering when dealing with noisy data?

<p>To detect and remove outliers. (A)</p> Signup and view all the answers

What does Data discrepancy detection typically involve?

<p>All of the above. (D)</p> Signup and view all the answers

What is Object identification in handling redundancy in data integration?

<p>Object identification: The same attribute or object may have different names in different databases (C)</p> Signup and view all the answers

What is the meaning of independence regarding covariance?

<p>COVA,B = 0 (A)</p> Signup and view all the answers

According to the content what tasks does WEKA provide?

<p>Ability to preprocess a dataset and train a model (D)</p> Signup and view all the answers

Flashcards

Data Accuracy

Ensuring data is correct; incorrect attribute values may stem from faulty instruments or human error.

Data Completeness

Full information is available. Incomplete data may occur because values are unavailable.

Data Consistency

Data from all sources is consistent. Different user assessments or time zones can cause inconsistency.

Data Timeliness

Time difference and delay that can occur in data syncing.

Signup and view all the flashcards

Data Believability

Reflects how much the data are trusted by users.

Signup and view all the flashcards

Data Interpretability

Reflects how easy the data are understood.

Signup and view all the flashcards

Data Cleaning

Cleaning data by filling missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies.

Signup and view all the flashcards

Data Integration

Integrating data from multiple sources (databases, data cubes, or files).

Signup and view all the flashcards

Data Reduction

Obtains a reduced representation of the data set that is much smaller in volume but produces the same mining results

Signup and view all the flashcards

Data Transformation

Convert the data into appropriate forms for better mining results.

Signup and view all the flashcards

Incomplete Data

Lacking attribute values or containing only aggregate data.

Signup and view all the flashcards

Noisy Data

Data containing noise, errors, or outliers.

Signup and view all the flashcards

Inconsistent Data

Data containing discrepancies in codes or names.

Signup and view all the flashcards

Noise in Data

Random error in data.

Signup and view all the flashcards

Binning for Noisy Data

First sort data and partition into (equal-frequency) bins, then smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.

Signup and view all the flashcards

Regression for Noisy Data

Smooth by fitting the data into regression functions.

Signup and view all the flashcards

Clustering

Detect and remove outliers in data.

Signup and view all the flashcards

Combined Inspection

Detect suspicious values and check by a human.

Signup and view all the flashcards

Using Metadata

Use metadata (e.g., domain, range, dependency, distribution) to find anomalies.

Signup and view all the flashcards

Data scrubbing

Use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections.

Signup and view all the flashcards

Data Auditing

By analyzing data to discover rules and relationship to detect violators

Signup and view all the flashcards

Data migration tools

Allow transformations to be specified.

Signup and view all the flashcards

ETL tools

Allow users to specify transformations through a graphical user interface.

Signup and view all the flashcards

Data Integration

Combines data from multiple sources into a coherent data store.

Signup and view all the flashcards

Schema Integration

Integrate metadata from different sources.

Signup and view all the flashcards

Entity identification

Identify real world entities from multiple data sources.

Signup and view all the flashcards

Object Identification

Object identification; The same attribute or object may have different names in different databases

Signup and view all the flashcards

Derivable Data

One attribute may be a derived attribute in another table

Signup and view all the flashcards

Correlation Analysis

Correlation relationship between two attributes, A and B

Signup and view all the flashcards

Contingency Table

Has its own cell in the contingency table using which the X2 value can be computed

Signup and view all the flashcards

Chi-squared test

A measure how strongly one attribute implies the other

Signup and view all the flashcards

Covariance

Similarity to correlation

Signup and view all the flashcards

Positive Covariance

Tend to be larger than their expected values if COVA,B>0.

Signup and view all the flashcards

Negative Covariance

A is larger than its expected value, B is likely to be smaller than its expected value

Signup and view all the flashcards

Independence

CovA,B = 0

Signup and view all the flashcards

WEKA Tool

Open source tool that provides machine learning algorithms implementations

Signup and view all the flashcards

Signup and view all the flashcards

Study Notes

  • The key learning outcomes for the week are to recognize major tasks in data preprocessing and to perform data cleaning and integration.

Data Quality and Preprocessing

  • Preprocessing improves data quality, especially for accuracy, which refers to the correctness of data. Incorrect attribute values can arise from faulty instruments or human error.
  • Improves completeness of data, meaning that full information is avialable
  • Improves consistency; data is same across all sources
  • Improves Timeliness: time difference and delay
  • Improves Believability: reflects how much the data is trusted by users
  • Improves Interpretability: reflects how easy the data are understood.

Major Tasks in Data Preprocessing

  • Data cleaning fills in missing values, smooths noisy data, identifies/removes outliers, and resolves inconsistencies.
  • Data integration integrates data from multiple sources like databases, data cubes, or files.
  • Data reduction obtains a smaller representation of the dataset that still produces the same results.
  • Data transformation converts data into appropriate forms to improve mining results.

Data Cleaning

  • Real-world data is often dirty, with potentially incorrect information due to instrument faults, or human/computer errors.
  • Incomplete data lacks attribute values and may contain only aggregate data. For example: Occupation=“” represents missing data.
  • Noisy data contains noise, errors, or outliers, like Salary=“-10."
  • Inconsistent data contains discrepancies in codes or names. An example can be Age="42" wheras Birthday=“03/07/2010."

Incomplete Data

  • Data is not always available.
  • Missing data can occur because:
    • Due to equipment malfunction
    • Inconsistencies
    • Data entry errors
    • Data irrelevance
    • Change in history

Handling Missing Values

  • Ignore the sample where the percentage of ignored samples is too high, which is not effective
  • Fill in manually, which is tedious and infeasible
  • Fill in automatically, which requires:
    • Global constant, like "unknown"
    • Attribute mean for all samples or for a specific class
    • The most probable value using statistical models

Noisy Data

  • Noise is random error in data.
  • Incorrect attribute values arise from faulty instruments, data entry/transmission problems, technology limitations, and naming convention inconsistencies.
  • Other data problems wich require cleaning includes incomplete data, duplicate records and inconsistent data

Handling Noisy Data

  • Binning is a technique that involved first sorting the data and partitioning into equal-frequency binds.
  • Regression smooths the data by fitting it into regression functions.
  • Clustering detects and removes outliers.
  • Combined computer and human inspection deals with possible outliers by detecting suspicious values and checking manually.

Data Cleaning as a Process

  • Data discrepancy detection involves:
    • Utilizing metadata (domain, dependency, distribution)
    • Checking field overloading, uniqueness/consecutive/null rules.
    • Data scrubbing using domain knowledge
    • Using data auditing for outlier detection via rules and relationship analysis.
  • Data migration and integration allows transformation tools and ETL tools for specifying transformations graphically.
  • Integration of the processes are best when iterative and interactive leveraging Potter's Wheels.

Data Integration

  • Combines data from multiple sources into a coherent store via data integration.
  • Schema integration: e.g., A.cust-id = B.cust-# requires integration metadata from various sources
  • Entity identification involves identifying real-world entities accross sources.
  • Detecting and resolving data value conflicts can arise since attribute values may differ, possibly due to different scales like metric vs British units.

Handling Redundancy in Data Integration

  • Occurs when integrated from multiple data sources
  • Object and Derivable data can contribute to this
  • Integration should reduce redundancies to improve mining speed and quality.

Redundancy Detection

  • Can use correlation analysis for redundancy detection
  • Use a chi-square test (X²) for nominal (categorical) attributes to assess their correlation, focusing on the differences between actual and expected counts to identify dependencies.
  • Apply correlation coefficients and covariance to numeric attributes to determine how values vary across attributes, understanding the strength and direction of their relationships.

X² Correlation Test

  • The test utilizes a contingency table that lists data tuples described by distinct values of attributes A and B.
  • X2= the sum ( (observed - expected)^2 / expected )
  • A statistic tests the hypothesis of independence between A and B. The test’s degrees of freedom are defined by [(r-1) x (c-1)], dependent on a significance level.

Correlation Example

  • In a group surveyed (n=1500), data reveals that 250 men prefer fiction, 50 prefer non-fiction, and 200 women prefer fiction, versus 1000 for non-fiction.
  • For 1 degree of freedom, the X2 value needed to reject the hypothesis is 10.828
  • The computed value, X² = 507.93 indicates correlation between gender and preferred reading type.

Covariance

  • The covariance and correlation coefficient measure the extent to which two numeric attributes change together.
  • covariance : The sum of (ai - A) (bi – B)/number of tuples
  • A positive covariance means that if A is larger than its expected value, B tends to be larger than its expected value - and vice versa. Negative = smaller.
  • Independence, COVA,B = 0 but the converse is not true

Co-Variance Example

  • Two stocks (A, B) have the following values in one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
  • The mean for A is: 4. For B: 9.6
  • Covariance(A,B) results in 4
  • These variables rise together. Cov shows that their relationship is > 0

Weka Tool

  • WEKA provides many algorithms, datasets, tools for transforming datasets, ability to train, preprocess and analyze without writting program code.
  • Complete Date Mining with WEKA can be viewed ad: https://www.youtube.com/user/WekaMOOC)

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser