Data Preprocessing and Data Sets

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

How does inconsistent data manifest in real-world datasets, and what makes it challenging for data mining?

Inconsistent data appears as discrepancies in codes or names within a dataset, making it difficult to ensure the reliability and accuracy of data mining outcomes.

Explain how data preprocessing enhances the reliability of data mining results.

Data preprocessing ensures quality data through cleaning, integration, and transformation, which leads to more quality mining results by handling missing values and inconsistencies.

Describe scenarios where aggregate data, while present, can still be considered a form of incomplete data.

If granular attribute values are missing, aggregate data is incomplete, hindering detailed analysis and insight extraction that necessitate individual data points.

How do data warehouses benefit from consistent integration of high-quality data?

<p>Consistent integration ensures that data in the warehouse is reliable and uniform, which facilitates more accurate analysis and decision-making.</p> Signup and view all the answers

Compare relational records and transaction data, referring to their structure and use cases.

<p>Relational records are structured into tables with rows representing data objects and columns representing attributes, suiting structured datasets, whereas transaction data captures interactions over time, like purchase history, useful in market basket analysis.</p> Signup and view all the answers

In the context of a sales database, distinguish between data objects and attributes, providing examples.

<p>Data objects are entities like customers or products, and attributes describe these objects, such as customer names or product prices; rows are data objects, columns are attributes.</p> Signup and view all the answers

What is the practical difference between nominal and numeric attributes in data analysis, and when would you use each.

<p>Nominal attributes are used for categorical data like colors, while numeric attributes represent quantitative values, guiding different analysis techniques, like frequency counting vs. statistical calculations.</p> Signup and view all the answers

Explain the importance of understanding the type of attribute (e.g., nominal, numeric) when selecting data analysis techniques.

<p>Understanding attribute types directs the choice of suitable analytical methods, ensuring you apply relevant techniques, like frequency analysis for nominal attributes or statistical measures for numeric attributes.</p> Signup and view all the answers

Describe situations where binary attributes are not equally important, and explain why this distinction matters.

<p>In medical testing, a positive result for a disease might be more critical than a negative one, thus asymmetrical attribute importance affects how outcomes are weighted and managed.</p> Signup and view all the answers

Explain how ordinal attributes are different from nominal and numeric attributes.

<p>Ordinal attributes possess a meaningful order or ranking, unlike nominal attributes which are categorical without order, and they don't have a consistent intervals like numeric attributes.</p> Signup and view all the answers

Differentiate between quantitative and interval-scaled numeric attributes. Give their distinguishing characteristics.

<p>Quantitative attributes represents magnitude (how much) whereas interval-scaled attributes represents a scale of equal-sized units.</p> Signup and view all the answers

Explain the difference between discrete and continuous attributes.

<p>Discrete attributes have a finite or countably infinite set of values (e.g., zip codes), while continuous attributes have real number values (e.g., temperature).</p> Signup and view all the answers

In data analysis, what are 'measures of central tendency,' and why are they useful?

<p>Measures of central tendency (mean, median, mode) indicate the typical value in a dataset, providing a quick overall summary of the data's distribution and characteristics.</p> Signup and view all the answers

Explain why a trimmed mean can be more useful than a simple average in certain data analysis scenarios.

<p>Trimmed mean reduces the impact of outliers by discarding extreme values during calculation, allowing you to get a more typical measure of the bulk values in a dataset.</p> Signup and view all the answers

Describe how to calculate the median for both odd and even numbered datasets. Why is the median useful?

<p>For odd numbered data, median is the middle value. For even, median is the average of the two middle values. Median is useful as it is not sensitive to extreme values.</p> Signup and view all the answers

Define quartiles and inter-quartile range (IQR), and explain their significance in understanding data dispersion.

<p>Quartiles divide data into four equal parts; IQR (Q3-Q1) measures the spread of the middle 50% of the data, helping to assess variability and identify potential outliers.</p> Signup and view all the answers

How does a 'five number summary' add value to basic measures like mean and standard deviation?

<p>It gives five descriptive points showing min, Q1, median, Q3, max of a data set. Together, it gives a quick overview of the data with spread and insight into skew and outliers.</p> Signup and view all the answers

Outline the key components of a boxplot and detail how each communicates information about the dataset.

<p>Boxplots show quartiles, median, and outliers; ends of the box are the quartiles; median is marked; whiskers extend to min/max, points outside the whiskers are outliers, displaying data's spread and skewness.</p> Signup and view all the answers

What criteria define a data point as an outlier? Why is it important to identify outliers in your data?

<p>Outliers lie significantly outside the distribution of other data points, indicating measurement errors or anomalies. Removing outliers improves subsequent statistical analysis and modeling accuracy.</p> Signup and view all the answers

Describe potential real-world causes of outliers in experimental data.

<p>Outliers can be from human error, failed equipment, anomalous input conditions, or warm-up effects. Identifying specific causes enhances data accuracy.</p> Signup and view all the answers

Contrast histograms and boxplots. When is using a histogram preferable to a boxplot?

<p>Histograms display frequency distributions in tabulated bars, while boxplots display summary elements (quartiles, median, outliers). Histograms show more detail about distribution shape and multiple modes.</p> Signup and view all the answers

How does the representation of data value differ between a bar chart and a histogram?

<p>In a bar chart, the height represents the data value, whereas in a histogram, the area of the bar represents the data value, which is crucial with nonuniform width categories.</p> Signup and view all the answers

What are the common shapes or skewness of data distribution and the positions of the mean, median, and mode.

<p>The 3 common shapes are symmetric, positively skewed, and negatively skewed data. In symmetric data, the mean, median, and mode are equal. In positive skewed data, mean &gt; median &gt; mode. In negative skewed data, mean &lt; median &lt; mode.</p> Signup and view all the answers

Compare equiwidth and equidepth binning techniques in histograms, discussing scenarios where one might be preferred.

<p>Equiwidth uses uniform width bucket ranges, which is simple. Equidepth creates constant data point frequency, so it's useful when data is skewed.</p> Signup and view all the answers

How is a quantile plot helpful in assessing the distribution of a dataset?

<p>A quantile plot sorts data in increasing order and plots it against cumulative frequency, showing behavior and unusual occurrences and indicating value distribution.</p> Signup and view all the answers

Describe the primary use and advantage of using a scatter plot in data analysis.

<p>Scatter plots display relationships between two variables, showing clusters and outliers and giving a good feel for bivariate data.</p> Signup and view all the answers

Explain the meaning of a correlation coefficient and its range of values.

<p>The correlation coefficient indicates the degree of linear relationship between two variables, ranging from -1 to 1, where 0 means independent and noncorrelated.</p> Signup and view all the answers

How do scatter plots visually represent uncorrelated data?

<p>In scatter plots for uncorrelated data, points seem arbitrarily scattered without any clear pattern or trend, indicating variables have no linear relationship.</p> Signup and view all the answers

What is the main goal of data cleaning? Give a specific example.

<p>Data cleaning improves data quality by correcting errors, handling missing values, and resolving inconsistencies, e.g., standardizing date formats.</p> Signup and view all the answers

Explain how moving averages help to recover missing values in time series data, and describe its variants.

<p>Moving averages estimate missing values by averaging previous data points, smoothing out fluctuations; variants include simple (unweighted) and weighted averages.</p> Signup and view all the answers

Define data normalization and list two common methods used for normalization.

<p>Data normalization scales data to a specific range to prevent attributes with larger ranges from dominating analysis. Commonly used methods are min-max and z-score normalization.</p> Signup and view all the answers

Briefly describe min-max normalization. Give an example.

<p>Min-max normalization scales data to a [0,1] range based on min/max values to preserve relationships. An example includes scaling house prices between $100K and $1M.</p> Signup and view all the answers

What are the scenarios in which data transformation such as natural log or square root is beneficial?

<p>Data transformation allows variables to be normally distributed, as required for data mining. These transformations like log and square root improve model reliability.</p> Signup and view all the answers

What is the main goal of data reduction techniques, and why is it important in data preprocessing?

<p>Data reduction reduces dataset size for faster subsequent analysis while maintaining integrity, which is beneficial with extremely large data sets.</p> Signup and view all the answers

Define what dimensionality reduction is, and describe how feature selection contributes to it.

<p>Dimensionality reduction reduces the number of variables for data mining by prioritizing the number of patterns in the patterns for easier insight. Feature selection selects relevant data features.</p> Signup and view all the answers

What are data sampling techniques in data mining, and why is it used?

<p>Data sampling is used to allow a mining algorithm to deal with large amounts of data so the reduced dataset can be a subset that mirrors main characteristics. Data sampling lowers data processing.</p> Signup and view all the answers

Contrast SRSWR vs SRSWOR sampling techniques.

<p>SRSWOR (simple random sample without replacement) selects without putting something back, while SRSWR (simple random sample with replacement) puts it back.</p> Signup and view all the answers

For data sampling, explain stratified sampling.

<p>Stratified sampling is an adaptive method that samples based on known skew of data. Subpopulations or percentages can be approximated.</p> Signup and view all the answers

In data preprocessing, what makes data discretization an important part of reducing data, especially for numerical data?

<p>Data discretization reduces data by converting numeric values into intervals, simplifying and creating greater value/insights.</p> Signup and view all the answers

Flashcards

Why data preprocessing?

Data in the real world often lacks values, contains errors (noisy), and has discrepancies.

What is an attribute?

An attribute is a data field representing a characteristic of a data object, such as customer_ID, name, or address.

What is a data object?

A data object represents an entity in a database; rows are data objects, and columns are object attributes.

Nominal attribute

Nominal attributes are categories, states, or 'names of things'.

Signup and view all the flashcards

Binary attribute

Binary is a nominal attribute with only 2 states (0 or 1).

Signup and view all the flashcards

Ordinal attribute

Ordinal attributes have a meaningful order, but the magnitude between values is unknown.

Signup and view all the flashcards

Discrete Attribute

Discrete attributes. have a finite or countably infinite set of values.

Signup and view all the flashcards

Continuous Attribute

Continuous attributes. have real numbers as attribute values.

Signup and view all the flashcards

Mean, Median, Mode

The mean is the average of the data, the median is the middle value, and the mode is the data value that occurs most frequently.

Signup and view all the flashcards

What is a boxplot?

Boxplots display quartiles, outliers, and data distribution, with the box representing the IQR and whiskers extending to min/max.

Signup and view all the flashcards

What does variance measure?

Variance measures how far a data set is spread out.

Signup and view all the flashcards

What are outliers?

Outliers are data points that come from a different distribution than the bulk of the data.

Signup and view all the flashcards

What is a histogram?

Histograms are a graphical display of tabulated frequencies, and area of the bar denotes the value.

Signup and view all the flashcards

Equiwidth bucketing

Equiwidth divides buckets so each bucket range is uniform.

Signup and view all the flashcards

Equidepth bucketing

Equidepth creates them so frequency of each bucket is constant.

Signup and view all the flashcards

Quantile plot

Quantile plots indicate that approximately 100f% of the data are ≤ xi.

Signup and view all the flashcards

Scatter Plot

A scatter plot see clusters of points and outliers.

Signup and view all the flashcards

Correlation Coefficient

Correlation coefficient measures how strongly two variables are related.

Signup and view all the flashcards

Data Cleaning

Data cleaning tasks are filling missing values, identifying or removing outliers and smooth out noisy data and correcting inconsistent data.

Signup and view all the flashcards

Moving Average

Simple moving average is the unweighted mean of the previous n data points and weighted is a weighted mean.

Signup and view all the flashcards

Data Normalization Goal

The goal of data normalization is to scale data to a smaller range, like 0 to 1.

Signup and view all the flashcards

Min-Max Normalization

With Min-max normalization, the new value will fall between a new specified min and max.

Signup and view all the flashcards

Z-Score Normalization

Z-score normalization transforms values based on mean and standard deviation.

Signup and view all the flashcards

What is data reduction?

Data reduction obtains a reduced representation of the same or similar analytical results.

Signup and view all the flashcards

Feature Selection

Feature selection is a minimum set of features useful for data mining and reduce the number of patterns.

Signup and view all the flashcards

Sampling

Sampling allows a mining algorithm to handle a large a representative subset of the data

Signup and view all the flashcards

Study Notes

  • Data preprocessing is required because real-world data is imperfect, often incomplete, noisy, and inconsistent.
  • Without quality data, data mining yields poor results, and quality decisions rely on consistent data integration.

Types of Data Sets

  • Relational records consist of NAME, AGE, INCOME, and CREDIT RATING e.g., Mike, <= 30, low, fair.
  • Transaction data lists TID and Items, like 1, Bread, Coke, Milk.
  • Data matrix is a numerical matrix.
  • Document data includes text documents represented as term-frequency vectors.

Data Objects

  • Data sets are composed of data objects, with each data object representing an entity.
  • Examples include customers and sales in a sales database, or patients and treatments in a medical database.
  • Data objects are also referred to as samples, examples, instances, data points, objects, or tuples.
  • Data objects are described by attributes, and database rows correspond to data objects while columns correspond to attributes.

Attributes

  • Attributes, also known as dimensions, features, or variables, are data fields representing characteristics of a data object (e.g., customer_ID, name, address).
  • Attribute types can be nominal or numeric.

Nominal Attributes

  • Nominal attributes are categories, states, or "names of things" (e.g., hair color: auburn, black, blond, brown, grey, red, white).
  • Binary attributes are a special type of nominal attribute with only two states (0 and 1).
  • Symmetric binary attributes have equally important outcomes, like gender
  • Asymmetric binary attributes have unequally important outcomes, such as medical test results, with a convention to assign 1 to the most important outcome.
  • Ordinal attributes have a meaningful order, but magnitude between values not known, (e.g., Size = {small, medium, large}, grades).

Numeric Attributes

  • Quantity attributes are integer or real-valued.
  • Interval attributes are measured on a scale of equal-sized units and have order, for example, calendar dates, but no true zero-point.

Discrete and Continuous Attributes

  • Discrete attributes have a finite or countably infinite set of values (e.g., zip codes, profession).
  • Binary attributes are a special case of discrete attributes.
  • Continuous attributes have real numbers as attribute values (e.g., temperature, height, weight).
  • Real values are measured and represented using a finite number of digits, and continuous attributes are typically represented as floating-point variables.

Basic Statistical Descriptions of Data

  • Data is better understood though central tendency, variation, and spread.
  • Data dispersion characteristics include median, max, min, quantiles, outliers, and variance
  • Numerical dimensions correspond to sorted intervals
  • Data dispersion is analyzed at multiple granularities using tools like boxplots or quantile analysis on sorted intervals.

Measuring the Central Tendency

  • Mean is an algebraic measure, calculated differently for samples versus populations and can be weighted or trimmed to handle extreme values.
  • Median is the middle value for an odd number of values, or the average of the middle two values for an even number of values, and can be estimated by interpolation for grouped data.
  • Mode is the value that occurs most frequently, data can be unimodal, bimodal, or trimodal.

Measuring the Dispersion of Data

  • Quartiles, outliers, and boxplots are essential for analyzing data dispersion.
  • Quartiles divide the data into Q1 (25th percentile) and Q3 (75th percentile).
  • The inter-quartile range is calculated as IQR = Q3 - Q1.
  • A five-number summary includes the min, Q1, median, Q3, and max values.
  • Boxplots visualize data, marking the median, including whiskers, and identifying outliers
  • Variance and standard deviation measure data dispersion.
  • Variance is calculated with different formulas for samples and populations, offering scalability.
  • Standard deviation is the square root of the variance.

Boxplot Analysis

  • Provides a five-number summary of a distribution.
  • Data represented in a box with ends at the first and third quartiles, where the height of the box equals the IQR
  • Has the median marked by a line with in the box
  • Whiskers, two lines extending outside the box to the minimum and maximum values.
  • Outliers are plotted individually beyond a specified threshold.

Outliers

  • Outliers are data points from a different distribution than the bulk of the data.
  • They can arise from operator blunders, equipment failures, day-to-day or batch-to-batch effects, anomalous input conditions, or warm-up effects.

Histograms

  • Histograms are graph displays of tabulated frequencies shown as bars, where X-axis are values, Y-axis represent frequency.
  • Histograms show the proportion of cases falling into several categories, and look like bar charts
  • It differs from a bar chart in that it is the area of the bar that denotes the value, not the height as in bar charts, a distinction when the categories are not of uniform width
  • The categories are specified as non-overlapping intervals of some variable and the categories (bars) must be adjacent
  • Histograms visually represent the median, mean, and mode for symmetric, positively, and negatively skewed data.
  • Histograms can tell more than boxplots, as histograms show the underlying data distribution

Histograms: Buckets

  • Single buckets contains a single value
  • Buckets can denote a continuous range of values
  • Buckets are determined and values are partitioned as:
    • Equiwidth: the width of each bucket range is uniform
    • Equidepth: buckets are created that their frequencies are constant

Equiwidth Histograms

  • Consist of properties of normal distribution curve, for example:
  • The normal distribution curve:
    • From μ-σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard deviation)
    • From μ−2σ to μ+2σ: contains about 95% of it
    • From μ−3σ to μ+3σ: contains about 99.7% of it

Graphic Displays of Basic Statistical Descriptions

  • Boxplot: Displays a graphic representation of a five-number summary
  • Histogram: Displays number of values as the x-axis, y-axis as frequencies
  • Quantile Plot: Each value x₁ is paired with f; to indicate that approximately 100 f; % of data are ≤ x₁
  • Scatter Plot: each pair of values is a pair of coordinates and plotted as points

Quantile Plot

  • Quantile Plot displays all data allowing viewers to assess the overall behavior and unusual occurrences.
  • Plots quantile information by sorting data in increasing order, indicates that approximately 100 f% of the data are below or equal to that value.

Scatter Plot

  • Provides a first look at bivariate data to see the clusters of points, outliers, etc.
  • Each pair of values is treated as a pair of coordinates and plotted as points.

Correlation

  • The correlation coefficient also called Pearson's product moment coefficient is used for measuring correlation.
  • If rA,B > 0, A and B are positively correlated, then A's values increase as B's increase
  • If The correlation is higher, this also makes it stronger
  • If rA,B = 0, they are independent
  • If rAB < 0, they are negatively correlated
  • In summary, can measure the similarity from -1 to 1

More on Outliers

  • An outlier is a data value of low probability, it's unusual or unexpected
  • In a scatterplot, outliers are points that fall outside of the overall pattern

Major Tasks in Data Preprocessing

  • Data cleaning: Fill in missing values, smooth noisy data, detect/remove outliers, and resolve inconsistencies.
  • Data integration: Integrate multiple databases, data cubes, or files.
  • Data transformation: Normalize and aggregate data.
  • Data reduction: Obtain a smaller volume representation while preserving analytical results.
  • Data discretization: Reduce data, especially for numerical data.

Data Cleaning

  • Involves filling missing values, identifying outliers, smoothing noisy data, and correcting inconsistent data.

Recover Missing Values

  • A simple moving average is the unweighted mean of previous data points in time series.
  • A weighted moving average is a weighted mean of previous data points
  • Weighted moving average is more responsive to market

Data Transformation

  • Smoothing to remove noise
  • Aggregation is summarization, data cube construction
  • Normalization scaling data to fall within a small, specified range
  • Attribute/feature construction to construct new attributes from existing ones
  • Involves smoothing, aggregation, and normalization is scaling between min-max values

Data Transformation: Normalization

  • Min-max Normalization uses the function v'= v – min/max-min *(new max – new min)+new min
  • Apply z-score normalization using function v'=v - mean/stand dev.

Transformation to Achieve Normality

  • Transforming data can achieve normality, because several data mining methods require that the variables are normally distributed.
  • Z-score standardization might not achieve normality, that is distribution is still be skewed
  • Apply a transformation: natural log, square root, and inverse square root

Data Reduction Strategies

  • Data reduction is important, as complex data analysis/mining can be lengthy to run on large data sets
  • Obtains a reduced representation of the dataset while preserving the same analytical results
  • Achieved with dimensionality reduction or sampling

Dimensionality Reduction

  • Feature selection: Selecting the minimum set of useful data mining features
  • Reduces a number of patterns and patterns that makes easier to understand

Sampling

  • Sampling allows a mining algorithm to handle large amounts of data by choosing a representative subset
  • Simple random sampling can have very poor performance in the presence of skew
  • Stratified or adaptive sampling offers better approximation to a real percentage of each class in the overall database
  • Stratified sampling is also coupled with skewed data

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Data Preprocessing
5 questions

Data Preprocessing

RealizablePrehnite avatar
RealizablePrehnite
Data Preprocessing in Data Mining
26 questions
Data Mining: Data Preprocessing
10 questions

Data Mining: Data Preprocessing

WellBalancedPoisson9779 avatar
WellBalancedPoisson9779
Use Quizgecko on...
Browser
Browser