Podcast
Questions and Answers
How does inconsistent data manifest in real-world datasets, and what makes it challenging for data mining?
How does inconsistent data manifest in real-world datasets, and what makes it challenging for data mining?
Inconsistent data appears as discrepancies in codes or names within a dataset, making it difficult to ensure the reliability and accuracy of data mining outcomes.
Explain how data preprocessing enhances the reliability of data mining results.
Explain how data preprocessing enhances the reliability of data mining results.
Data preprocessing ensures quality data through cleaning, integration, and transformation, which leads to more quality mining results by handling missing values and inconsistencies.
Describe scenarios where aggregate data, while present, can still be considered a form of incomplete data.
Describe scenarios where aggregate data, while present, can still be considered a form of incomplete data.
If granular attribute values are missing, aggregate data is incomplete, hindering detailed analysis and insight extraction that necessitate individual data points.
How do data warehouses benefit from consistent integration of high-quality data?
How do data warehouses benefit from consistent integration of high-quality data?
Compare relational records and transaction data, referring to their structure and use cases.
Compare relational records and transaction data, referring to their structure and use cases.
In the context of a sales database, distinguish between data objects and attributes, providing examples.
In the context of a sales database, distinguish between data objects and attributes, providing examples.
What is the practical difference between nominal and numeric attributes in data analysis, and when would you use each.
What is the practical difference between nominal and numeric attributes in data analysis, and when would you use each.
Explain the importance of understanding the type of attribute (e.g., nominal, numeric) when selecting data analysis techniques.
Explain the importance of understanding the type of attribute (e.g., nominal, numeric) when selecting data analysis techniques.
Describe situations where binary attributes are not equally important, and explain why this distinction matters.
Describe situations where binary attributes are not equally important, and explain why this distinction matters.
Explain how ordinal attributes are different from nominal and numeric attributes.
Explain how ordinal attributes are different from nominal and numeric attributes.
Differentiate between quantitative and interval-scaled numeric attributes. Give their distinguishing characteristics.
Differentiate between quantitative and interval-scaled numeric attributes. Give their distinguishing characteristics.
Explain the difference between discrete and continuous attributes.
Explain the difference between discrete and continuous attributes.
In data analysis, what are 'measures of central tendency,' and why are they useful?
In data analysis, what are 'measures of central tendency,' and why are they useful?
Explain why a trimmed mean can be more useful than a simple average in certain data analysis scenarios.
Explain why a trimmed mean can be more useful than a simple average in certain data analysis scenarios.
Describe how to calculate the median for both odd and even numbered datasets. Why is the median useful?
Describe how to calculate the median for both odd and even numbered datasets. Why is the median useful?
Define quartiles and inter-quartile range (IQR), and explain their significance in understanding data dispersion.
Define quartiles and inter-quartile range (IQR), and explain their significance in understanding data dispersion.
How does a 'five number summary' add value to basic measures like mean and standard deviation?
How does a 'five number summary' add value to basic measures like mean and standard deviation?
Outline the key components of a boxplot and detail how each communicates information about the dataset.
Outline the key components of a boxplot and detail how each communicates information about the dataset.
What criteria define a data point as an outlier? Why is it important to identify outliers in your data?
What criteria define a data point as an outlier? Why is it important to identify outliers in your data?
Describe potential real-world causes of outliers in experimental data.
Describe potential real-world causes of outliers in experimental data.
Contrast histograms and boxplots. When is using a histogram preferable to a boxplot?
Contrast histograms and boxplots. When is using a histogram preferable to a boxplot?
How does the representation of data value differ between a bar chart and a histogram?
How does the representation of data value differ between a bar chart and a histogram?
What are the common shapes or skewness of data distribution and the positions of the mean, median, and mode.
What are the common shapes or skewness of data distribution and the positions of the mean, median, and mode.
Compare equiwidth and equidepth binning techniques in histograms, discussing scenarios where one might be preferred.
Compare equiwidth and equidepth binning techniques in histograms, discussing scenarios where one might be preferred.
How is a quantile plot helpful in assessing the distribution of a dataset?
How is a quantile plot helpful in assessing the distribution of a dataset?
Describe the primary use and advantage of using a scatter plot in data analysis.
Describe the primary use and advantage of using a scatter plot in data analysis.
Explain the meaning of a correlation coefficient and its range of values.
Explain the meaning of a correlation coefficient and its range of values.
How do scatter plots visually represent uncorrelated data?
How do scatter plots visually represent uncorrelated data?
What is the main goal of data cleaning? Give a specific example.
What is the main goal of data cleaning? Give a specific example.
Explain how moving averages help to recover missing values in time series data, and describe its variants.
Explain how moving averages help to recover missing values in time series data, and describe its variants.
Define data normalization and list two common methods used for normalization.
Define data normalization and list two common methods used for normalization.
Briefly describe min-max normalization. Give an example.
Briefly describe min-max normalization. Give an example.
What are the scenarios in which data transformation such as natural log or square root is beneficial?
What are the scenarios in which data transformation such as natural log or square root is beneficial?
What is the main goal of data reduction techniques, and why is it important in data preprocessing?
What is the main goal of data reduction techniques, and why is it important in data preprocessing?
Define what dimensionality reduction is, and describe how feature selection contributes to it.
Define what dimensionality reduction is, and describe how feature selection contributes to it.
What are data sampling techniques in data mining, and why is it used?
What are data sampling techniques in data mining, and why is it used?
Contrast SRSWR vs SRSWOR sampling techniques.
Contrast SRSWR vs SRSWOR sampling techniques.
For data sampling, explain stratified sampling.
For data sampling, explain stratified sampling.
In data preprocessing, what makes data discretization an important part of reducing data, especially for numerical data?
In data preprocessing, what makes data discretization an important part of reducing data, especially for numerical data?
Flashcards
Why data preprocessing?
Why data preprocessing?
Data in the real world often lacks values, contains errors (noisy), and has discrepancies.
What is an attribute?
What is an attribute?
An attribute is a data field representing a characteristic of a data object, such as customer_ID, name, or address.
What is a data object?
What is a data object?
A data object represents an entity in a database; rows are data objects, and columns are object attributes.
Nominal attribute
Nominal attribute
Signup and view all the flashcards
Binary attribute
Binary attribute
Signup and view all the flashcards
Ordinal attribute
Ordinal attribute
Signup and view all the flashcards
Discrete Attribute
Discrete Attribute
Signup and view all the flashcards
Continuous Attribute
Continuous Attribute
Signup and view all the flashcards
Mean, Median, Mode
Mean, Median, Mode
Signup and view all the flashcards
What is a boxplot?
What is a boxplot?
Signup and view all the flashcards
What does variance measure?
What does variance measure?
Signup and view all the flashcards
What are outliers?
What are outliers?
Signup and view all the flashcards
What is a histogram?
What is a histogram?
Signup and view all the flashcards
Equiwidth bucketing
Equiwidth bucketing
Signup and view all the flashcards
Equidepth bucketing
Equidepth bucketing
Signup and view all the flashcards
Quantile plot
Quantile plot
Signup and view all the flashcards
Scatter Plot
Scatter Plot
Signup and view all the flashcards
Correlation Coefficient
Correlation Coefficient
Signup and view all the flashcards
Data Cleaning
Data Cleaning
Signup and view all the flashcards
Moving Average
Moving Average
Signup and view all the flashcards
Data Normalization Goal
Data Normalization Goal
Signup and view all the flashcards
Min-Max Normalization
Min-Max Normalization
Signup and view all the flashcards
Z-Score Normalization
Z-Score Normalization
Signup and view all the flashcards
What is data reduction?
What is data reduction?
Signup and view all the flashcards
Feature Selection
Feature Selection
Signup and view all the flashcards
Sampling
Sampling
Signup and view all the flashcards
Study Notes
- Data preprocessing is required because real-world data is imperfect, often incomplete, noisy, and inconsistent.
- Without quality data, data mining yields poor results, and quality decisions rely on consistent data integration.
Types of Data Sets
- Relational records consist of NAME, AGE, INCOME, and CREDIT RATING e.g., Mike, <= 30, low, fair.
- Transaction data lists TID and Items, like 1, Bread, Coke, Milk.
- Data matrix is a numerical matrix.
- Document data includes text documents represented as term-frequency vectors.
Data Objects
- Data sets are composed of data objects, with each data object representing an entity.
- Examples include customers and sales in a sales database, or patients and treatments in a medical database.
- Data objects are also referred to as samples, examples, instances, data points, objects, or tuples.
- Data objects are described by attributes, and database rows correspond to data objects while columns correspond to attributes.
Attributes
- Attributes, also known as dimensions, features, or variables, are data fields representing characteristics of a data object (e.g., customer_ID, name, address).
- Attribute types can be nominal or numeric.
Nominal Attributes
- Nominal attributes are categories, states, or "names of things" (e.g., hair color: auburn, black, blond, brown, grey, red, white).
- Binary attributes are a special type of nominal attribute with only two states (0 and 1).
- Symmetric binary attributes have equally important outcomes, like gender
- Asymmetric binary attributes have unequally important outcomes, such as medical test results, with a convention to assign 1 to the most important outcome.
- Ordinal attributes have a meaningful order, but magnitude between values not known, (e.g., Size = {small, medium, large}, grades).
Numeric Attributes
- Quantity attributes are integer or real-valued.
- Interval attributes are measured on a scale of equal-sized units and have order, for example, calendar dates, but no true zero-point.
Discrete and Continuous Attributes
- Discrete attributes have a finite or countably infinite set of values (e.g., zip codes, profession).
- Binary attributes are a special case of discrete attributes.
- Continuous attributes have real numbers as attribute values (e.g., temperature, height, weight).
- Real values are measured and represented using a finite number of digits, and continuous attributes are typically represented as floating-point variables.
Basic Statistical Descriptions of Data
- Data is better understood though central tendency, variation, and spread.
- Data dispersion characteristics include median, max, min, quantiles, outliers, and variance
- Numerical dimensions correspond to sorted intervals
- Data dispersion is analyzed at multiple granularities using tools like boxplots or quantile analysis on sorted intervals.
Measuring the Central Tendency
- Mean is an algebraic measure, calculated differently for samples versus populations and can be weighted or trimmed to handle extreme values.
- Median is the middle value for an odd number of values, or the average of the middle two values for an even number of values, and can be estimated by interpolation for grouped data.
- Mode is the value that occurs most frequently, data can be unimodal, bimodal, or trimodal.
Measuring the Dispersion of Data
- Quartiles, outliers, and boxplots are essential for analyzing data dispersion.
- Quartiles divide the data into Q1 (25th percentile) and Q3 (75th percentile).
- The inter-quartile range is calculated as IQR = Q3 - Q1.
- A five-number summary includes the min, Q1, median, Q3, and max values.
- Boxplots visualize data, marking the median, including whiskers, and identifying outliers
- Variance and standard deviation measure data dispersion.
- Variance is calculated with different formulas for samples and populations, offering scalability.
- Standard deviation is the square root of the variance.
Boxplot Analysis
- Provides a five-number summary of a distribution.
- Data represented in a box with ends at the first and third quartiles, where the height of the box equals the IQR
- Has the median marked by a line with in the box
- Whiskers, two lines extending outside the box to the minimum and maximum values.
- Outliers are plotted individually beyond a specified threshold.
Outliers
- Outliers are data points from a different distribution than the bulk of the data.
- They can arise from operator blunders, equipment failures, day-to-day or batch-to-batch effects, anomalous input conditions, or warm-up effects.
Histograms
- Histograms are graph displays of tabulated frequencies shown as bars, where X-axis are values, Y-axis represent frequency.
- Histograms show the proportion of cases falling into several categories, and look like bar charts
- It differs from a bar chart in that it is the area of the bar that denotes the value, not the height as in bar charts, a distinction when the categories are not of uniform width
- The categories are specified as non-overlapping intervals of some variable and the categories (bars) must be adjacent
- Histograms visually represent the median, mean, and mode for symmetric, positively, and negatively skewed data.
- Histograms can tell more than boxplots, as histograms show the underlying data distribution
Histograms: Buckets
- Single buckets contains a single value
- Buckets can denote a continuous range of values
- Buckets are determined and values are partitioned as:
- Equiwidth: the width of each bucket range is uniform
- Equidepth: buckets are created that their frequencies are constant
Equiwidth Histograms
- Consist of properties of normal distribution curve, for example:
- The normal distribution curve:
- From μ-σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard deviation)
- From μ−2σ to μ+2σ: contains about 95% of it
- From μ−3σ to μ+3σ: contains about 99.7% of it
Graphic Displays of Basic Statistical Descriptions
- Boxplot: Displays a graphic representation of a five-number summary
- Histogram: Displays number of values as the x-axis, y-axis as frequencies
- Quantile Plot: Each value x₁ is paired with f; to indicate that approximately 100 f; % of data are ≤ x₁
- Scatter Plot: each pair of values is a pair of coordinates and plotted as points
Quantile Plot
- Quantile Plot displays all data allowing viewers to assess the overall behavior and unusual occurrences.
- Plots quantile information by sorting data in increasing order, indicates that approximately 100 f% of the data are below or equal to that value.
Scatter Plot
- Provides a first look at bivariate data to see the clusters of points, outliers, etc.
- Each pair of values is treated as a pair of coordinates and plotted as points.
Correlation
- The correlation coefficient also called Pearson's product moment coefficient is used for measuring correlation.
- If rA,B > 0, A and B are positively correlated, then A's values increase as B's increase
- If The correlation is higher, this also makes it stronger
- If rA,B = 0, they are independent
- If rAB < 0, they are negatively correlated
- In summary, can measure the similarity from -1 to 1
More on Outliers
- An outlier is a data value of low probability, it's unusual or unexpected
- In a scatterplot, outliers are points that fall outside of the overall pattern
Major Tasks in Data Preprocessing
- Data cleaning: Fill in missing values, smooth noisy data, detect/remove outliers, and resolve inconsistencies.
- Data integration: Integrate multiple databases, data cubes, or files.
- Data transformation: Normalize and aggregate data.
- Data reduction: Obtain a smaller volume representation while preserving analytical results.
- Data discretization: Reduce data, especially for numerical data.
Data Cleaning
- Involves filling missing values, identifying outliers, smoothing noisy data, and correcting inconsistent data.
Recover Missing Values
- A simple moving average is the unweighted mean of previous data points in time series.
- A weighted moving average is a weighted mean of previous data points
- Weighted moving average is more responsive to market
Data Transformation
- Smoothing to remove noise
- Aggregation is summarization, data cube construction
- Normalization scaling data to fall within a small, specified range
- Attribute/feature construction to construct new attributes from existing ones
- Involves smoothing, aggregation, and normalization is scaling between min-max values
Data Transformation: Normalization
- Min-max Normalization uses the function v'= v – min/max-min *(new max – new min)+new min
- Apply z-score normalization using function v'=v - mean/stand dev.
Transformation to Achieve Normality
- Transforming data can achieve normality, because several data mining methods require that the variables are normally distributed.
- Z-score standardization might not achieve normality, that is distribution is still be skewed
- Apply a transformation: natural log, square root, and inverse square root
Data Reduction Strategies
- Data reduction is important, as complex data analysis/mining can be lengthy to run on large data sets
- Obtains a reduced representation of the dataset while preserving the same analytical results
- Achieved with dimensionality reduction or sampling
Dimensionality Reduction
- Feature selection: Selecting the minimum set of useful data mining features
- Reduces a number of patterns and patterns that makes easier to understand
Sampling
- Sampling allows a mining algorithm to handle large amounts of data by choosing a representative subset
- Simple random sampling can have very poor performance in the presence of skew
- Stratified or adaptive sampling offers better approximation to a real percentage of each class in the overall database
- Stratified sampling is also coupled with skewed data
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.