Podcast
Questions and Answers
In the context of data mining, why is it crucial to distinguish between data, information, and knowledge?
In the context of data mining, why is it crucial to distinguish between data, information, and knowledge?
- Because information is obsolete in modern data mining.
- Because each represents a different level of abstraction and understanding, impacting how it is used. (correct)
- Because data is always more valuable than knowledge.
- Because they are interchangeable terms that can confuse the analysis process.
What is the primary characteristic of nominal attributes?
What is the primary characteristic of nominal attributes?
- They can be measured on a continuous scale.
- Their values represent categories or names with no inherent order. (correct)
- They always have a binary (0 or 1) value.
- Mathematical operations can be meaningfully performed on them.
In the context of binary attributes, what does it mean for an attribute to be asymmetric?
In the context of binary attributes, what does it mean for an attribute to be asymmetric?
- The outcomes of the states are equally important.
- It cannot be used as a Boolean variable.
- Both of its states (0 or 1) are equally valuable and carry the same weight.
- The outcomes of the states are not equally important. (correct)
Which measure of central tendency is most susceptible to being skewed by extreme values (outliers) in a dataset?
Which measure of central tendency is most susceptible to being skewed by extreme values (outliers) in a dataset?
Why is the median often preferred over the mean when describing the central tendency of a skewed dataset?
Why is the median often preferred over the mean when describing the central tendency of a skewed dataset?
How are quartiles used to assess the spread and shape of a data distribution?
How are quartiles used to assess the spread and shape of a data distribution?
What does a boxplot typically display?
What does a boxplot typically display?
Under what condition is the standard deviation equal to zero?
Under what condition is the standard deviation equal to zero?
Considering the knowledge pyramid, which of the following is closest to the 'wisdom' component?
Considering the knowledge pyramid, which of the following is closest to the 'wisdom' component?
Which of the following best describes the role of data engineering in the context of diverse data types?
Which of the following best describes the role of data engineering in the context of diverse data types?
What distinguishes interval-scaled attributes from ratio-scaled attributes?
What distinguishes interval-scaled attributes from ratio-scaled attributes?
What is the primary purpose of statistical description of data?
What is the primary purpose of statistical description of data?
A dataset contains salary information for employees. If the mean salary is significantly higher than the median salary, what can you infer about the distribution of salaries?
A dataset contains salary information for employees. If the mean salary is significantly higher than the median salary, what can you infer about the distribution of salaries?
How is the interquartile range (IQR) calculated?
How is the interquartile range (IQR) calculated?
What is the purpose of 'trimming' a dataset when calculating the mean?
What is the purpose of 'trimming' a dataset when calculating the mean?
What is the empirical relationship between mean, median, and mode for moderately skewed unimodal data?
What is the empirical relationship between mean, median, and mode for moderately skewed unimodal data?
How does an outlier affect a boxplot?
How does an outlier affect a boxplot?
What does a high standard deviation indicate about a dataset?
What does a high standard deviation indicate about a dataset?
What is a 'quantile' in the context of data distribution?
What is a 'quantile' in the context of data distribution?
In the formula for calculating the median from grouped data, what does width represent?
In the formula for calculating the median from grouped data, what does width represent?
Consider a dataset of customer ages. If some customers' ages are not recorded, which measure of central tendency might be least affected by the missing data?
Consider a dataset of customer ages. If some customers' ages are not recorded, which measure of central tendency might be least affected by the missing data?
Which of the following methods is best for comparing the distribution of unit prices across several branches of a store?
Which of the following methods is best for comparing the distribution of unit prices across several branches of a store?
Why is it often unrealistic to expect a single ML system to handle all types of data?
Why is it often unrealistic to expect a single ML system to handle all types of data?
What does the 'range' signify as a measure of data dispersion?
What does the 'range' signify as a measure of data dispersion?
In the context of data objects, what is an attribute vector?
In the context of data objects, what is an attribute vector?
Which statistical measure is most suitable for finding the central tendency in data that is grouped into intervals, especially when the exact values are unknown?
Which statistical measure is most suitable for finding the central tendency in data that is grouped into intervals, especially when the exact values are unknown?
What is a key advantage of using boxplots for outlier detection?
What is a key advantage of using boxplots for outlier detection?
An analyst observes that the mean test score of a class is strongly influenced by a few students with exceptionally high scores. Which of the following actions would best mitigate this effect when reporting typical performance?
An analyst observes that the mean test score of a class is strongly influenced by a few students with exceptionally high scores. Which of the following actions would best mitigate this effect when reporting typical performance?
If a unimodal dataset is asymmetrical or skewed, what does this imply about the relationship between measures of central tendency?
If a unimodal dataset is asymmetrical or skewed, what does this imply about the relationship between measures of central tendency?
Flashcards
What is Data?
What is Data?
Discrete, objective facts about a task or representations of a phenomenon.
Knowledge Pyramid
Knowledge Pyramid
Distinguishes between data, information, and knowledge in data mining: Data, Information, Knowledge and Wisdom.
Data Objects and Attributes
Data Objects and Attributes
Data sets are made up of data objects, which are described by attributes.
What is an Attribute?
What is an Attribute?
Signup and view all the flashcards
Types of Attributes
Types of Attributes
Signup and view all the flashcards
Nominal Attributes
Nominal Attributes
Signup and view all the flashcards
Binary attributes
Binary attributes
Signup and view all the flashcards
Ordinal Attributes
Ordinal Attributes
Signup and view all the flashcards
Numeric Attributes
Numeric Attributes
Signup and view all the flashcards
Interval-Scaled Attributes
Interval-Scaled Attributes
Signup and view all the flashcards
Ratio-Scaled Attributes
Ratio-Scaled Attributes
Signup and view all the flashcards
Statistical Description
Statistical Description
Signup and view all the flashcards
Central Tendency
Central Tendency
Signup and view all the flashcards
Dispersion
Dispersion
Signup and view all the flashcards
Midrange
Midrange
Signup and view all the flashcards
Median
Median
Signup and view all the flashcards
Mode
Mode
Signup and view all the flashcards
4-Quantiles
4-Quantiles
Signup and view all the flashcards
Outliers
Outliers
Signup and view all the flashcards
Five-Number Summary
Five-Number Summary
Signup and view all the flashcards
Boxplots
Boxplots
Signup and view all the flashcards
Variance and Standard Deviation
Variance and Standard Deviation
Signup and view all the flashcards
Study Notes
- DAT203 AI and Machine Learning, Unit III- Data Engineering
What is Data
- Discrete, objective facts of a task or representations of a phenomenon are data
- Application systems generate a spectrum of new data types, like structured data and data warehouse data
- Organisational data infrastructure becomes overwhelmed with various data types, such as semi-structured and unstructured data
- Current ML systems can't deal with all types of data because of the diversity of data and intended goals of data mining tasks
- Discovering knowledge from structured, semi-structured or unstructured interconnected data poses challenges for turning raw data into value
- Data mining systems that are domain or application dedicated are constructed for granular mining of specific kinds of data
Knowledge Pyramid
- Central to data mining to distinguish between data, information, and knowledge
- Although synonymous, data, information, and knowledge are semantically different
Data Objects and Attribute Types
- Data sets are made up of data objects, described by attributes
- Data objects represent an entity which can be customers, store items, or sales in a sales database
- Data objects can be referred to as samples, examples, instances, or data points/objects
- Attributes are used interchangeably with dimension, feature, or variable, is a data field representing a characteristic of a data object
- Dimension is used in data warehousing, and feature is used in machine learning, while statisticians prefer variable
- Customer objects can include attributes such as customer_id, name, and address
- Observed values for a given attribute are known as observations
- A set of attributes used to describe a given object are defined by an attribute/feature vector, also known as a row of variable values
Attribute Types
- Attribute type is set by possible values from nominal, binary, ordinal, or numeric
Nominal Attributes
- Concerned with names, such attributes take values like symbols or names of things, of some type of category/code/state
- Also referred to as categorical, their values do not have meaningful orders
- Examples - hair_colour/marital_status
- The symbols or names of things can also be represented with numbers
- Numbers should not be used quantitatively for these attributes
Binary Attributes
- Nominal attribute made of two categories; 0 or 1
- 0 means that the attribute is absent, and 1 means it is present
- Binary attributes are Boolean when the two states correspond to true or false
- Symmetric means that its states/outcomes are equally valuable
- The outcomes are asymmetric when values are not equally important
Ordinal Attributes
- Possible values are of a meaningful order or ranking
- Magnitude between the successive values is not known
- For example, drink_size values have a meaningful sequence; cant ascertain how much bigger "large" is than "medium" from the values
- Ordinal attributes are useful for subjective assessments of qualities that cannot be measured objectively
- Central tendency is represented by mode and its median, but the mean cannot be defined
- Nominal, binary, and ordinal attributes are qualitative
- Drink size (small, medium, large), grades, professional ranks, army ranks, & satisfaction levels are all examples of ordinal attributes
Numeric Attributes
- Quantitative and measurable with integer or real values
- Can be interval or ratio scaled
Interval Scaled
- Measures on a scale of equal-size units
- Equal distance between two points
- Positive, 0, or negative values
- Temperature is an interval-scaled attribute
Ratio Scaled
- A numeric attribute with a real zero-point indicating the complete absence of the variable
- Can be a multiple or ratio of another value
- Values are ordered and computable with value differences
- Year of experience, numbers of words in documents, and monetary quantities are all examples of ratio scaled attributes
- Height and weight are ratio variables, measure from 0 and never fall below
Statistical Description of Data
- Statistical description can be used to identify data properties
- Measures of central tendency measure where most data falls within a distribution
- Dispersion of data measures how data are spread out
- Common dispersion measures are the range/quartiles/interquartile range, the five-number summary, boxplots, and variance/standard deviation of the data
- Helps highlight data value based on noise and outliers
Measuring the Central Tendency
- Suppose an attribute x, like salary, has been recorded with a set of objects with defined values where would most of the values fall?
- The central tendency measured using the mean, median, mode, and midrange
- The most common/effective numeric measure is the arithmetic mean
- Can also be written with an aggregate SQL function - average
- In a data set, the values can be associated with a weight which reflect the significance or importance
- This is called a weighted arithmetic mean/average
- Although mean is the single most useful, it is not always the best way of measuring the centre of data
- Extreme, outlier values skew the mean measure
- Trimmed mean is the mean obtained after removing high/low extremity values
- Avoid trimming too large a portion to maintain value
- For skewed data, it is better to find the median value
- Ordered data uses the median for its middle value
- It is possible that there are an even number of observations, making the median not unique. It can be any value within the two middlemost values
- Conventionally, we assign the average of the two middlemost values as the median
- If there were an odd number of values, the median is the middlemost value
- The median is expensive to compute w/ large number of obs.
- The median can be approximated from numeric attributes
- Assume data are grouped according to intervals of frequency
- Identify the interval that contains the median frequency
- The median can be approximated using interpolation
Mode
- The value that occurs most frequently compared to neigbouring values
- Applies to both qualitative and quantitative attributes
- The frequency corresponds to several values, resulting in > 1 mode
- Datasets with one, two, or three modes are unimodal, bimodal, and trimodal respectively
- For data that is bimodal, this means that there are two modes and > 2 is considered multimodal
- For unimodal numeric data that are moderately skewed, the empirical relation: mean - mode ≈ 3 * (mean - median)
- For a uni-modal frequency curve the mode can be approximated when the mean and median are known
Midrange
- Used to assess central tendency using the average of the largest and smallest values
- Can be computed using SQL aggregate functions
- In a unimodal frequency curve with perfect symmetric data distribution, the mean, mode, and median are all at the same center value
- In most applications, data are not symmetrical and therefore skewed
Measure of Dispersion of Data
- Range, Quartiles, and Interquartile Range measures
- Data values identified by X1, X2, ..., XN for some attribute "X"
- Sorted in ascending order, you can pick data points from the data distribution into consecutive sets
- These data points are called quantiles
- Quantiles indicate the intervals of data distribution to divide into the appropriate consecutive set values
- The "K"th quantile shows at what value of "X" at most "k/q" is less than "X"
- "K" must be > 0 and less than "q", with "q - 1" = "q" quantiles
- Quantiles can be determined with an integer formula
- 2-quantile = data point division between upper/lower haves
- Median falls into this category
- 4 quantiles = the 3 data points that split the data into 4 equal parts
- Each part = 1/4 of the data
- Can also be referred to as quartiles
Quartiles
- Quantiles that split data into 100 equal sized consecutive sets are known as percentile
- Indicate a distribution's centre, spread, and shape
- The first quartile is the 25th percentile
- The third quartile is the 75th percentile
- The second quartile is the 50th percentile
- The distance between the first and third quantities gives the range covered by the middle half of the data called the interquartile range (IQR)
- IQR = Q3 - Q1
Outlier Calculation
- Must calculate mean elements of the elements/list
- Find the value of the median of the central elements
- Use the IQR rule - should be the mean of the 3rd/4th elements
Five-number Summary
- Used for boxplots and outliers visualization
- Use thumb rule for identifying suspected outliers. values fall at least 1.5 × IQR above the third quartile or below the first quartile
Variance and Standard Deviation
- Written in the order of minimum, 1st quartile, median, 3rd quartile, maximum
- Graphically, the ends of the box indicate quartiles and length is interquartile range
- The line in the box marks the median
- Whiskers extend outside the box to the smallest/largest observations
- Can use to compare compatible sets of data
- Extend whiskers to extreme observations if less than 1.5xIQR/if not, whiskers terminate at the more extreme data
- Variance and standard deviation = measures of the dispersion of data
- Low standard deviation = data observations are very close to the mean
- High standard deviation = data spread out largely
- Standard deviation of the observations = square root of variance
- σ measures spread about the mean when chosen is the measure of centre
- σ is 0 only when there is no spread. Otherwise, σ>0
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.