DAT203 AI/ML Unit III: Data Engineering

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

In the context of data mining, why is it crucial to distinguish between data, information, and knowledge?

Because information is obsolete in modern data mining.
Because each represents a different level of abstraction and understanding, impacting how it is used. (correct)
Because data is always more valuable than knowledge.
Because they are interchangeable terms that can confuse the analysis process.

What is the primary characteristic of nominal attributes?

They can be measured on a continuous scale.
Their values represent categories or names with no inherent order. (correct)
They always have a binary (0 or 1) value.
Mathematical operations can be meaningfully performed on them.

In the context of binary attributes, what does it mean for an attribute to be asymmetric?

The outcomes of the states are equally important.
It cannot be used as a Boolean variable.
Both of its states (0 or 1) are equally valuable and carry the same weight.
The outcomes of the states are not equally important. (correct)

Which measure of central tendency is most susceptible to being skewed by extreme values (outliers) in a dataset?

Mean (A) Signup and view all the answers

Why is the median often preferred over the mean when describing the central tendency of a skewed dataset?

The median is not influenced by extreme values. (C) Signup and view all the answers

How are quartiles used to assess the spread and shape of a data distribution?

They divide the data into four equal parts, indicating the spread and skewness around the median. (A) Signup and view all the answers

What does a boxplot typically display?

Minimum, first quartile, median, third quartile, and maximum values. (D) Signup and view all the answers

Under what condition is the standard deviation equal to zero?

When all observations in the dataset have the same value. (C) Signup and view all the answers

Considering the knowledge pyramid, which of the following is closest to the 'wisdom' component?

Knowing 'why' certain phenomena occur. (A) Signup and view all the answers

Which of the following best describes the role of data engineering in the context of diverse data types?

To manage and transform various data types, facilitating effective data mining despite heterogeneity. (C) Signup and view all the answers

What distinguishes interval-scaled attributes from ratio-scaled attributes?

Ratio-scaled attributes have an absolute zero point, while interval-scaled attributes do not. (A) Signup and view all the answers

What is the primary purpose of statistical description of data?

To identify general properties of the data, such as central tendency and dispersion. (D) Signup and view all the answers

A dataset contains salary information for employees. If the mean salary is significantly higher than the median salary, what can you infer about the distribution of salaries?

The distribution is positively skewed. (B) Signup and view all the answers

How is the interquartile range (IQR) calculated?

By subtracting the first quartile (Q1) from the third quartile (Q3). (D) Signup and view all the answers

What is the purpose of 'trimming' a dataset when calculating the mean?

To reduce the effect of extreme values on the mean. (D) Signup and view all the answers

What is the empirical relationship between mean, median, and mode for moderately skewed unimodal data?

$mean - mode \approx 3 * (mean - median)$ (A) Signup and view all the answers

How does an outlier affect a boxplot?

It is plotted as an individual point beyond the 'whiskers'. (C) Signup and view all the answers

What does a high standard deviation indicate about a dataset?

The data points are spread out over a large range of values. (C) Signup and view all the answers

What is a 'quantile' in the context of data distribution?

A point taken at regular intervals of data distribution, dividing it into equal-size consecutive sets. (C) Signup and view all the answers

In the formula for calculating the median from grouped data, what does width represent?

The width of the median interval (class size). (B) Signup and view all the answers

Consider a dataset of customer ages. If some customers' ages are not recorded, which measure of central tendency might be least affected by the missing data?

Median (C) Signup and view all the answers

Which of the following methods is best for comparing the distribution of unit prices across several branches of a store?

Creating boxplots for each branch. (C) Signup and view all the answers

Why is it often unrealistic to expect a single ML system to handle all types of data?

Due to the diversity of data types and intended goals of data mining tasks. (B) Signup and view all the answers

What does the 'range' signify as a measure of data dispersion?

The difference between the largest and smallest values in a dataset. (D) Signup and view all the answers

In the context of data objects, what is an attribute vector?

A fancy way of saying 'row of variable values' representing a set of attributes describing an object. (C) Signup and view all the answers

Which statistical measure is most suitable for finding the central tendency in data that is grouped into intervals, especially when the exact values are unknown?

The median. (A) Signup and view all the answers

What is a key advantage of using boxplots for outlier detection?

They visually represent the IQR helping to quickly identify points far from central cluster. (D) Signup and view all the answers

An analyst observes that the mean test score of a class is strongly influenced by a few students with exceptionally high scores. Which of the following actions would best mitigate this effect when reporting typical performance?

Report the median and trimmed mean. (C) Signup and view all the answers

If a unimodal dataset is asymmetrical or skewed, what does this imply about the relationship between measures of central tendency?

Its mean, median, and mode will all be distinct. (C) Signup and view all the answers

Flashcards

What is Data?

Discrete, objective facts about a task or representations of a phenomenon.

Knowledge Pyramid

Distinguishes between data, information, and knowledge in data mining: Data, Information, Knowledge and Wisdom.