DAT203 AI/ML Unit III: Data Engineering

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

In the context of data mining, why is it crucial to distinguish between data, information, and knowledge?

  • Because information is obsolete in modern data mining.
  • Because each represents a different level of abstraction and understanding, impacting how it is used. (correct)
  • Because data is always more valuable than knowledge.
  • Because they are interchangeable terms that can confuse the analysis process.

What is the primary characteristic of nominal attributes?

  • They can be measured on a continuous scale.
  • Their values represent categories or names with no inherent order. (correct)
  • They always have a binary (0 or 1) value.
  • Mathematical operations can be meaningfully performed on them.

In the context of binary attributes, what does it mean for an attribute to be asymmetric?

  • The outcomes of the states are equally important.
  • It cannot be used as a Boolean variable.
  • Both of its states (0 or 1) are equally valuable and carry the same weight.
  • The outcomes of the states are not equally important. (correct)

Which measure of central tendency is most susceptible to being skewed by extreme values (outliers) in a dataset?

<p>Mean (A)</p> Signup and view all the answers

Why is the median often preferred over the mean when describing the central tendency of a skewed dataset?

<p>The median is not influenced by extreme values. (C)</p> Signup and view all the answers

How are quartiles used to assess the spread and shape of a data distribution?

<p>They divide the data into four equal parts, indicating the spread and skewness around the median. (A)</p> Signup and view all the answers

What does a boxplot typically display?

<p>Minimum, first quartile, median, third quartile, and maximum values. (D)</p> Signup and view all the answers

Under what condition is the standard deviation equal to zero?

<p>When all observations in the dataset have the same value. (C)</p> Signup and view all the answers

Considering the knowledge pyramid, which of the following is closest to the 'wisdom' component?

<p>Knowing 'why' certain phenomena occur. (A)</p> Signup and view all the answers

Which of the following best describes the role of data engineering in the context of diverse data types?

<p>To manage and transform various data types, facilitating effective data mining despite heterogeneity. (C)</p> Signup and view all the answers

What distinguishes interval-scaled attributes from ratio-scaled attributes?

<p>Ratio-scaled attributes have an absolute zero point, while interval-scaled attributes do not. (A)</p> Signup and view all the answers

What is the primary purpose of statistical description of data?

<p>To identify general properties of the data, such as central tendency and dispersion. (D)</p> Signup and view all the answers

A dataset contains salary information for employees. If the mean salary is significantly higher than the median salary, what can you infer about the distribution of salaries?

<p>The distribution is positively skewed. (B)</p> Signup and view all the answers

How is the interquartile range (IQR) calculated?

<p>By subtracting the first quartile (Q1) from the third quartile (Q3). (D)</p> Signup and view all the answers

What is the purpose of 'trimming' a dataset when calculating the mean?

<p>To reduce the effect of extreme values on the mean. (D)</p> Signup and view all the answers

What is the empirical relationship between mean, median, and mode for moderately skewed unimodal data?

<p>$mean - mode \approx 3 * (mean - median)$ (A)</p> Signup and view all the answers

How does an outlier affect a boxplot?

<p>It is plotted as an individual point beyond the 'whiskers'. (C)</p> Signup and view all the answers

What does a high standard deviation indicate about a dataset?

<p>The data points are spread out over a large range of values. (C)</p> Signup and view all the answers

What is a 'quantile' in the context of data distribution?

<p>A point taken at regular intervals of data distribution, dividing it into equal-size consecutive sets. (C)</p> Signup and view all the answers

In the formula for calculating the median from grouped data, what does width represent?

<p>The width of the median interval (class size). (B)</p> Signup and view all the answers

Consider a dataset of customer ages. If some customers' ages are not recorded, which measure of central tendency might be least affected by the missing data?

<p>Median (C)</p> Signup and view all the answers

Which of the following methods is best for comparing the distribution of unit prices across several branches of a store?

<p>Creating boxplots for each branch. (C)</p> Signup and view all the answers

Why is it often unrealistic to expect a single ML system to handle all types of data?

<p>Due to the diversity of data types and intended goals of data mining tasks. (B)</p> Signup and view all the answers

What does the 'range' signify as a measure of data dispersion?

<p>The difference between the largest and smallest values in a dataset. (D)</p> Signup and view all the answers

In the context of data objects, what is an attribute vector?

<p>A fancy way of saying 'row of variable values' representing a set of attributes describing an object. (C)</p> Signup and view all the answers

Which statistical measure is most suitable for finding the central tendency in data that is grouped into intervals, especially when the exact values are unknown?

<p>The median. (A)</p> Signup and view all the answers

What is a key advantage of using boxplots for outlier detection?

<p>They visually represent the IQR helping to quickly identify points far from central cluster. (D)</p> Signup and view all the answers

An analyst observes that the mean test score of a class is strongly influenced by a few students with exceptionally high scores. Which of the following actions would best mitigate this effect when reporting typical performance?

<p>Report the median and trimmed mean. (C)</p> Signup and view all the answers

If a unimodal dataset is asymmetrical or skewed, what does this imply about the relationship between measures of central tendency?

<p>Its mean, median, and mode will all be distinct. (C)</p> Signup and view all the answers

Flashcards

What is Data?

Discrete, objective facts about a task or representations of a phenomenon.

Knowledge Pyramid

Distinguishes between data, information, and knowledge in data mining: Data, Information, Knowledge and Wisdom.

Data Objects and Attributes

Data sets are made up of data objects, which are described by attributes.

What is an Attribute?

A data field representing a characteristic or feature of a data object (dimension in warehousing).

Signup and view all the flashcards

Types of Attributes

Attribute types determined by the set of possible values—nominal, binary, ordinal, or numeric.

Signup and view all the flashcards

Nominal Attributes

Means relating to names, takes values as symbols or names of things, representing a category or code.

Signup and view all the flashcards

Binary attributes

A nominal attribute with only two categories or states: 0 or 1. Referred to as Boolean if the two states correspond to true or false.

Signup and view all the flashcards

Ordinal Attributes

Attribute where values have a meaningful order or ranking, but the magnitude between successive values is not known.

Signup and view all the flashcards

Numeric Attributes

Quantitative attribute as integer or real values, either interval-scaled or ratio-scaled.

Signup and view all the flashcards

Interval-Scaled Attributes

Measures on a scale of equal-size units; distance between two points is equal, values can be positive, 0, or negative.

Signup and view all the flashcards

Ratio-Scaled Attributes

Numeric attribute with an actual zero-point, meaning complete absence of measurable variable.

Signup and view all the flashcards

Statistical Description

Identify properties of data location (central tendency) and spread(dispersion).

Signup and view all the flashcards

Central Tendency

Measures the location of the middle or centre of a distribution.

Signup and view all the flashcards

Dispersion

Measures how data are spread out from the middle (range, quartiles).

Signup and view all the flashcards

Midrange

Average of the largest and smallest values in a data set.

Signup and view all the flashcards

Median

A better measure of centre for skewed data. Middle value in a set of ordered data values.

Signup and view all the flashcards

Mode

The value occurs most frequently compared to all neighboring values in the set.

Signup and view all the flashcards

4-Quantiles

Splits the data distribution into four equal ports. More commonly referred to as quartiles

Signup and view all the flashcards

Outliers

A common practice to identify suspected outliers, values that are very different to others, that fall outside the normal expected range.

Signup and view all the flashcards

Five-Number Summary

Consists of the minimum, Q1, median (Q2), Q3, and maximum values. It is in this order.

Signup and view all the flashcards

Boxplots

Visualize data distribution using box (interquartile range) and lines (whiskers).

Signup and view all the flashcards

Variance and Standard Deviation

Measures of data dispersion, high values show spread out range.

Signup and view all the flashcards

Study Notes

  • DAT203 AI and Machine Learning, Unit III- Data Engineering

What is Data

  • Discrete, objective facts of a task or representations of a phenomenon are data
  • Application systems generate a spectrum of new data types, like structured data and data warehouse data
  • Organisational data infrastructure becomes overwhelmed with various data types, such as semi-structured and unstructured data
  • Current ML systems can't deal with all types of data because of the diversity of data and intended goals of data mining tasks
  • Discovering knowledge from structured, semi-structured or unstructured interconnected data poses challenges for turning raw data into value
  • Data mining systems that are domain or application dedicated are constructed for granular mining of specific kinds of data

Knowledge Pyramid

  • Central to data mining to distinguish between data, information, and knowledge
  • Although synonymous, data, information, and knowledge are semantically different

Data Objects and Attribute Types

  • Data sets are made up of data objects, described by attributes
  • Data objects represent an entity which can be customers, store items, or sales in a sales database
  • Data objects can be referred to as samples, examples, instances, or data points/objects
  • Attributes are used interchangeably with dimension, feature, or variable, is a data field representing a characteristic of a data object
  • Dimension is used in data warehousing, and feature is used in machine learning, while statisticians prefer variable
  • Customer objects can include attributes such as customer_id, name, and address
  • Observed values for a given attribute are known as observations
  • A set of attributes used to describe a given object are defined by an attribute/feature vector, also known as a row of variable values

Attribute Types

  • Attribute type is set by possible values from nominal, binary, ordinal, or numeric

Nominal Attributes

  • Concerned with names, such attributes take values like symbols or names of things, of some type of category/code/state
  • Also referred to as categorical, their values do not have meaningful orders
  • Examples - hair_colour/marital_status
  • The symbols or names of things can also be represented with numbers
  • Numbers should not be used quantitatively for these attributes

Binary Attributes

  • Nominal attribute made of two categories; 0 or 1
  • 0 means that the attribute is absent, and 1 means it is present
  • Binary attributes are Boolean when the two states correspond to true or false
  • Symmetric means that its states/outcomes are equally valuable
  • The outcomes are asymmetric when values are not equally important

Ordinal Attributes

  • Possible values are of a meaningful order or ranking
  • Magnitude between the successive values is not known
  • For example, drink_size values have a meaningful sequence; cant ascertain how much bigger "large" is than "medium" from the values
  • Ordinal attributes are useful for subjective assessments of qualities that cannot be measured objectively
  • Central tendency is represented by mode and its median, but the mean cannot be defined
  • Nominal, binary, and ordinal attributes are qualitative
  • Drink size (small, medium, large), grades, professional ranks, army ranks, & satisfaction levels are all examples of ordinal attributes

Numeric Attributes

  • Quantitative and measurable with integer or real values
  • Can be interval or ratio scaled

Interval Scaled

  • Measures on a scale of equal-size units
  • Equal distance between two points
  • Positive, 0, or negative values
  • Temperature is an interval-scaled attribute

Ratio Scaled

  • A numeric attribute with a real zero-point indicating the complete absence of the variable
  • Can be a multiple or ratio of another value
  • Values are ordered and computable with value differences
  • Year of experience, numbers of words in documents, and monetary quantities are all examples of ratio scaled attributes
  • Height and weight are ratio variables, measure from 0 and never fall below

Statistical Description of Data

  • Statistical description can be used to identify data properties
  • Measures of central tendency measure where most data falls within a distribution
  • Dispersion of data measures how data are spread out
  • Common dispersion measures are the range/quartiles/interquartile range, the five-number summary, boxplots, and variance/standard deviation of the data
  • Helps highlight data value based on noise and outliers

Measuring the Central Tendency

  • Suppose an attribute x, like salary, has been recorded with a set of objects with defined values where would most of the values fall?
  • The central tendency measured using the mean, median, mode, and midrange
  • The most common/effective numeric measure is the arithmetic mean
  • Can also be written with an aggregate SQL function - average
  • In a data set, the values can be associated with a weight which reflect the significance or importance
  • This is called a weighted arithmetic mean/average
  • Although mean is the single most useful, it is not always the best way of measuring the centre of data
  • Extreme, outlier values skew the mean measure
  • Trimmed mean is the mean obtained after removing high/low extremity values
  • Avoid trimming too large a portion to maintain value
  • For skewed data, it is better to find the median value
  • Ordered data uses the median for its middle value
  • It is possible that there are an even number of observations, making the median not unique. It can be any value within the two middlemost values
  • Conventionally, we assign the average of the two middlemost values as the median
  • If there were an odd number of values, the median is the middlemost value
  • The median is expensive to compute w/ large number of obs.
  • The median can be approximated from numeric attributes
  • Assume data are grouped according to intervals of frequency
  • Identify the interval that contains the median frequency
  • The median can be approximated using interpolation

Mode

  • The value that occurs most frequently compared to neigbouring values
  • Applies to both qualitative and quantitative attributes
  • The frequency corresponds to several values, resulting in > 1 mode
  • Datasets with one, two, or three modes are unimodal, bimodal, and trimodal respectively
  • For data that is bimodal, this means that there are two modes and > 2 is considered multimodal
  • For unimodal numeric data that are moderately skewed, the empirical relation: mean - mode ≈ 3 * (mean - median)
  • For a uni-modal frequency curve the mode can be approximated when the mean and median are known

Midrange

  • Used to assess central tendency using the average of the largest and smallest values
  • Can be computed using SQL aggregate functions
  • In a unimodal frequency curve with perfect symmetric data distribution, the mean, mode, and median are all at the same center value
  • In most applications, data are not symmetrical and therefore skewed

Measure of Dispersion of Data

  • Range, Quartiles, and Interquartile Range measures
  • Data values identified by X1, X2, ..., XN for some attribute "X"
  • Sorted in ascending order, you can pick data points from the data distribution into consecutive sets
  • These data points are called quantiles
  • Quantiles indicate the intervals of data distribution to divide into the appropriate consecutive set values
  • The "K"th quantile shows at what value of "X" at most "k/q" is less than "X"
  • "K" must be > 0 and less than "q", with "q - 1" = "q" quantiles
  • Quantiles can be determined with an integer formula
  • 2-quantile = data point division between upper/lower haves
  • Median falls into this category
  • 4 quantiles = the 3 data points that split the data into 4 equal parts
  • Each part = 1/4 of the data
  • Can also be referred to as quartiles

Quartiles

  • Quantiles that split data into 100 equal sized consecutive sets are known as percentile
  • Indicate a distribution's centre, spread, and shape
  • The first quartile is the 25th percentile
  • The third quartile is the 75th percentile
  • The second quartile is the 50th percentile
  • The distance between the first and third quantities gives the range covered by the middle half of the data called the interquartile range (IQR)
  • IQR = Q3 - Q1

Outlier Calculation

  • Must calculate mean elements of the elements/list
  • Find the value of the median of the central elements
  • Use the IQR rule - should be the mean of the 3rd/4th elements

Five-number Summary

  • Used for boxplots and outliers visualization
  • Use thumb rule for identifying suspected outliers. values fall at least 1.5 × IQR above the third quartile or below the first quartile

Variance and Standard Deviation

  • Written in the order of minimum, 1st quartile, median, 3rd quartile, maximum
  • Graphically, the ends of the box indicate quartiles and length is interquartile range
  • The line in the box marks the median
  • Whiskers extend outside the box to the smallest/largest observations
  • Can use to compare compatible sets of data
  • Extend whiskers to extreme observations if less than 1.5xIQR/if not, whiskers terminate at the more extreme data
  • Variance and standard deviation = measures of the dispersion of data
  • Low standard deviation = data observations are very close to the mean
  • High standard deviation = data spread out largely
  • Standard deviation of the observations = square root of variance
  • σ measures spread about the mean when chosen is the measure of centre
  • σ is 0 only when there is no spread. Otherwise, σ>0

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser