Machine Learning Unit-2 Quiz
48 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is the term used for the 100-quantiles?

Percentiles

What is the purpose of quartiles?

Quartiles provide information about a distribution's center, spread, and shape.

What is the most widely used form of quantiles?

Median, quartiles, and percentiles

What is the relationship between the first quartile (Q1) and the 25th percentile?

<p>They are the same.</p> Signup and view all the answers

What does the third quartile (Q3) represent?

<p>The 75th percentile</p> Signup and view all the answers

What does the interquartile range (IQR) measure?

<p>The spread of the middle half of the data</p> Signup and view all the answers

What is the formula for calculating the interquartile range (IQR)?

<p>IQR = Q3 - Q1</p> Signup and view all the answers

Given the salary data (in thousands of dollars): 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110. What is the IQR for this data?

<p>18</p> Signup and view all the answers

What is the median of the weights of the boxes of raisins?

<p>The median is 32 grams.</p> Signup and view all the answers

Identify the first quartile (Q1) from the weights of the boxes of raisins.

<p>The first quartile (Q1) is 29 grams.</p> Signup and view all the answers

What is the maximum weight of the boxes of raisins?

<p>The maximum weight is 38 grams.</p> Signup and view all the answers

Calculate the range of the weights of the boxes of raisins.

<p>The range is 13 grams.</p> Signup and view all the answers

Construct the lower hinge for the box plot from the given monetary data.

<p>The lower hinge (minimum) is $19.</p> Signup and view all the answers

What is the third quartile (Q3) for the hourly collections from the Salvation Army kettle?

<p>The third quartile (Q3) is $34.</p> Signup and view all the answers

Using the hourly collections, what is the sum of the values between Q1 and Q3?

<p>The sum is $61.</p> Signup and view all the answers

How many data points lie above the median in the hourly collections?

<p>There are 6 data points above the median.</p> Signup and view all the answers

What distinguishes ordinal data from nominal data?

<p>Ordinal data can be arranged in a meaningful order based on assigned values, while nominal data cannot.</p> Signup and view all the answers

Why can't mathematical operations be performed on nominal data?

<p>Mathematical operations cannot be performed on nominal data because it represents categories without inherent numerical value.</p> Signup and view all the answers

What is one key feature of interval data that makes it different from ratio data?

<p>Interval data does not have a true zero point, whereas ratio data does.</p> Signup and view all the answers

Identify an example of ordinal data and explain its order.

<p>An example of ordinal data is customer satisfaction levels like ‘Very Happy’, ‘Happy’, and ‘Unhappy’, which can be ranked from best to worst.</p> Signup and view all the answers

What types of central tendency measures can be utilized with ordinal data?

<p>Mode and median can be used with ordinal data, but the mean cannot be calculated.</p> Signup and view all the answers

How are ratio data characterized in terms of mathematical operations?

<p>Ratio data can be added, subtracted, multiplied, or divided, and can calculate measures of central tendency like mean and standard deviation.</p> Signup and view all the answers

Provide an example of quantitative data and explain why it is considered measurable.

<p>An example of quantitative data is height, which can be measured in units like centimeters.</p> Signup and view all the answers

What makes the difference between measuring temperature in interval data and ratio data?

<p>In interval data, such as temperature in degrees Celsius, there is no true zero, so you cannot say one temperature is 'twice' another; in ratio data, such as weight, there is an absolute zero allowing such comparisons.</p> Signup and view all the answers

How do outliers affect the mean in a dataset?

<p>Outliers can drastically shift the mean, causing it to misrepresent the data.</p> Signup and view all the answers

What is the weighted mean and why is it useful?

<p>The weighted mean is an average where different outcomes have different probabilities, it is useful when outcomes do not contribute equally.</p> Signup and view all the answers

Calculate the final weighted average for three exams with scores 80, 80, and 95 and weights 40%, 40%, and 20%.

<p>The final weighted average is 83.</p> Signup and view all the answers

What does a small deviation between mean and median suggest about a dataset?

<p>It suggests that the dataset is less likely to have significant outliers.</p> Signup and view all the answers

Define the mode in a dataset and provide an example.

<p>The mode is the most common value in a dataset; for example, in 1, 1, 2, 2, 2, the mode is 2.</p> Signup and view all the answers

What is the significance of missing values in an attribute like horsepower?

<p>Missing values can lead to an incomplete analysis and affect overall statistical conclusions.</p> Signup and view all the answers

How does examining measures of data spread contribute to understanding a dataset?

<p>Examining data spread provides insights into variability and the distribution of data points.</p> Signup and view all the answers

Why might certain attributes have significant deviations between mean and median?

<p>Significant deviations often indicate the presence of outliers or skewed data distributions.</p> Signup and view all the answers

What is a potential disadvantage of ignoring tuples with missing values in a dataset?

<p>Ignoring tuples leads to the loss of potentially useful information from other attributes in those tuples.</p> Signup and view all the answers

Why might manually filling in missing values be impractical?

<p>It is time-consuming and may not be feasible for large datasets with many missing values.</p> Signup and view all the answers

What is the risk of using a global constant to fill in missing values?

<p>It may create a misleading concept in the data, as all missing values would share the same constant value.</p> Signup and view all the answers

When should the mean be used to replace a missing value, and when should the median be favored?

<p>The mean should be used for normal data distributions, while the median is better for skewed distributions.</p> Signup and view all the answers

How can the mean or median be utilized for filling missing values within classes?

<p>Missing values can be replaced with the mean or median of that class, such as mean income for a specific credit risk category.</p> Signup and view all the answers

What methods can be used to determine the most probable value for filling in missing data?

<p>Regression, inference-based tools using Bayesian methods, or decision tree induction can be employed.</p> Signup and view all the answers

Define min-max normalization in the context of data scaling.

<p>Min-max normalization scales data to a specified range, typically [0.0, 1.0], by transforming values from an original range.</p> Signup and view all the answers

What is a potential drawback of using the most probable value approach for missing data?

<p>It may oversimplify the data relationships and fail to account for variations within the dataset.</p> Signup and view all the answers

What is the interquartile range if the first quartile is 64 and the third quartile is 77?

<p>The interquartile range is 13.</p> Signup and view all the answers

How do you calculate the first quartile when the sample size is odd, for example, the lower half consists of values 64 and 64?

<p>The first quartile is the mean of the two middle values, calculated as (64+64)/2 = 64.</p> Signup and view all the answers

What does a box plot visually represent in terms of data distribution?

<p>A box plot visually shows the distribution of numerical data, including quartiles and averages.</p> Signup and view all the answers

What marks the mid-point of a data set in a box plot?

<p>The median marks the mid-point of the data and is represented by the line dividing the box.</p> Signup and view all the answers

What is indicated by the lower whisker in a box plot?

<p>The lower whisker indicates scores outside the middle 50%, specifically the lowest scores excluding outliers.</p> Signup and view all the answers

In a dataset, if 75% of scores fall below a certain value, what is this value called?

<p>This value is called the upper quartile or the third quartile.</p> Signup and view all the answers

What is the five number summary in statistics?

<p>The five number summary consists of the minimum, first quartile, median, third quartile, and maximum.</p> Signup and view all the answers

How can you determine if there are outliers in a dataset based on a box plot?

<p>Outliers are shown at the ends of the whiskers and are identified as points outside the whiskers.</p> Signup and view all the answers

Flashcards

Nominal Data

Categorical data without a natural order or ranking.

Ordinal Data

Categorical data with a clear, ordered relationship among values.

Quantitative Data

Numeric data that can be measured and has numerical values.

Interval Data

Numeric data where the difference between values is meaningful, but no true zero exists.

Signup and view all the flashcards

Ratio Data

Numeric data with a true zero point, allowing for all mathematical operations.

Signup and view all the flashcards

Mode

The value that appears most frequently in a dataset.

Signup and view all the flashcards

Median

The middle value when data is ordered from least to greatest.

Signup and view all the flashcards

Mean

The average value calculated by summing all values and dividing by the number of values.

Signup and view all the flashcards

Outliers

Values that are significantly higher or lower than the rest of the data.

Signup and view all the flashcards

Weighted Mean

An average where some values contribute more than others based on assigned weights.

Signup and view all the flashcards

Deviation

The difference between a data point and the mean or another measure like median.

Signup and view all the flashcards

Dispersion of Data

The extent to which data values spread out from the mean or median.

Signup and view all the flashcards

Missing Values

Data points that are not recorded or available in a dataset.

Signup and view all the flashcards

First Quartile (Q1)

The median of the lower half of data, where 25% fall below this value.

Signup and view all the flashcards

Third Quartile (Q3)

The median of the upper half of data, where 75% fall below this value.

Signup and view all the flashcards

Interquartile Range (IQR)

The difference between Q3 and Q1, showing the range of the middle 50% of data.

Signup and view all the flashcards

Box Plot

A graphical representation showing the distribution of data based on quartiles.

Signup and view all the flashcards

Five Number Summary

A summary of data consisting of minimum, Q1, median, Q3, and maximum values.

Signup and view all the flashcards

Upper Whisker

Represents scores outside the middle 50% and extends to the maximum value.

Signup and view all the flashcards

Lower Whisker

Represents scores outside the middle 50% and extends to the minimum value.

Signup and view all the flashcards

Ignoring tuples

An ineffective method for handling missing values by omitting incomplete data.

Signup and view all the flashcards

Manual filling

Manually entering missing values, which is time-consuming and impractical for large datasets.

Signup and view all the flashcards

Global constant replacement

Replacing missing values with a uniform constant, like 'Unknown', risking misinterpretation.

Signup and view all the flashcards

Central tendency

Using mean or median values to replace missing data, depending on distribution shape.

Signup and view all the flashcards

Class-based filling

Replacing missing values with the mean or median for that classification category.

Signup and view all the flashcards

Most probable value

Filling in missing data using common values, often determined by regression or decision trees.

Signup and view all the flashcards

Decision trees

A method used to predict missing values based on other dataset attributes.

Signup and view all the flashcards

Min-max normalization

Scaling data to a specified range, typically [0, 1], using original min and max values.

Signup and view all the flashcards

Minimum (Min)

The smallest value in a dataset.

Signup and view all the flashcards

Maximum (Max)

The largest value in a dataset.

Signup and view all the flashcards

Quartiles

Values that divide a dataset into four equal parts, specifically Q1, Median, and Q3.

Signup and view all the flashcards

Percentiles

Values that divide a dataset into 100 equal parts.

Signup and view all the flashcards

Second Quartile (Q2)

The 50th percentile; also known as the median.

Signup and view all the flashcards

Measuring Spread

Using quartiles to indicate distribution's spread and shape.

Signup and view all the flashcards

Dataset Example

Using salary data to find quartiles and IQR.

Signup and view all the flashcards

Study Notes

Machine Learning 702AI0C012 Unit-2: Data Exploration, Pre-processing and Visualization

  • This unit covers data exploration, pre-processing, and visualization techniques for machine learning.
  • Topics include missing value treatment, handling categorical data (mapping ordinal features, encoding class labels, one-hot encoding for nominal features), outlier detection and treatment, feature engineering (variable transformation, variable creation, and feature selection), and data visualization (box plots).
  • The workflow of machine learning involves:
    • Input data
    • Preparing the model (data exploration and pre-processing)
    • Learning (model selection, training, and tuning)
    • Performance evaluation (testing & validating models)
    • Performance improvement (refining models & using ensembling techniques like bagging and boosting).

Data Types in Machine Learning

  • Data is categorized as qualitative (categorical) or quantitative (numerical).
  • Categorical data comprises:
    • Nominal data: unordered categories like blood type, nationality, and gender.
    • Ordinal data: ordered categories like customer satisfaction levels, grades, and hardness of metal.
  • Numerical data includes:
    • Interval data: data with meaningful intervals, like temperature or date, but no true zero point.
    • Ratio data: data with a true zero point, like height, weight, and salary.

Data Attributes

  • Discrete attributes have a finite or countably infinite number of values (e.g., roll number, street number, gender).
  • Numeric attributes can take on any value within a range (e.g., student counts, ranks).
  • Binary attributes have only two values (e.g., male/female, yes/no).
  • Continuous attributes can take on any real number (e.g., length, height, price).

Descriptive Statistics

  • Measures of central tendency:
    • Mean: the average of all data values.
    • Median: the middle value when data is ordered.
    • Mode: the most frequently occurring value.
  • Measures of dispersion:
    • Range: the difference between the largest and smallest values.
    • Interquartile range (IQR): the difference between the 75th and 25th percentiles.
    • Standard deviation: a measure of how spread out the data is from the mean.
    • Variance

Data Visualization - Box Plots

  • Box plots display data through quartiles, median, minimum, maximum, and outliers to show data distribution and skewness.
  • Box plots aid in visualizing the five-number summary (minimum, first quartile, median, third quartile, maximum), which shows the center, spread & shape of data.
  • They are helpful in identifying outliers, dispersion, mean values, and signs of skewness in given data.

Handling Missing Values

  • Three main categories of methods to handle missing values:
    • Skipping: discarding data points or features with missing values.
    • Imputation: replacing missing values with estimated values based on other data points or features.
  • Methods for imputation:
    • Global constant
    • Central tendency (mean or median)
    • Most probable value (mode)
  • Choosing a method depends on the context and strategy of analysis that would minimize loss or skewing of results when the data is discarded.

Normalization

  • Techniques to convert different scales of data to a consistent scale:
    • Min-max normalization: shifts data to range between 0 and 1.
    • Z-score normalization: standardizes data to mean = 0, std dev = 1.

Outlier Treatment

  • Three approaches to treat outliers aside from removal:
    • Trimming: reducing the effect of outliers (weights)
    • Mean/Median: replace outlier values using mean/median
    • Log transformation: transform the variable to reduce skewness

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Description

Test your knowledge on data exploration, pre-processing, and visualization techniques in machine learning. This unit dives into essential topics such as handling missing values, categorical data, outlier detection, and various visualization methods. Assess your understanding of the machine learning workflow and its critical components.

More Like This

Data Exploration and Quality Quiz
10 questions
Data Exploration Techniques Quiz
61 questions

Data Exploration Techniques Quiz

WinningTropicalRainforest avatar
WinningTropicalRainforest
Data Exploration and PCA Concepts
24 questions

Data Exploration and PCA Concepts

InfallibleLawrencium3753 avatar
InfallibleLawrencium3753
Use Quizgecko on...
Browser
Browser