Machine Learning - Data Preprocessing

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Which of the following best describes the relationship between data and information?

Information is raw and unorganized, while data is processed and structured.
Information is used to create data.
Data is processed and formatted to become information. (correct)
Data and information are interchangeable terms with the same meaning.

In the hierarchy of data, information, and knowledge, what characterizes the transition from data to information?

An increase in quantity and an increase in the degree of abstraction.
An increase in quantity and a decrease in the degree of abstraction.
A decrease in quantity and a decrease in the degree of abstraction.
A decrease in quantity and an increase in the degree of abstraction. (correct)

Which of the following scenarios exemplifies the importance of data quality in machine learning?

Ignoring noise and outliers in the data to speed up the training process.
Using a complex algorithm to compensate for missing values in the dataset.
An intelligent algorithm failing to produce reliable results due to inconsistent data. (correct)
Achieving high accuracy with a small dataset containing numerous errors.

What percentage of data scientists' work is typically dedicated to data preparation?

80% (A) Signup and view all the answers

What is a key consideration when assessing the validity of a dataset?

Determining whether the dataset is appropriate for the intended use. (D) Signup and view all the answers

What is 'dimensionality' in the context of a dataset?

The number of attributes or features in the dataset. (A) Signup and view all the answers

Which of the following describes a potential consequence 'the curse of dimensionality'?

The amount of training data required grows exponentially with dimensionality. (B) Signup and view all the answers

What is the primary goal of data preprocessing?

To transform raw data into an understandable and usable format. (D) Signup and view all the answers

Which of the following is an example of data aggregation?

Combining daily sales data to create monthly sales figures. (D) Signup and view all the answers

What is the primary purpose of sampling in the context of data mining?

To select a representative subset of data for analysis. (A) Signup and view all the answers

In the context of sampling, what does it mean for a sample to be 'representative'?

It mirrors the key properties and characteristics of the original dataset. (D) Signup and view all the answers

What is a potential drawback of using a very small sample size?

Inability to accurately detect underlying patterns in data. (D) Signup and view all the answers

What is the main reason for performing dimensionality reduction on a dataset?

To facilitate data visualization and reduce computational requirements. (A) Signup and view all the answers

What is a potential consequence of excessively reducing the dimensionality of a dataset?

Important information is lost potentially distorting results. (D) Signup and view all the answers

Which of the following is a characteristic of high data precision coupled with high bias?

Measurements are closely clustered, centered far from the true value. (D) Signup and view all the answers

What is the difference between feature selection and feature extraction in dimensionality reduction?

Feature extraction creates entirely new features, while feature selection picks a subset of the old ones. (B) Signup and view all the answers

In the context of feature subset selection, what are redundant features?

Features that are highly correlated with each other, duplicating information. (D) Signup and view all the answers

What is the goal of 'feature creation' in data preprocessing?

To automatically generate new features that capture important information more efficiently. (A) Signup and view all the answers

What is Wavelet transformation primarily used for in feature engineering?

To decompose signals into different frequency components, which is useful for feature extraction. (D) Signup and view all the answers

What does data discretization involve?

Transforming continuous variables into discrete or nominal variables. (A) Signup and view all the answers

What is the purpose of attribute transformation?

To map the original attribute values to new transformed values. (A) Signup and view all the answers

In data preprocessing, what does 'scaling' refer to?

The process of mapping all features to the exact same scale or range. (B) Signup and view all the answers

What is the purpose of Min-Max Normalization?

To scale all numeric variables between 0 and 1. (D) Signup and view all the answers

How is the new value $x'$ calculated using Min-Max Normalization, where the original value is $x$, and the new range is $[a,b]$?

$x' = a + \frac{(x - min(x))(b - a)}{max(x) - min(x)}$ (D) Signup and view all the answers

How is Unit Length scaling typically achieved?

By dividing each value in a vector by the magnitude of the vector. (C) Signup and view all the answers

What does 'binning' accomplish in data preprocessing?

Grouping numerical or continuous values into a small number of categories. (D) Signup and view all the answers

During text preprocessing, what is Tokenization used for?

Splitting the text into sequence of individual words. (D) Signup and view all the answers

In text preprocessing, what is the purpose of removing stop words?

To eliminate common words that do not carry important meaning. (C) Signup and view all the answers

What is the difference between stemming versus lemmatization?

Lemmatization accounts for word meaning. (A) Signup and view all the answers

In image preprocessing, what does 'thresholding' achieve?

Creating a binary image by setting pixel values above a threshold to one value and those below to another. (A) Signup and view all the answers

Which of the following represents a morphological operation used in Image Processing?

Image Erosion. (B) Signup and view all the answers

What is the effect of 'erosion' in image processing?

It shrinks bright regions and enlarges dark regions. (B) Signup and view all the answers

In audio preprocessing, what does STFT stand for?

Short-Time Fourier Transformation. (A) Signup and view all the answers

In audio processing, what is the purpose of Mel-Frequency Cepstral Coefficients (MFCCs)?

To transform audio signals into a format that mimics human auditory perception. (D) Signup and view all the answers

What is Zero-Crossing Rate?

Quantifies number of times a signal crosses zero amplitude within a time frame. (B) Signup and view all the answers

What type of audio sound can noise reduction methods remove?

These methods may not always be able to remove all types of sounds. (C) Signup and view all the answers

What is the difference between high-pass vs low-pass filtering?

High-pass only filters frequency components lower than an intended amplitude.. (B) Signup and view all the answers

Which is a key characteristic of noise cancellation algorithms?

Unlike passive methodologies, noise cancellation uses active noise reduction. (A) Signup and view all the answers

Flashcards

What is Data?

Descriptions of things, events, activities, and transactions that have no inherent meaning and don't directly affect users.

What is Information?

Data that has been processed, formatted, and organized into a form that is meaningful and beneficial for decision-making.

Examples of bad data.

Errors, noise, outliers, duplicate data, missing values, inconsistent data, timeliness, relevance, etc.