Data Preprocessing

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

In data preprocessing, what is the primary characteristic of 'useless' data?

Contains a high number of decimal places.
Represents continuous values within a specific range.
Is temporal in nature.
Has no statistical relevance to the problem being solved. (correct)

How can a data domain expert contribute to the detection of useless data?

By converting data into binary format.
By calculating the mean and median of the dataset.
By normalizing the dataset to a standard scale.
By using specialized knowledge to identify irrelevant features. (correct)

What type of data is best described as having only two possible values?

Continuous data
Binary data (correct)
Categorical data
Temporal data

Which method is most effective for detecting binary variables in a dataset?

Counting the unique values. (A)

Signup and view all the answers

Which term best describes 'interval data' that is measured on a continuous quantified scale?

Continuous (D)

Signup and view all the answers

Which of the following is a common characteristic of continuous data that aids in its detection?

Having a high count of unique values. (A)

Signup and view all the answers

How does ordinal data differ from nominal data regarding the 'distance' between values?

The order of the data matters for ordinal data, 'distance' between values is not quantified. (B)

Signup and view all the answers

What is a key distinction between categorical ordinal and categorical nominal data?

Ordinal data has a meaningful order, while nominal data does not. (C)

Signup and view all the answers

What is the primary characteristic of text data in the context of machine learning preprocessing?

Text data consists of words, sentences, or documents, requiring additional processing. (D)

Signup and view all the answers

Which preprocessing tasks are commonly associated with Natural Language Processing (NLP)?

Tokenization, stemming, and stop words removal. (C)

Signup and view all the answers

What does 'tokenization' refer to in the context of Natural Language Processing (NLP)?

Dividing a text into individual words or terms. (B)

Signup and view all the answers

What is the main purpose of 'stemming' in Natural Language Processing (NLP)?

Reducing words to their root form. (B)

Signup and view all the answers

What characteristic defines temporal data?

It is related to dates, times, or sequences. (A)

Signup and view all the answers

What must be done to raw data to make it useable for machine learning algorithms?

It must be preprocessed. (B)

Signup and view all the answers

What is the first step in data preprocessing?

Acquire data. (C)

Signup and view all the answers

What is the role of Pandas in importing libraries? Select all that apply.

Pandas read and manage the data. (B)

Signup and view all the answers

What is the iloc[ ] method?

Extract the independent variables. (D)

Signup and view all the answers

What should be considered when deciding to delete the data? Select all that apply.

More then 75% of the data missing. (A)

Signup and view all the answers

Why is it important to handle the missing values?

Can draw inaccurate conclusions. (D)

Signup and view all the answers

What is the process of imputing the data?

Replacing with median, mean, or constant value. (C)

Signup and view all the answers

How One Hot Encoding impacts the ML model?

It can cause issue of correlation between variables, therefore producing a faulty output. (A)

Signup and view all the answers

What does dummy data equal?

The number of categories. (D)

Signup and view all the answers

What is the use for LabelEncoder()?

Used when data only has 2 categories. (D)

Signup and view all the answers

What is an aspect of the data set in a training set?

Already aware of output. (A)

Signup and view all the answers

What are the 2 types of datasets to be split?

training and test set. (A)

Signup and view all the answers

What is a typical ratio that the dataset gets split into?

70:30 or 80:20. (B)

Signup and view all the answers

For skilearn.model_selection, what does train_test_split import?

train_test_split (C)

Signup and view all the answers

What type of encoding is typically used for encoding continuous data?

Normalizing (B)

Signup and view all the answers

Why do we use feature scaling?

All of the above. (D)

Signup and view all the answers

What does feature scaling accomplish?

Decreases influence of higher values. (A)

Signup and view all the answers

What would influence delivering incorrect results?

If values do not have the same scale. (A)

Signup and view all the answers

What are the feature scaling methods?

Standardization and Normalization. (A)

Signup and view all the answers

During data processing when should standardization code and transform functions should be used?

No data leaking. (B)

Signup and view all the answers

What does data binning/bucketing acomplish?

Groups data. (C)

Signup and view all the answers

When can binning be applied to numeric values?

To convert numeric values to categorical values. (D)

Signup and view all the answers

What does data smoothing accomplish?

Removes noise from data. (C)

Signup and view all the answers

What method is used to calculate the size for each bin?

Binning by frequency. (D)

Signup and view all the answers

What is an alternative technique of data binning?

Sampling. (B)

Signup and view all the answers

Flashcards

Useless Data

Data that has no statistical relevance to the problem being solved, often with high cardinality.

Binary Data

Data with only two possible values.

Continuous Data

Data measured on a continuous, quantified scale; also called 'interval data'.

Categorical Ordinal Data

Categorical data where the order of categories matters, but the 'distance' between values is not quantified.

Signup and view all the flashcards

Categorical Nominal Data

Categorical data where the order of categories is irrelevant.

Signup and view all the flashcards

Text Data

A word, sentence, or document; requires text encoding for analysis.

Signup and view all the flashcards

Natural Language Processing (NLP)

Applying machine learning to linguistics.

Signup and view all the flashcards

Tokenization

Dividing text into individual units.

Signup and view all the flashcards

Bag of Words

Technique to represent text data, where the frequency of words is identified.

Signup and view all the flashcards

Stemming

Reducing words to their root form.

Signup and view all the flashcards

Stop Words Removal

Removing commonly used words from the data, such as 'the'.

Signup and view all the flashcards

Temporal Data

Data with a time component.

Signup and view all the flashcards

Data Preprocessing

Preparing raw data to make it suitable for machine learning.

Signup and view all the flashcards

Acquiring the Dataset

The first step in data preprocessing, which involves gathering data from various sources.

Signup and view all the flashcards

Libraries NumPy, Pandas, Matplotlib

Essential python libraries used for data manipulation, mathermatical operations and plotting.

Signup and view all the flashcards

Handling Missing Values

Addressing missing entries in your data

Signup and view all the flashcards

Imputing Data

Techniques for using domain knowledge, statistics, or models to replace missing data entries

Signup and view all the flashcards

Deleting Rows

Omitting rows where there are missing values

Signup and view all the flashcards

Unbalanced Data

The target dataset doesn't contain a proportional representation of the classes.

Signup and view all the flashcards

Encoding Categorical Data

Numerical label for categories so that machines understand it.

Signup and view all the flashcards

Mapping ordinal values

Label categories for machines if there is an underlying order to your classes.

Signup and view all the flashcards

One Hot Encoding

Creating new features (columns) to represent categorical variables, each column represents a category.

Signup and view all the flashcards

Splitting the dataset

Divides a dataset into two groups of data to train and assess performance.

Signup and view all the flashcards

Feature scaling

Transforms data within the same scale.

Signup and view all the flashcards

Standardization

Method to standardize independent variables of a dataset within a specific range.

Signup and view all the flashcards

Min-Max normalization

Scales and translates each feature individually such that it is in the range of zero and one.

Signup and view all the flashcards

Data binning

Groups data in bins/buckets by replacing values into intervals.

Signup and view all the flashcards

Binning by distance

Define edges and labels to compute value ranges associated to a category.

Signup and view all the flashcards

Binning by frquency

Calculate sizes of each bind by including the same amount of observations.

Signup and view all the flashcards

Data sampling

Each bin's value is replaced by the mean, median or max/min.

Signup and view all the flashcards

Study Notes

Data preprocessing prepares raw data for machine learning.
It is the first and crucial step while creating a machine learning model.

Why Data Preprocessing Is Needed

Real-world data often contains noises, missing values, or unusable formats.
These issues prevent the data from being directly used for machine learning models.

Steps for Data Preprocessing

Acquire the dataset.
Import relevant libraries (e.g., Numpy, Pandas, Matplotlib).
Import the dataset into the working environment or notebook.
Identify and handle missing values.
Encode categorical data.
Split the dataset into training and testing sets.
Apply feature scaling.

Acquiring the Dataset

Involves gathering data from multiple sources into a combined format.
Dataset formats vary based on the use case (e.g., business or medical).
Datasets are typically stored in CSV, HTML, or XLSX file formats.

Data Types

Useless data has no statistical relevance to the problem being solved.
Indexes, IDs, account numbers, names, and email addresses can be considered useless data.
Binary data has only two possible values and may include binary classification labels.
Data domain experts and unique value counts can be used for binary data detection.
Continuous, or 'interval', data is measured on a continuous quantified scale, such as temperature or value.
Categorical data, the order of the data matters, but the 'distance' between values is not quantified
Color, species and drink preference are included as nominal categorical data.
Temporal data includes entries such as dates, time or order.

Importing Libraries

Numpy: Fundamental package for scientific calculation in Python, used for mathematical operations and multidimensional arrays.
Pandas: Open-source Python library for data manipulation and analysis, used for importing and managing datasets.
Matplotlib: Python 2D plotting library for creating various types of charts.

Sample Dataset Characteristics

Features three independent variables: Country, Age, Salary.
Features one dependent variable: Purchased.
Contains missing values in Age and Salary
Includes a categorical variable: Country

Importing the Dataset

It's recommended to save the Python file in the directory with the dataset.
read_csv() from Pandas imports a CSV file.
It is necessary to separate the data in the dataset into independent and dependent variables .
Employ the iloc[] feature in the Pandas library to isolate the independent variables.

Handling Missing Values

Identify and correctly handle missing values to avoid inaccurate inferences.
Methods include:
- Deleting rows: removing rows with missing values, but use with caution.
- Imputing data: replacing missing values using mean, median, or constant values.

Imputing Data (Handling Missing Values)

Imputation can add variance but negates data loss.
Often yields better results as compared to omitting rows or columns.

Encoding Categorical Data

Converts categorical information into numerical data.
Machine learning models rely on numerical calculations.

Ordinal Data Mapping

The column satisfaction is ordinal.
Because order matters in this column, the mapping should reflect this order.

One-Hot Encoding (Nominal Data)

Nominal Data is not ordered.
Dummy Encoding is used to eliminate this issue, generating dummy variables with 0 or 1 to represent the presence of a category.

Encoding Continuous Data

Binning.
Mapping
OneHotEncoding (nominal data)

Splitting the Dataset

Datasets are divided into training and test sets.

Purpose of Splitting

Improves model performance.
Training sets "teach" the model.
Test sets evaluate the model's ability to generalize.

Train Test Split

Datasets are commonly split using a 70:30 or 80:20 ratio between the training and testing sets.
Four elements are present in the code.
- X_train displays features for training data.
- X_test displays features for testing data.
- y_train is for dependent variables during training.
- and y_test is and independent variables for testing.

Feature Scaling (Normalization) and Binning

Marks the end of data preprocessing in machine learning.

Feature Scaling

Standardizes independent variables within a specific range.
Limits the range of variables for comparison.
Prevents algorithms from being unduly influenced by higher values.

Standardisation and Normalisation Methods

Standardization applies transformations.
Normalization scales values between 0 and 1.

Feature Scaling in Specific Sample Datasets

Age and salary columns might not have the same scale so feature scaling addresses this as salary values can easily dominate age values and deliver incorrect results.

Data Binning

A type of data smoothing.
Groups data into bins or buckets, replacing values within a interval with single representative value for interval to convert numeric values to categorical value.
Improves accuracy in models.

Binning Techniques

Binning by distance: Define the edges of each bin.
Binning by frequency: Calculates the size of each bin so that each bin contains the same number of observations by dividing the dataset into equal portions.
*Binning by sampling (mean, median, boundary)**: Each is employed to reduce samples, by grouping similar values of contiguous values.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Data Preprocessing

Choose a study mode

Podcast

Questions and Answers

In data preprocessing, what is the primary characteristic of 'useless' data?

How can a data domain expert contribute to the detection of useless data?

What type of data is best described as having only two possible values?

Which method is most effective for detecting binary variables in a dataset?

Which term best describes 'interval data' that is measured on a continuous quantified scale?

Which of the following is a common characteristic of continuous data that aids in its detection?

How does ordinal data differ from nominal data regarding the 'distance' between values?

What is a key distinction between categorical ordinal and categorical nominal data?

What is the primary characteristic of text data in the context of machine learning preprocessing?

Which preprocessing tasks are commonly associated with Natural Language Processing (NLP)?

What does 'tokenization' refer to in the context of Natural Language Processing (NLP)?

What is the main purpose of 'stemming' in Natural Language Processing (NLP)?

What characteristic defines temporal data?

What must be done to raw data to make it useable for machine learning algorithms?

What is the first step in data preprocessing?

What is the role of Pandas in importing libraries? Select all that apply.

What is the iloc[ ] method?

What should be considered when deciding to delete the data? Select all that apply.

Why is it important to handle the missing values?

What is the process of imputing the data?

How One Hot Encoding impacts the ML model?

What does dummy data equal?

What is the use for LabelEncoder()?

What is an aspect of the data set in a training set?

What are the 2 types of datasets to be split?

What is a typical ratio that the dataset gets split into?

For skilearn.model_selection, what does train_test_split import?

What type of encoding is typically used for encoding continuous data?

Why do we use feature scaling?

What does feature scaling accomplish?

What would influence delivering incorrect results?

What are the feature scaling methods?

During data processing when should standardization code and transform functions should be used?

What does data binning/bucketing acomplish?

When can binning be applied to numeric values?

What does data smoothing accomplish?

What method is used to calculate the size for each bin?

What is an alternative technique of data binning?

Flashcards

Useless Data

Binary Data

Continuous Data

Categorical Ordinal Data

Categorical Nominal Data

Text Data

Natural Language Processing (NLP)

Tokenization

Bag of Words

Stemming

Stop Words Removal

Temporal Data

Data Preprocessing

Acquiring the Dataset

Libraries NumPy, Pandas, Matplotlib

Handling Missing Values

Imputing Data

Deleting Rows

Unbalanced Data

Encoding Categorical Data

Mapping ordinal values

One Hot Encoding

Splitting the dataset

Feature scaling

Standardization

Min-Max normalization

Data binning

Binning by distance

Binning by frquency

Data sampling

Study Notes

Why Data Preprocessing Is Needed

Steps for Data Preprocessing

Acquiring the Dataset

Data Types

Importing Libraries

Sample Dataset Characteristics