Podcast
Questions and Answers
Which of the following data types is most likely to be removed during preprocessing due to 'high cardinality'?
Which of the following data types is most likely to be removed during preprocessing due to 'high cardinality'?
- Continuous data such as temperature readings.
- Data with no statistical relevance, like names or unique IDs. (correct)
- Categorical data representing customer satisfaction levels (e.g., low, medium, high).
- Binary data representing a True/False condition.
In the context of data preprocessing, why might 'useless data' such as indexes or IDs be removed from a dataset?
In the context of data preprocessing, why might 'useless data' such as indexes or IDs be removed from a dataset?
- To ensure all data is on a continuous, quantified scale.
- To improve model performance by reducing statistical irrelevance. (correct)
- To convert numerical data into categorical data.
- To reduce the dimensionality of temporal data.
A dataset contains a feature indicating whether a customer clicked on an advertisement ('Yes' or 'No'). What type of data is this?
A dataset contains a feature indicating whether a customer clicked on an advertisement ('Yes' or 'No'). What type of data is this?
- Binary (correct)
- Categorical Nominal
- Continuous
- Temporal
Which type of data is characterized by being measured on a continuous quantified scale, also known as 'interval data'?
Which type of data is characterized by being measured on a continuous quantified scale, also known as 'interval data'?
What distinguishes categorical ordinal data from categorical nominal data?
What distinguishes categorical ordinal data from categorical nominal data?
A dataset includes a 'customer satisfaction' column with values like 'Not Satisfied', 'Slightly Satisfied', 'Satisfied', and 'Very Satisfied'. What type of data is this?
A dataset includes a 'customer satisfaction' column with values like 'Not Satisfied', 'Slightly Satisfied', 'Satisfied', and 'Very Satisfied'. What type of data is this?
In a dataset, the 'eye_color' column contains values such as 'blue', 'green', and 'brown'. This is an example of what type of data?
In a dataset, the 'eye_color' column contains values such as 'blue', 'green', and 'brown'. This is an example of what type of data?
Which of the following is a primary characteristic of text data in the context of data preprocessing?
Which of the following is a primary characteristic of text data in the context of data preprocessing?
What initial step is performed in Natural Language Processing (NLP) after acquiring the text data?
What initial step is performed in Natural Language Processing (NLP) after acquiring the text data?
In NLP, what is the purpose of 'stemming'?
In NLP, what is the purpose of 'stemming'?
What is the main function of 'stop words removal' in text preprocessing?
What is the main function of 'stop words removal' in text preprocessing?
In data preprocessing, what does 'temporal data' primarily refer to?
In data preprocessing, what does 'temporal data' primarily refer to?
Why is data preprocessing a crucial step in machine learning?
Why is data preprocessing a crucial step in machine learning?
A dataset contains missing values. What is a common initial approach for identifying and handling these missing values?
A dataset contains missing values. What is a common initial approach for identifying and handling these missing values?
Which of the following is NOT typically a step in data preprocessing?
Which of the following is NOT typically a step in data preprocessing?
Why is it important to identify and handle missing values in a dataset?
Why is it important to identify and handle missing values in a dataset?
What should you ensure when you choose to delete a particular row with a null value?
What should you ensure when you choose to delete a particular row with a null value?
In the context of handling missing data, what does imputing data involve?
In the context of handling missing data, what does imputing data involve?
Which of the following is a valid technique for imputing missing numerical data?
Which of the following is a valid technique for imputing missing numerical data?
A dataset has missing values, and a domain expert advises replacing the missing values in the 'age' column with the average age. What imputation method is being used?
A dataset has missing values, and a domain expert advises replacing the missing values in the 'age' column with the average age. What imputation method is being used?
Why are machine learning models are primarily based on mathematical equations?
Why are machine learning models are primarily based on mathematical equations?
What is the purpose of encoding categorical data?
What is the purpose of encoding categorical data?
Which encoding method is most suitable for categorical data where the order of categories matters?
Which encoding method is most suitable for categorical data where the order of categories matters?
A column 'education_level' has values like 'High School', 'Bachelor’s', 'Master’s', and 'Doctorate'. Which encoding method should be used?
A column 'education_level' has values like 'High School', 'Bachelor’s', 'Master’s', and 'Doctorate'. Which encoding method should be used?
In 'One-Hot Encoding (Nominal Data)', if there are 5 categories in a column, how many new columns will be created?
In 'One-Hot Encoding (Nominal Data)', if there are 5 categories in a column, how many new columns will be created?
What happens if nominal data is mapped as ordinal data?
What happens if nominal data is mapped as ordinal data?
Which of the following accurately describes how to use Dummy Variables effectively?
Which of the following accurately describes how to use Dummy Variables effectively?
Why is splitting a dataset into training and testing sets a crucial step in data preprocessing?
Why is splitting a dataset into training and testing sets a crucial step in data preprocessing?
What is the main purpose of having a test set after training a machine learning model on a training dataset?
What is the main purpose of having a test set after training a machine learning model on a training dataset?
What is the approximate ratio for a training to test sets?
What is the approximate ratio for a training to test sets?
In machine learning, what does feature scaling primarily aim to do?
In machine learning, what does feature scaling primarily aim to do?
Why should we consider using feature scaling in a machine learning model?
Why should we consider using feature scaling in a machine learning model?
What are the two common methods for the process of standardization in machine learning?
What are the two common methods for the process of standardization in machine learning?
Which formula represents Min-Max Normalization?
Which formula represents Min-Max Normalization?
Binning is a technique for?
Binning is a technique for?
What is the primary intention of data smoothing?
What is the primary intention of data smoothing?
During data preprocessing, what does binning by distance involve?
During data preprocessing, what does binning by distance involve?
What is the focus of binning by frequency?
What is the focus of binning by frequency?
Which one is a data binning technique?
Which one is a data binning technique?
Which of the following describes data binning by boundary?
Which of the following describes data binning by boundary?
Flashcards
Useless Data
Useless Data
Data that has no statistical relevance to the problem being solved; often includes high cardinality.
Binary Data
Binary Data
Data with only two possible values.
Continuous Data
Continuous Data
Data measured on a continuous, quantified scale, also known as 'interval data'.
Categorical Ordinal Data
Categorical Ordinal Data
Data where the order of the data matters, but the 'distance' between values is not quantified.
Signup and view all the flashcards
Categorical Nominal Data
Categorical Nominal Data
Categorical data where the order is irrelevant.
Signup and view all the flashcards
Text Data
Text Data
A word, sentence, or document that can be analyzed using NLP.
Signup and view all the flashcards
Temporal Data
Temporal Data
A field representing date, time, or order.
Signup and view all the flashcards
Data Preprocessing
Data Preprocessing
Process of preparing raw data to make it suitable for machine learning.
Signup and view all the flashcards
Acquiring the Dataset
Acquiring the Dataset
First step in data preprocessing; involves gathering data from multiple sources and combining them in a proper format.
Signup and view all the flashcards
Numpy Library
Numpy Library
The fundamental package for scientific calculation in Python.
Signup and view all the flashcards
Pandas Library
Pandas Library
Open-source Python library for data manipulation and analysis.
Signup and view all the flashcards
Matplotlib Library
Matplotlib Library
Python 2D plotting library used to plot any type of charts in Python.
Signup and view all the flashcards
Dropping Useless Data
Dropping Useless Data
Removing data columns that aren't statistically relevant to the problem being solved.
Signup and view all the flashcards
Impute data
Impute data
Techniques to fill in missing values, data are entered to replace the missing values.
Signup and view all the flashcards
Drop NA
Drop NA
Removal rows that containing null values.
Signup and view all the flashcards
Most Frequent
Most Frequent
Using most frequent values, the imputer computes the most frequent value and replaces missing values.
Signup and view all the flashcards
Encoding the data
Encoding the data
Technique utilizing binning, one-hot encoding, binning converts number values into bins.
Signup and view all the flashcards
Nominal Mapping
Nominal Mapping
Nominal data mapping is appropriate only if the variable is ordinal, which implies a natural ordering and ranking of the categories.
Signup and view all the flashcards
Encoded Data
Encoded Data
Nominal data is not ordered.
Signup and view all the flashcards
Traning Data
Traning Data
dividing the dataset with 70:30 ratio or 80:20 ratio.
Signup and view all the flashcards
Feature scaling
Feature scaling
Method to standardize the independent variables of a dataset within a specific range.
Signup and view all the flashcards
Standardization
Standardization
It is a technique to standardize the independent variables of a dataset within a specific range.
Signup and view all the flashcards
Min Max normalization
Min Max normalization
It is a technique to standardize the independent variables of a dataset within a specific range.
Signup and view all the flashcards
Data Binning
Data Binning
Data binning/bucketing groups data in bins/buckets, in the sense that it replaces values contained into a small interval with a single representative value for that interval.
Signup and view all the flashcards
Binning by distance
Binning by distance
defining edges for each bin.
Signup and view all the flashcards
Binning by frequency
Binning by frequency
calculates the data for each bin so that each bin contains the same number of observations, with the bin
Signup and view all the flashcardsStudy Notes
Data Preprocessing
- Data preprocessing prepares raw data, rendering it suitable for machine learning.
- This is a crucial initial step in creating a machine learning model.
- Real-world data contains noises, missing values, and may be in an unusable format.
Steps for Data Preprocessing
- Acquire the dataset
- Import crucial libraries
- Import the dataset
- Identify and handle missing values
- Encode categorical data
- Split the dataset
- Feature scaling
Acquiring the Dataset
- This is the initial step in data preprocessing for machine learning.
- A dataset comprises data gathered from multiple and disparate sources, then combined into a proper format.
- Dataset formats differ according to use cases; business datasets differ from medical ones, containing different data.
- Data should be put in CSV, HTML, or XLSX file formats.
Importing Libraries
- Numpy is a Python package for scientific calculation
- It is used for mathematical operations.
- Numpy adds multidimensional arrays and matrices.
- Pandas is a Python library for data manipulation and analysis.
- Pandas is used for importing and managing data sets.
- Matplotlib is a Python 2D plotting library.
- Matplotlib is used to plot various types of charts.
Code Example
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Sample Dataset Characteristics
- Dataset stored in "Data.csv" file.
- It includes 10 instances/examples.
- Three independent variables: 'Country', 'Age', and 'Salary'.
- One dependent variable: 'Purchased'.
- Two missing values, one each in 'Age' and 'Salary' independent variables.
- 'Country' is a categorical variable.
Importing the Dataset
- Save your Python file in the directory containing the data set.
read_csv()
function of the pandas library reads a CSV file.- Separating independent and dependent variables are required for every Machine Learning model.
- Employ the
iloc[]
function from the Pandas library to extract independent variables.
Deleting Rows with NaN Values code
X = pd.DataFrame(X)
X = X.dropna()
X = np.array(X)
Identifying and Handling Missing Values
- Handle missing values correctly during data preprocessing.
- Failure to handle missing values leads to incorrect conclusions and inferences.
- Methods to handle missing data include:
- Deleting a particular row
- Imputing the data by:
- Replacing with the mean
- Replacing with the median
- Replacing with the most frequently occurring value
- Replacing with a constant value
Deleting a Particular Row
- Remove rows with null values or with >75% missing values in a particular column.
- Only recommended when the dataset has adequate samples.
- Ensure removal doesn't introduce bias.
Impute Data
- Imputation adds variance, but can negate losses efficiently.
- Imputation can yield better results.
Code - Replacing NAN (Most Frequent)
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imputer.fit(X[:, 0:3])
X[:, 0:3] = imputer.transform(X[:, 0:3])
Code - Replacing NAN (Median/Mean)
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy-'median')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
Encoding The Data
- Data types include continues, ordinal and nominal data.
- Associated encoding methods include binning, mapping and one hot encoding.
Encoding the Data
- Categorical data has specific categories in a dataset.
- Machine Learning models rely on mathematical equations.
- Use numbers in the equations.
- How to encode
- Categorical data
- Ordinal data mapping
- One hot encoding (nominal data)
- Continuous data
- Binning
- Normalization
- Categorical data
Mapping (Ordinal data)
- Categorical columns include
eye_color
(Nominal),Satisfaction
(Ordinal), andUpsell
(Nominal). - Satisfaction is ordinal.
- Order matters in the satisfaction column.
Mapping (Ordinal data) Code:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
dataset['satisfaction'] = labelencoder.fit_transform(dataset['satisfaction'])
One-Hot Encoding (Nominal Data)
- Nominal data lacks order.
- ML models may assume correlations when mapping nominal data as ordinal data, producing faulty output.
- Use Dummy Encoding to eliminate this issue.
- Dummy variables use 0 or 1 to indicate absence or presence of a specific categorical effect.
- 1 indicates presence; other variables become 0, in dummy encoding.
- Number of columns equals the number of categories.
Splitting the Dataset
- Split data into separate training and test sets
- It enhances the performance of your machine learning Model
- Training involves feeding data to the machine, test runs it after training.
- Training data and test data enable the model to identify correlations between models.
- Training Set denotes the subset of a data that will be run to train the ML model
- Test set enables the test of the machine in evaluation for outcomes relative to the trained ML model
- Usually, the dataset is split into 70:30 ratio or 80:20 ratio.
Splitting Data Code
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
Code includes the variables:
- X_train is used for features in the training data
- X_text is used for features in the test data
- y_train is used for dependant variables in training data
- y_test is used for the independent variable in testing data
train_test_split() Function Parameters:
- Arrays of data.
- Test size to specify the the dividing ratio between the training and test sets.
- "random_state" sets seed for a random generator so that the output is always is its set to zero.
Encoding the Continuous Data
- Encoding can be either binning, or normalization.
Feature Scaling
- Marks the end of the data preprocessing in Machine Learning.
- Method to standardize independent variables of a dataset within a specific range.
- Feature Scaling Limits the range of variables so you can compare them on common ground.
- It prevents algorithms from being influenced by higher values
Feature Scaling
- Feature scaling can be standardization, or normalization.
Code - Standardization
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])
- To standardize data of the test set, mean and standard deviation values of the training set are used (no data leaking).
- transform() function is used for the test set to avoid to the fit_transform() function.
Code - Min Max normalization
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])
- To standardize the data of the test set, max- and min values of the training set are used. So, there is no data leaking.
- Transform() uses a fit_transform() function.
Data Binning
- Data binning data in bins/buckets, in the sense that it replaces values contained into a small interval with a single representative value for that interval.
- Binning can be applied to
- convert numeric values to categorical values
- binning by distance
- binning by frequency
- Reduce numeric values: Quantization (or sampling)
- convert numeric values to categorical values
Binning
- Binning is a technique for data smoothing.
- Data smoothing is employed to remove noise from data.
Three techniques are used for data smoothing:
- binning
- regression
- outlier analysis
Binning by Distance Procedures
- Import the dataset
- Compute the range of values and find the edges of intervals/bins
- Define labels
- Convert numeric values into categorical labels
- Plot the histogram to see the distribution
Binning by Frequency Procedures
- Import the dataset
- Define the labels
- Use qcut of the pandas library for data binning
- Plot the histogram to see the distribution
Binning and Sampling Techniques:
-binning by mean: Each value in a bin is replaced by the mean value of the bin.
-Binning by median: Each bin value is replaced by its bin median value.
- Binning by boundary: Each bin value is replaced by the closest boundary value, i.e. maximum or minimum value of the bin.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.