Data Preprocessing for Machine Learning

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Listen to an AI-generated conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Which of the following data types is most likely to be removed during preprocessing due to 'high cardinality'?

  • Continuous data such as temperature readings.
  • Data with no statistical relevance, like names or unique IDs. (correct)
  • Categorical data representing customer satisfaction levels (e.g., low, medium, high).
  • Binary data representing a True/False condition.

In the context of data preprocessing, why might 'useless data' such as indexes or IDs be removed from a dataset?

  • To ensure all data is on a continuous, quantified scale.
  • To improve model performance by reducing statistical irrelevance. (correct)
  • To convert numerical data into categorical data.
  • To reduce the dimensionality of temporal data.

A dataset contains a feature indicating whether a customer clicked on an advertisement ('Yes' or 'No'). What type of data is this?

  • Binary (correct)
  • Categorical Nominal
  • Continuous
  • Temporal

Which type of data is characterized by being measured on a continuous quantified scale, also known as 'interval data'?

<p>Continuous data (B)</p>
Signup and view all the answers

What distinguishes categorical ordinal data from categorical nominal data?

<p>Ordinal data has a meaningful order or ranking, while nominal data doesn't. (A)</p>
Signup and view all the answers

A dataset includes a 'customer satisfaction' column with values like 'Not Satisfied', 'Slightly Satisfied', 'Satisfied', and 'Very Satisfied'. What type of data is this?

<p>Ordinal categorical data (C)</p>
Signup and view all the answers

In a dataset, the 'eye_color' column contains values such as 'blue', 'green', and 'brown'. This is an example of what type of data?

<p>Nominal data (C)</p>
Signup and view all the answers

Which of the following is a primary characteristic of text data in the context of data preprocessing?

<p>It consists of words, sentences, or documents. (B)</p>
Signup and view all the answers

What initial step is performed in Natural Language Processing (NLP) after acquiring the text data?

<p>Tokenization (D)</p>
Signup and view all the answers

In NLP, what is the purpose of 'stemming'?

<p>Reducing words to their root form. (B)</p>
Signup and view all the answers

What is the main function of 'stop words removal' in text preprocessing?

<p>To eliminate common words that don't provide much value. (D)</p>
Signup and view all the answers

In data preprocessing, what does 'temporal data' primarily refer to?

<p>Data related to dates, times, or order. (B)</p>
Signup and view all the answers

Why is data preprocessing a crucial step in machine learning?

<p>It prepares raw data to be suitable for machine learning. (A)</p>
Signup and view all the answers

A dataset contains missing values. What is a common initial approach for identifying and handling these missing values?

<p>Consult a data domain expert for informed decisions. (C)</p>
Signup and view all the answers

Which of the following is NOT typically a step in data preprocessing?

<p>Writing the research paper based on the results (A)</p>
Signup and view all the answers

Why is it important to identify and handle missing values in a dataset?

<p>Failing to handle missing values might lead to incorrect outcomes. (A)</p>
Signup and view all the answers

What should you ensure when you choose to delete a particular row with a null value?

<p>The deletion does not add bias to our values. (D)</p>
Signup and view all the answers

In the context of handling missing data, what does imputing data involve?

<p>Replacing missing values with estimated or calculated values. (B)</p>
Signup and view all the answers

Which of the following is a valid technique for imputing missing numerical data?

<p>Replacing with a constant value (A)</p>
Signup and view all the answers

A dataset has missing values, and a domain expert advises replacing the missing values in the 'age' column with the average age. What imputation method is being used?

<p>Replacing with the mean (A)</p>
Signup and view all the answers

Why are machine learning models are primarily based on mathematical equations?

<p>To allow numbers to be manipulated in the equations (B)</p>
Signup and view all the answers

What is the purpose of encoding categorical data?

<p>Transforming categorical data into a numerical format. (B)</p>
Signup and view all the answers

Which encoding method is most suitable for categorical data where the order of categories matters?

<p>Mapping (Ordinal) (B)</p>
Signup and view all the answers

A column 'education_level' has values like 'High School', 'Bachelor’s', 'Master’s', and 'Doctorate'. Which encoding method should be used?

<p>Ordinal mapping (D)</p>
Signup and view all the answers

In 'One-Hot Encoding (Nominal Data)', if there are 5 categories in a column, how many new columns will be created?

<p>5 (D)</p>
Signup and view all the answers

What happens if nominal data is mapped as ordinal data?

<p>The ML model may assume some correlation between the nominal data. (D)</p>
Signup and view all the answers

Which of the following accurately describes how to use Dummy Variables effectively?

<p>They are those that take values 0 or 1 to show the specific categorical effect that can shift the outcome (A)</p>
Signup and view all the answers

Why is splitting a dataset into training and testing sets a crucial step in data preprocessing?

<p>To enhance the performance of our machine learning model. (A)</p>
Signup and view all the answers

What is the main purpose of having a test set after training a machine learning model on a training dataset?

<p>To evaluate the trained ML model. (A)</p>
Signup and view all the answers

What is the approximate ratio for a training to test sets?

<p>70:30 (C)</p>
Signup and view all the answers

In machine learning, what does feature scaling primarily aim to do?

<p>Standardize all values to be within a single scale. (D)</p>
Signup and view all the answers

Why should we consider using feature scaling in a machine learning model?

<p>To compare each dataset on a set of common grounds (D)</p>
Signup and view all the answers

What are the two common methods for the process of standardization in machine learning?

<p>Standardization and Normalization (A)</p>
Signup and view all the answers

Which formula represents Min-Max Normalization?

<p>$X' = \frac{X - min(X)}{max(X) - min(X)}$ (D)</p>
Signup and view all the answers

Binning is a technique for?

<p>Data smoothing (B)</p>
Signup and view all the answers

What is the primary intention of data smoothing?

<p>To remove noise from data. (A)</p>
Signup and view all the answers

During data preprocessing, what does binning by distance involve?

<p>Defining the edges of intervals/bins. (C)</p>
Signup and view all the answers

What is the focus of binning by frequency?

<p>It calculates the bin sizes (C)</p>
Signup and view all the answers

Which one is a data binning technique?

<p>Sampling (C)</p>
Signup and view all the answers

Which of the following describes data binning by boundary?

<p>Each bin value is closest to its boundary value. (A)</p>
Signup and view all the answers

Flashcards

Useless Data

Data that has no statistical relevance to the problem being solved; often includes high cardinality.

Binary Data

Data with only two possible values.

Continuous Data

Data measured on a continuous, quantified scale, also known as 'interval data'.

Categorical Ordinal Data

Data where the order of the data matters, but the 'distance' between values is not quantified.

Signup and view all the flashcards

Categorical Nominal Data

Categorical data where the order is irrelevant.

Signup and view all the flashcards

Text Data

A word, sentence, or document that can be analyzed using NLP.

Signup and view all the flashcards

Temporal Data

A field representing date, time, or order.

Signup and view all the flashcards

Data Preprocessing

Process of preparing raw data to make it suitable for machine learning.

Signup and view all the flashcards

Acquiring the Dataset

First step in data preprocessing; involves gathering data from multiple sources and combining them in a proper format.

Signup and view all the flashcards

Numpy Library

The fundamental package for scientific calculation in Python.

Signup and view all the flashcards

Pandas Library

Open-source Python library for data manipulation and analysis.

Signup and view all the flashcards

Matplotlib Library

Python 2D plotting library used to plot any type of charts in Python.

Signup and view all the flashcards

Dropping Useless Data

Removing data columns that aren't statistically relevant to the problem being solved.

Signup and view all the flashcards

Impute data

Techniques to fill in missing values, data are entered to replace the missing values.

Signup and view all the flashcards

Drop NA

Removal rows that containing null values.

Signup and view all the flashcards

Most Frequent

Using most frequent values, the imputer computes the most frequent value and replaces missing values.

Signup and view all the flashcards

Encoding the data

Technique utilizing binning, one-hot encoding, binning converts number values into bins.

Signup and view all the flashcards

Nominal Mapping

Nominal data mapping is appropriate only if the variable is ordinal, which implies a natural ordering and ranking of the categories.

Signup and view all the flashcards

Encoded Data

Nominal data is not ordered.

Signup and view all the flashcards

Traning Data

dividing the dataset with 70:30 ratio or 80:20 ratio.

Signup and view all the flashcards

Feature scaling

Method to standardize the independent variables of a dataset within a specific range.

Signup and view all the flashcards

Standardization

It is a technique to standardize the independent variables of a dataset within a specific range.

Signup and view all the flashcards

Min Max normalization

It is a technique to standardize the independent variables of a dataset within a specific range.

Signup and view all the flashcards

Data Binning

Data binning/bucketing groups data in bins/buckets, in the sense that it replaces values contained into a small interval with a single representative value for that interval.

Signup and view all the flashcards

Binning by distance

defining edges for each bin.

Signup and view all the flashcards

Binning by frequency

calculates the data for each bin so that each bin contains the same number of observations, with the bin

Signup and view all the flashcards

Study Notes

Data Preprocessing

  • Data preprocessing prepares raw data, rendering it suitable for machine learning.
  • This is a crucial initial step in creating a machine learning model.
  • Real-world data contains noises, missing values, and may be in an unusable format.

Steps for Data Preprocessing

  • Acquire the dataset
  • Import crucial libraries
  • Import the dataset
  • Identify and handle missing values
  • Encode categorical data
  • Split the dataset
  • Feature scaling

Acquiring the Dataset

  • This is the initial step in data preprocessing for machine learning.
  • A dataset comprises data gathered from multiple and disparate sources, then combined into a proper format.
  • Dataset formats differ according to use cases; business datasets differ from medical ones, containing different data.
  • Data should be put in CSV, HTML, or XLSX file formats.

Importing Libraries

  • Numpy is a Python package for scientific calculation
  • It is used for mathematical operations.
  • Numpy adds multidimensional arrays and matrices.
  • Pandas is a Python library for data manipulation and analysis.
  • Pandas is used for importing and managing data sets.
  • Matplotlib is a Python 2D plotting library.
  • Matplotlib is used to plot various types of charts.

Code Example

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Sample Dataset Characteristics

  • Dataset stored in "Data.csv" file.
  • It includes 10 instances/examples.
  • Three independent variables: 'Country', 'Age', and 'Salary'.
  • One dependent variable: 'Purchased'.
  • Two missing values, one each in 'Age' and 'Salary' independent variables.
  • 'Country' is a categorical variable.

Importing the Dataset

  • Save your Python file in the directory containing the data set.
  • read_csv() function of the pandas library reads a CSV file.
  • Separating independent and dependent variables are required for every Machine Learning model.
  • Employ the iloc[] function from the Pandas library to extract independent variables.

Deleting Rows with NaN Values code

X = pd.DataFrame(X)
X = X.dropna()
X = np.array(X)

Identifying and Handling Missing Values

  • Handle missing values correctly during data preprocessing.
  • Failure to handle missing values leads to incorrect conclusions and inferences.
  • Methods to handle missing data include:
    • Deleting a particular row
    • Imputing the data by:
      • Replacing with the mean
      • Replacing with the median
      • Replacing with the most frequently occurring value
      • Replacing with a constant value

Deleting a Particular Row

  • Remove rows with null values or with >75% missing values in a particular column.
  • Only recommended when the dataset has adequate samples.
  • Ensure removal doesn't introduce bias.

Impute Data

  • Imputation adds variance, but can negate losses efficiently.
  • Imputation can yield better results.

Code - Replacing NAN (Most Frequent)

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imputer.fit(X[:, 0:3])
X[:, 0:3] = imputer.transform(X[:, 0:3])

Code - Replacing NAN (Median/Mean)

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy-'median')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

Encoding The Data

  • Data types include continues, ordinal and nominal data.
  • Associated encoding methods include binning, mapping and one hot encoding.

Encoding the Data

  • Categorical data has specific categories in a dataset.
  • Machine Learning models rely on mathematical equations.
  • Use numbers in the equations.
  • How to encode
    • Categorical data
      • Ordinal data mapping
      • One hot encoding (nominal data)
    • Continuous data
      • Binning
      • Normalization

Mapping (Ordinal data)

  • Categorical columns include eye_color (Nominal), Satisfaction (Ordinal), and Upsell (Nominal).
  • Satisfaction is ordinal.
  • Order matters in the satisfaction column.

Mapping (Ordinal data) Code:

from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
dataset['satisfaction'] = labelencoder.fit_transform(dataset['satisfaction'])

One-Hot Encoding (Nominal Data)

  • Nominal data lacks order.
  • ML models may assume correlations when mapping nominal data as ordinal data, producing faulty output.
  • Use Dummy Encoding to eliminate this issue.
  • Dummy variables use 0 or 1 to indicate absence or presence of a specific categorical effect.
  • 1 indicates presence; other variables become 0, in dummy encoding.
  • Number of columns equals the number of categories.

Splitting the Dataset

  • Split data into separate training and test sets
  • It enhances the performance of your machine learning Model
  • Training involves feeding data to the machine, test runs it after training.
  • Training data and test data enable the model to identify correlations between models.
  • Training Set denotes the subset of a data that will be run to train the ML model
  • Test set enables the test of the machine in evaluation for outcomes relative to the trained ML model
  • Usually, the dataset is split into 70:30 ratio or 80:20 ratio.

Splitting Data Code

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

Code includes the variables:

  • X_train is used for features in the training data
  • X_text is used for features in the test data
  • y_train is used for dependant variables in training data
  • y_test is used for the independent variable in testing data

train_test_split() Function Parameters:

  • Arrays of data.
  • Test size to specify the the dividing ratio between the training and test sets.
  • "random_state" sets seed for a random generator so that the output is always is its set to zero.

Encoding the Continuous Data

  • Encoding can be either binning, or normalization.

Feature Scaling

  • Marks the end of the data preprocessing in Machine Learning.
  • Method to standardize independent variables of a dataset within a specific range.
  • Feature Scaling Limits the range of variables so you can compare them on common ground.
  • It prevents algorithms from being influenced by higher values

Feature Scaling

  • Feature scaling can be standardization, or normalization.

Code - Standardization

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])
  • To standardize data of the test set, mean and standard deviation values of the training set are used (no data leaking).
  • transform() function is used for the test set to avoid to the fit_transform() function.

Code - Min Max normalization

from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])
  • To standardize the data of the test set, max- and min values of the training set are used. So, there is no data leaking.
  • Transform() uses a fit_transform() function.

Data Binning

  • Data binning data in bins/buckets, in the sense that it replaces values contained into a small interval with a single representative value for that interval.
  • Binning can be applied to
    • convert numeric values to categorical values
      • binning by distance
      • binning by frequency
    • Reduce numeric values: Quantization (or sampling)

Binning

  • Binning is a technique for data smoothing.
  • Data smoothing is employed to remove noise from data. Three techniques are used for data smoothing:
    • binning
    • regression
    • outlier analysis

Binning by Distance Procedures

  • Import the dataset
  • Compute the range of values and find the edges of intervals/bins
  • Define labels
  • Convert numeric values into categorical labels
  • Plot the histogram to see the distribution

Binning by Frequency Procedures

   - Import the dataset
   - Define the labels
   - Use qcut of the pandas library for data binning
   - Plot the histogram to see the distribution

Binning and Sampling Techniques:

-binning by mean: Each value in a bin is replaced by the mean value of the bin.

-Binning by median: Each bin value is replaced by its bin median value.

  • Binning by boundary: Each bin value is replaced by the closest boundary value, i.e. maximum or minimum value of the bin.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser