Podcast
Questions and Answers
What function is used to find missing values in a DataFrame?
What function is used to find missing values in a DataFrame?
NaN is equal to itself.
NaN is equal to itself.
False
Which function in NumPy checks if an element is NaN?
Which function in NumPy checks if an element is NaN?
np.isnan()
To replace missing values in a DataFrame, you can use the function df.______().
To replace missing values in a DataFrame, you can use the function df.______().
Signup and view all the answers
Match the following Pandas functions with their purposes:
Match the following Pandas functions with their purposes:
Signup and view all the answers
Which of the following statements is correct about np.nan?
Which of the following statements is correct about np.nan?
Signup and view all the answers
Pandas uses None to represent missing values in DataFrames.
Pandas uses None to represent missing values in DataFrames.
Signup and view all the answers
What command would you use to remove all rows with missing values from a DataFrame?
What command would you use to remove all rows with missing values from a DataFrame?
Signup and view all the answers
What is the purpose of the 'pd.melt' function in the given content?
What is the purpose of the 'pd.melt' function in the given content?
Signup and view all the answers
The 'Profit' column in the melted DataFrame contains values from both 'New' and 'Old' models.
The 'Profit' column in the melted DataFrame contains values from both 'New' and 'Old' models.
Signup and view all the answers
What are the identifier columns used in the 'pd.melt' function?
What are the identifier columns used in the 'pd.melt' function?
Signup and view all the answers
Match the following DataFrame components with their description:
Match the following DataFrame components with their description:
Signup and view all the answers
What will be the output of the line 'pd.melt(df, id_vars=Type, value_vars=[New, Old])'?
What will be the output of the line 'pd.melt(df, id_vars=Type, value_vars=[New, Old])'?
Signup and view all the answers
Values from the 'New' column of the original DataFrame are repeated for each corresponding 'Old' value in the melted DataFrame.
Values from the 'New' column of the original DataFrame are repeated for each corresponding 'Old' value in the melted DataFrame.
Signup and view all the answers
Which method can be used to impute missing data?
Which method can be used to impute missing data?
Signup and view all the answers
The outcome of the dropna function is to drop rows that contain NaN values.
The outcome of the dropna function is to drop rows that contain NaN values.
Signup and view all the answers
What is the primary purpose of scaling and standardization in data preprocessing?
What is the primary purpose of scaling and standardization in data preprocessing?
Signup and view all the answers
Data can be encoded as _____ vectors using one-of-K encoding.
Data can be encoded as _____ vectors using one-of-K encoding.
Signup and view all the answers
Match the following terms with their definitions:
Match the following terms with their definitions:
Signup and view all the answers
Which of the following libraries is scikit-learn built on top of?
Which of the following libraries is scikit-learn built on top of?
Signup and view all the answers
Scikit-learn requires data to be in a format other than Numpy or Pandas DataFrame.
Scikit-learn requires data to be in a format other than Numpy or Pandas DataFrame.
Signup and view all the answers
What happens when performing arithmetic operations on a DataFrame that contains NaN values?
What happens when performing arithmetic operations on a DataFrame that contains NaN values?
Signup and view all the answers
Using the method dropna with axis set to 0 will remove columns that contain NaN values.
Using the method dropna with axis set to 0 will remove columns that contain NaN values.
Signup and view all the answers
What method is used to fill NaN values in a DataFrame with the mean of a column?
What method is used to fill NaN values in a DataFrame with the mean of a column?
Signup and view all the answers
The command df.dropna(axis=0)
will remove the rows that have __________ values.
The command df.dropna(axis=0)
will remove the rows that have __________ values.
Signup and view all the answers
Match the following pandas methods with their descriptions:
Match the following pandas methods with their descriptions:
Signup and view all the answers
What is the purpose of the command df.fillna(df['Num'].mean())
?
What is the purpose of the command df.fillna(df['Num'].mean())
?
Signup and view all the answers
What will the command df['Num'].sum()
return if there are missing values in the 'Num' column?
What will the command df['Num'].sum()
return if there are missing values in the 'Num' column?
Signup and view all the answers
Mathematical operators in Pandas will ignore NaN values.
Mathematical operators in Pandas will ignore NaN values.
Signup and view all the answers
What function can replace missing values in a DataFrame with the mean of a specific column?
What function can replace missing values in a DataFrame with the mean of a specific column?
Signup and view all the answers
To drop rows with missing values in a DataFrame, you would use the function df.dropna(axis=__)
.
To drop rows with missing values in a DataFrame, you would use the function df.dropna(axis=__)
.
Signup and view all the answers
Which of the following is NOT a way to handle missing values in Pandas?
Which of the following is NOT a way to handle missing values in Pandas?
Signup and view all the answers
Match the missing value terms with their meanings:
Match the missing value terms with their meanings:
Signup and view all the answers
Numpy can perform operations on non-numerical missing data types like 'None'.
Numpy can perform operations on non-numerical missing data types like 'None'.
Signup and view all the answers
What binary value will be produced for an input of 0.5 using the Binarizer with a threshold of 0.6?
What binary value will be produced for an input of 0.5 using the Binarizer with a threshold of 0.6?
Signup and view all the answers
What happens if you attempt to sum a column in a DataFrame containing NaN using only NumPy functions?
What happens if you attempt to sum a column in a DataFrame containing NaN using only NumPy functions?
Signup and view all the answers
The Binarizer function can take any threshold value, not just 0.6.
The Binarizer function can take any threshold value, not just 0.6.
Signup and view all the answers
What is the purpose of encoding categorical features as integers?
What is the purpose of encoding categorical features as integers?
Signup and view all the answers
The encoded size values are: 'S' = ______, 'M' = ______, 'L' = ______.
The encoded size values are: 'S' = ______, 'M' = ______, 'L' = ______.
Signup and view all the answers
ML models only work with boolean input values.
ML models only work with boolean input values.
Signup and view all the answers
What is the output of the Binarizer when the input is greater than 0.6?
What is the output of the Binarizer when the input is greater than 0.6?
Signup and view all the answers
Study Notes
Data Tidying and Data Preprocessing
- Data moves from raw sources (databases, web, Excel, text files, APIs) to tidying, then to tabular data.
- Tidying arranges data for analysis, using techniques like split-apply-combine, groupby, melt, and pivot.
- Preprocessing transforms raw data into a format suitable for analysis or machine learning.
- Data analysis uses statistics and visualizations to understand data, including machine learning and optimization.
- 80% of work is tidying/pre-processing, 20% is analysis
- A dataset is a collection of numerical and/or categorical values.
- A variable groups values measuring the same attribute, criteria, or feature. All values share the same type and units.
- An observation groups the values of multiple variables for a single object, person, item, etc.
- Example data shows a table with
Type
,Model
, andProfit
columns. This illustrates 3 variables and 6 observations, representing sales profit for different models. - Long format data has one column per variable. Wide format has more than one column containing the same variable. Long format is better for machine learning.
- Tidy data properties:
- Each variable is in a single column.
- Each row is a complete observation.
- Each table represents a type of observational unit.
- Wide format to long format (melt/unpivot). Long format to wide format (pivot).
-
Pandas library is a Python library for data manipulation. The Website:
http://pandas.pydata.org/
and Documentation:http://pandas.pydata.org/pandas-docs/stable/
-
import pandas as pd
- Data Wrangling/Tidying arranges data for analysis, using techniques like Split-Apply-Combine, groupby, melt, pivot etc..
- Data Munging/Preprocessing transforms raw data into an appropriate format for data analysis or machine learning.
- Data Analysis uses statistics and visualizations to understand data, often using machine learning, optimization, etc.
- Common Data Preprocessing steps: Textual transformations (e.g., using regular expressions), handling missing values (dropping, replacing), scaling, standardization, binarization, encoding categorical features (as integers, binary vectors).
Missing Values
- Missing data can include unknown, lost, wrong data or empty cells in Excel/CSV files. Incorrect indexing often causes missing values (NaN).
- Missing values (NaN) are handled differently in NumPy and Pandas
- NumPy mathematical functions propagate missing values as
NaN
. - Special NumPy functions, like
np.nansum()
, ignore missing values. - Replace missing values with a particular value in NumPy (e.g, 0).
- Handle missing values in Pandas using methods like
df.isna()
,df.fillna()
,df.dropna()
. Use.mean()
to calculate missing values based on the average for columns or rows (axis = 0 or 1
).
- NumPy mathematical functions propagate missing values as
Scaling (MinMaxScaler)
- Scaling data to a specific range (e.g., [0, 1]) is often necessary because different ranges across columns can affect results.
-
MinMaxScaler
scales data to a specified range (e.g., 0,1). - This normalization is useful in many machine learning algorithms.
Standardisation
- Standardisation, also called Z-score normalization, transforms data to a normal distribution with a mean of 0 and a standard deviation of 1.
- Useful when models assume data is centered around 0 (e.g., RBF kernel in support vector machines (SVM)) or when features have very different variances.
L1/L2 Normalisation
- Normalises individual samples (rows) to have a unit norm of 1.
- L1 and L2 are standard unit norms for comparing similarity/distance between samples in vector spaces.
Binarisation
- Convert numerical values to binary (0, 1) based on a threshold.
- Helpful for some machine learning models requiring boolean input values
Encoding Categorical Features
- Encode categorical data into numerical values for many data processing operations.
- Use
LabelEncoder
(fromsklearn.preprocessing
) to encode data (e.g., US=0,ES=1, UK=2). - Use
OneHotEncoder
to create one-hot encoded variables (e.g, CN, ES, UK, US represented as independent binary features in the new columns)
- Use
Split-Apply-Combine
- Split data based on a common criteria, or the index (row).
- Apply an operation/calculations to each group
- Combine to get results
Groupby
-
.groupby()
(on a specific column) is not a standard dataframe. -
groupedby_type
creates a separate dataframe for each unique value of the groupby column.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Explore the essential concepts of data tidying and preprocessing in this quiz. Understand the techniques used for arranging and transforming data for effective analysis and machine learning. Test your knowledge on the differences between variables and observations as well as the importance of each step in data analysis.