Podcast
Questions and Answers
When importing libraries in Python for data analysis, what benefit does using an abbreviation (e.g., as pd
) provide?
When importing libraries in Python for data analysis, what benefit does using an abbreviation (e.g., as pd
) provide?
- It automatically updates the library to the latest version.
- It reduces the memory footprint of the library.
- It speeds up coding by allowing you to avoid typing the full library name each time. (correct)
- It allows the code to run faster by skipping unnecessary functions.
In the context of data preparation, what does 'reading data' primarily involve?
In the context of data preparation, what does 'reading data' primarily involve?
- Loading data from external sources into a Python environment.
- Cleaning and transforming data to ensure it is in a usable format.
- Visualizing data to identify patterns and anomalies.
- Accessing data and loading it into memory for use within a Python program. (correct)
Which of the following is NOT a common issue encountered during data preparation?
Which of the following is NOT a common issue encountered during data preparation?
- Missing values
- Imbalanced datasets
- Perfectly balanced data (correct)
- Outliers
A researcher is investigating income inequality and omits data from time periods when inequality was low, focusing only on periods of high inequality. What type of missingness does this represent?
A researcher is investigating income inequality and omits data from time periods when inequality was low, focusing only on periods of high inequality. What type of missingness does this represent?
When handling missing data, if a data analyst chooses to impute missing values for a qualitative variable, which of the following methods is most appropriate?
When handling missing data, if a data analyst chooses to impute missing values for a qualitative variable, which of the following methods is most appropriate?
In a dataset of customer purchases, a few customers have recorded purchases that are ten times higher than the average purchase amount. These are noted to follow the overall trend of the data. What are these data primarily classified as?
In a dataset of customer purchases, a few customers have recorded purchases that are ten times higher than the average purchase amount. These are noted to follow the overall trend of the data. What are these data primarily classified as?
In a dataset analyzing the factors determining house prices, one house is priced significantly higher than similar houses in the same location due to an unrecorded historical significance. How should this be addressed?
In a dataset analyzing the factors determining house prices, one house is priced significantly higher than similar houses in the same location due to an unrecorded historical significance. How should this be addressed?
What initial step helps to identify if abnormal data points may be present?
What initial step helps to identify if abnormal data points may be present?
Which plot would be most appropriate to visualize the distribution of a single numerical variable to check for outliers?
Which plot would be most appropriate to visualize the distribution of a single numerical variable to check for outliers?
What should you do if you remove abnormal data?
What should you do if you remove abnormal data?
In the context of imbalanced datasets, what does the term 'class' refer to?
In the context of imbalanced datasets, what does the term 'class' refer to?
Which of the following imbalance ratios would be considered acceptable?
Which of the following imbalance ratios would be considered acceptable?
What is the primary goal of Random Under-Sampling (RUS) in dealing with imbalanced datasets?
What is the primary goal of Random Under-Sampling (RUS) in dealing with imbalanced datasets?
When is Random Under-Sampling (RUS) most suitable for addressing imbalanced datasets?
When is Random Under-Sampling (RUS) most suitable for addressing imbalanced datasets?
What is a key limitation of using Random Under-Sampling (RUS) to handle imbalanced datasets?
What is a key limitation of using Random Under-Sampling (RUS) to handle imbalanced datasets?
In Random Over-Sampling (ROS), what approach is used to balance the dataset?
In Random Over-Sampling (ROS), what approach is used to balance the dataset?
When is it most appropriate to use Random Over-Sampling (ROS)?
When is it most appropriate to use Random Over-Sampling (ROS)?
What is a potential drawback of using Random Over-Sampling (ROS) to address imbalanced datasets?
What is a potential drawback of using Random Over-Sampling (ROS) to address imbalanced datasets?
What is the purpose of the code df.isnull().sum()
?
What is the purpose of the code df.isnull().sum()
?
What is the outcome of running the line of code df.dropna(axis=1)
?
What is the outcome of running the line of code df.dropna(axis=1)
?
In Python, what function would you use to import a specific tool (e.g., RandomForestClassifier
) from a library (e.g., sklearn.ensemble
)?
In Python, what function would you use to import a specific tool (e.g., RandomForestClassifier
) from a library (e.g., sklearn.ensemble
)?
To identify the number of missing values per row in a Pandas DataFrame, which code snippet would be most suitable assuming the DataFrame is called df
?
To identify the number of missing values per row in a Pandas DataFrame, which code snippet would be most suitable assuming the DataFrame is called df
?
How do you install the 'lasio' library when you are not able to import it regularly?
How do you install the 'lasio' library when you are not able to import it regularly?
Which pandas function is appropriate for reading a CSV file?
Which pandas function is appropriate for reading a CSV file?
The interquartile range (IQR) is calculated using which of the following?
The interquartile range (IQR) is calculated using which of the following?
In order to create a new dataset containing only the values within the limits, which of the following should you implement?
In order to create a new dataset containing only the values within the limits, which of the following should you implement?
What could be a concern when using the same instances multiple times?
What could be a concern when using the same instances multiple times?
A survey asks respondents to report their income, but higher-income individuals are less likely to respond. This missingness is most accurately classified as:
A survey asks respondents to report their income, but higher-income individuals are less likely to respond. This missingness is most accurately classified as:
A dataset has a variable where ages are recorded, but some entries show '150' years. Assuming this is likely a data entry error, the most appropriate initial action would be to:
A dataset has a variable where ages are recorded, but some entries show '150' years. Assuming this is likely a data entry error, the most appropriate initial action would be to:
After applying Random Under-Sampling, you notice that the performance of your fraud detection model decreased significantly. Which action should you take?
After applying Random Under-Sampling, you notice that the performance of your fraud detection model decreased significantly. Which action should you take?
Which of the following explains why third-party cookies may need to be enabled during data import?
Which of the following explains why third-party cookies may need to be enabled during data import?
What is the correct syntax to create a new codesheet?
What is the correct syntax to create a new codesheet?
During data exploration, you notice a patient's age is recorded as -2. The best approach to handle this anomaly is to:
During data exploration, you notice a patient's age is recorded as -2. The best approach to handle this anomaly is to:
You create x and y variables and then specify: x_ros, y_ros = ros.fit_resample(x,y)
. What transformation are you trying to achieve?
You create x and y variables and then specify: x_ros, y_ros = ros.fit_resample(x,y)
. What transformation are you trying to achieve?
In anomaly detection, one potential strategy is to establish flagging thresholds using which statistical measure?
In anomaly detection, one potential strategy is to establish flagging thresholds using which statistical measure?
Consider a dataset for predicting hospital readmission, where less than 5% of patients are readmitted. Randomly under-sampling the majority class is chosen. What adverse outcome is most liable to affect the model?
Consider a dataset for predicting hospital readmission, where less than 5% of patients are readmitted. Randomly under-sampling the majority class is chosen. What adverse outcome is most liable to affect the model?
A credit card company uses a fraud detection system. By default, these systems flag a small proportion of transactions as fraud when they are in fact legitimate. After undersampling, what may occur?
A credit card company uses a fraud detection system. By default, these systems flag a small proportion of transactions as fraud when they are in fact legitimate. After undersampling, what may occur?
Which python library is most known for its powerful and flexible tools for data manipulation and analysis?
Which python library is most known for its powerful and flexible tools for data manipulation and analysis?
You are tasked with creating a function that calculates the upper and lower bounds for outlier detection using the IQR method. Assume Q1 and Q3 are already defined. Which code snippet is most efficient?
You are tasked with creating a function that calculates the upper and lower bounds for outlier detection using the IQR method. Assume Q1 and Q3 are already defined. Which code snippet is most efficient?
You are working on a dataset with 10,000 rows and after handling missing values, you realize a critical column has 9,999 rows with the same value and only 1 unique, valid entry. What action would be most justifiable?
You are working on a dataset with 10,000 rows and after handling missing values, you realize a critical column has 9,999 rows with the same value and only 1 unique, valid entry. What action would be most justifiable?
Your team is preparing a large dataset of patient records for machine learning analysis but discovers that lab values were measured using different scales. What do you propose to your colleagues so they avoid this "garbage in, garbage out" scenario?
Your team is preparing a large dataset of patient records for machine learning analysis but discovers that lab values were measured using different scales. What do you propose to your colleagues so they avoid this "garbage in, garbage out" scenario?
Flashcards
Why import libraries first?
Why import libraries first?
First, ensure necessary libraries are imported; otherwise, code lines must be written manually.
How to import a library
How to import a library
Use 'import library_name as abbreviation' to shorten code.
Main Python libraries
Main Python libraries
Pandas: Data analysis and manipulation. NumPy: Numerical operations. Matplotlib: Plotting.
How to install a library
How to install a library
Signup and view all the flashcards
Data importing
Data importing
Signup and view all the flashcards
Data reading
Data reading
Signup and view all the flashcards
How to read data
How to read data
Signup and view all the flashcards
Challenge with raw data
Challenge with raw data
Signup and view all the flashcards
Steps for data preparation
Steps for data preparation
Signup and view all the flashcards
Missing data
Missing data
Signup and view all the flashcards
Missing Not at Random (MNAR)
Missing Not at Random (MNAR)
Signup and view all the flashcards
Missing At Random (MAR)
Missing At Random (MAR)
Signup and view all the flashcards
Missing Completely At Random (MCAR)
Missing Completely At Random (MCAR)
Signup and view all the flashcards
Missing data can be informative?
Missing data can be informative?
Signup and view all the flashcards
How to handle missing data?
How to handle missing data?
Signup and view all the flashcards
Removing the variable
Removing the variable
Signup and view all the flashcards
Removing the observation
Removing the observation
Signup and view all the flashcards
How to impute missing data?
How to impute missing data?
Signup and view all the flashcards
Finding missing values total?
Finding missing values total?
Signup and view all the flashcards
Missing values per variable?
Missing values per variable?
Signup and view all the flashcards
Abnormal data points
Abnormal data points
Signup and view all the flashcards
Extreme values
Extreme values
Signup and view all the flashcards
What are outliers?
What are outliers?
Signup and view all the flashcards
Influential points
Influential points
Signup and view all the flashcards
Tools for detecting data abnormality?
Tools for detecting data abnormality?
Signup and view all the flashcards
When to use under-sampling?
When to use under-sampling?
Signup and view all the flashcards
Problems with under-sampling
Problems with under-sampling
Signup and view all the flashcards
Random Over-sampling is good for:
Random Over-sampling is good for:
Signup and view all the flashcards
ROS problems
ROS problems
Signup and view all the flashcards
Imbalanced dataset
Imbalanced dataset
Signup and view all the flashcards
Study Notes
- Chapter 1 focuses on data preparation.
- The plan includes code sheet creation, library management, data importing, data preparation, and at-home practice.
Code Sheet
- Code sheets can be new or existing.
- Creating a new code sheet involves logging into a Google account, accessing Google Colab, and creating a new notebook.
- Using Chrome is recommended to avoid problems.
- Opening an existing code sheet also requires logging into a Google account and uploading a file to Google Colab.
Libraries
- It's important to import all necessary libraries at the beginning.
- Doing this avoids manually writing each line of code.
- The syntax to import a library is
import library_name as abbreviation_library_name
. - Standard libraries are pandas, numpy, matplotlib.pyplot, and seaborn.
- It is not required to import all libraries immediately; import them as needed.
- Individual tools from a library can be imported using syntax like
from sklearn.ensemble import RandomForestClassifier.
- Libraries used are available in Google Colab, so no installation is required.
- It's possible to install a library using
!pip install library_name
.
Data Import
- Data is imported with the coding sheet and libraries available.
- A simple way to import data is to run
from google.colab import files
followed byfiles.upload()
. - Upload the file directly from the laptop.
- If third-party cookies are disabled in the browser, an error message will appear.
- It’s possible to fix this by going to browser settings and whitelisting
https://[*.]googleusercontent.com:443
.
Read Data
- Data importing refers to loading data from external sources like CSV/Excel files into Python.
- Reading data specifically means accessing and loading it into memory for use in Python.
- Reading data is typically a part of the importing process.
- Pandas is used via powerful and flexible tools for data manipulation.
- Datasets are named "df" for ease of referencing.
- Identification of the type of file containing the data will vary accordingly. The file types are CSV and Excel.
- It’s possible to show all rows, specify a limited number of rows or to show the default number of rows (5).
Data Preparation Process
- Data from the real world is often imperfect.
- Issues, such as errors, missing values, outliers, and imbalanced datasets, are addressed before analysis.
- Data preparation steps: collect data, find impurities, prepare data to address these impurities, and analyze data.
- Pandas and NumPy are used for data preparation.
Missing Data
- Missing data refers to one or more missing values in an observation (row) or variable (column).
- Missing values can be Missing Not At Random (MNAR), Missing At Random (MAR), or Missing Completely At Random (MCAR).
Missing Not At Random (MNAR)
- Missingness is related to unobserved data (the missing value) and not to observed data.
- For instance, a researcher focuses on years when tax fraud occurred and intentionally ignores the years not.
Missing At Random (MAR)
- Missingness is related to observed data and not to unobserved data
- That is, other variables in the dataset but not the missing value itself.
- Example: A survey respondent skips the question “How many books do you read per month?" because of their education level and no correlation between reading the actual value.
Missing Completely At Random (MCAR)
- Missingness is unrelated to both observed and unobserved data.
- In this case, missing data is fully random/caused by external factors unrelated to the dataset.
- Understanding missing data types (MNAR, MAR, MCAR) is important for choosing the correct strategies to handle it like removing missing values or using advanced modeling techniques designed to account for the missing data.
- Example: Skip the question, "What is your favorite movie?"
Handling Missing Data
- To remove data, columns or rows containing many missing values can be deleted.
- Missing data can be imputed; the type matters for value choice: use the mean/median for quantitative variables and the mode for qualitative variables.
Detecting Missing Values
- How to detect will be covered:
- The total number of missing values will be covered.
- The number of missing values/variable (per column) will be covered.
- The number of missing values/observation (per row) will be covered.
- ".isnull" identifies missing values. The first "sum ()" calculates the total number of missing values/column. The second " sum ()" calculates the total number of missing values/row.
Detect Missing All Values
- To find the total number of missing values, use
df.isnull().sum().sum()
.
Detect Missing Values per Variable
- Using only the first
sum
one can obtain the number of missing values per variable (column). Using sum(axis=0) will achieve similar results.
Detect Missing Values per Observation
- To identify observations that are missing values, the user will need to use “sum (axis=1)”. Otherwise, the user will encounter the number of missing values per variable (per column).
Remove Missing Values
- Using
df.dropna (inplace=True)
will permanently apply changes, outliers or missing values.
Delete Observations or Variables
- "drop" is used for deleting and "na" stands for "not available," referring to missing values. This line of code removes all rows with missing values and deletes variables.
Impute Missing Values
- Filling “ "fill" is used for imputation," "na" which is used to indicate the missing values," "A“ which is the name of the variable (i.e., the column label). However, median can be used instead of the “mean”. The mode option is also available here.
Abnormal Data Points
- Some data points are more valuable and contribute to analysis positively, but some introduce bias.
- Problematic data points should be examined carefully to avoid bias.
- These points are abnormally far from other data and represent extreme max/mins.
Types of Abnormal Data Points
- There are extreme values, outliers, and influential points:
- X value is an abnormal distance, and the Y value follows the trend. The observation significantly differs from others.
- X value is a normal distance, and the Y value doesn't follow the trend. An independent variable that affects the difference isn't in the model.
- X value is an abnormal distance, while the Y value doesn’t follow the trend. This point biases the model.
What to do with Abnormal Data Points
- Understanding is essential.
- Obvious error: correct it, like a grade being 150/20.
- Omitted independent variable: include it in the model; location for real estate.
- Data point does not belong to population; a Ferrari in a study of car engine power.
- Population point that differs significantly: Monitor closely.
- As another option, try different models.
- Try models with the problematic point.
- Try models without it.
- Try models but with the problematic point and after modification
Detecting Abnormal Points
- Descriptive Statistics help, plus graphs and summary data
- Graphs are histograms, box plots, and scatter plots.
- Summary statistics are such as first (Q1) and third quartiles (Q3) and include the interquartile range (IQR).
- Begin with graphs to get a general overview of the dataset,
- You will need NumPy for computations.
Histograms, Boxplots, and Scatterplots
- These can be implemented to help resolve abnormal points
- For example, Matplotlib is used
Steps to Limit
- There are typically steps to limiting the dataset for abnormal points
- Quatiles
- Interquartile Range
- Showing the calculated limits
Steps to Detect Abnormal points
- It's possible to show points outside the limits.
Actions to address problems
- Steps include an abnormal point removal data setup.
Modification Of Abnormal Values
- Data points should not be removed unless there is a compelling justification.
- Mitigate its impact by modifying it when there’s no reason to delete a data point
- Remove the abnormal points.
- Report the removed point.
- Impute values.
- Handling data requires caution and careful consideration, including Pandas and NumPy.
Imbalanced Datasets
- Classification problems.
- Use a dataset to build a model. Evaluate its performance on new data points.
- Datasets build the model and address the problem of classification.
- Features are explanatory variables (qualitative or quantitative).
- The explained variable (qualitative) is the “outcome”, “target” or “class”.
- A challenge with this dataset can be dealing with imbalanced datasets.
Further on Imbalanced Datasets
- In an imbalanced dataset, one class is more represented.
- That can lead to biased results.
- A slight imbalance is not problematic
- For the purposes of the course, up to a 40:60 ratio is ok
Action Steps
- Determine if the data set is balanced
Detect Imbalanced Classes
- Identify the features
- Identify the target
- This can help identify imbalances.
Random Under-Sampling (RUS)
- RUS is a data analysis technique for imbalanced datasets that decreases the number of cases in the majority class to make the dataset balanced.
- Steps:
- Find class distribution to identify majority and minority classes.
- Select cases from majority class.
- Remove the selected instances.
Random Over-Sampling (ROS)
- ROS balances datasets by increasing the number of cases in the minority class by randomly duplicating existing cases.
- Steps: -Find the percentages of the whole for minority populations. -Add the missing cases, or select cases again out of the dataset
Considerations for data sampling
- It's highly imbalanced, and the loss of some cases from the majority class won't affect the ability model
- Reducing size, enhancements to the system training
Imbalanced Dataset Results
- Information loss, removing many cases from the majority, can negatively affect the model’s performance
- Underrepresentation, major variations on the majority might be missed after sampling
Homework/Practice Exercises
- Two practice homework sets are available.
- One uses the "weight-height" dataset and the other "melbourne_housing."
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.