Data Preparation: Code Sheets and Libraries

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

When importing libraries in Python for data analysis, what benefit does using an abbreviation (e.g., as pd) provide?

  • It automatically updates the library to the latest version.
  • It reduces the memory footprint of the library.
  • It speeds up coding by allowing you to avoid typing the full library name each time. (correct)
  • It allows the code to run faster by skipping unnecessary functions.

In the context of data preparation, what does 'reading data' primarily involve?

  • Loading data from external sources into a Python environment.
  • Cleaning and transforming data to ensure it is in a usable format.
  • Visualizing data to identify patterns and anomalies.
  • Accessing data and loading it into memory for use within a Python program. (correct)

Which of the following is NOT a common issue encountered during data preparation?

  • Missing values
  • Imbalanced datasets
  • Perfectly balanced data (correct)
  • Outliers

A researcher is investigating income inequality and omits data from time periods when inequality was low, focusing only on periods of high inequality. What type of missingness does this represent?

<p>Missing Not At Random (MNAR) (C)</p> Signup and view all the answers

When handling missing data, if a data analyst chooses to impute missing values for a qualitative variable, which of the following methods is most appropriate?

<p>Using the mode of the variable (D)</p> Signup and view all the answers

In a dataset of customer purchases, a few customers have recorded purchases that are ten times higher than the average purchase amount. These are noted to follow the overall trend of the data. What are these data primarily classified as?

<p>Extreme values (C)</p> Signup and view all the answers

In a dataset analyzing the factors determining house prices, one house is priced significantly higher than similar houses in the same location due to an unrecorded historical significance. How should this be addressed?

<p>Include it in the model and monitor it closely. (A)</p> Signup and view all the answers

What initial step helps to identify if abnormal data points may be present?

<p>Creating graphs like histograms, boxplots, and scatter plots. (D)</p> Signup and view all the answers

Which plot would be most appropriate to visualize the distribution of a single numerical variable to check for outliers?

<p>Histogram (D)</p> Signup and view all the answers

What should you do if you remove abnormal data?

<p>Report the removed points as missing data. (C)</p> Signup and view all the answers

In the context of imbalanced datasets, what does the term 'class' refer to?

<p>A unique category or outcome in the target variable. (A)</p> Signup and view all the answers

Which of the following imbalance ratios would be considered acceptable?

<p>60:40 (C)</p> Signup and view all the answers

What is the primary goal of Random Under-Sampling (RUS) in dealing with imbalanced datasets?

<p>To decrease the number of instances in the majority class. (A)</p> Signup and view all the answers

When is Random Under-Sampling (RUS) most suitable for addressing imbalanced datasets?

<p>When the dataset is very large and reducing its size improves the training process. (D)</p> Signup and view all the answers

What is a key limitation of using Random Under-Sampling (RUS) to handle imbalanced datasets?

<p>It may result in the loss of potentially valuable information from the majority class. (B)</p> Signup and view all the answers

In Random Over-Sampling (ROS), what approach is used to balance the dataset?

<p>Instances of the minority class are duplicated. (B)</p> Signup and view all the answers

When is it most appropriate to use Random Over-Sampling (ROS)?

<p>When under-sampling the majority class would lead to a significant loss of valuable information. (B)</p> Signup and view all the answers

What is a potential drawback of using Random Over-Sampling (ROS) to address imbalanced datasets?

<p>It increases the risk of overfitting due to the duplication of instances. (D)</p> Signup and view all the answers

What is the purpose of the code df.isnull().sum()?

<p>It counts the number of null values in each column. (D)</p> Signup and view all the answers

What is the outcome of running the line of code df.dropna(axis=1)?

<p>It removes all columns that contain missing values. (D)</p> Signup and view all the answers

In Python, what function would you use to import a specific tool (e.g., RandomForestClassifier) from a library (e.g., sklearn.ensemble)?

<p><code>from sklearn.ensemble import RandomForestClassifier</code> (C)</p> Signup and view all the answers

To identify the number of missing values per row in a Pandas DataFrame, which code snippet would be most suitable assuming the DataFrame is called df?

<p><code>df.isnull().sum(axis=1)</code> (B)</p> Signup and view all the answers

How do you install the 'lasio' library when you are not able to import it regularly?

<p>Using the command <code>!pip install lasio</code> (B)</p> Signup and view all the answers

Which pandas function is appropriate for reading a CSV file?

<p><code>pd.read_csv()</code> (D)</p> Signup and view all the answers

The interquartile range (IQR) is calculated using which of the following?

<p>Q3 - Q1 (D)</p> Signup and view all the answers

In order to create a new dataset containing only the values within the limits, which of the following should you implement?

<p>df[(df.Height&gt;lower_bound)&amp;(df.Height&lt;upper_bound)] (D)</p> Signup and view all the answers

What could be a concern when using the same instances multiple times?

<p>Overfitting (C)</p> Signup and view all the answers

A survey asks respondents to report their income, but higher-income individuals are less likely to respond. This missingness is most accurately classified as:

<p>Missing Not at Random (MNAR) (B)</p> Signup and view all the answers

A dataset has a variable where ages are recorded, but some entries show '150' years. Assuming this is likely a data entry error, the most appropriate initial action would be to:

<p>Mark the '150' entries as missing values and impute appropriately. (B)</p> Signup and view all the answers

After applying Random Under-Sampling, you notice that the performance of your fraud detection model decreased significantly. Which action should you take?

<p>Switch to Random Over-Sampling or a more sophisticated sampling technique. (B)</p> Signup and view all the answers

Which of the following explains why third-party cookies may need to be enabled during data import?

<p>To prevent the occurrence of errors related to accessing upload widgets. (D)</p> Signup and view all the answers

What is the correct syntax to create a new codesheet?

<p>Open Google Colab and click &quot;New notebook&quot; in the pop-up window (C)</p> Signup and view all the answers

During data exploration, you notice a patient's age is recorded as -2. The best approach to handle this anomaly is to:

<p>Investigate whether this was an intentional means of flagging this record for a specific reason. (B)</p> Signup and view all the answers

You create x and y variables and then specify: x_ros, y_ros = ros.fit_resample(x,y). What transformation are you trying to achieve?

<p>Randomly oversample the target and the features (C)</p> Signup and view all the answers

In anomaly detection, one potential strategy is to establish flagging thresholds using which statistical measure?

<p>Interquartile range (IQR) (B)</p> Signup and view all the answers

Consider a dataset for predicting hospital readmission, where less than 5% of patients are readmitted. Randomly under-sampling the majority class is chosen. What adverse outcome is most liable to affect the model?

<p>Loss of important patterns within the majority class. (A)</p> Signup and view all the answers

A credit card company uses a fraud detection system. By default, these systems flag a small proportion of transactions as fraud when they are in fact legitimate. After undersampling, what may occur?

<p>An increased false positive rate. (B)</p> Signup and view all the answers

Which python library is most known for its powerful and flexible tools for data manipulation and analysis?

<p>Pandas (D)</p> Signup and view all the answers

You are tasked with creating a function that calculates the upper and lower bounds for outlier detection using the IQR method. Assume Q1 and Q3 are already defined. Which code snippet is most efficient?

<p><code>lower_bound = Q1 - IQR * 1.5; upper_bound = Q3 + IQR * 1.5</code> (C)</p> Signup and view all the answers

You are working on a dataset with 10,000 rows and after handling missing values, you realize a critical column has 9,999 rows with the same value and only 1 unique, valid entry. What action would be most justifiable?

<p>Remove the column, indicating near-zero variance. (C)</p> Signup and view all the answers

Your team is preparing a large dataset of patient records for machine learning analysis but discovers that lab values were measured using different scales. What do you propose to your colleagues so they avoid this "garbage in, garbage out" scenario?

<p>Standardize measured lab variables with different scales. (A)</p> Signup and view all the answers

Flashcards

Why import libraries first?

First, ensure necessary libraries are imported; otherwise, code lines must be written manually.

How to import a library

Use 'import library_name as abbreviation' to shorten code.

Main Python libraries

Pandas: Data analysis and manipulation. NumPy: Numerical operations. Matplotlib: Plotting.

How to install a library

Use '!pip install library_name' to install libraries.

Signup and view all the flashcards

Data importing

Loading data from sources like CSV or Excel into your program.

Signup and view all the flashcards

Data reading

Accessing data in memory for use within your Python program.

Signup and view all the flashcards

How to read data

Use df = pd.read_csv('file.csv') or df = pd.read_excel('file.xlsx').

Signup and view all the flashcards

Challenge with raw data

Real-world data has missing values, errors and outliers that needs addressing.

Signup and view all the flashcards

Steps for data preparation

Identify impurities, prepare data, analyze cleaned data.

Signup and view all the flashcards

Missing data

Missing data in an observation or variable.

Signup and view all the flashcards

Missing Not at Random (MNAR)

Data is missing because of unobserved data.

Signup and view all the flashcards

Missing At Random (MAR)

Missingness is related to other variables in the dataset.

Signup and view all the flashcards

Missing Completely At Random (MCAR)

Missingness is completely unrelated to observed or unobserved data.

Signup and view all the flashcards

Missing data can be informative?

Missing data's absence may contain valuable context-aware information.

Signup and view all the flashcards

How to handle missing data?

Choose methods based on missingness type.

Signup and view all the flashcards

Removing the variable

Delete a column with too many missing values.

Signup and view all the flashcards

Removing the observation

Delete a row with missing values.

Signup and view all the flashcards

How to impute missing data?

Use mean/median for quantitative, mode for qualitative.

Signup and view all the flashcards

Finding missing values total?

Use df.isnull().sum().sum() to reveal missing values.

Signup and view all the flashcards

Missing values per variable?

Use df.isnull().sum() to show per column.

Signup and view all the flashcards

Abnormal data points

Points abnormally far from other data.

Signup and view all the flashcards

Extreme values

Points far from other data, and differs from the population.

Signup and view all the flashcards

What are outliers?

Does not follow the trend. Could mean we are missing an independent variable.

Signup and view all the flashcards

Influential points

Points that bias the model.

Signup and view all the flashcards

Tools for detecting data abnormality?

Graphs, Summary Statistics, and the IQR Rule.

Signup and view all the flashcards

When to use under-sampling?

Highly imbalanced, smaller datasets.

Signup and view all the flashcards

Problems with under-sampling

Loss of valuable data, and unrepresentative samples of data.

Signup and view all the flashcards

Random Over-sampling is good for:

High quality datasets with lower volume.

Signup and view all the flashcards

ROS problems

Overfitting, duplicate data creates issues.

Signup and view all the flashcards

Imbalanced dataset

One class is disproportionately represented compared to others.

Signup and view all the flashcards

Study Notes

  • Chapter 1 focuses on data preparation.
  • The plan includes code sheet creation, library management, data importing, data preparation, and at-home practice.

Code Sheet

  • Code sheets can be new or existing.
  • Creating a new code sheet involves logging into a Google account, accessing Google Colab, and creating a new notebook.
  • Using Chrome is recommended to avoid problems.
  • Opening an existing code sheet also requires logging into a Google account and uploading a file to Google Colab.

Libraries

  • It's important to import all necessary libraries at the beginning.
  • Doing this avoids manually writing each line of code.
  • The syntax to import a library is import library_name as abbreviation_library_name.
  • Standard libraries are pandas, numpy, matplotlib.pyplot, and seaborn.
  • It is not required to import all libraries immediately; import them as needed.
  • Individual tools from a library can be imported using syntax like from sklearn.ensemble import RandomForestClassifier.
  • Libraries used are available in Google Colab, so no installation is required.
  • It's possible to install a library using !pip install library_name.

Data Import

  • Data is imported with the coding sheet and libraries available.
  • A simple way to import data is to run from google.colab import files followed by files.upload().
  • Upload the file directly from the laptop.
  • If third-party cookies are disabled in the browser, an error message will appear.
  • It’s possible to fix this by going to browser settings and whitelisting https://[*.]googleusercontent.com:443.

Read Data

  • Data importing refers to loading data from external sources like CSV/Excel files into Python.
  • Reading data specifically means accessing and loading it into memory for use in Python.
  • Reading data is typically a part of the importing process.
  • Pandas is used via powerful and flexible tools for data manipulation.
  • Datasets are named "df" for ease of referencing.
  • Identification of the type of file containing the data will vary accordingly. The file types are CSV and Excel.
  • It’s possible to show all rows, specify a limited number of rows or to show the default number of rows (5).

Data Preparation Process

  • Data from the real world is often imperfect.
  • Issues, such as errors, missing values, outliers, and imbalanced datasets, are addressed before analysis.
  • Data preparation steps: collect data, find impurities, prepare data to address these impurities, and analyze data.
  • Pandas and NumPy are used for data preparation.

Missing Data

  • Missing data refers to one or more missing values in an observation (row) or variable (column).
  • Missing values can be Missing Not At Random (MNAR), Missing At Random (MAR), or Missing Completely At Random (MCAR).

Missing Not At Random (MNAR)

  • Missingness is related to unobserved data (the missing value) and not to observed data.
  • For instance, a researcher focuses on years when tax fraud occurred and intentionally ignores the years not.

Missing At Random (MAR)

  • Missingness is related to observed data and not to unobserved data
  • That is, other variables in the dataset but not the missing value itself.
  • Example: A survey respondent skips the question “How many books do you read per month?" because of their education level and no correlation between reading the actual value.

Missing Completely At Random (MCAR)

  • Missingness is unrelated to both observed and unobserved data.
  • In this case, missing data is fully random/caused by external factors unrelated to the dataset.
  • Understanding missing data types (MNAR, MAR, MCAR) is important for choosing the correct strategies to handle it like removing missing values or using advanced modeling techniques designed to account for the missing data.
  • Example: Skip the question, "What is your favorite movie?"

Handling Missing Data

  • To remove data, columns or rows containing many missing values can be deleted.
  • Missing data can be imputed; the type matters for value choice: use the mean/median for quantitative variables and the mode for qualitative variables.

Detecting Missing Values

  • How to detect will be covered:
  • The total number of missing values will be covered.
  • The number of missing values/variable (per column) will be covered.
  • The number of missing values/observation (per row) will be covered.
  • ".isnull" identifies missing values. The first "sum ()" calculates the total number of missing values/column. The second " sum ()" calculates the total number of missing values/row.

Detect Missing All Values

  • To find the total number of missing values, use df.isnull().sum().sum().

Detect Missing Values per Variable

  • Using only the first sum one can obtain the number of missing values per variable (column). Using sum(axis=0) will achieve similar results.

Detect Missing Values per Observation

  • To identify observations that are missing values, the user will need to use “sum (axis=1)”. Otherwise, the user will encounter the number of missing values per variable (per column).

Remove Missing Values

  • Using df.dropna (inplace=True) will permanently apply changes, outliers or missing values.

Delete Observations or Variables

  • "drop" is used for deleting and "na" stands for "not available," referring to missing values. This line of code removes all rows with missing values and deletes variables.

Impute Missing Values

  • Filling “ "fill" is used for imputation," "na" which is used to indicate the missing values," "A“ which is the name of the variable (i.e., the column label). However, median can be used instead of the “mean”. The mode option is also available here.

Abnormal Data Points

  • Some data points are more valuable and contribute to analysis positively, but some introduce bias.
  • Problematic data points should be examined carefully to avoid bias.
  • These points are abnormally far from other data and represent extreme max/mins.

Types of Abnormal Data Points

  • There are extreme values, outliers, and influential points:
    • X value is an abnormal distance, and the Y value follows the trend. The observation significantly differs from others.
    • X value is a normal distance, and the Y value doesn't follow the trend. An independent variable that affects the difference isn't in the model.
    • X value is an abnormal distance, while the Y value doesn’t follow the trend. This point biases the model.

What to do with Abnormal Data Points

  • Understanding is essential.
  • Obvious error: correct it, like a grade being 150/20.
  • Omitted independent variable: include it in the model; location for real estate.
  • Data point does not belong to population; a Ferrari in a study of car engine power.
  • Population point that differs significantly: Monitor closely.
  • As another option, try different models.
    • Try models with the problematic point.
    • Try models without it.
    • Try models but with the problematic point and after modification

Detecting Abnormal Points

  • Descriptive Statistics help, plus graphs and summary data
  • Graphs are histograms, box plots, and scatter plots.
  • Summary statistics are such as first (Q1) and third quartiles (Q3) and include the interquartile range (IQR).
  • Begin with graphs to get a general overview of the dataset,
  • You will need NumPy for computations.

Histograms, Boxplots, and Scatterplots

  • These can be implemented to help resolve abnormal points
  • For example, Matplotlib is used

Steps to Limit

  • There are typically steps to limiting the dataset for abnormal points
  • Quatiles
  • Interquartile Range
  • Showing the calculated limits

Steps to Detect Abnormal points

  • It's possible to show points outside the limits.

Actions to address problems

  • Steps include an abnormal point removal data setup.

Modification Of Abnormal Values

  • Data points should not be removed unless there is a compelling justification.
  • Mitigate its impact by modifying it when there’s no reason to delete a data point
    • Remove the abnormal points.
    • Report the removed point.
    • Impute values.
  • Handling data requires caution and careful consideration, including Pandas and NumPy.

Imbalanced Datasets

  • Classification problems.
  • Use a dataset to build a model. Evaluate its performance on new data points.
  • Datasets build the model and address the problem of classification.
  • Features are explanatory variables (qualitative or quantitative).
  • The explained variable (qualitative) is the “outcome”, “target” or “class”.
  • A challenge with this dataset can be dealing with imbalanced datasets.

Further on Imbalanced Datasets

  • In an imbalanced dataset, one class is more represented.
  • That can lead to biased results.
  • A slight imbalance is not problematic
  • For the purposes of the course, up to a 40:60 ratio is ok

Action Steps

  • Determine if the data set is balanced

Detect Imbalanced Classes

  • Identify the features
  • Identify the target
  • This can help identify imbalances.

Random Under-Sampling (RUS)

  • RUS is a data analysis technique for imbalanced datasets that decreases the number of cases in the majority class to make the dataset balanced.
  • Steps:
    • Find class distribution to identify majority and minority classes.
    • Select cases from majority class.
    • Remove the selected instances.

Random Over-Sampling (ROS)

  • ROS balances datasets by increasing the number of cases in the minority class by randomly duplicating existing cases.
  • Steps: -Find the percentages of the whole for minority populations. -Add the missing cases, or select cases again out of the dataset

Considerations for data sampling

  • It's highly imbalanced, and the loss of some cases from the majority class won't affect the ability model
    • Reducing size, enhancements to the system training

Imbalanced Dataset Results

  • Information loss, removing many cases from the majority, can negatively affect the model’s performance
  • Underrepresentation, major variations on the majority might be missed after sampling

Homework/Practice Exercises

  • Two practice homework sets are available.
  • One uses the "weight-height" dataset and the other "melbourne_housing."

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Data Preparation and Structuring Quiz
5 questions
Data Preparation for Machine Learning
18 questions
Use Quizgecko on...
Browser
Browser