Pandas for Data Handling

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Which of the following is NOT a typical step in preprocessing data for a machine learning (ML) pipeline?

  • Dimensionality reduction
  • Feature scaling
  • Model selection (correct)
  • Feature engineering

In the context of machine learning, why is it important to split data into training and test sets?

  • To evaluate the model's ability to generalize to unseen data. (correct)
  • To increase the overall size of the dataset.
  • To ensure the training data is diverse.
  • To reduce the computational complexity of the training process.

What is the primary purpose of 'cross-validation' in machine learning?

  • To increase the size of the training dataset.
  • To reduce the dimensionality of the dataset.
  • To estimate the generalization performance of a model. (correct)
  • To select the most relevant features.

What does the pandas library offer?

<p>Tools for handling relational tables and time series (C)</p> Signup and view all the answers

What are the two primary data structures offered by the pandas library?

<p>Series and DataFrames (C)</p> Signup and view all the answers

In pandas, if you have a DataFrame df, how would you access the column named 'temperature'?

<p>df['temperature'] (C)</p> Signup and view all the answers

What is a key advantage of using Pandas for data analysis?

<p>Vectorized operations (B)</p> Signup and view all the answers

If you want to find unique values in a pandas DataFrame column named 'color', which method would you use?

<p>df['color'].unique() (A)</p> Signup and view all the answers

Which pandas function would you use to group rows based on the values in one or more columns?

<p>groupby() (D)</p> Signup and view all the answers

What operation does the following pandas code perform? df[df['age'] > 30]

<p>It filters the DataFrame to show rows where the 'age' column is greater than 30. (B)</p> Signup and view all the answers

Which of the following pandas operations is used to get descriptive statistics of a DataFrame?

<p>df.describe() (B)</p> Signup and view all the answers

How can you count the number of occurrences of each unique value in a pandas Series?

<p>series.value_counts() (A)</p> Signup and view all the answers

In pandas, what is the purpose of the apply() function?

<p>To apply a function along an axis of the DataFrame (C)</p> Signup and view all the answers

What is the purpose of a pivot table in pandas?

<p>To reshape and summarize data (D)</p> Signup and view all the answers

How do you remove rows with missing values in a pandas DataFrame?

<p>df.dropna() (D)</p> Signup and view all the answers

In pandas, what is the result of running df.loc[0:5] on a DataFrame df?

<p>It selects rows with index labels 0 through 5 (inclusive). (D)</p> Signup and view all the answers

What is the correct way to read a CSV file into a pandas DataFrame?

<p>df = pd.read_csv('file.csv') (D)</p> Signup and view all the answers

If you want to change the data type of a column named 'amount' in a pandas DataFrame to integer, how would you do it?

<p>df['amount'].astype(int) (A)</p> Signup and view all the answers

What would be the correct code for computing the mean of the 'salary' column in a pandas DataFrame called employee_data?

<p>employee_data['salary'].mean() (B)</p> Signup and view all the answers

What is the purpose of the fillna() method in pandas?

<p>To fill missing values with a specified value (B)</p> Signup and view all the answers

How can you sort a pandas DataFrame by the values in a column named 'date' in ascending order?

<p>df.sort_values(by='date', ascending=True) (C)</p> Signup and view all the answers

If you need to select a subset of columns ('A', 'B', 'C') from a pandas DataFrame df, how do you achieve this?

<p>df[['A', 'B', 'C']] (C)</p> Signup and view all the answers

What is the difference between .iloc[] and .loc[] in pandas when accessing data in a DataFrame?

<p><code>.iloc[]</code> is for integer-based indexing, while <code>.loc[]</code> is for label-based indexing. (A)</p> Signup and view all the answers

When using the groupby() function in pandas, what is the typical next step after grouping the data?

<p>Applying an aggregation function (C)</p> Signup and view all the answers

Which method is most suitable for merging two pandas DataFrames based on a common column?

<p>df.merge() (B)</p> Signup and view all the answers

Flashcards

ML Pipeline

A sequence of steps to build and deploy machine learning models.

Data Preprocessing

First stage of the ML pipeline which transforms raw data into a suitable format.

Train/Test Split

Splitting the data into two separate groups, one for training, one for testing.

Feature Engineering

Creating new columns from raw data to enhance model performance.

Signup and view all the flashcards

Learning Algorithm Selection

Choosing the right model and parameters to optimize learning.

Signup and view all the flashcards

Hyperparameter Optimization

Fine-tuning model parameters to improve performance.

Signup and view all the flashcards

Generalization Error

This estimates how well the ML model generalizes to unseen test data.

Signup and view all the flashcards

Pandas Library

Tool for handling relational tables and time series analysis and more.

Signup and view all the flashcards

Pandas Series

One-dimensional labeled array in Pandas.

Signup and view all the flashcards

Pandas DataFrame

Two-dimensional labeled data structure with columns of potentially different types.

Signup and view all the flashcards

Vectorized Operations

Running calculations across multiple values simultaneously.

Signup and view all the flashcards

Find unique values

Finding unique values in dataframes.

Signup and view all the flashcards

Group rows

Using a specified condition to sort your data into groups.

Signup and view all the flashcards

Create conditional column

Adds a conditional category based on some specified condition being met.

Signup and view all the flashcards

Filtering DataFrames

Selecting subsets of data based on conditions.

Signup and view all the flashcards

Descriptive Statistics

Getting measures such as the mean, median etc.

Signup and view all the flashcards

Counting Values

Counting how frequently values ​​occur in the set.

Signup and view all the flashcards

Searching a Column

A way to find entries in the column of a dataframe.

Signup and view all the flashcards

Dropping Rows/Columns

Eliminating rows or columns from a DataFrame.

Signup and view all the flashcards

Pivot Tables

Change a table format to summarize data like averaging.

Signup and view all the flashcards

Selecting Pandas DataFrame Rows

Selecting rows based on conditions.

Signup and view all the flashcards

Study Notes

Data Handling with Pandas

  • Pandas is used for data handling
  • The lecture will cover data handling with pandas

Lecture Agenda

  • The lecture will cover the ML Pipeline
  • The lecture will cover Prerequisites
  • The lecture will cover what the pandas library offers
  • Resources will be reviewed in the lecture
  • Several Lecture Exercises
  • Compulsory Assignment 1 will be described

ML Pipeline

  • A diagram is displayed for Machine Learning Pipelines
  • Feature Extraction and Scaling are essential for Machine Learning Pipelines
  • Labels, Training Datasets, Learning Algorithms and Final Models are all parts of the process
  • New Data can be predicted with Labels can Evaluation
  • Model Selection, Cross-Validation, Performance Metrics, and Hyperparameter Optimization are all used

ML Pipeline: Preprocessing

  • Preprocessing data is a crucial step in every ML application
  • Raw data often needs processing to become a good format
  • Many ML algorithms require scaling for good performance
  • Dimensionality reduction is used during the ML Pipeline

ML Pipeline: Preprocessing

  • Separation of data into training and test sets is required
  • Models should generalize well
  • Good data should create a good performance in training and test set
  • Feature engineering can create new features from raw data by transformations

ML Pipeline: Learning algorithm

  • Many algorithms are available for ML, optimization methods, etc
  • Each algorithms has it's own strengths and weaknesses
  • The best model is found by comparing performance
  • One must choose their performance metrics (accuracy, AUC of ROC, etc.)
  • Cross-validation is used for generalization testing
  • Hyperparameter optimization is used when fine tuning models

ML Pipeline: Evaluation and prediction

  • Estimate generalization error using unseen test data
  • Track expected prediction performance for future data
  • All transformations applied to the training data is applied to the test data using the same, user-set or ML algorithm-acquired parameters

Prerequisites

  • Anaconda, Miniconda, or Python with a pip package manager should be installed
  • The required packages should also be installed
  • The environment YAML file is located on Canvas
  • Conda can create an environment from the environment.yml file, which provides the different dependencies to the project
  • Start environment with the command conda activate dat200_env
  • Install package ipykernel - which provides the Ipython integration for Jupyter
  • The command is python -m ipykernel install --user --name=dat200_env

Pandas Library

  • Pandas is a free and open-source software library for Python
  • Features fast and highly flexible structures for handling relational tables and time series
  • Pandas is a tool for handling spreadsheets in Python
  • The 2 main data structures are Series (1D Array) and DataFrames (DF; a 2D array with named rows and columns)
  • Every row and/or column in a DF is a Series
  • The pandas library is built on top of Numpy

Pandas Resources

  • pandas website and documentation is a resource
  • pandas community tutorials are offical pandas website incl videos
  • RealPython is another resource
  • Pandas is a powerful tool with many commands and options
  • ChatGPT is a great tool to remember syntax
  • A precise question on what to do is a helpful tool

Common tasks with Pandas

  • Creating a DataFrame
  • Indexing the rows and columns
  • Pandas is a vectorized operation
  • Examples showcase pandas capabilities

Lecture exercises 1

  • Iris dataset should be loaded into a pandas datafram from the web
  • The link for Iris dataset is: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
  • Set the column name to sepal_length, sepal_width, petal_length, petal_width, and types
  • Set the row names to flower_1, flower_2, flower_3, ..., flower_150

Common tasks with Pandas 2

  • Finding unique values in Pandas Dataframes
  • Grouping rows in Pandas
  • Creating a column based on a conditional in Pandas

Lecture exercises 2

  • Find the unique values for the column types in a dataframe
  • Compute the cloumn mean for each type
  • Create a new column named sepal width >=3 that contains True or False depending on if the cloumn sepal width is >= 3 (True) or < 3 (False)
  • Count how many times the sepal width is >= 3 (you can use the column sepal width >= 3 for that)

Common tasks with Pandas 3

  • Filtering Pandas Dataframes
  • Getting descriptive statistics for Pandas Dataframes
  • Counting values in a Pandas Dataframe
  • Searching a Pandas column for a value

Lecture exercises 3

  • Count how many times each class occurs (Answer: 50 of each class)
  • Create three data subsets from the original dataframe, one for each kind of flower
  • Use conditional row selection based on the column types
  • View the last 10 rows of the columns sepal_length and types

Common tasks with Pandas 4

  • Dropping rows and columns in a Pandas DataFrame
  • Selecting Pandas DataFrame rows based on conditions
  • Sorting rows in Pandas DataFrames
  • Applying Operations Pandas DataFrames
  • Getting Pivot Tables in Pandas
  • Selecting Pandas DataFrame Rows based on conditions

Lecture exercises 4

  • View the rows where sepal_length > 5 and petal_width < 0.2
  • Make a new DataFrame containig only rows where petal_width is exactly 1.8
  • Get the descriptive statistics for the whole data frame and afterwards just columns for the cloumn petal_length
  • Remove the rows named flower 55 and flower 77
  • Remove the column sepal_width >= 3
  • View all rows of sepal length where petal_width is exactly 1.8
  • Get the values of the DataFrame stored in a numpy array
  • Remove the column types and apply a function named computation to each cell in the DataFrame

Compulsory Assignment 1

  • Compulsory Assignment 1 is posted on Canvas
  • The tasks in CA1 is similar to the exercises in the lecture
  • The CAs should be started early, or it will get overwhelming before the deadline.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Unit 1: Data Handling using Pandas - I
37 questions
Data Handling with Pandas - Series
29 questions

Data Handling with Pandas - Series

AuthoritativeSequence1658 avatar
AuthoritativeSequence1658
Use Quizgecko on...
Browser
Browser