Podcast
Questions and Answers
Which of the following is NOT a typical step in preprocessing data for a machine learning (ML) pipeline?
Which of the following is NOT a typical step in preprocessing data for a machine learning (ML) pipeline?
- Dimensionality reduction
- Feature scaling
- Model selection (correct)
- Feature engineering
In the context of machine learning, why is it important to split data into training and test sets?
In the context of machine learning, why is it important to split data into training and test sets?
- To evaluate the model's ability to generalize to unseen data. (correct)
- To increase the overall size of the dataset.
- To ensure the training data is diverse.
- To reduce the computational complexity of the training process.
What is the primary purpose of 'cross-validation' in machine learning?
What is the primary purpose of 'cross-validation' in machine learning?
- To increase the size of the training dataset.
- To reduce the dimensionality of the dataset.
- To estimate the generalization performance of a model. (correct)
- To select the most relevant features.
What does the pandas library offer?
What does the pandas library offer?
What are the two primary data structures offered by the pandas library?
What are the two primary data structures offered by the pandas library?
In pandas, if you have a DataFrame df
, how would you access the column named 'temperature'?
In pandas, if you have a DataFrame df
, how would you access the column named 'temperature'?
What is a key advantage of using Pandas for data analysis?
What is a key advantage of using Pandas for data analysis?
If you want to find unique values in a pandas DataFrame column named 'color', which method would you use?
If you want to find unique values in a pandas DataFrame column named 'color', which method would you use?
Which pandas function would you use to group rows based on the values in one or more columns?
Which pandas function would you use to group rows based on the values in one or more columns?
What operation does the following pandas code perform? df[df['age'] > 30]
What operation does the following pandas code perform? df[df['age'] > 30]
Which of the following pandas operations is used to get descriptive statistics of a DataFrame?
Which of the following pandas operations is used to get descriptive statistics of a DataFrame?
How can you count the number of occurrences of each unique value in a pandas Series?
How can you count the number of occurrences of each unique value in a pandas Series?
In pandas, what is the purpose of the apply()
function?
In pandas, what is the purpose of the apply()
function?
What is the purpose of a pivot table in pandas?
What is the purpose of a pivot table in pandas?
How do you remove rows with missing values in a pandas DataFrame?
How do you remove rows with missing values in a pandas DataFrame?
In pandas, what is the result of running df.loc[0:5]
on a DataFrame df
?
In pandas, what is the result of running df.loc[0:5]
on a DataFrame df
?
What is the correct way to read a CSV file into a pandas DataFrame?
What is the correct way to read a CSV file into a pandas DataFrame?
If you want to change the data type of a column named 'amount' in a pandas DataFrame to integer, how would you do it?
If you want to change the data type of a column named 'amount' in a pandas DataFrame to integer, how would you do it?
What would be the correct code for computing the mean of the 'salary' column in a pandas DataFrame called employee_data
?
What would be the correct code for computing the mean of the 'salary' column in a pandas DataFrame called employee_data
?
What is the purpose of the fillna()
method in pandas?
What is the purpose of the fillna()
method in pandas?
How can you sort a pandas DataFrame by the values in a column named 'date' in ascending order?
How can you sort a pandas DataFrame by the values in a column named 'date' in ascending order?
If you need to select a subset of columns ('A', 'B', 'C') from a pandas DataFrame df
, how do you achieve this?
If you need to select a subset of columns ('A', 'B', 'C') from a pandas DataFrame df
, how do you achieve this?
What is the difference between .iloc[]
and .loc[]
in pandas when accessing data in a DataFrame?
What is the difference between .iloc[]
and .loc[]
in pandas when accessing data in a DataFrame?
When using the groupby()
function in pandas, what is the typical next step after grouping the data?
When using the groupby()
function in pandas, what is the typical next step after grouping the data?
Which method is most suitable for merging two pandas DataFrames based on a common column?
Which method is most suitable for merging two pandas DataFrames based on a common column?
Flashcards
ML Pipeline
ML Pipeline
A sequence of steps to build and deploy machine learning models.
Data Preprocessing
Data Preprocessing
First stage of the ML pipeline which transforms raw data into a suitable format.
Train/Test Split
Train/Test Split
Splitting the data into two separate groups, one for training, one for testing.
Feature Engineering
Feature Engineering
Signup and view all the flashcards
Learning Algorithm Selection
Learning Algorithm Selection
Signup and view all the flashcards
Hyperparameter Optimization
Hyperparameter Optimization
Signup and view all the flashcards
Generalization Error
Generalization Error
Signup and view all the flashcards
Pandas Library
Pandas Library
Signup and view all the flashcards
Pandas Series
Pandas Series
Signup and view all the flashcards
Pandas DataFrame
Pandas DataFrame
Signup and view all the flashcards
Vectorized Operations
Vectorized Operations
Signup and view all the flashcards
Find unique values
Find unique values
Signup and view all the flashcards
Group rows
Group rows
Signup and view all the flashcards
Create conditional column
Create conditional column
Signup and view all the flashcards
Filtering DataFrames
Filtering DataFrames
Signup and view all the flashcards
Descriptive Statistics
Descriptive Statistics
Signup and view all the flashcards
Counting Values
Counting Values
Signup and view all the flashcards
Searching a Column
Searching a Column
Signup and view all the flashcards
Dropping Rows/Columns
Dropping Rows/Columns
Signup and view all the flashcards
Pivot Tables
Pivot Tables
Signup and view all the flashcards
Selecting Pandas DataFrame Rows
Selecting Pandas DataFrame Rows
Signup and view all the flashcards
Study Notes
Data Handling with Pandas
- Pandas is used for data handling
- The lecture will cover data handling with pandas
Lecture Agenda
- The lecture will cover the ML Pipeline
- The lecture will cover Prerequisites
- The lecture will cover what the pandas library offers
- Resources will be reviewed in the lecture
- Several Lecture Exercises
- Compulsory Assignment 1 will be described
ML Pipeline
- A diagram is displayed for Machine Learning Pipelines
- Feature Extraction and Scaling are essential for Machine Learning Pipelines
- Labels, Training Datasets, Learning Algorithms and Final Models are all parts of the process
- New Data can be predicted with Labels can Evaluation
- Model Selection, Cross-Validation, Performance Metrics, and Hyperparameter Optimization are all used
ML Pipeline: Preprocessing
- Preprocessing data is a crucial step in every ML application
- Raw data often needs processing to become a good format
- Many ML algorithms require scaling for good performance
- Dimensionality reduction is used during the ML Pipeline
ML Pipeline: Preprocessing
- Separation of data into training and test sets is required
- Models should generalize well
- Good data should create a good performance in training and test set
- Feature engineering can create new features from raw data by transformations
ML Pipeline: Learning algorithm
- Many algorithms are available for ML, optimization methods, etc
- Each algorithms has it's own strengths and weaknesses
- The best model is found by comparing performance
- One must choose their performance metrics (accuracy, AUC of ROC, etc.)
- Cross-validation is used for generalization testing
- Hyperparameter optimization is used when fine tuning models
ML Pipeline: Evaluation and prediction
- Estimate generalization error using unseen test data
- Track expected prediction performance for future data
- All transformations applied to the training data is applied to the test data using the same, user-set or ML algorithm-acquired parameters
Prerequisites
- Anaconda, Miniconda, or Python with a pip package manager should be installed
- The required packages should also be installed
- The environment YAML file is located on Canvas
- Conda can create an environment from the environment.yml file, which provides the different dependencies to the project
- Start environment with the command
conda activate dat200_env
- Install package
ipykernel
- which provides the Ipython integration for Jupyter - The command is
python -m ipykernel install --user --name=dat200_env
Pandas Library
- Pandas is a free and open-source software library for Python
- Features fast and highly flexible structures for handling relational tables and time series
- Pandas is a tool for handling spreadsheets in Python
- The 2 main data structures are Series (1D Array) and DataFrames (DF; a 2D array with named rows and columns)
- Every row and/or column in a DF is a Series
- The pandas library is built on top of Numpy
Pandas Resources
- pandas website and documentation is a resource
- pandas community tutorials are offical pandas website incl videos
- RealPython is another resource
- Pandas is a powerful tool with many commands and options
- ChatGPT is a great tool to remember syntax
- A precise question on what to do is a helpful tool
Common tasks with Pandas
- Creating a DataFrame
- Indexing the rows and columns
- Pandas is a vectorized operation
- Examples showcase pandas capabilities
Lecture exercises 1
- Iris dataset should be loaded into a pandas datafram from the web
- The link for Iris dataset is:
https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
- Set the column name to
sepal_length
,sepal_width
,petal_length
,petal_width
, andtypes
- Set the row names to
flower_1
,flower_2
, flower_3, ...,flower_150
Common tasks with Pandas 2
- Finding unique values in Pandas Dataframes
- Grouping rows in Pandas
- Creating a column based on a conditional in Pandas
Lecture exercises 2
- Find the unique values for the column
types
in a dataframe - Compute the cloumn mean for each
type
- Create a new column named
sepal width >=3
that containsTrue
orFalse
depending on if the cloumn sepal width is >= 3 (True) or < 3 (False) - Count how many times the sepal width is >= 3 (you can use the column sepal width >= 3 for that)
Common tasks with Pandas 3
- Filtering Pandas Dataframes
- Getting descriptive statistics for Pandas Dataframes
- Counting values in a Pandas Dataframe
- Searching a Pandas column for a value
Lecture exercises 3
- Count how many times each class occurs (Answer: 50 of each class)
- Create three data subsets from the original dataframe, one for each kind of flower
- Use conditional row selection based on the column
types
- View the last 10 rows of the columns
sepal_length
andtypes
Common tasks with Pandas 4
- Dropping rows and columns in a Pandas DataFrame
- Selecting Pandas DataFrame rows based on conditions
- Sorting rows in Pandas DataFrames
- Applying Operations Pandas DataFrames
- Getting Pivot Tables in Pandas
- Selecting Pandas DataFrame Rows based on conditions
Lecture exercises 4
- View the rows where sepal_length > 5 and petal_width < 0.2
- Make a new DataFrame containig only rows where petal_width is exactly 1.8
- Get the descriptive statistics for the whole data frame and afterwards just columns for the cloumn petal_length
- Remove the rows named
flower 55
andflower 77
- Remove the column sepal_width >= 3
- View all rows of sepal length where petal_width is exactly 1.8
- Get the values of the DataFrame stored in a numpy array
- Remove the column types and apply a function named computation to each cell in the DataFrame
Compulsory Assignment 1
- Compulsory Assignment 1 is posted on Canvas
- The tasks in CA1 is similar to the exercises in the lecture
- The CAs should be started early, or it will get overwhelming before the deadline.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.