Podcast
Questions and Answers
What is the purpose of loading the MLBAM data in the analysis?
What is the purpose of loading the MLBAM data in the analysis?
- To visualize baseball statistics without any coding
- To collect data from different sports leagues
- To perform run expectancy analysis using game event data (correct)
- To create a new data set from scratch
Which programming language is primarily used for the analysis discussed?
Which programming language is primarily used for the analysis discussed?
- C++
- Python
- Java
- R (correct)
How many rows are available in the MLBAM 2018 data set?
How many rows are available in the MLBAM 2018 data set?
- 62,000
- 200,000
- 185,771 (correct)
- 100,000
What is the first step when starting the analysis in the Jupyter notebook?
What is the first step when starting the analysis in the Jupyter notebook?
Which column is immediately dropped from the MLBAM data and why?
Which column is immediately dropped from the MLBAM data and why?
What data source does the book 'Analyzing Baseball Data with R' primarily use?
What data source does the book 'Analyzing Baseball Data with R' primarily use?
What is the purpose of setting the option to display a maximum of 100 columns?
What is the purpose of setting the option to display a maximum of 100 columns?
How many columns are included in the MLBAM 2018 data set?
How many columns are included in the MLBAM 2018 data set?
What does the variable start1B represent in the run expectancy analysis?
What does the variable start1B represent in the run expectancy analysis?
What is represented by the term NaN in the data set?
What is represented by the term NaN in the data set?
How does the startOuts variable function within the run expectancy framework?
How does the startOuts variable function within the run expectancy framework?
In the context of this analysis, what does runsFuture represent?
In the context of this analysis, what does runsFuture represent?
Which variable would indicate if second base is occupied before the plate appearance?
Which variable would indicate if second base is occupied before the plate appearance?
What coding value is assigned to start1 if start1B is null?
What coding value is assigned to start1 if start1B is null?
What does the 'stand' variable represent in the dataset?
What does the 'stand' variable represent in the dataset?
Which of the following variables indicates whether first base is occupied after the plate appearance?
Which of the following variables indicates whether first base is occupied after the plate appearance?
What does 'start1' represent in this context?
What does 'start1' represent in this context?
How is the 'start_state' variable created?
How is the 'start_state' variable created?
What is the purpose of converting the variables to strings using 'astype string'?
What is the purpose of converting the variables to strings using 'astype string'?
What does 'end1' signify after the plate appearance?
What does 'end1' signify after the plate appearance?
What does the value of '0' represent for the variable 'end1'?
What does the value of '0' represent for the variable 'end1'?
What is included in the 'end_state' variable?
What is included in the 'end_state' variable?
Which statement is true regarding the occupancy variables 'start1', 'start2', and 'start3'?
Which statement is true regarding the occupancy variables 'start1', 'start2', and 'start3'?
What is the significance of the space added in the concatenation of the state variables?
What is the significance of the space added in the concatenation of the state variables?
What event is excluded from the run expectancy analysis due to its unrelatedness to batting performance?
What event is excluded from the run expectancy analysis due to its unrelatedness to batting performance?
Why is it important to ensure that outsInInning equals 3 in the analysis?
Why is it important to ensure that outsInInning equals 3 in the analysis?
What condition is applied to subset the data in the analysis?
What condition is applied to subset the data in the analysis?
After preparing the data, how many rows are retained for further analysis?
After preparing the data, how many rows are retained for further analysis?
What is the main goal of the changes made to the data set?
What is the main goal of the changes made to the data set?
What type of event might occur in a game that would require exclusion from expectancy analysis?
What type of event might occur in a game that would require exclusion from expectancy analysis?
What does the run expectancy analysis aim to evaluate in baseball?
What does the run expectancy analysis aim to evaluate in baseball?
What is the effect of excluding observations where start_state equals end_state?
What is the effect of excluding observations where start_state equals end_state?
Flashcards
Run Expectancy
Run Expectancy
A statistic that estimates the expected number of runs a team is likely to score in a given situation.
Event by Event Data
Event by Event Data
This refers to all events that occurred within a given Major League Baseball season, providing detailed information for each play.
Run Expectancy Variables
Run Expectancy Variables
A set of variables that are important for analyzing the run expectancy.
Retrosheet Data
Retrosheet Data
Signup and view all the flashcards
R (programming Language)
R (programming Language)
Signup and view all the flashcards
Data Cleaning
Data Cleaning
Signup and view all the flashcards
Python
Python
Signup and view all the flashcards
Jupyter Notebook
Jupyter Notebook
Signup and view all the flashcards
start1B
start1B
Signup and view all the flashcards
start2B
start2B
Signup and view all the flashcards
start3B
start3B
Signup and view all the flashcards
end1B
end1B
Signup and view all the flashcards
end2B
end2B
Signup and view all the flashcards
end3B
end3B
Signup and view all the flashcards
startOuts
startOuts
Signup and view all the flashcards
endOuts
endOuts
Signup and view all the flashcards
Start State in Baseball
Start State in Baseball
Signup and view all the flashcards
Start1 variable
Start1 variable
Signup and view all the flashcards
End State in Baseball
End State in Baseball
Signup and view all the flashcards
End1 Variable
End1 Variable
Signup and view all the flashcards
Concatenation
Concatenation
Signup and view all the flashcards
Astype String
Astype String
Signup and view all the flashcards
Outs in Baseball
Outs in Baseball
Signup and view all the flashcards
Start State
Start State
Signup and view all the flashcards
End State
End State
Signup and view all the flashcards
Runs on Play
Runs on Play
Signup and view all the flashcards
Foul Air
Foul Air
Signup and view all the flashcards
Outs in Inning
Outs in Inning
Signup and view all the flashcards
Data Preparation
Data Preparation
Signup and view all the flashcards
Evaluating Player Performance
Evaluating Player Performance
Signup and view all the flashcards
Study Notes
Run Expectancy Analysis
- Run expectancy (RE) variables are created for analysis
- Analyzing Baseball Data with R, by Marchi and Albert (2013) is a useful resource for further reading
- Jupyter Notebook will use MLBAM (Major League Baseball Advanced Media) data
- General process of analyzing baseball data is similar to the book's, using R or Python
- Packages pandas (pd) and numpy (np) are loaded
- MLBAM 2018 data is read into a variable (MLBAM18)
- Unnecessary 'unnamed: 0' column is dropped
- Maximum display columns set to 100, due to many observations
- Data contains event-by-event data for 2018 baseball season
- Data has 185,771 rows and 62 columns
Run Expectancy Variables
- Variables to keep for analysis are identified (batterName, batterID, eventType, etc.)
- A new variable 'RE18' is created from MLBAM18 data, selecting specified columns
- Variables 'start1B', 'start2B', 'start3B' represent base occupation before plate appearance
Base State Variables
- These variables track base occupation prior to each plate appearance
- Null values ('NaN') indicate a base is not occupied
- Variables 'start1', 'start2', and 'start3' represent whether first, second, and third base are occupied respectively.
Plate Appearance State
- 'start_state' variable is created concatenating 'start1', 'start2', 'start3', and the number of outs prior to the plate appearance
- The new variable gives a comprehensive state of the game before a given play appearance.
Plate Appearance State (End of Play)
- End-of-play state variables ('end1', 'end2', 'end3') are created
- A new variable 'end_state' is produced in a similar way to start-state, concatenating end variables and outs.
Data Filtering
- Data is filtered, removing:
- Events where start_state is not equal to end_state
- Events with runs on play is equal to zero
- Unnecessary events like dropped foul balls
- Data rows where outsInInning isn't equal to 3
- Resulting data set has 184,949 rows and 27 columns
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Explore run expectancy analysis in baseball using R and Python. This quiz focuses on the analysis of MLBAM data and the creation of run expectancy variables. Familiarity with data manipulation in pandas and numpy is beneficial for this exercise.