2025 Wharton High School Data Science Competition Workbook PDF
Document Details
![LawfulJadeite](https://quizgecko.com/images/avatars/avatar-6.webp)
Uploaded by LawfulJadeite
Wharton High School
2025
Tags
Summary
This workbook guides students through the 2025 Wharton High School Data Science Competition focused on basketball tournament predictions. Students will analyze data and create models to predict competition results. A detailed overview is provided of the competition’s structure, data sets, and required methods.
Full Transcript
Wharton High School Data Science Competition: Basketball Tournament Predictions 2025 Workbook for Phase 1 The purpose of this workbook is to allow you to develop your approach and work through Phase 1 as a team. However, the student team leader is responsible for submitti...
Wharton High School Data Science Competition: Basketball Tournament Predictions 2025 Workbook for Phase 1 The purpose of this workbook is to allow you to develop your approach and work through Phase 1 as a team. However, the student team leader is responsible for submitting your team’s Phase 1 answers in Apply. Competition Prompt and Data Delivery Crunching the Numbers You’re the analytics staff for a basketball team, the group behind the numbers that drive performance. It’s the offseason, and your task is clear: analyze the previous season to uncover trends, rethink strategies, and predict future success. What if you had the power to review the past and rewrite it? What if you could reshuffle team rankings, simulate matchups that never happened, and test your predictions against "what-if" scenarios? That’s the challenge we’re putting to you: dive deep into the numbers from the 2022 NCAA Women's Division 1 College Basketball season and show us what decisions you’d make. In this competition, teams of high school students will step into the role of basketball analysts, reviewing more than 5,300 games worth of statistics to 1) Rank teams within their regions and 2) Predict which teams would win in hypothetical matchups. We’re handing you a dataset of game-level and team-level stats, including various game summary stats like scores, field goals attempted (FGA), offensive rebounds (OREB), turnovers (TOV), and more. Using this data, your task is to reimagine how the season could have unfolded, creating your own rankings and predicting the outcomes of matchups that never actually happened. The real 2022 NCAA tournament results exist—but don’t look at them! For this challenge, you’ll be asked to rethink the tournament seeds and evaluate entirely new matchups. The information from the actual tournament should not be used as it will not be helpful. How to Succeed: Victory in this competition isn’t just about getting the numbers right. It’s about showcasing your skills in three key areas: 1. Accuracy: How well do your rankings and predictions match up to the simulated outcomes? 2. Methodology: Can you explain your approach clearly and justify your choices with sound reasoning? 3. Communication: Do your findings tell a compelling story, both visually and in your presentation? 2 This is more than a numbers game—it’s a chance to turn data into strategy, strategy into predictions, and predictions into a bold new vision of college basketball. The question is: do you have what it takes to shape the finals tournament and predict the 1st round of matchups? The competition will take part in 3 phases Phase 1: Main Competition Analyze 5,300+ games worth of statistics to rank teams within their regions and predict which teams would win in hypothetical matchups, submitted via online platform SurveyMonkey Apply (Apply). Phase 2: Semifinals (top 25 teams invited to participate) Creation of a short slide deck that describes and visualizes the team’s methods and findings from Phase 1. Additionally, teams will explore effects and impacts of home-court advantage. Slide deck will be submitted via Apply. Phase 3: Finals (top 5 teams invited to participate) Presentation of the Phase 2 submission material to a panel of judges during a virtual meeting. 3 The bracket below summarizes the Main Competition tasks. For games in the East region (blue box) winning probabilities will be reported, and the other three regions (pink boxes) teams will be ranked within the region. 4 Phase 1: Main Competition The competition heats up as your team dives into the action! Your mission: crunch the numbers, rank the tournament teams, and predict the winners for 10 thrilling game matchups. Your strategy and predictions will be submitted through the online platform Apply. This is where the groundwork is laid, and the sharpest minds will advance to the next stage. Phase 1a: Ranking the Teams Your analytics staff’s first challenge is to rank the top 16 teams in three of four regions— West, North, and South—based on their regular season performance. But this isn’t just about win-loss records. Like the Coaches Polls, your rankings should reflect the overall strength and quality of the teams. To help you get started, each team has already been assigned to one of four regions (East, West, North, South). Note: this setup differs from the actual NCAA tournament to make your work easier and more streamlined. Now, it’s time to make your picks—who are the top contenders? Phase 1b: Predicting Winning Probabilities Your team is busy and working overtime; it’s getting intense! You’ll predict the winning probabilities for 10 first-round games in the East Region, including a play-in game. This is your chance to flex your data-driven strategy skills! Here’s the breakdown: One play-in game: Two teams face off for a chance to enter the tournament. Two play-in possibilities: Predict how either winner of the play-in game would perform against the #1 seed in the next round. First-round matchups: Seven games featuring top-ranked teams in the East Region. But don’t be fooled by the seeding. The highest-seeded team isn’t always the strongest! Your predictions should focus on the stats and strategies that truly matter, not just the seed number. When you submit your predictions, your submission for each game should be a numeric probability between 0 and 1. Think of it as assigning the higher-seed team a “chance to win” based on your analysis. This phase is your chance to think like a pro, balancing hard data with the intuition of a seasoned analyst. Who will advance, and who will fall? It’s up to you to call the shots! 5 Phase 1c: Tell Your Story - Summarize Your Methodology Approach this as if your team is presenting its findings to the head coach and front office executives, breaking down how your insights can shape strategy and drive success. Your task is to craft a concise but comprehensive explanation of your approach—how you analyzed the raw data, identified key drivers of performance, and predicted game outcomes to inform critical decisions. Ensure your explanation is detailed enough that your methodology could be replicated by another analytics team, showcasing your work's rigor and transparency. Your summary should cover these key points: 1. Process: How did you clean or transform the raw data before analysis? Please describe it in ~50 words. Did you create any additional variables? Please describe in ~25 words. 2. Tools and Techniques: What software tools did you use? [Select all that apply] And for what purpose? Please describe in ~50 words. What statistical methods did you employ? Please describe in ~100 words. 3. Your Predictions: How did you create team rankings? Please describe in ~50 words. How did you determine game-winning probabilities? Please describe in ~50 words. 4. Your Insights: How did you assess your model performance? Did you use generative AI tools (ChatGPT, etc.)? If so, explain how. Did you use any additional outside data sources? If so, list them and explain why they were relevant. 6 Files and Data Fields Competition Data and Resources provided to participants: Files are provided in two ways:.csv files in a Box folder: ○ Folder link: https://upenn.box.com/v/WHSDSC-2025-basketball Google Sheets ○ Folder link: https://drive.google.com/drive/folders/1z2- knlH2vDLxa7o6MY_-lFKgta9r7Z0I?usp=sharing Individual file links are below: Primary dataset, one season of games: Each row is one team results for a team with descriptors of the game outcome. Each game has two teams and hence two rows in the dataset. Full season n=5300 games and n=11,600 rows. games_2022 boxscore of NCAA regular seasons DataDictionary for all fields in the Game box scores file Tournament teams to rank by region: The set of teams with winning records - those that could be considered for inclusion in one of the 4 regions of the tournament. Team Region Groups Tournament games for predictions: Each row is one game to predict. East Regional Games to predict *The starting point for the NCAA Women’s basketball stats was a great R package, wehoop. Gilani, Saiem and Hutchinson, Geoffery, wehoop: Access Women’s Basketball Play by Play Data, CRAN: Contributed Packages, 2021. http://doi.org/10.32614/CRAN.package.wehoop 7 Educational Modules We are happy to provide some resources for you to get started. This competition isn’t about how much you know about basketball or stats about your favorite star. However, we’ve created this video library so you can learn more about what terms mean and how they may relate to the competition. Our video library is a tool for you to reference throughout the competition. Click the Video Library to access a playlist of all videos. Educational Modules Tournament structure Overview of stats & data set Possessions Home Court Advantage 4 Factors in Basketball League Table: Overview League Table: Google Sheets League Table: R League Table: Python League Table: LLM (ChatGPT) Probability: logistic model for probabilities Confounding These materials were developed for educational purposes associated with the Wharton High School Data Science Competition only. Please do not reuse, reproduce, or distribute outside of your school. 8