The Kaggle Book: Data Analysis and Machine Learning (2022) PDF
Document Details
Uploaded by Deleted User
2022
Konrad Banachewicz, Luca Massaron
Tags
Related
- Data Understanding - Machine Learning Applications - 2023/2024
- Unit-3 Machine Learning PDF
- Data Preprocessing in Machine Learning PDF
- Data Analysis for Business, Economics, and Policy PDF
- Machine Learning Lab Manual (University of Science and Technology)
- Machine Learning and Security: Protecting Systems with Data and Algorithms, 1st Edition PDF
Summary
This book, "The Kaggle Book," authored by Konrad Banachewicz and Luca Massaron, focuses on data analysis and machine learning techniques for competitive data science on the Kaggle platform. It explores how Kaggle works and provides key learnings from the authors' extensive experience on the site, with over 20 years of combined participation, including 330 competitions and over 2,000 forum posts. This guide is designed to help newcomers navigate Kaggle and experienced users improve their skills.
Full Transcript
The Kaggle Book Data analysis and machine learning for competitive data science Konrad Banachewicz Luca Massaron BIRMINGHAM—MUMBAI Packt and this book are not officially connected with Kaggle. This book is an effort from the Kaggle community of experts to help more developers. The Kaggle Book...
The Kaggle Book Data analysis and machine learning for competitive data science Konrad Banachewicz Luca Massaron BIRMINGHAM—MUMBAI Packt and this book are not officially connected with Kaggle. This book is an effort from the Kaggle community of experts to help more developers. The Kaggle Book Copyright © 2022 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Producer: Tushar Gupta Acquisition Editor – Peer Reviews: Saby Dsilva Project Editor: Parvathy Nair Content Development Editor: Lucy Wan Copy Editor: Safis Editing Technical Editor: Karan Sonawane Proofreader: Safis Editing Indexer: Sejal Dsilva Presentation Designer: Pranit Padwal First published: April 2022 Production reference: 2220422 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-80181-747-9 www.packt.com Foreword I had a background in econometrics but became interested in machine learning techniques, ini- tially as an alternative approach to solving forecasting problems. As I started discovering my interest, I found the field intimidating to enter: I didn’t know the techniques, the terminology, and didn’t have the credentials that would allow me to break in. It was always my dream that Kaggle would allow people like me the opportunity to break into this powerful new field. Perhaps the thing I’m proudest of is the extent to which Kaggle has made data science and machine learning more accessible. We’ve had many Kagglers go from newbies to top machine learners, being hired at places like NVIDIA, Google, and OpenAI, and starting companies like DataRobot. Luca and Konrad’s book helps make Kaggle even more accessible. It offers a guide to both how Kaggle works, as well as many of the key learnings that they have taken out of their time on the site. Collectively, they’ve been members of Kaggle for over 20 years, entered 330 competitions, made over 2,000 posts to Kaggle forums, and shared over 100 notebooks and 50 datasets. They are both top-ranked users and well-respected members of the Kaggle community. Those who complete this book should expect to be able to engage confidently on Kaggle – and engaging confidently on Kaggle has many rewards. Firstly, it’s a powerful way to stay on top of the most pragmatic developments in machine learn- ing. Machine learning is moving very quickly. In 2019, over 300 peer reviewed machine learning papers were published per day. This volume of publishing makes it impossible to be on top of the literature. Kaggle ends up being a very valuable way to filter what developments matter on real-world problems – and Kaggle is useful for more than keeping up with the academic litera- ture. Many of the tools that have become standard in the industry have spread via Kaggle. For example, XGBoost in 2014 and Keras in 2015 both spread through the community before making their way into industry. Secondly, Kaggle offers users a way to “learn by doing.” I’ve heard active Kagglers talk about com- peting regularly as “weight training” for machine learning. The variety of use cases and problems they tackle on Kaggle makes them well prepared when they encounter similar problems in indus- try. And because of competition deadlines, Kaggle trains the muscle of iterating quickly. There’s probably no better way to learn than to attempt a problem and then see how top performers tackled the same problem (it’s typical for winners to share their approaches after the competition). So, for those of you who are reading this book and are new to Kaggle, I hope it helps make Kaggle less intimidating. And for those who have been on Kaggle for a while and are looking to level up, I hope this book from two of Kaggle’s strongest and most respected members helps you get more out of your time on the site. Anthony Goldbloom Kaggle Founder and CEO Contributors About the authors Konrad Banachewicz holds a PhD in statistics from Vrije Universiteit Amsterdam. During his period in academia, he focused on problems of extreme dependency modeling in credit risk. In addition to his research activities, Konrad was a tutor and supervised master’s students. Starting from classical statistics, he slowly moved toward data mining and machine learning (this was before the terms “data science” or “big data” became ubiquitous). In the decade after his PhD, Konrad worked in a variety of financial institutions on a wide array of quantitative data analysis problems. In the process, he became an expert on the entire lifetime of a data product cycle. He has visited different ends of the frequency spectrum in finance (from high-frequency trading to credit risk, and everything in between), predicted potato prices, and analyzed anomalies in the performance of large-scale industrial equipment. As a person who himself stood on the shoulders of giants, Konrad believes in sharing knowledge with others. In his spare time, he competes on Kaggle (“the home of data science”). I would like to thank my brother for being a fixed point in a chaotic world and continuing to provide inspiration and motivation. Dzięki, Braciszku. Luca Massaron is a data scientist with more than a decade of experience in transforming data into smarter artifacts, solving real-world problems, and generating value for businesses and stakeholders. He is the author of bestselling books on AI, machine learning, and algorithms. Luca is also a Kaggle Grandmaster who reached no. 7 in the worldwide user rankings for his perfor- mance in data science competitions, and a Google Developer Expert (GDE) in machine learning. My warmest thanks go to my family, Yukiko and Amelia, for their support and loving patience as I prepared this new book in a long series. My deepest thanks to Anthony Goldbloom for kindly writing the foreword for this book and to all the Kaggle Masters and Grandmasters who have so enthusiastically contributed to its making with their interviews, suggestions, and help. Finally, I would like to thank Tushar Gupta, Parvathy Nair, Lucy Wan, Karan Sonawane, and all of the Packt Publishing editorial and production staff for their support on this writing effort. About the reviewer Dr. Andrey Kostenko is a data science and machine learning professional with extensive expe- rience across a variety of disciplines and industries, including hands-on coding in R and Python to build, train, and serve time series models for forecasting and other applications. He believes that lifelong learning and open-source software are both critical for innovation in advanced analytics and artificial intelligence. Andrey recently assumed the role of Lead Data Scientist at Hydroinformatics Institute (H2i.sg), a specialized consultancy and solution services provider for all aspects of water management. Prior to joining H2i, Andrey had worked as Senior Data Scientist at IAG InsurTech Innovation Hub for over 3 years. Before moving to Singapore in 2018, he worked as Data Scientist at TrafficGuard. ai, an Australian AdTech start-up developing novel data-driven algorithms for mobile ad fraud detection. In 2013, Andrey received his doctorate degree in Mathematics and Statistics from Monash University, Australia. By then, he already had an MBA degree from the UK and his first university degree from Russia. In his spare time, Andrey is often found engaged in competitive data science projects, learning new tools across R and Python ecosystems, exploring the latest trends in web development, solving chess puzzles, or reading about the history of science and mathematics. Dr. Firat Gonen is the Head of Data Science and Analytics at Getir. Gonen leads the data sci- ence and data analysis teams delivering innovative and cutting edge Machine Learning projects. Before Getir, Dr. Gonen was managing Vodafone Turkey’s AI teams. Prior to Vodafone Turkey, he was the Principal Data Scientist at Dogus Group (one of Turkey’s largest conglomerates). Gonen holds extensive educational qualifications including a PhD degree in NeuroScience and Neural Networks from University of Houston and is an expert in Machine Learning, Deep Learning, Visual Attention, Decision-Making & Genetic Algorithms with over more than 12 years in the field. He has authored several peer-review journal papers. He’s also a Kaggle Triple GrandMaster and has more than 10 international data competition medals. He was also selected as the 2020 Z by HP Data Science Global Ambassador. About the interviewees We were fortunate enough to be able to collect interviews from 31 talented Kagglers across the Kaggle community, who we asked to reflect on their time on the platform. You will find their answers scattered across the book. They represent a broad range of perspectives, with many in- sightful responses that are as similar as they are different. We read each one of their contributions with great interest and hope the same is true for you, the reader. We give thanks to all of them and list them in alphabetical order below. Abhishek Thakur, who is currently building AutoNLP at Hugging Face. Alberto Danese, Head of Data Science at Nexi. Andrada Olteanu, Data Scientist at Endava, Dev Expert at Weights and Biases, and Z by HP Global Data Science Ambassador. Andrew Maranhão, Senior Data Scientist at Hospital Albert Einstein in São Paulo. Andrey Lukyanenko, Machine Learning Engineer and TechLead at MTS Group. Bojan Tunguz, Machine Learning Modeler at NVIDIA. Chris Deotte, Senior Data Scientist and Researcher at NVIDIA. Dan Becker, VP Product, Decision Intelligence at DataRobot. Dmitry Larko, Chief Data Scientist at H2O.ai. Firat Gonen, Head of Data Science and Analytics at Getir and Z by HP Global Data Science Am- bassador. Gabriel Preda, Principal Data Scientist at Endava. Gilberto Titericz, Senior Data Scientist at NVIDIA. Giuliano Janson, Senior Applied Scientist for ML and NLP at Zillow Group. Jean-François Puget, Distinguished Engineer, RAPIDS at NVIDIA, and the manager of the NVIDIA Kaggle Grandmaster team. Jeong-Yoon Lee, Senior Research Scientist in the Rankers and Search Algorithm Engineering team at Netflix Research. Kazuki Onodera, Senior Deep Learning Data Scientist at NVIDIA and member of the NVIDIA KGMON team. Laura Fink, Head of Data Science at Micromata. Martin Henze, PhD Astrophysicist and Data Scientist at Edison Software. Mikel Bober-Irizar, Machine Learning Scientist at ForecomAI and Computer Science student at the University of Cambridge. Osamu Akiyama, Medical Doctor at Osaka University. Parul Pandey, Data Scientist at H2O.ai. Paweł Jankiewicz, Chief Data Scientist & AI Engineer as well as Co-founder of LogicAI. Rob Mulla, Senior Data Scientist at Biocore LLC. Rohan Rao, Senior Data Scientist at H2O.ai. Ruchi Bhatia, Data Scientist at OpenMined, Z by HP Global Data Science Ambassador, and grad- uate student at Carnegie Mellon University. Ryan Chesler, Data Scientist at H2O.ai. Shotaro Ishihara, Data Scientist and Researcher at a Japanese news media company. Sudalai Rajkumar, an AI/ML advisor for start-up companies. Xavier Conort, Founder and CEO at Data Mapping and Engineering. Yifan Xie, Co-founder of Arion Ltd, a data science consultancy firm. Yirun Zhang, final-year PhD student at King’s College London in applied machine learning. Join our book’s Discord space Join the book’s Discord workspace for a monthly Ask me Anything session with the authors: https://packt.link/KaggleDiscord Table of Contents Preface xix Part I: Introduction to Competitions 1 Chapter 1: Introducing Kaggle and Other Data Science Competitions 3 The rise of data science competition platforms 4 The Kaggle competition platform 7 A history of Kaggle 7 Other competition platforms 10 Introducing Kaggle 12 Stages of a competition 12 Types of competitions and examples 17 Submission and leaderboard dynamics 22 Explaining the Common Task Framework paradigm 23 Understanding what can go wrong in a competition 24 Computational resources 26 Kaggle Notebooks 27 Teaming and networking 28 Performance tiers and rankings 32 Criticism and opportunities 33 Summary 35 xii Table of Contents Chapter 2: Organizing Data with Datasets 37 Setting up a dataset 37 Gathering the data 42 Working with datasets 48 Using Kaggle Datasets in Google Colab 49 Legal caveats 51 Summary 52 Chapter 3: Working and Learning with Kaggle Notebooks 53 Setting up a Notebook 54 Running your Notebook 58 Saving Notebooks to GitHub 60 Getting the most out of Notebooks 63 Upgrading to Google Cloud Platform (GCP) 64 One step beyond 66 Kaggle Learn courses 73 Summary 77 Chapter 4: Leveraging Discussion Forums 79 How forums work 79 Example discussion approaches 86 Netiquette 92 Summary 93 Part II: Sharpening Your Skills for Competitions 95 Chapter 5: Competition Tasks and Metrics 97 Evaluation metrics and objective functions 98 Basic types of tasks 100 Regression 100 Table of Contents xiii Classification 100 Ordinal 101 The Meta Kaggle dataset 102 Handling never-before-seen metrics 105 Metrics for regression (standard and ordinal) 109 Mean squared error (MSE) and R squared 109 Root mean squared error (RMSE) 111 Root mean squared log error (RMSLE) 112 Mean absolute error (MAE) 113 Metrics for classification (label prediction and probability) 114 Accuracy 114 Precision and recall 116 The F1 score 119 Log loss and ROC-AUC 119 Matthews correlation coefficient (MCC) 121 Metrics for multi-class classification 122 Metrics for object detection problems 129 Intersection over union (IoU) 131 Dice 132 Metrics for multi-label classification and recommendation problems 133 MAP@{K} 134 Optimizing evaluation metrics 135 Custom metrics and custom objective functions 136 Post-processing your predictions 139 Predicted probability and its adjustment 141 Summary 146 Chapter 6: Designing Good Validation 149 Snooping on the leaderboard 150 The importance of validation in competitions 153 Bias and variance 156 xiv Table of Contents Trying different splitting strategies 159 The basic train-test split 160 Probabilistic evaluation methods 161 k-fold cross-validation 161 Subsampling 171 The bootstrap 171 Tuning your model validation system 176 Using adversarial validation 179 Example implementation 181 Handling different distributions of training and test data 183 Handling leakage 187 Summary 192 Chapter 7: Modeling for Tabular Competitions 195 The Tabular Playground Series 196 Setting a random state for reproducibility 202 The importance of EDA 203 Dimensionality reduction with t-SNE and UMAP 205 Reducing the size of your data 208 Applying feature engineering 210 Easily derived features 211 Meta-features based on rows and columns 213 Target encoding 215 Using feature importance to evaluate your work 220 Pseudo-labeling 224 Denoising with autoencoders 226 Neural networks for tabular competitions 231 Summary 238 Chapter 8: Hyperparameter Optimization 241 Basic optimization techniques 242 Grid search 243 Table of Contents xv Random search 245 Halving search 246 Key parameters and how to use them 249 Linear models 250 Support-vector machines 250 Random forests and extremely randomized trees 251 Gradient tree boosting 253 LightGBM 253 XGBoost 255 CatBoost 257 HistGradientBoosting 258 Bayesian optimization 261 Using Scikit-optimize 262 Customizing a Bayesian optimization search 268 Extending Bayesian optimization to neural architecture search 276 Creating lighter and faster models with KerasTuner 285 The TPE approach in Optuna 295 Summary 301 Chapter 9: Ensembling with Blending and Stacking Solutions 303 A brief introduction to ensemble algorithms 304 Averaging models into an ensemble 307 Majority voting 309 Averaging of model predictions 312 Weighted averages 314 Averaging in your cross-validation strategy 315 Correcting averaging for ROC-AUC evaluations 316 Blending models using a meta-model 317 Best practices for blending 318 Stacking models together 323 Stacking variations 327 xvi Table of Contents Creating complex stacking and blending solutions 329 Summary 333 Chapter 10: Modeling for Computer Vision 335 Augmentation strategies 335 Keras built-in augmentations 341 ImageDataGenerator approach 341 Preprocessing layers 345 albumentations 346 Classification 349 Object detection 357 Semantic segmentation 371 Summary 388 Chapter 11: Modeling for NLP 389 Sentiment analysis 389 Open domain Q&A 398 Text augmentation strategies 414 Basic techniques 415 nlpaug 420 Summary 423 Chapter 12: Simulation and Optimization Competitions 425 Connect X 426 Rock-paper-scissors 431 Santa competition 2020 435 The name of the game 439 Summary 444 Table of Contents xvii Part III: Leveraging Competitions for Your Career 445 Chapter 13: Creating Your Portfolio of Projects and Ideas 447 Building your portfolio with Kaggle 447 Leveraging Notebooks and discussions 452 Leveraging Datasets 455 Arranging your online presence beyond Kaggle 460 Blogs and publications 460 GitHub 463 Monitoring competition updates and newsletters 465 Summary 467 Chapter 14: Finding New Professional Opportunities 469 Building connections with other competition data scientists 470 Participating in Kaggle Days and other Kaggle meetups 481 Getting spotted and other job opportunities 482 The STAR approach 483 Summary (and some parting words) 485 Other Books You May Enjoy 489 Index 495 Preface Having competed on Kaggle for over ten years, both of us have experienced highs and lows over many competitions. We often found ourselves refocusing our efforts on different activities relating to Kaggle. Over time, we devoted ourselves not just to competitions but also to creating content and code based on the demands of the data science market and our own professional aspirations. At this point in our journey, we felt that our combined experience and still-burning passion for competitions could really help other participants who have just started, or who would like to get inspired, to get hold of the essential expertise they need, so they can start their own journey in data science competitions. We then decided to work on this book with a purpose: To offer, in a single place, the best tips for being competitive and approaching most of the problems you may find when participating on Kaggle and also other data science competitions. To offer enough suggestions to allow anyone to reach at least the Expert level in any Kaggle discipline: Competitions, Datasets, Notebooks, or Discussions. To provide tips on how to learn the most from Kaggle and leverage this experience for professional growth in data science. To gather in a single source the largest number of perspectives on the experience of partic- ipating in competitions, by interviewing Kaggle Masters and Grandmasters and listening to their stories. In short, we have written a book that demonstrates how to participate in competitions success- fully and make the most of all the opportunities that Kaggle offers. The book is also intended as a practical reference that saves you time and effort, through its selection of many competition tips and tricks that are hard to learn about and find on the internet or on Kaggle forums. Nevertheless, the book doesn’t limit itself to providing practical help; it also aspires to help you figure out how to boost your career in data science by participating in competitions. xx Preface Please be aware: this book doesn’t teach you data science from the basics. We don’t explain in detail how linear regression or random forests or gradient boosting work, but how to use them in the best way and obtain the best results from them in a data problem. We expect solid foun- dations and at least a basic proficiency in data science topics and Python usage from our readers. If you are still a data science beginner, you need to supplement this book with other books on data science, machine learning, and deep learning, and train up on online courses, such as those offered by Kaggle itself or by MOOCs such as edX or Coursera. If you want to start learning data science in a practical way, if you want to challenge yourself with tricky and intriguing data problems and simultaneously build a network of great fellow data scientists as passionate about their work in data as you are, this is indeed the book for you. Let’s get started! Who this book is for At the time of completion of this book, there are 96,190 Kaggle novices (users who have just reg- istered on the website) and 67,666 Kaggle contributors (users who have just filled in their profile) enlisted in Kaggle competitions. This book has been written for all of them and for anyone else wanting to break the ice and start taking part in competitions on Kaggle and learning from them. What this book covers Part 1: Introduction to Competitions Chapter 1, Introducing Kaggle and Other Data Science Competitions, discusses how competitive programming evolved into data science competitions. It explains why the Kaggle platform is the most popular site for these competitions and provides you with an idea about how it works. Chapter 2, Organizing Data with Datasets, introduces you to Kaggle Datasets, the standard method of data storage on the platform. We discuss setup, gathering data, and utilizing it in your work on Kaggle. Chapter 3, Working and Learning with Kaggle Notebooks, discusses Kaggle Notebooks, the baseline coding environment. We talk about the basics of Notebook usage, as well as how to leverage the GCP environment, and using them to build up your data science portfolio. Chapter 4, Leveraging Discussion Forums, allows you to familiarize yourself with discussion forums, the primary manner of communication and idea exchange on Kaggle. Preface xxi Part 2: Sharpening Your Skills for Competitions Chapter 5, Competition Tasks and Metrics, details how evaluation metrics for certain kinds of prob- lems strongly influence the way you can operate when building your model solution in a data science competition. The chapter also addresses the large variety of metrics available in Kaggle competitions. Chapter 6, Designing Good Validation, will introduce you to the importance of validation in data competitions, discussing overfitting, shake-ups, leakage, adversarial validation, different kinds of validation strategies, and strategies for your final submissions. Chapter 7, Modeling for Tabular Competitions, discusses tabular competitions, mostly focusing on the more recent reality of Kaggle, the Tabular Playground Series. Tabular problems are standard practice for the majority of data scientists around and there is a lot to learn from Kaggle. Chapter 8, Hyperparameter Optimization, explores how to extend the cross-validation approach to find the best hyperparameters for your models – in other words, those that can generalize in the best way on the private leaderboard – under the pressure and scarcity of time and resources that you experience in Kaggle competitions. Chapter 9, Ensembling with Blending and Stacking Solutions, explains ensembling techniques for multiple models such as averaging, blending, and stacking. We will provide you with some the- ory, some practice, and some code examples you can use as templates when building your own solutions on Kaggle. Chapter 10, Modeling for Computer Vision, we discuss problems related to computer vision, one of the most popular topics in AI in general, and on Kaggle specifically. We demonstrate full pipelines for building solutions to challenges in image classification, object detection, and image segmentation. Chapter 11, Modeling for NLP, focuses on the frequently encountered types of Kaggle challenges related to natural language processing. We demonstrate how to build an end-to-end solution for popular problems like open domain question answering. Chapter 12, Simulation and Optimization Competitions, provides an overview of simulation compe- titions, a new class of contests gaining popularity on Kaggle over the last few years. Part 3: Leveraging Competitions for Your Career Chapter 13, Creating Your Portfolio of Projects and Ideas, explores ways you can stand out by show- casing your work on Kaggle itself and other sites in an appropriate way. xxii Preface Chapter 14, Finding New Professional Opportunities, concludes the overview of how Kaggle can positively affect your career by discussing the best ways to leverage all your Kaggle experience in order to find new professional opportunities. To get the most out of this book The Python code in this book has been designed to be run on a Kaggle Notebook, without any installation on a local computer. Therefore, don’t worry about what machine you have available or what version of Python packages you should install. All you need is a computer with access to the internet and a free Kaggle account. In fact, to run the code on a Kaggle Notebook (you will find instructions about the procedure in Chapter 3), you first need to open an account on Kaggle. If you don’t have one yet, just go to www.kaggle.com and follow the instructions on the website. We link out to many different resources throughout the book that we think you will find useful. When referred to a link, explore it: you will find code available on public Kaggle Notebooks that you can reuse, or further materials to illustrate concepts and ideas that we have discussed in the book. Download the example code files The code bundle for the book is hosted on GitHub at https://github.com/PacktPublishing/ The-Kaggle-Book. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out! Download the color images We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781801817479_ ColorImages.pdf. Conventions used There are a few text conventions used throughout this book. CodeInText: Indicates code words in text, database table names, folder names, filenames, file ex- tensions, pathnames, dummy URLs, user input, and Twitter handles. For example; “ The dataset will be downloaded to the Kaggle folder as a.zip archive – unpack it and you are good to go.” Preface xxiii A block of code is set as follows: from google.colab import drive drive.mount('/content/gdrive') Any command-line input or output is written as follows: I genuinely have no idea what the output of this sequence of words will be - it will be interesting to find out what nlpaug can do with this! Bold: Indicates a new term, an important word, or words that you see on the screen, for example, in menus or dialog boxes. For example: “ The specific limits at the time of writing are 100 GB per private dataset and a 100 GB total quota.” Further notes, references, and links to useful places appear like this. Tips and tricks appear like this. Get in touch Feedback from our readers is always welcome. General feedback: Email [email protected], and mention the book’s title in the subject of your message. If you have questions about any aspect of this book, please email us at questions@ packtpub.com. Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details. Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material. xxiv Preface If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit http://authors.packtpub. com. Share your thoughts Once you’ve read The Kaggle Book, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback. Your review is important to us and the tech community and will help us make sure we’re deliv- ering excellent quality content. Part I Introduction to Competitions 1 Introducing Kaggle and Other Data Science Competitions Data science competitions have long been around and they have experienced growing success over time, starting from a niche community of passionate competitors, drawing more and more attention, and reaching a much larger audience of millions of data scientists. As longtime com- petitors on the most popular data science competition platform, Kaggle, we have witnessed and directly experienced all these changes through the years. At the moment, if you look for information about Kaggle and other competition platforms, you can easily find a large number of meetups, discussion panels, podcasts, interviews, and even online courses explaining how to win in such competitions (usually telling you to use a variable mixture of grit, computational resources, and time invested). However, apart from the book that you are reading now, you won’t find any structured guides about how to navigate so many data science competitions and how to get the most out of them – not just in terms of score or ranking, but also professional experience. In this book, instead of just packaging up a few hints about how to win or score highly on Kaggle and other data science competitions, our intention is to present you with a guide on how to com- pete better on Kaggle and get back the maximum possible from your competition experiences, particularly from the perspective of your professional life. Also accompanying the contents of the book are interviews with Kaggle Masters and Grandmasters. We hope they will offer you some different perspectives and insights on specific aspects of competing on Kaggle, and inspire the way you will test yourself and learn doing competitive data science. 4 Introducing Kaggle and Other Data Science Competitions By the end of this book, you’ll have absorbed the knowledge we drew directly from our own ex- periences, resources, and learnings from competitions, and everything you need to pave a way for yourself to learn and grow, competition after competition. As a starting point, in this chapter, we will explore how competitive programming evolved into data science competitions, why the Kaggle platform is the most popular site for such competi- tions, and how it works. We will cover the following topics: The rise of data science competition platforms The Common Task Framework paradigm The Kaggle platform and some other alternatives How a Kaggle competition works: stages, competition types, submission and leaderboard dynamics, computational resources, networking, and more The rise of data science competition platforms Competitive programming has a long history, starting in the 1970s with the first iterations of the ICPC, the International Collegiate Programming Contest. In the original ICPC, small teams from universities and companies participated in a competition that required solving a series of problems using a computer program (at the beginning, participants coded in FORTRAN). In order to achieve a good final rank, teams had to display good skills in team working, problem solving, and programming. The experience of participating in the heat of such a competition and the opportunity to stand in a spotlight for recruiting companies provided the students with ample motivation and it made the competition popular for many years. Among ICPC finalists, a few have become renowned: there is Adam D’Angelo, the former CTO of Facebook and founder of Quora, Nikolai Durov, the co-founder of Telegram Messenger, and Matei Zaharia, the creator of Apache Spark. Together with many other professionals, they all share the same experience: having taken part in an ICPC. After the ICPC, programming competitions flourished, especially after 2000 when remote par- ticipation became more feasible, allowing international competitions to run more easily and at a lower cost. The format is similar for most of these competitions: there is a series of problems and you have to code a solution to solve them. The winners are given a prize, but also make themselves known to recruiting companies or simply become famous. Chapter 1 5 Typically, problems in competitive programming range from combinatorics and number theory to graph theory, algorithmic game theory, computational geometry, string analysis, and data structures. Recently, problems relating to artificial intelligence have successfully emerged, in particular after the launch of the KDD Cup, a contest in knowledge discovery and data mining, held by the Association for Computing Machinery’s (ACM’s) Special Interest Group (SIG) during its annual conference (https://kdd.org/conferences). The first KDD Cup, held in 1997, involved a problem about direct marketing for lift curve optimi- zation and it started a long series of competitions that continues today. You can find the archives containing datasets, instructions, and winners at https://www.kdd.org/kdd-cup. Here is the lat- est available at the time of writing: https://ogb.stanford.edu/kddcup2021/. KDD Cups proved quite effective in establishing best practices, with many published papers describing solutions, techniques, and competition dataset sharing, which have been useful for many practitioners for experimentation, education, and benchmarking. The successful examples of both competitive programming events and the KDD Cup inspired companies (such as Netflix) and entrepreneurs (such as Anthony Goldbloom, the founder of Kaggle) to create the first data science competition platforms, where companies can host data science challenges that are hard to solve and might benefit from crowdsourcing. In fact, given that there is no golden approach that works for all the problems in data science, many problems require a time-consuming approach that can be summed up as try all that you can try. In fact, in the long run, no algorithm can beat all the others on all problems, as stated by the No Free Lunch theorem by David Wolpert and William Macready. The theorem tells you that each machine learning algorithm performs if and only if its hypothesis space comprises the solution. Consequently, as you cannot know beforehand if a machine learning algorithm can best tackle your problem, you have to try it, testing it directly on your problem before being assured that you are doing the right thing. There are no theoretical shortcuts or other holy grails of machine learning – only empirical experimentation can tell you what works. For more details, you can look up the No Free Lunch theorem for a theoretical expla- nation of this practical truth. Here is a complete article from Analytics India Magazine on the topic: https://analyticsindiamag.com/what-are-the-no-free-lunch- theorems-in-data-science/. 6 Introducing Kaggle and Other Data Science Competitions Crowdsourcing proves ideal in such conditions where you need to test algorithms and data trans- formations extensively to find the best possible combinations, but you lack the manpower and computer power for it. That’s why, for instance, governments and companies resort to competi- tions in order to advance in certain fields: On the government side, we can quote DARPA and its many competitions surrounding self-driving cars, robotic operations, machine translation, speaker identification, finger- print recognition, information retrieval, OCR, automatic target recognition, and many others. On the business side, we can quote a company such as Netflix, which entrusted the out- come of a competition to improve its algorithm for predicting user movie selection. The Netflix competition was based on the idea of improving existing collaborative filtering. The purpose of this was simply to predict the potential rating a user would give a film, solely based on the ratings that they gave other films, without knowing specifically who the user was or what the films were. Since no user description or movie title or description were available (all being replaced with identity codes), the competition required entrants to develop smart ways to use the past ratings available. The grand prize of US $1,000,000 was to be awarded only if the solution could improve the existing Netflix algorithm, Cinematch, above a certain threshold. The competition ran from 2006 to 2009 and saw victory for a team made up of the fusion of many previous competition teams: a team from Commendo Research & Consulting GmbH, An- dreas Töscher and Michael Jahrer, quite renowned also in Kaggle competitions; two researchers from AT&T Labs; and two others from Yahoo!. In the end, winning the competition required so much computational power and the ensembling of different solutions that teams were forced to merge in order to keep pace. This situation was also reflected in the actual usage of the solution by Netflix, who preferred not to implement it, but simply took the most interesting insight from it in order to improve its existing Cinematch algorithm. You can read more about it in this Wired article: https://www.wired.com/2012/04/netflix-prize-costs/. At the end of the Netflix competition, what mattered was not the solution per se, which was quickly superseded by the change in business focus of Netflix from DVDs to online movies. The real benefit for both the participants, who gained a huge reputation in collaborative filtering, and the company, who could transfer its improved recommendation knowledge to its new business, were the insights that were gained from the competition. Chapter 1 7 The Kaggle competition platform Companies other than Netflix have also benefitted from data science competitions. The list is long, but we can quote a few examples where the company running the competition reported a clear benefit from it. For instance: The insurance company Allstate was able to improve its actuarial models built by their own experts, thanks to a competition involving hundreds of data scientists (https://www. kaggle.com/c/ClaimPredictionChallenge) As another well-documented example, General Electric was able to improve by 40% on the industry-standard performance (measured by the root mean squared error metric) for predicting arrival times of airline flights, thanks to a similar competition (https:// www.kaggle.com/c/flight) The Kaggle competition platform has to this day held hundreds of competitions, and these two are just a couple of examples of companies that used them successfully. Let’s take a step back from specific competitions for a moment and talk about the Kaggle company, which is the common thread through this book. A history of Kaggle Kaggle took its first steps in February 2010, thanks to Anthony Goldbloom, an Australian trained economist with a degree in Economics and Econometrics. After working at Australia’s Depart- ment of the Treasury and the Research department at the Reserve Bank of Australia, Goldbloom interned in London at The Economist, the international weekly newspaper on current affairs, international business, politics, and technology. At The Economist, he had occasion to write an article about big data, which inspired his idea to build a competition platform that could crowd- source the best analytical experts to solve interesting machine learning problems (https://www. smh.com.au/technology/from-bondi-to-the-big-bucks-the-28yearold-whos-making-data- science-a-sport-20111104-1myq1.html). Since the crowdsourcing dynamics played a relevant part in the business idea for this platform, he derived the name Kaggle, which recalls by rhyme the term gaggle, a flock of geese, the goose also being the symbol of the platform. After moving to Silicon Valley in the USA, his Kaggle start-up received $11.25 million in Series A funding from a round led by Khosla Ventures and Index Ventures, two renowned venture capital firms. The first competitions were rolled out, the community grew, and some of the initial com- petitors came to be quite prominent, such as Jeremy Howard, the Australian data scientist and entrepreneur, who, after winning a couple of competitions on Kaggle, became the President and Chief Scientist of the company. 8 Introducing Kaggle and Other Data Science Competitions Jeremy Howard left his position as President in December 2013 and established a new start-up, fast.ai (www.fast.ai), offering machine learning courses and a deep learning library for coders. At the time, there were some other prominent Kagglers (the name indicating frequent partici- pants of competitions held by Kaggle) such as Jeremy Achin and Thomas de Godoy. After reaching the top 20 global rankings on the platform, they promptly decided to retire and to found their own company, DataRobot. Soon after, they started hiring their employees from among the best participants in the Kaggle competitions in order to instill the best machine learning knowledge and practices into the software they were developing. Today, DataRobot is one of the leading companies in developing AutoML solutions (software for automatic machine learning). The Kaggle competitions claimed more and more attention from a growing audience. Even Geoffrey Hinton, the “godfather” of deep learning, participated in (and won) a Kaggle competition hosted by Merck in 2012 (https://www.kaggle.com/c/MerckActivity/overview/winners). Kaggle was also the platform where François Chollet launched his deep learning package Keras during the Otto Group Product Classification Challenge (https://www.kaggle.com/c/otto-group-product- classification-challenge/discussion/13632) and Tianqi Chen launched XGBoost, a speedier and more accurate version of gradient boosting machines, in the Higgs Boson Machine Learning Challenge (https://www.kaggle.com/c/higgs-boson/discussion/10335). Besides Keras, François Chollet has also provided the most useful and insightful perspective on how to win a Kaggle competition in an answer of his on the Quo- ra website: https://www.quora.com/Why-has-Keras-been-so-successful- lately-at-Kaggle-competitions. Fast iterations of multiple attempts, guided by empirical (more than theoretical) evidence, are actually all that you need. We don’t think that there are many more secrets to winning a Kaggle competition than the ones he pointed out in his answer. Notably, François Chollet also hosted his own competition on Kaggle (https:// www.kaggle.com/c/abstraction-and-reasoning-challenge/), which is widely recognized as being the first general AI competition in the world. Competition after competition, the community revolving around Kaggle grew to touch one mil- lion in 2017, the same year as, during her keynote at Google Next, Fei-Fei Li, Chief Scientist at Google, announced that Google Alphabet was going to acquire Kaggle. Since then, Kaggle has been part of Google. Chapter 1 9 Today, the Kaggle community is still active and growing. In a tweet of his (https://twitter. com/antgoldbloom/status/1400119591246852096), Anthony Goldbloom reported that most of its users, other than participating in a competition, have downloaded public data (Kaggle has become an important data hub), created a public Notebook in Python or R, or learned something new in one of the courses offered: Figure 1.1: A bar chart showing how users used Kaggle in 2020, 2019, and 2018 Through the years, Kaggle has offered many of its participants even more opportunities, such as: Creating their own company Launching machine learning software and packages Getting interviews in magazines (https://www.wired.com/story/solve-these-tough- data-problems-and-watch-job-offers-roll-in/) Writingmachinelearningbooks(https://twitter.com/antgoldbloom/status /745662719588589568) Finding their dream job And, most importantly, learning more about the skills and technicalities involved in data science. 10 Introducing Kaggle and Other Data Science Competitions Other competition platforms Though this book focuses on competitions on Kaggle, we cannot forget that many data compe- titions are held on private platforms or on other competition platforms. In truth, most of the information you will find in this book will also hold for other competitions, since they essentially all operate under similar principles and the benefits for the participants are more or less the same. Although many other platforms are localized in specific countries or are specialized only for certain kinds of competitions, for completeness we will briefly introduce some of them, at least those we have some experience and knowledge of: DrivenData (https://www.drivendata.org/competitions/) is a crowdsourcing com- petition platform devoted to social challenges (see https://www.drivendata.co/blog/ intro-to-machine-learning-social-impact/). The company itself is a social enterprise whose aim is to bring data science solutions to organizations tackling the world’s biggest challenges, thanks to data scientists building algorithms for social good. For instance, as you can read in this article, https://www.engadget.com/facebook-ai-hate-speech- covid-19-160037191.html, Facebook has chosen DrivenData for its competition on build- ing models against hate speech and misinformation. Numerai (https://numer.ai/) is an AI-powered, crowdsourced hedge fund based in San Francisco. It hosts a weekly tournament in which you can submit your predictions on hedge fund obfuscated data and earn your prizes in the company’s cryptocurrency, Numeraire. CrowdANALYTIX (https://www.crowdanalytix.com/community) is a bit less active now, but this platform used to host quite a few challenging competitions a short while ago, as you can read from this blog post: https://towardsdatascience.com/how-i-won-top- five-in-a-deep-learning-competition-753c788cade1. The community blog is quite interesting for getting an idea of what challenges you can find on this platform: https:// www.crowdanalytix.com/jq/communityBlog/listBlog.html. Signate (https://signate.jp/competitions) is a Japanese data science competition platform. It is quite rich in contests and it offers a ranking system similar to Kaggle’s (https://signate.jp/users/rankings). Zindi (https://zindi.africa/competitions) is a data science competition platform from Africa. It hosts competitions focused on solving Africa’s most pressing social, economic, and environmental problems. Chapter 1 11 Alibaba Cloud (https://www.alibabacloud.com/campaign/tianchi-competitions) is a Chinese cloud computer and AI provider that has launched the Tianchi Academic competitions, partnering with academic conferences such as SIGKDD, IJCAI-PRICAI, and CVPR and featuring challenges such as image-based 3D shape retrieval, 3D object recon- struction, and instance segmentation. Analytics Vidhya (https://datahack.analyticsvidhya.com/) is the largest Indian com- munity for data science, offering a platform for data science hackathons. CodaLab (https://codalab.lri.fr/) is a French-based data science competition plat- form, created as a joint venture between Microsoft and Stanford University in 2013. They feature a free cloud-based notebook called Worksheets (https://worksheets.codalab. org/) for knowledge sharing and reproducible modeling. Other minor platforms are CrowdAI (https://www.crowdai.org/) from École Polytechnique Fédérale de Lausanne in Switzerland, InnoCentive (https://www.innocentive.com/), Grand-Chal- lenge (https://grand-challenge.org/) for biomedical imaging, DataFountain (https://www. datafountain.cn/business?lang=en-US), OpenML (https://www.openml.org/), and the list could go on. You can always find a large list of ongoing major competitions at the Russian com- munity Open Data Science (https://ods.ai/competitions) and even discover new competition platforms from time to time. You can see an overview of running competitions on the mlcontests.com website, along with the current costs for renting GPUs. The website is often updated and it is an easy way to get a glance at what’s going on with data science competitions across different platforms. Kaggle is always the best platform where you can find the most interesting competitions and ob- tain the widest recognition for your competition efforts. However, picking up a challenge outside of it makes sense, and we recommend it as a strategy, when you find a competition matching your personal and professional interests. As you can see, there are quite a lot of alternatives and opportunities besides Kaggle, which means that if you consider more competition platforms alongside Kaggle, you can more easily find a competition that might interest you because of its specialization or data. In addition, you can expect less competitive pressure during these challenges (and consequently a better ranking or even winning something), since they are less known and advertised. Just ex- pect less sharing among participants, since no other competition platform has reached the same richness of sharing and networking opportunities as Kaggle. 12 Introducing Kaggle and Other Data Science Competitions Introducing Kaggle At this point, we need to delve more deeply into how Kaggle in particular works. In the following paragraphs, we will discuss the various aspects of the Kaggle platform and its competitions, and you’ll get a flavor of what it means to be in a competition on Kaggle. Afterward, we’ll come back to discuss many of these topics in much more detail, with more suggestions and strategies in the remaining chapters of the book. Stages of a competition A competition on Kaggle is arranged into different steps. By having a look at each of them, you can get a better understanding of how a data science competition works and what to expect from it. When a competition is launched, there are usually some posts on social media, for instance on the Kaggle Twitter profile, https://twitter.com/kaggle, that announce it, and a new tab will appear in the Kaggle section about Active Competitions on the Competitions page (https:// www.kaggle.com/competitions). If you click on a particular competition’s tab, you’ll be taken to its page. At a glance, you can check if the competition will have prizes (and if it awards points and medals, a secondary consequence of participating in a competition), how many teams are currently involved, and how much time is still left for you to work on a solution: Chapter 1 13 Figure 1.2: A competition’s page on Kaggle 14 Introducing Kaggle and Other Data Science Competitions There, you can explore the Overview menu first, which provides information about: The topic of the competition Its evaluation metric (that your models will be evaluated against) The timeline of the competition The prizes The legal or competition requirements Usually the timeline is a bit overlooked, but it should be one of the first things you check; it doesn’t tell you simply when the competition starts and ends, but it will provide you with the rule acceptance deadline, which is usually from seven days to two weeks before the competition closes. The rule acceptance deadline marks the last day you can join the competition (by accepting its rules). There is also the team merger deadline: you can arrange to combine your team with another competitor’s one at any point before that deadline, but after that it won’t be possible. The Rules menu is also quite often overlooked (with people just jumping to Data), but it is im- portant to check it because it can tell you about the requirements of the competition. Among the key information you can get from the rules, there is: Your eligibility for a prize Whether you can use external data to improve your score How many submissions (tests of your solution) a day you get How many final solutions you can choose Once you have accepted the rules, you can download any data from the Data menu or directly start working on Kaggle Notebooks (online, cloud-based notebooks) from the Code menu, reusing code that others have made available or creating your own code from scratch. If you decide to download the data, also consider that you have a Kaggle API that can help you to run downloads and submissions in an almost automated way. It is an important tool if you are running your models on your local computer or on your cloud instance. You can find more details about the API at https://www.kaggle.com/docs/api and you can get the code from GitHub at https://github.com/Kaggle/kaggle-api. Chapter 1 15 If you check the Kaggle GitHub repo closely, you can also find all the Docker images they use for their online notebooks, Kaggle Notebooks: Figure 1.3: A Kaggle Notebook ready to be coded At this point, as you develop your solution, it is our warm suggestion not to continue in solitude, but to contact other competitors through the Discussion forum, where you can ask and answer questions specific to the competition. Often you will also find useful hints about specific problems with the data or even ideas to help improve your own solution. Many successful Kagglers have reported finding ideas on the forums that have helped them perform better and, more importantly, learn more about modeling in data science. Once your solution is ready, you can submit it to the Kaggle evaluation engine, in adherence to the specifications of the competition. Some competitions will accept a CSV file as a solution, oth- ers will require you to code and produce results in a Kaggle Notebook. You can keep submitting solutions throughout the competition. Every time you submit a solution, soon after, the leaderboard will provide you with a score and a position among the competitors (the wait time varies depending on the computations necessary for the score evaluation). That position is only roughly indicative, because it reflects the perfor- mance of your model on a part of the test set, called the public test set, since your performance on it is made public during the competition for everyone to know. 16 Introducing Kaggle and Other Data Science Competitions Before the competition closes, each competitor can choose a number (usually two) of their solu- tions for the final evaluation. Figure 1.4: A diagram demonstrating how data turns into scores for the public and private leaderboard Only when the competition closes, based on the models the contestants have decided to be scored, is their score on another part of the test set, called the private test set, revealed. This new leader- board, the private leaderboard, constitutes the final, effective scores for the competition, but it is still not official and definitive in its rankings. In fact, the Kaggle team will take some time to check that everything is correct and that all contestants have respected the rules of the competition. After a while (and sometimes after some changes in the rankings due to disqualifications), the private leaderboard will become official and definitive, the winners will be declared, and many par- ticipants will unveil their strategies, their solutions, and their code on the competition discussion forum. At this point, it is up to you to check the other solutions and try to improve your own. We strongly recommend that you do so, since this is another important source of learning in Kaggle. Chapter 1 17 Types of competitions and examples Kaggle competitions are categorized based on competition categories, and each category has a different implication in terms of how to compete and what to expect. The type of data, difficulty of the problem, awarded prizes, and competition dynamics are quite diverse inside the categories, therefore it is important to understand beforehand what each implies. Here are the official categories that you can use to filter out the different competitions: Featured Masters Annuals Research Recruitment Getting Started Playground Analytics Community Featured are the most common type of competitions, involving a business-related problem from a sponsor company and a prize for the top performers. The winners will grant a non-exclusive license of their work to the sponsor company; they will have to prepare a detailed report of their solution and sometimes even participate in meetings with the sponsor company. There are examples of Featured competitions every time you visit Kaggle. At the moment, many of them are problems relating to the application of deep learning methods to unstructured data like text, images, videos, or sound. In the past, tabular data competitions were commonly seen, that is, competitions based on problems relating to structured data that can be found in a database. First by using random forests, then gradient boosting methods with clever feature engineering, tabular data solutions derived from Kaggle could really improve an existing solution. Nowadays, these competitions are run much less often, because a crowdsourced solution won’t often be much better than what a good team of data scientists or even AutoML software can do. Given the spread of better software and good practices, the increase in result quality obtainable from competitions is indeed marginal. In the unstructured data world, however, a good deep learning solution could still make a big difference. For instance, pre-trained networks such as BERT brought about double-digit increases in previous standards for many well-known NLP task benchmarks. 18 Introducing Kaggle and Other Data Science Competitions Masters are less common now, but they are private, invite-only competitions. The purpose was to create competitions only for experts (generally competitors ranked as Masters or Grandmasters, based on Kaggle medal rankings), based on their rankings on Kaggle. Annuals are competitions that always appear during a certain period of the year. Among the Annuals, we have the Santa Claus competitions (usually based on an algorithmic optimization problem) and the March Machine Learning Mania competition, run every year since 2014 during the US College Basketball Tournaments. Research competitions imply a research or science purpose instead of a business one, sometimes for serving the public good. That’s why these competitions do not always offer prizes. In addi- tion, these competitions sometimes require the winning participants to release their solution as open-source. Google has released a few Research competitions in the past, such as Google Landmark Recognition 2020 (https://www.kaggle.com/c/landmark-recognition-2020), where the goal was to label famous (and not-so-famous) landmarks in images. Sponsors that want to test the ability of potential job candidates hold Recruitment competitions. These competitions are limited to teams of one and offer to best-placed competitors an interview with the sponsor as a prize. The competitors have to upload their CV at the end of the competition if they want to be considered for being contacted. Examples of Recruitment competitions have been: The Facebook Recruiting Competition (https://www.kaggle.com/c/FacebookRecruiting); Facebook have held a few of this kind The Yelp Recruiting Competition (https://www.kaggle.com/c/yelp-recruiting) Getting Started competitions do not offer any prizes, but friendly and easy problems for beginners to get accustomed to Kaggle principles and dynamics. They are usually semi-permanent com- petitions whose leaderboards are refreshed from time to time. If you are looking for a tutorial in machine learning, these competitions are the right places to start, because you can find a highly collaborative environment and there are many Kaggle Notebooks available showing you how to process the data and create different types of machine learning models. Famous ongoing Getting Started competitions are: Digit Recognizer (https://www.kaggle.com/c/digit-recognizer) Chapter 1 19 Titanic — Machine Learning from Disaster (https://www.kaggle.com/c/titanic) House Prices — Advanced Regression Techniques (https://www.kaggle.com/c/house- prices-advanced-regression-techniques) Playground competitions are a little bit more difficult than the Getting Started ones, but they are also meant for competitors to learn and test their abilities without the pressure of a fully-fledged Featured competition (though in Playground competitions sometimes the heat of the competition may also turn quite high). The usual prizes for such competitions are just swag (an acronym for “Stuff We All Get,” such as, for instance, a cup, a t-shirt, or socks branded by Kaggle; see https:// www.kaggle.com/general/68961) or a bit of money. One famous Playground competition is the original Dogs vs. Cats competition (https://www. kaggle.com/c/dogs-vs-cats), where the task is to create an algorithm to distinguish dogs from cats. Mentions should be given to Analytics competitions, where the evaluation is qualitative and participants are required to provide ideas, drafts of solutions, PowerPoint slides, charts, and so on; and Community (previously known as InClass) competitions, which are held by academic institutions as well as Kagglers. You can read about the launch of the Community competitions at https://www.kaggle.com/product-feedback/294337 and you can get tips about running one of your own at https://www.kaggle.com/c/about/host and at https://www.kaggle.com/ community-competitions-setup-guide. Parul Pandey https://www.kaggle.com/parulpandey We spoke to Parul Pandey, Kaggle Notebooks Grandmaster, Datasets Master, and data scientist at H2O.ai, about her experience with Ana- lytics competitions and more. What’s your favorite kind of competition and why? In terms of techniques and solving approaches, what is your specialty on Kaggle? I really enjoy the Data Analytics competitions, which require you to analyze the data and provide a comprehensive analysis report at the end. These include the Data Science for Good competitions (DS4G), sports analytics competitions (NFL etc.), and the general survey challenges. Unlike the traditional com- petitions, these competitions don’t have a leaderboard to track your performance compared to others; nor do you get any medals or points. 20 Introducing Kaggle and Other Data Science Competitions On the other hand, these competitions demand end-to-end solutions touching on multi-faceted aspects of data science like data cleaning, data mining, visualizations, and conveying insights. Such problems provide a way to mimic real-life scenarios and provide your insights and viewpoints. There may not be a single best answer to solve the problem, but it gives you a chance to deliberate and weigh up potential solutions, and imbibe them into your solution. How do you approach a Kaggle competition? How different is this approach to what you do in your day-to-day work? My first step is always to analyze the data as part of EDA (exploratory data analysis). It is something that I also follow as part of my work routine. Typically, I explore the data to look for potential red flags like inconsistencies in data, missing values, outliers, etc., which might pose problems later. The next step is to create a good and reliable cross-validation strategy. Then I read the discussion forums and look at some of the Notebooks shared by people. It generally acts as a good starting point, and then I can incorporate things in this workflow from my past experiences. It is also essential to track the model performance. For an Analytics competition, however, I like to break down the problem into multiple steps. For instance, the first part could be related to understanding the problem, which may require a few days. After that, I like to explore the data, followed by creating a basic baseline solution. Then I continue enhancing this solution by adding a piece at a time. It might be akin to adding Lego bricks one part at a time to create that final masterpiece. Tell us about a particularly challenging competition you entered, and what insights you used to tackle the task. As I mentioned, I mostly like to compete in Analytics competitions, even though occasionally I also try my hand in the regular ones too. I’d like to point out a very intriguing Data Science for Good competi- tion titled Environmental Insights Explorer (https://www.kaggle.com/c/ds4g-environmental-insights- explorer). The task was to use remote sensing techniques to understand environmental emissions instead of calculating emissions factors from current methodologies. What really struck me was the use case. Our planet is grappling with climate change issues, and this compe- tition touched on this very aspect. While researching for my competition, I was amazed to find the amount of progress being made in this field of satellite imagery and it gave me a chance to understand and dive more deeply into the topic. It gave me a chance to understand how satellites like Landsat, Modis, and Sentinel worked, and how they make the satellite data available. This was a great competition to learn about a field I knew very little about before the competition. Chapter 1 21 In your experience, what do inexperienced Kagglers often overlook? What do you know now that you wish you’d known when you first started? I will cite some of the mistakes that I made in my initial years on Kaggle. Firstly, most of the newbies think of Kaggle as a competitions-only platform. If you love competitions, there are plenty here, but Kaggle also has something for people with other specialties. You can write code and share it with others, indulge in healthy discussions, and network. Curate and share good datasets with the community. I initially only used Kaggle for downloading datasets, and it was only a couple of years ago that I actually became active. Now when I look back, I couldn’t have been more wrong. A lot of people get intimidated by competi- tions. You can first get comfortable with the platform and then slowly start participating in the competitions. Another important thing that I would like to mention is that many people work in isolation, lose motivation, and quit. Teaming up on Kaggle has many unseen advantages. It teaches you to work in a team, learn from the experiences, and work towards a common goal in a limited time frame. Do you use other competition platforms? How do they compare to Kaggle? While most of my current time is spent on Kaggle, in the past I have used Zindi, a data science competition platform focused on African use cases. It’s a great place to access datasets focused on Africa. Kaggle is a versa- tile platform, but there is a shortage of problem statements from different parts of the world. Of late, we have seen some diversified problems too, like the recently held chaii competition — an NLP competition focusing on Indian languages. I believe similar competitions concentrating on different countries will be helpful for the research and the general data science community as well. Cross-sectional to this taxonomy of Kaggle competitions, you also have to consider that competi- tions may have different formats. The usual format is the so-called Simple format where you pro- vide a solution and it is evaluated as we previously described. More sophisticated, the two-stage competition splits the contest into two parts, and the final dataset is released only after the first part has finished and only to the participants of the first part. The two-stage competition format has emerged in order to limit the chance of some competitors cheating and infringing the rules, since the evaluation is done on a completely untried test set that is available for a short time only. Contrary to the original Kaggle competition format, in this case, competitors have a much shorter amount of time and much fewer submissions to figure out any useful patterns from the test set. 22 Introducing Kaggle and Other Data Science Competitions For the same reason, the Code competitions have recently appeared, where all submissions are made from a Kaggle Notebook, and any direct upload of submissions is disabled. For Kagglers at different stages of their competition careers, there are no restrictions at all in taking on any kind of competition. However, we have some suggestions against or in favor of the format or type of competition depending on your level of experience in data science and your computational resources: For complete beginners, the Getting Started or the Playground competitions are good places to begin, since you can easily get more confident about how Kaggle works with- out facing high competitive pressure. That being said, many beginners have successfully started from Featured and Research competitions, because being under pressure helped them to learn faster. Our suggestion is therefore to decide based on your learning style: some Kagglers need to learn by exploring and collaborating (and the Getting Started or the Playground competitions are ideal for that), others need the heat of a fast-paced competition to find their motivation. For Featured and Research competitions, also take into account that these competitions are often about fringe applications of AI and machine learning and, consequently, you often need a solid background or the willingness to study all the relevant research in the field of application of the competition. Finally, keep in mind that most competitions require you to have access to computational resources that are often not available to most data scientists in the workplace. This can turn into growing expenses if you use a cloud platform outside the Kaggle one. Code competitions and competitions with time or resource limitations might then be the ideal place to spend your efforts, since they strive to put all the participants on the same resource level. Submission and leaderboard dynamics The way Kaggle works seems simple: the test set is hidden to participants; you fit your model; if your model is the best in predicting on the test set, then you score highly and you possibly win. Unfortunately, this description renders the inner workings of Kaggle competitions in an overly simplistic way. It doesn’t take into account that there are dynamics regarding the direct and indirect interactions of competitors, or the nuances of the problem you are facing and of its training and test set. Chapter 1 23 Explaining the Common Task Framework paradigm A more comprehensive description of how Kaggle works is actually given by Professor Da- vid Donoho, professor of statistics at Stanford University (https://web.stanford.edu/dept/ statistics/cgi-bin/donoho/), in his paper 50 Years of Data Science. It first appeared in the Jour- nal of Computational and Graphical Statistics and was subsequently posted on the MIT Computer Science and Artificial Intelligence Laboratory (see http://courses.csail.mit.edu/18.337/2015/ docs/50YearsDataScience.pdf). Professor Donoho does not refer to Kaggle specifically, but to all data science competition plat- forms. Quoting computational linguist Mark Liberman, he refers to data science competitions and platforms as being part of a Common Task Framework (CTF) paradigm that has been si- lently and steadily progressing data science in many fields during the last decades. He states that a CTF can work incredibly well at improving the solution of a problem in data science from an empirical point of view, quoting the Netflix competition and many DARPA competitions as successful examples. The CTF paradigm has contributed to reshaping the best-in-class solutions for problems in many fields. A CTF is composed of ingredients and a secret sauce. The ingredients are simply: 1. A publicly available dataset and a related prediction task 2. A set of competitors who share the common task of producing the best prediction for the task 3. A system for scoring the predictions by the participants in a fair and objective way, with- out providing hints about the solution that are too specific (or limiting them, at least) The system works the best if the task is well defined and the data is of good quality. In the long run, the performance of solutions improves by small gains until it reaches an asymptote. The process can be sped up by allowing a certain amount of sharing among participants (as happens on Kaggle by means of discussions, and sharing Kaggle Notebooks and extra data provided by the datasets found in the Datasets section). According to the CTF paradigm, competitive pressure in a competition suffices to produce always-improving solutions. When the competitive pressure is paired with some degree of sharing among participants, the improvement happens at an even faster rate – hence why Kaggle introduced many incentives for sharing. 24 Introducing Kaggle and Other Data Science Competitions This is because the secret sauce in the CTF paradigm is the competition itself, which, within the framework of a practical problem whose empirical performance has to be improved, always leads to the emergence of new benchmarks, new data and modeling solutions, and in general to an im- proved application of machine learning to the problem posed by the competition. A competition can therefore provide a new way to solve a prediction problem, new ways of feature engineering, and new algorithmic or modeling solutions. For instance, deep learning did not simply emerge from academic research, but it first gained a great boost because of successful competitions that signaled its efficacy (we have already mentioned, for instance, the Merck competition, won by Geoffrey Hinton’s team: https://www.kaggle.com/c/MerckActivity/overview/winners). Coupled with the open software movement, which allows everyone access to powerful analytical tools (such as Scikit-learn, TensorFlow, or PyTorch), the CTF paradigm brings about even better results because all competitors are on the same level at the start. On the other hand, the reliance of a solution to a competition on specialized or improved hardware can limit achievable results, because it can prevent competitors without access to such resources from properly participating and contributing directly to the solution, or indirectly by exercising competitive pressure on the other participants. Understandably, this is the reason why Kaggle started offering cloud services free to participants of its competitions, the Kaggle Notebooks we will introduce in the Compu- tational resources section. It can flatten some differences in hardware-intense competitions (as most deep learning ones are) and increase the overall competitive pressure. Understanding what can go wrong in a competition Given our previous description of the CTF paradigm, you may be tempted to imagine that all a competition needs is to be set up on a proper platform, and good results such as positive involve- ment for participants and outstanding models for the sponsor company will automatically come in. However, there are also things that can go wrong and instead lead to a disappointing result in a competition, both for the participants and the institution running it: Leakage from the data Probing from the leaderboard (the scoring system) Overfitting and consequent leaderboard shake-up Private sharing You have leakage from data when part of the solution can be retraced in the data itself. For instance, certain variables could be posterior to the target variable, so they reveal something about it. This happens in fraud detection when you use variables that are updated after a fraud happens, or in sales forecasting when you process information relating to the effective distribution of a product (more distribution implies more requests for the product, hence more sales). Chapter 1 25 Another issue could be that the training and test examples are ordered in a predictable way or that the values of the identifiers of the examples hint at the solution. Examples are, for instance, when the identifier is based on the ordering of the target, or the identifier value is correlated with the flow of time and time affects the probability of the target. Such solution leakage, sometimes named golden features by competitors (because getting a hint of such nuances in the data can turn into gold prizes for the participants), invariably leads to a solution that is not reusable. This also implies a sub-optimal result for the sponsor, but they at least are able to learn something about leaking features that can affect solutions to their problem. Another problem is the possibility of probing a solution from the leaderboard. In this situation, you can take advantage of the evaluation metrics shown to you and snoop the solution by repeated submission trials on the leaderboard. Again, in this case the solution is completely unusable in different circumstances. A clear example of this happened in the competition Don’t Overfit II. The winning participant, Zachary Mayers, submitted every individual variable as a single submission, gaining information about the possible weight of each variable that allowed him to estimate the correct coefficients for his model (you can read Zach’s detailed solution here: https://www. kaggle.com/c/dont-overfit-ii/discussion/91766). Generally, time series problems, or other problems where there are systematic shifts in the test data, may be seriously affected by probing, since they can help competitors to successfully define some kind of post-processing (like multiply- ing their predictions by a constant) that is most suitable for scoring highly on the specific test set. Another form of leaderboard snooping (that is, getting a hint about the test set and overfitting to it) happens when participants rely more on the feedback from the public leaderboard than their own tests. Sometimes this turns into a complete failure of the competition, causing a wild shake-up – a complete and unpredictable reshuffling of the positions on the final leaderboard. The winning solutions, in such a case, may turn out to be not so optimal for the problem or even just dictated by chance. This has led to the diffusion of techniques analyzing the potential gap between the training set and the public test set. This kind of analysis, called adversarial testing, can provide insight about how much to rely on the leaderboard and whether there are features that are so different between the training and test set that it would be better to avoid them completely. For an example, you can have a look at this Notebook by Bojan Tunguz: https://www.kaggle. com/tunguz/adversarial-ieee. 26 Introducing Kaggle and Other Data Science Competitions Another kind of defense against leaderboard overfitting is choosing safe strategies to avoid submit- ting solutions that are based too much on the leaderboard results. For instance, since (typically) two solutions are allowed to be chosen by each participant for final evaluation, a good strategy is to submit the best performing one based on the leaderboard, and the best performing one based on your own cross-validation tests. In order to avoid problems with leaderboard probing and overfitting, Kaggle has recently intro- duced different innovations based on Code competitions, where the evaluation is split into two distinct stages, as we previously discussed, with participants being completely blind to the actual test data so they are forced to consider their own local validation tests more. Finally, another possible distortion of a competition is due to private sharing (sharing ideas and solutions in a closed circle of participants) and other illicit moves such as playing through multiple accounts or playing in multiple teams and stealing ideas. All such actions create an asymmetry of information between participants that can be favorable to a few and detrimental to most. Again, the resulting solution may be affected because sharing has been imperfect during the competition and fewer teams have been able to exercise full competitive pressure. Moreover, if these situations become evident to participants (for instance, see https://www.kaggle.com/c/ ashrae-energy-prediction/discussion/122503), it can lead to distrust and less involvement in the competition or subsequent competitions. Computational resources Some competitions pose limitations in order to render feasible solutions available to production. For instance, the Bosch Production Line Performance competition (https://www.kaggle.com/c/ bosch-production-line-performance) had strict limits on execution time, model file output, and memory limit for solutions. Notebook-based (previously known as Kernel-Only) competitions, which require both training and inference to be executed on the Kaggle Notebooks, do not pose a problem for the resources you have to use. This is because Kaggle will provide you with all the resources you need (and this is also intended as a way to put all participants on the same start line for a better competition result). Problems arise when you have competitions that only limit the use of Notebooks to inference time. In these cases, you can train your models on your own machine and the only limit is then at test time, on the number and complexity of models you produce. Since most competitions at the moment require deep learning solutions, you have to be aware that you will need specialized hardware, such as GPUs, in order to achieve a competitive result. Chapter 1 27 Even in some of the now-rare tabular competitions, you’ll soon realize that you need a strong machine with quite a number of processors and a lot of memory in order to easily apply feature engineering to data, run experiments, and build models quickly. Standards change rapidly, so it is difficult to specify a standard hardware that you should have in order to compete at least in the same league as other teams. We can get hints about the cur- rent standard by looking at what other competitors are using, either as their own machine or a machine on the cloud. For instance, HP launched a program where it awarded an HP Z4 or Z8 to a few selected Kaggle participants in exchange for brand visibility. For instance, a Z8 machine has up to 72 cores, 3 TB of memory, 48 TB of storage (a good share by solid storage hard drive standards), and usually dual NVIDIA RTX as the GPU. We understand that this may be a bit out of reach for many; even renting a similar machine for a short time on a cloud instance such as Google’s GCP or Amazon’s AWS is out of the discussion, given the expenses for even moderate usage. The cloud costs for each competition naturally depend on the amount of data to process and on the number and type of models you build. Free credit giveaways in Kaggle competitions for both GCP and AWS cloud platforms usually range from US $200 to US $500. Our suggestion, as you start your journey to climb to the top rankings of Kaggle participants, is therefore to go with the machines provided free by Kaggle, Kaggle Notebooks (previously known as Kaggle Kernels). Kaggle Notebooks Kaggle Notebooks are versioned computational environments, based on Docker containers run- ning in cloud machines, that allow you to write and execute both scripts and notebooks in the R and Python languages. Kaggle Notebooks: Are integrated into the Kaggle environment (you can make submissions from them and keep track of what submission refers to what Notebook) Come with most data science packages pre-installed Allow some customization (you can download files and install further packages) The basic Kaggle Notebook is just CPU-based, but you can have versions boosted by an NVIDIA Tesla P100 or a TPU v3-8. TPUs are hardware accelerators specialized for deep learning tasks. 28 Introducing Kaggle and Other Data Science Competitions Though bound by a usage number and time quota limit, Kaggle Notebooks give you access to the computational workhorse to build your baseline solutions on Kaggle competitions: Notebook CPU Memory Number of notebooks that Weekly quota type cores can be run at a time CPU 4 16 GB 10 Unlimited GPU 2 13 GB 2 30 hours TPU 4 16 GB 2 30 hours Besides the total runtime, CPU and GPU notebooks can run for a maximum of 12 hours per session before stopping (TPU notebooks for just 9 hours) meaning you won’t get any results from the run apart from what you have saved on disk. You have a 20 GB disk saving allowance to store your models and results, plus an additional scratchpad disk that can exceed 20 GB for temporary usage during script running. In certain cases, the GPU-enhanced machine provided by Kaggle Notebooks may not be enough. For instance, the recent Deepfake Detection Challenge (https://www.kaggle.com/c/deepfake- detection-challenge) required the processing of data consisting of around 500 GB of videos. That is especially challenging because of the 30-hour time limit of weekly usage, and because of the fact that you cannot have more than two machines with GPUs running at the same time. Even if you can double your machine time by changing your code to leverage the usage of TPUs instead of GPUs (which you can find some guidance for easily achieving here: https://www.kaggle.com/ docs/tpu), that may still not prove enough for fast experimentation in a data-heavy competition such as the Deepfake Detection Challenge. For this reason, in Chapter 3, Working and Learning with Kaggle Notebooks, we are going to provide you with tips for successfully coping with these limitations to produce decent results without having to buy a heavy-performing machine. We are also going to show you how to integrate Kaggle Notebooks with GCP or, alternatively, in Chapter 2, Organizing Data with Datasets, how to move all your work into another cloud-based solution, Google Colab. Teaming and networking While computational power plays its part, only human expertise and ability can make the real difference in a Kaggle competition. For a competition to be handled successfully, it sometimes requires the collaborative efforts of a team of contestants. Apart from Recruitment competitions, where the sponsor may require individual participants for a better evaluation of their abilities, there is typically no restriction against forming teams. Usually, teams can be made up of a max- imum of five contestants. Chapter 1 29 Teaming has its own advantages because it can multiply efforts to find a better solution. A team can spend more time on the problem together and different skills can be of great help; not all data scientists will have the same skills or the same level of skill when it comes to different models and data manipulation. However, teaming is not all positive. Coordinating different individuals and efforts toward a com- mon goal may prove not so easy, and some suboptimal situations may arise. A common problem is when some of the participants are not involved or are simply idle, but no doubt the worst is when someone infringes the rules of the competition – to the detriment of everyone, since the whole team could be disqualified – or even spies on the team in order to give an advantage to another team, as we mentioned earlier. In spite of any negatives, teaming in a Kaggle competition is a great opportunity to get to know other data scientists better, to collaborate for a purpose, and to achieve more, since Kaggle rules do reward teams over lonely competitors. In fact, for smaller teams you get a percentage of the total that is higher than an equal share. Teaming up is not the only possibility for networking in Kaggle, though it is certainly more profitable and interesting for the participants. You can also network with others through discussions on the forums, or by sharing Datasets and Notebooks during competitions. All these opportunities on the platform can help you get to know other data scientists and be recognized in the community. There are also many occasions to network with other Kagglers outside of the Kaggle platform itself. First of all, there are a few Slack channels that can be helpful. For instance, KaggleNoobs (https://www.kaggle.com/getting-started/20577) is a channel, opened up in 2016, that fea- tures many discussions about Kaggle competitions. They have a supportive community that can help you if you have some specific problem with code or models. There are quite a few other channels devoted to exchanging opinions about Kaggle competitions and data science-related topics. Some channels are organized on a regional or national basis, for instance, the Japanese channel Kaggler-ja (http://kaggler-ja-wiki.herokuapp.com/) or the Russian community Open Data Science Network (https://ods.ai/), created in 2015, which later opened also to non-Russian speaking participants. The Open Data Science Network doesn’t offer simply a Slack channel but also courses on how to win competitions, events, and reporting on active competitions taking place on all known data science platforms (see https://ods.ai/ competitions). 30 Introducing Kaggle and Other Data Science Competitions Aside from Slack channels, quite a few local meetups themed around Kaggle in general or around specific competitions have sprung up, some just on a temporary basis, others in a more estab- lished form. A meetup focused on Kaggle competitions, usually built around a presentation from a competitor who wants to share their experience or suggestions, is the best way to meet other Kagglers in person, to exchange opinions, and to build alliances for participating in data science contests together. In this league, a mention should be given to Kaggle Days (https://kaggledays.com/), built by Maria Parysz and Paweł Jankiewicz. The Kaggle Days organization arranged a few events in major locations around the world (https://kaggledays.com/about-us/) with the aim of bringing together a conference of Kaggle experts. It also created a network of local meetups in different countries, which are still quite active (https://kaggledays.com/meetups/). Paweł Jankiewicz https://www.kaggle.com/paweljankiewicz We had the opportunity to catch up with Paweł about his experiences with Kaggle. He is a Competitions Grandmaster and a co-founder of LogicAI. What’s your favourite kind of competition and why? In terms of techniques and solving approaches, what is your specialty on Kaggle? Code competitions are my favourite type of competition because working in a limited environment forces you to think about different kinds of budgets: time, CPU, memory. Too many times in previous compe- titions I needed to utilize even up to 3-4 strong virtual machines. I didn’t like that in order to win I had to utilize such resources, because it makes it a very uneven competition. How do you approach a Kaggle competition? How different is this approach to what you do in your day-to-day work? I approach every competition a little bit differently. I tend to always build a framework for each compe- tition that allows me to create as many experiments as possible. For example, in one competition where we needed to create a deep learning convolutional neural network, I created a way to configure neural networks by specifying them in the format C4-MP4-C