2409.14160v1.pdf
Document Details
Uploaded by Deleted User
Tags
Related
- Ethical Audit of CantastorIA AI-Powered Audio Content PDF
- Environmental Protection: A Critical Imperative for Our Planet's Future
- Artificial Intelligence and Machine Learning Applications in Smart Production (PDF)
- Pollution, Sustainability, & Climate Change PDF
- CREAZIONE 6 Crediti - Industry 5.0
- Unit 5 Trends in Digital Marketing PDF
Full Transcript
Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI Gaël Varoquaux * 1 Alexandra Sasha Luccioni * 2 Meredith Whittaker * 3 4...
Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI Gaël Varoquaux * 1 Alexandra Sasha Luccioni * 2 Meredith Whittaker * 3 4 Abstract raking, and targeted marketing. Pursuing scale has helped improve performance of ML models on benchmarks for With the growing attention and investment in re- many tasks, with the paradigmatic example being the abil- cent AI approaches such as large language mod- arXiv:2409.14160v1 [cs.CY] 21 Sep 2024 ity of large language models (LLMs) to encode “general els, the narrative that the larger the AI system the knowledge” when pretrained on large amounts of text data. more valuable, powerful and interesting it is is The perceived success of these approaches has further en- increasingly seen as common sense. But what is trenched the assumption that bigger-is-better in AI. In this this assumption based on, and how are we mea- position paper, we examine the underpinnings of this as- suring value, power, and performance? And what sumption about scale and its collateral consequences. We are the collateral consequences of this race to argue not that scale isn’t useful, in some cases, but that there ever-increasing scale? Here, we scrutinize the cur- is too much focus on scale and not enough value on other rent scaling trends and trade-offs across multiple research. axes and refute two common assumptions under- lying the ‘bigger-is-better’ AI paradigm: 1) that improved performance is a product of increased From narrative to norm The famous AlexNet paper scale, and 2) that all interesting problems ad- (Krizhevsky et al., 2012) has been key in shaping the cur- dressed by AI require large-scale models. Rather, rent era of AI, including the assumption that increased scale we argue that this approach is not only fragile is the key to improved performance. While building on scientifically, but comes with undesirable conse- and acknowledging decades of work, AlexNet created the quences. First, it is not sustainable, as its compute recipe for the current bigger-is-better paradigm in AI, com- demands increase faster than model performance, bining GPUs, big data (at least for the time), and large-scale leading to unreasonable economic requirements neural network-based approaches. The paper also demon- and a disproportionate environmental footprint. strated that GPUs1 scale compute much better than CPUs, Second, it implies focusing on certain problems enabling graduate students to rig a hardware setup that pro- at the expense of others, leaving aside important duced a model which outcompeted those trained on tens applications, e.g. health, education, or the climate. of thousands of CPUs. The AlexNet authors viewed scale Finally, it exacerbates a concentration of power, as a key source of their model’s benchmark beating perfor- which centralizes decision-making in the hands mance: “all of our experiments suggest that our results can of a few actors while threatening to disempower be improved simply by waiting for faster GPUs and big- others in the context of shaping both AI research ger datasets to become available” (Krizhevsky et al., 2012, and its applications throughout society. p.2). This point was further reinforced by Sutton in his “bitter lesson” article, which states that approaches based on more computation win in the long term, as the cost of of compute decreases with time (2019). The consequence 1. Introduction: AI is overly focusing scale of this is both an explosion in investment in large-scale AI Our field is increasingly dominated by research programs models and a concomitant spike in the size of the notable driven by a race for scale: bigger models, bigger datasets, (highly cited) models. Generative AI, whether for images or more compute. Over the past decade, machine learning text, has taken this assumption to a new level, both within (ML) has been used to build systems used by millions, car- the AI research discipline and as a component of the pop- rying out tasks such as automatic translation, news feed ular ‘bigger-is-better’ narrative surrounding AI – this, of course, has corresponding training and inference compute * Equal contribution 1 Inria, Saclay, France 2 Hugging Face, Mon- treal, Canada 3 Signal, NYC, United States 4 University of Western 1 GPUs, or Graphical Processing Units, were initially developed Australia, Perth, Australia. Correspondence to: Gaël Varoquaux for video games, enabling better graphics rendering given their. ability to carry out more processes in parallel compared to CPUs, or Central Processing Units, which had been the dominant hardware Copyright 2024 by the author(s). architecture. 1 Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI 1PB # Parameters large industrial labs to develop, test, and deploy SOTA AI Amount of RAM 1T 1TB systems. for $1000 The bigger-is-better assumption is also prevalent beyond 1G 1GB the AI research community. It is shaping how AI is used 1M and the understandings and expectations about its capa- 1MB bilities. Popular news reporting assumes larger amounts 1k 1kB of compute result in and equate to better results and com- bfloat16 mercial success (Wodecki, 2023; Law, 2024). Regulatory 2000 2010 2020 footprint determinations and thresholds assume larger means more Language Multimodal Vision Drawing powerful and more dangerous, e.g. the US Executive Order Other Speech 1024 te of Unknown on AI (Biden, 2023) and EU AI Act (Parliament, 2023) – Games c o m pucomputer y 1gdeast super this broader assumption shapes markets and policy-making. 1018 lar 1012 Training FLOP Paper outline We start by discussing why this reliance 2000 2010 2020 on scale is misguided – we first examine this assumption in terms of how, and whether, scale leads to improvement Figure 1. An explosion in model size – Top: The increase in in goal setting, finding that scale is not always correlated model size means it is more and more expensive to run them in with better performance in certain contexts, and in fact ben- terms of RAM. Bottom: resources we need are increasing faster efits of scale tend to saturate. (section 2). Then we look at than available compute. Data from Epoch (2023), specific details harmful consequences that result from the growth of large- on the plots in Appendix A. scale AI. Firstly, large scale development is environmentally unsustainable (section 3), and the drive for more and more data at any cost encourages surveillance and unethical and requirements, which have ballooned in recent years (see unauditable data practices (section 4). Further, due to the Figure 1). expense and scarcity of core hardware and talent required to produce large-scale AI, privileging bigger-is-better increas- The bigger-is-better norm is also self-reinforcing, shaping ingly concentrates power over AI in the hands of a narrow the AI research field by informing what kinds of research is set players (section 5). We conclude by outlining how the incentivized, which questions are asked (or remain unasked), research community can reclaim the scientific discourse in as well as the relationship between industrial and academic the AI field, and move away from a singular focus on scale. actors. This is important because science, both in AI and in other disciplines, does not happen in a vacuum – it is built upon relationships and a shared pool of knowledge, 2. What problems does scale solve? in which prior work is studied, incorporated, and extended. 2.1. Scale is one of many factors that matter Currently, a handful of benchmarks define how “SOTA” (state-of-the-art) is measured and understood, and in pursuit A staggering increase in scale, and cost The scale of no- of the goal of improving performance on these benchmarks, table (highly-cited) models has massively increased over the scale has become the preferred tool for achieving progress last decade, driven by a super-exponential growth in terms and establishing new records. Reviewers ask for experi- of number of parameters, and amount of data and compute ments at a large scale, both in the context of new models used (Figure 1). This growth is much more rapid than the and in the context of measuring performance against exist- increased capacity of hardware to execute necessary process- ing models (e.g. fySy, 2024); scientific best practices call ing for model training and tuning. Indeed, while the size of for running experiments many times e.g. for hyperparam- large models, as measured by number of parameters, is cur- eter selection (Bouthillier et al., 2021). These incentives rently doubling every 5 months (subsection A.3), the cost of and norms contribute to pushing computing budgets beyond 1 GB of memory has been nearly constant for a decade. This what is accessible to most university labs – which in turn means that the resources required participate in cutting-edge makes many labs increasingly dependent on close ties with AI research have increased significantly. The compute used industry in order to secure such access (Abdalla & Abdalla, to train an AI model went from a single day on a gaming 2021; Whittaker, 2021; Abdalla et al., 2023). Taken together, console (in 2013) to surpassing the largest supercomput- we see the “bigger-is-better” norm in AI creating conditions ers on Earth (in 2020). A common response to concerns in which it is increasingly difficult for anyone outside of about the exponential growth of compute requirements is a 2 Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI Data science Medical imaging reference to Moore’s law2 – i.e. that computational power XGBoost Tabular Performance Performance will also increase, and this will help ensure that compute 75% learning 90 remains accessible (see Sutton, 2019). However, this is not Tree 85 50% models true in practice, since compute requirements for SOTA mod- 80 Neural els are surpassing improvements in computational power. nets Organ 25% Fraction 75 1.0GB segmentation For instance, Thompson et al. (2022) carried out a meta- of best score 70500MB analysis of 1,527 research papers from different ML sub- 1mn 10mn 1GB 5GB domains and extrapolated how much computational power a Train time b GPU Memory used Computer vision Scene was needed to improve upon common benchmarks like Ima- 70 Object 60 parsing Performance Performance geNet and MS COCO. They estimate that an additional 567 60 Detection 55 (ADE20K) times more compute would be needed to achieve a 5% error (COCO) 50 Date 50 rate on ImageNet, stating that “fundamental rearchitecting 40 2020 2022 45 0.4GB is needed to lower the computational intensity so that the Date scaling of these problems becomes less onerous”. However, 30 0.7GB 40 3.0GB 2021 2022 2023 many of the architectures they survey, such as Transformer- 10MB 100MB 1GB 10MB 100MB 1GB based models, including Vision Transformers, are still the c Model Size (for bfloat16) d Model Size (for bfloat16) Natural Language ones in use in 2024. Finally, building large AI model entails Text Embedding LLM Performance Massive Text Embedding Performance costs beyond the computational infrastructure, for instance Benchmark as larger models require more human labor (Appendix C). 60 Benchmark 60 Diminishing returns of scale The problem with the 30 bigger-is-better approach is not simply that it is inacces- 1.3GB 14.2GB 21.5GB 50 100MB 1GB 10GB 100Mb 1Gb 10Gb 100Gb sible. It also does not consistently produce the best models, and, after a given point, even shows diminishing returns. e Model Size f Model Size (for bfloat16) On many tasks, benchmark performance as a function of Figure 2. Performance as a function of scale saturates across scale tends to saturate after a certain point (see Figure 2). a variety of tasks. Plots of performance as a function of scale In addition, there is as much variability in model perfor- (time or memory footprint) on benchmark data from a) tabular mance between models within a similar size class as there learning (Grinsztajn et al., 2022), b) a medical image segmentation is between models of different sizes (Beeching et al., 2023). challenge (Ma & Wang, 2023), c) computer-vision object detection Indeed, there are many factors beyond scale that are im- (Lin et al., 2014, COCO) and d) scene parsing (Zhou et al., 2017, portant to produce performant AI models. For instance, ADE20K), e) text embedding (Muennighoff et al., 2022), and f) choosing the right model architecture for the data at hand text understanding (Beeching et al., 2023). Details in Appendix B. is crucial and Transformer-based models, widely perceived to be SOTA on most or even all ML benchmarks, are not always the most fitting solution. 2022). Considering text-based models, we have seen ev- For instance, when working with tabular data of the kind idence that encoder-based models lead to representations commonly produced in enterprise environments, tree-based that facilitate learning in relational data much better than models produce better predictions, and and are much faster decoder-based models, and do so in a much less compute (and thus less expensive and more accessible) compared intensive way (Grinsztajn et al., 2023). to neural network approaches. This is notably due to the fact that their inductive bias is adapted to the specificity of 2.2. In many applications, utility does not requires scale columnar data (Grinsztajn et al., 2022, and Figure 2a). Train- Different tasks can also call for models of different sizes, as ing or fine-tuning strategies can also play a very important seen on Figure 2. Using these benchmarks as a reference for role (Davidson et al., 2023) – for instance, in text embed- comparison, we see that e.g. a 1 GB model performs well dings where for a model of the same size, and often even for on medical image segmentation, even though the images the same type of pre-trained model fine-tuned in a different themselves are relatively large. In computer vision, a 0.7 manner, to produce useful similarities can markedly improve GB model can achieve good performance on a task such as the resulting embeddings on domain-specific tasks (Muen- object detection, even though scene parsing could require up nighoff et al., 2022, and Figure 2c). We see this exemplified to 3 GB of memory. For text processing, a natural language in the E5 model, which leverages both contrastive learning understanding task could be addressed with an LLM with and curated data to achieve remarkable results (Wang et al., hundreds of billions of parameters requiring more than 20 2 Moore’s law is an observation that states that the number of GB of memory (and multiple GPUs), even though a 1.3 GB transistors on a microchip doubles roughly every two years. model can also provide a good semantic embedding, which 3 Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI can be leveraged for solving the same task. Particularly for show that the most frequently used ML models are linear more focused tasks, smaller models often suffice: object models and tree-based ones (Kumar, 2022). detection (2c) versus scene parsing (2d), or semantics (2e) Happily for the AI research community, a focus on smaller versus ’understanding’ (2f). models also unearths myriad unresolved scientific questions, Given this varied landscape, and the particularities in size which remain in need of scrutiny and innovative thinking in- and performance across tasks, where should we direct our dependent of scale. For example, for many applications, bot- research efforts? There is no single answer to this question, tlenecks lie in decision-making, calibration, and uncertainty but it is instructive that a proposal made by Wagstaff over quantification, not more data or compute as such (Van Cal- a decade ago called for ML with meaningful applications, ster et al., 2019). Similarly, causal machine learning and each of which calls for approaches adapted to its constraints distributional shift are very active areas of research that are (2012), a philosophy that has become decidedly unpopular important irrespective of scale (Kaddour et al., 2022). And in recent years in the pursuit of ‘general-purpose’ AI models. interpretability is still an unrealized goal for most neural- Let us survey published applications of ML, to sheds some network-based approaches, even as the ability to understand light on corresponding tradeoffs. and audit such systems remains centrally important in the context of meaningful real life applications, especially in If we consider health applications, where we would like high-stakes scenarios like health (Murdoch et al., 2019; machine learning to positively contribute to society, AI mod- Ghassemi et al., 2021). els are often built in data-scarce contexts on which large models can more easily overfit (Varoquaux & Cheplygina, 2022). For example, to predict epilepsy, Eriksson et al. 2.3. Unrepresentative benchmarks show outsize benefits (2023) find no prediction-performance difference between from scale logistic regression, boosted trees, or deep learning, even as Progress in AI is measured by an ever-evolving set of bench- these approaches differ dramatically in terms of resources marks which are meant to enable comparing the perfor- required. The largest source of medical data are probably mance of various models in a standardized way. In fact, electronic health records – Rajkomar et al. (2018) reported one of the main reasons ML practitioners are drawn to scale performance of deep learning, though their appendices show is that large scale models have, over the past decade or that logistic regression achieves the same performance. For more, beaten smaller models, as measured by these bench- education, one promise of machine learning is that it could marks. And beating the benchmark is currently how SOTA produce customized pedagogy, shaped to the student based is defined (and how best paper awards are allotted, tenure on which topics or skills they have learned, and which they given, funding secured, etc.). This persists even as this are striving the understand. For this purpose, a random benchmark-centric evaluation often comes with ill-advised embedding of the student’s learning sequence outperforms practices (Flach, 2019), both in terms of reproducibility of more resource intensive optimized deep or Bayesian models results (Pineau et al., 2021; Belz et al., 2021; Marie et al., (Ding & Larson, 2019). For robotics, Fu et al. (2024) present 2021) as well as the metrics used for measuring perfor- an impressively useful system that can autonomously com- mance (Post, 2018; Wu et al., 2016). In fact, one problem plete many daily-life tasks such as serving dishes whose key that the ML community currently faces is that many bench- is co-training by a human. The learning systems used are marks were created to assess the performance of models in rather modest, such as a ResNet18 architecture (He et al., the context of the academic research field, not as a stand 2016), dating from 2016 with 11 million parameters, which in for measures of contextualized performance, especially only take up 22 MB of memory with a precision of bfloat16– given the diversity of downstream contexts of applications which is dwarfed by many of today’s SOTA systems, whose in which AI models can be used, or to support broad claims size is often measured in GB (Figure 2). made by marketers or companies given rising industrial If we inspect applications such as those mentioned above, interest and investment in AI. we can see that for applied ML research what often matters Furthermore, with the advent of generative models includ- most – even if we focus on simple purpose-specific met- ing LLMs, the ML community has faced new evaluation rics – is to address well-defined and application-specific challenges given the openendedness of model outputs. It is goals while respecting the compute constraints at hand. And no longer possible to simply evaluate prediction on labels doing so often benefits from smaller, purpose-specific ap- in the test set when there is no canonical ‘correct answer’. proaches as opposed to larger, more generic models. The This has resulted in a diversification of model evaluation benefit of purpose-specific models is also apparent in busi- practices, with the development of many different evalua- ness applications, which are, by definition, specific to given tion benchmarks intended to evaluative different aspects of business model, and have well-defined goals (efficiency, generative model performance (see Koch et al., 2021; Chang profit, etc.) (Bernardi et al., 2019). Industry surveys indeed et al., 2023). This proliferation of non-standard approaches 4 Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI further confuses the landscape of assessment. Furthermore, but the recent history of AI suggests the opposite. It is examining training datasets often reveals evidence of data well-known phenomenon in economics that when the ef- contamination (Dodge et al., 2021), including the presence ficiency of a general-use technology increases, the falling of benchmark datasets inside training corpora (Deng et al., costs lead to an increased demand, resulting in an overall 2023). This renders evaluations against these benchmarks increase in resource usage– this is referred to as “Jevons questionable. This suggests that we should be very cautious Paradox” (Jevons, 1865). Indeed, the AlexNet paper, which regarding the validity of current evaluations of industrial helped launch the current bigger-is-better era of AI research, models, given that the datasets used to train them are often was itself introducing a profound efficiency improvement, not accessible to carry out the kind of public scrutiny nec- showing that a few GPUs could compete with tens of thou- essary to identify such contamination or to understand the sands of CPUs. One could assume that this would make AI suitability of a given evaluation approach to the model at training “democratically accessible”, but in fact the oppo- hand. site happened, with subsequent generations of AI models using even larger quantities of GPUs and training for a These issues have led researchers to propose various ap- longer amount of time – with the training of the most recent proaches for more rigorous model evaluation, including generations of LLMs being measured in millions of GPU libraries (Gao et al., 2023; Von Werra et al., 2022) and hours. This counter-intuitive dynamic has been documented open leaderboards (Beeching et al., 2023). However, the numerous times in energy, agriculture, transportation, and crux of the matter remains that ML benchmarks and evalua- computing (York & McGee, 2016), where observed trends tions remain unrepresentative proxies of downstream model have led to truisms such as Wirth’s law, which asserts that performance, and are not a substitute for more holistic as- software demands outpace hardware improvements (Wirth, sessments. Importantly, such benchmarks cannot speak to 1995). whether a given model is suited to a given purpose and con- text. In addition, the “cost functions” that are used for many of these comparisons are not absolute and result in, e.g., 3.1. The costs of scale-first AI don’t add up the fixation on specific metrics such as accuracy or preci- In fact, in many organizations, compute costs are already sion, while ignoring others, such as robustness or efficiency, a major roadblock to developing and deploying ML mod- which remain crucial in real-world applications (Rogers & els (Paleyes et al., 2022). This is true even at large, tech- Luccioni, 2024). Finally, some ad-hoc benchmarks claim to savvy companies such as Booking.com, who have found measure ill-defined performance properties that are in fact that gains in model performance cease to translate into gains impossible to verify (e.g. claiming that certain LLMs exhibit in value to the company at a certain point, at which other artificial general intelligence (Bubeck et al., 2023)), which factors such as efficiency or robustness become more im- impacts the way the field as a whole tracks and perceives portant (Bernardi et al., 2019). Increase in model scale progress towards an illusory and unattainable goal. aggravates this problem – even after the costs of training are accounted for, deploying large models can lead to sizeable 3. Consequence 1: An unsustainable trajectory costs. In fact, the price of a single inference (i.e. model query) has been growing in the last two decades despite As we have shown in the sections above, focusing all of hardware efficiency improvements (Figure 3). This problem our research efforts on the pursuit of bigger models narrows is made even worse as growing applications of such mod- the field, and scale is not an efficient solution to every prob- els in user-facing applications such as smartphones, smart lem. There are many interesting problems in AI that do devices connected to the Internet of Things (IOT) and Web not require scale, and many different ways of measuring applications require increased quantities of inferences by performance that do not only emphasize performance. The ML models. singular pursuit of scale also comes with worrying conse- quences beyond the ML research community, which we The prohibitive cost of model deployment at scale threat- examine below. ens the business model of AI. For instance, the compute required to run ChatGPT as a publicly accessible interface allegedly costs OpenAI $ 700 000 per day (The Economic Efficiency can have rebound effects As we have seen Times, 2023). According to Alphabet’s chairman, a search from historical data, the compute required to create and using Bard, which is powered by Google’s recent PaLM 2 deploy SOTA AI models grows faster than the cost of com- model (Anil et al., 2023), is 10 times more expensive than pute decreases. And yet, if we succeed at making these a pre-Bard Google search (Dastin & Nellis, 2023). In a technologies generally useful, deployment at massive scale similar vein, Bornstein et al. (2023) reports that in gener- would be the logical next step, making these SOTA mod- ative AI gross margins are “more often as low as 50-60%, els accessible to increasing quantities of users. We may driven largely by the cost of model inference”. Beyond hope that efficiency improvements will come to the rescue, generative AI, self-driving cars provide another cautionary 5 Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI Inference 10Wh FLOP for one query 1Wh Energy used 1015 FLOPS in $100 GPU 100mWh 1012 10mWh 109 1mWh n g k n n n n n g n tio in as tio tio tio tio tio in tio s i f icnasweexrt msificsaificdaetecneraarizaaptionnera s t s s e e 106 2010 2020 xt clation afilln clae clabjectet xt sgumamgeacge g s teque tokima e g o im im Figure 3. The cost of a single inference is growing faster than task compute is improving Figure 4. A single inference uses more energy for models with broad purposes. Data from Luccioni et al. (2024). example. But here again the economics are challenging – risk, with both Microsoft and Google announcing in 2024 ‘autonomous’ driving has turned out to rely heavily on the that they would miss the sustainability targets they set in support of human workers and expensive sensors (Higgins, previous years due to the energy demands of AI (Rathi & 2022). As of early 2024, the field of companies endeavoring Bass, 2024; Metz, 2024). to commercialize fully autonomous vehicles has shrunk con- Comparing different model architectures also shows that siderably. Cruise suspended its efforts, leaving only Waymo “general-purpose” (multi-task or zero-shot) models have remaining (Shepardson & Klayman, 2023). higher energy costs than specialized models that were trained or fine-tuned for a specific task (Luccioni et al., 3.2. Worrying environmental impacts 2024, and Figure 4). This means that the current trend of us- The emphasis on scale in AI also comes with consequences ing generative AI approaches for tasks such as Web search, to the planet, since training and deploying AI models re- which were previously carried out using “legacy” (infor- quires raw materials for manufacturing computing hard- mation retrieval and embedding-based) systems, stands to ware, energy to power infrastructure, and water to cool ever- increase the environmental costs of our day-to-day computer growing datacenters. The energy consumption and carbon use significantly. Taking a step back, according to recent footprint of AI models has grown with models such as LLMs figures, global data centre electricity consumption repre- emitting up to 550 tonnes of CO2 during their training pro- sents 1-1.3% of global electricity demand and contributes cess (Strubell et al., 2019; Luccioni & Hernandez-Garcia, 1% of energy-related greenhouse gas emissions (Hintemann 2023; Luccioni et al., 2022). But the most serious sustain- & Hinterholzer, 2022; Copenhagen Centre on Energy Ef- ability concerns relate to inference, given the speed at which ficiency, 2020). It is hard to estimate what portion of this AI models are being deployed in user-facing applications. number is attributable to AI. However, a recent report from It is hard to gather meaningful data regarding both the en- the International Energy Agency estimates that electricity ergy cost and carbon emissions of AI inference because this consumption from data centres and AI is set to double in the varies widely depending on the deployment choices made next 2 years, with data centres’ total electricity consumption (e.g. type of GPU used, batch size, precision, etc.). How- surpassing that of Japan (1 000 TWh) in 2026. (IEA, 2024). ever, high-level estimates from Meta attribute approximately one-third of their AI-related carbon footprint to model in- 4. Consequence 2: More data, more problems ference (Wu et al., 2021), whereas Google attributes 60% of its AI-related energy use to inference (Patterson et al., As ML datasets grow in size (Figure 5), they bring a slew 2022). Comparing estimates of the energy required for in- of issues ranging from documentation debts to a lack of ference for LLMs (Luccioni et al., 2024) to ChatGPT’s 10 auditability to biases – we discuss these below. million users per day (Oremus, 2023) reveals that within a few weeks of commercial use, the energy use of inference Data size in tension with quality In recent years, pretrain- overtakes that of training. Given the many companies and ing has become the dominant approach in both computer investors working to make AI more ubiquitous, these num- vision (Szegedy et al., 2015; Redmon et al., 2016) and natu- bers could go up by many orders of magnitude. In fact, we ral language processing (Devlin et al., 2018; Liu et al., 2019). are already seeing chatbots applied in consumer electronics This approach requires access to large datasets, which have such as smart speakers and TVs, as well as appliances such grown in size, from millions of images for datasets such as as refrigerators and ovens. The increased use of AI in user- ImageNet (Deng et al., 2009) to gigabytes of textual docu- facing products is putting tech companies’ climate targets at ments for C4 (Raffel et al., 2020) and billions of image-text 6 Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI pairs in LAION-5B (Schuhmann et al., 2022). The premise Invasive data gathering From a privacy perspective, the of pretraining is that more data will improve model perfor- perceived need for ever-growing datasets implicitly incen- mance and ensure maximal coverage in model predictions, tivizes more pervasive surveillance, as companies gather with the assumption that the more data used for pretraining, our clicks, likes and searches to feed models that are then the more representative the model’s coverage will be of the applied to sell us consumer products or impact the order world at large. However, numerous studies have shown that the search results that we see. In fact, the majority of the neither image nor text datasets are broadly representative, revenue of the Big Tech companies leading the AI revolu- reflecting instead a select set of communities, regions and tion comes from targeted advertising - representing 80% of populations (Dodge et al., 2021; Rogers, 2021; Raji et al., Google’s and 90% of Meta’s annual revenue in 2022. 2021; Luccioni & Rolnick, 2023). In fact, recent research Also, while much of ML data gathering efforts have been op- on the LAION datasets showed that as these datasets grow erating under the assumption that copyright laws do not ap- in size, the issues that plague them also multiply, with larger ply for data gathered from the Internet and used to train ML datasets contain disproportionally more problematic content models (or, at the least, that training ML models constitutes than smaller datasets (Birhane et al., 2023). Thiel (2023) re- ‘fair use’), recent months have seen a series of copyright ported that the LAION datasets included child sexual abuse lawsuits filed against many companies and organizations. material, prompting the datasets to be taken down from This includes lawsuits from authors, artists and newspapers several hosting platforms. But with dozens of image gen- (Small, 2023; Schrader, 2023; Grynbaum & Mac, 2023). eration models trained on LAION variants, some already While is is too early to tell what the verdicts of these law- deployed in user-facing settings, it is hard to assess and mit- suits will be, they will undoubtedly impact on the way in igate the negative effects of this grim reality (Sambasivan which ML datasets are created and leveraged. The Euro- et al., 2021). pean Union’s General Data Protection Regulation (GDPR) While a growing wave of research has proposed a more law sets some limitations in terms of data gathering and ‘data-centric’ approach to data collection and curation (Zha privacy in the EU, even if unevenly enforced. However, et al., 2023; Mitchell et al., 2022), attempts to actually doc- the United States and many other jurisdiction have yet to ument the contents of ML datasets have been hampered pass federal privacy laws, offering their citizens little to no by their sheer size, which requires compute and storage privacy protection against predatory data gathering practices beyond the grasp of most members of the ML research com- and surveillance, which the rush to scale AI incentivises. munity (Luccioni & Viviano, 2021), before the challenges of classifying and understanding the contents of a given 5. Consequence 3: Narrow field, few players dataset are even addressed. As the field works with ever larger datasets, we are also incurring more and more ‘doc- 5.1. Scale shouldn’t be a requirement for science umentation debt’ wherein training datasets are too large to The bigger-is-better paradigm shapes the AI research field, document both during creation and post-hoc (Bender et al., yet we are in a moment where expensive and scarce infras- 2021). In a nutshell, this means that we do not truly know tructure is viewed as necessary to conduct cutting-edge AI what goes in to the models that we rely on to answer our research and where companies are increasingly focused on questions and generate our images, apart from some high- providing resources to large-scale actors, at the expense of level statistics. And this, in turn, hampers efforts to audit, academics and hobbyists (Moss, 2018). Recent research evaluate, and understand these models. argues that “a compute divide has coincided with a reduced representation of academic-only research teams in com- pute intensive research topics, especially foundation mod- 1015 Training data size # samples els.” (Besiroglu et al., 2024, p.1) We also see evidence of this in the way in which the AI field is both growing–as measured by number of PhDs in AI, and shrinking–as mea- sured by graduates remaining in academia. Stanford HAI 109 (2023) reports that “the proportion of new computer science PhD graduates from U.S. universities who specialized in AI jumped to 19.1% in 2021, from 14.9% in 2020 and 10.2% in 103 2010.”. This dynamic sees academia increasingly marginal- ized in the AI space and reliant on corporate resources to 2000 2010 2020 participate in large-scale research and development of the kind that is likely to be published and recognized. And it Figure 5. A sharp increase in amount of data used for training arguably disincentives work that could challenge the bigger- Details in Appendix A. is-better paradigm and the actors benefiting from it. 7 Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI 5.2. Scale comes with a Concentration of power providing OpenAI access to Azure compute, in exchange for exclusive license to integrate OpenAI’s GPT models, and a The expense and scarcity of the ingredients needed to build promise of $1 trillion in profit delivered to Microsoft prior and operate increasingly large models benefits the actors to any revenue being directed to OpenAI’s social mission. in AI that have access to compute and distribution markets. This works to concentrate power and influence over AI, which, in turn, provides incentives for those with sufficient Shaping practices and applications AI dominance is one resources to perpetuate the bigger-is-better AI paradigm in factor that has helped vault companies into shapers of global service of maintaining their market advantage. economic forecasts and markets. The S&P 500, used by US investors as an index of market performance, is significantly shaped by the fates of large US tech companies (Rennison & The scale game benefits large players Market concentra- Murray, 2023). This indicates that the interests of large AI tion is apparent in compute resources, and is most evident companies, and particularly their emphasis on large-scale when examining Nvidia GPUs. The cost and scarcity of AI, have implications well beyond the tech industry, since these resources make it increasingly difficult for an aca- a downturn in the AI market would impact global financial demic lab, or even most startups, to purchase and rack suffi- markets beyond. This creates market incentives, divorced cient hardware on premises. An H100, currently Nvidia’s from scientific imperatives, for perpetuating the bigger-is- most powerful chip, costs up to $40,000. Recent shortages better paradigm on which these companies are betting. in chips have intensified the gap between the “GPU rich” and “GPU poor” (Barr, 2023) – notably, this constrained The expense of creating large-scale AI models also makes supply has benefited the large cloud providers who have the need to commercialize these models more pressing – privileged access to such hardware (Bornstein et al., 2023). even if inference can be costly (Vanian & Leswing, 2023). Here, established cloud and platform companies also have an advantage in the form of access to large and existing mar- $1M Training kets. Cloud infrastructure offerings are a readily available cost means to commercialize AI models, allowing startups and $1k a single enterprises to rent access to, e.g., generative AI APIs as part run of cloud contracts. The ecosystem of ‘GPT wrapper’ com- panies licensing access to the model via Microsoft’s Azure $1 cloud is an obvious example. McKinsey (2022), looking at the 2022 landscape of AI, noted that “commercialization 2010 2020 is likely tied to hosting. Demand for proprietary APIs (e.g. from OpenAI) is growing rapidly.” Notable here is that Figure 6. The increase in training costs leaves many aside the business model for commercializing generative AI in particular is still being worked out (Bornstein et al., 2023). For training and deploying large models, most AI practition- ers operating outside of large cloud companies must rent Consequences for innovation The 1990s were marked access to compute from these companies. Since the cost by the explosion of personal computing, and the race to of training ML systems has grown exponentially, this can commercialize networked computation; this innovation re- entail costs for large scale models in the tens to hundreds of shaped the economy and the labor market (Caballero, 2010). millions of dollars, which is beyond the budget of the vast From there, the actors involved became concentrated into a majority of academic researchers (Figure 6). However, tech small number of firms, becoming enablers and gatekeepers companies are more able to invest this money to fuel their of personal-computing software. Resulting market capture AI research and competitiveness – Meta alone is estimated created a rent, to the expense of the rest of the economy, to have spent $18 billion on GPUs in 2024 (Hays, 2024). including IT consumers (Aghion et al., 2023). This, in turn, While AI startups continue to raise significant capital, the led to a fall of innovation and growth because of lack of majority of this capital (“up to 80-90% in early rounds”) is incentives for leaders and lack of market access for com- paid to cloud providers (Bornstein et al., 2023). Notably, petitors (Aghion et al., 2023); desktop software has been the flow of capital here is often circular: three large cloud moving slowly. companies “contributed a full two-thirds of the $27bn raised Today we face a similar scenario. Even the most well- by fledgling AI companies in 2023” (Hammond, 2024). resourced AI startups–like OpenAI, Anthropic, or Mistral– These circular investment deals can blur the boundary be- currently need compute resources so large that they can only tween “investment” and “acquisition”. For instance, the be leased from a handful of dominant companies. And the deal between Microsoft and OpenAI (currently under inves- primary pathway to a more widespread commercialization tigation by the Federal Trade Commission, 2024), involved of AI models also lies through these companies. For exam- 8 Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI ple, Microsoft licensed their GPT API to ‘wrapper’ startup rent concentration of AI resources threatens to undermine Jasper before they launched ChatGPT – Jasper provided technological sovereignty globally. writing help, offering services very similar to ChatGPT (Pardes, 2022). This, and OpenAI’s launch of their GPT app A broader landscape of risks Deploying AIs is society store, inflamed concerns that Microsoft and OpenAI would may come with a variety of risks. Some risks are already leverage their informational advantage and privileged ability visible in existing systems such as the problems of fair- to shape the AI market and to compete with smaller players ness and replicating biases which lead to underdiagnosis dependent on their platform. This echos practices that Ama- in historically under-served populations (Seyyed-Kalantari zon Marketplace has been criticised for, in which they used et al., 2021). Other risks are more speculative given today’s data collected about buyers and sellers to inform product de- systems but are to be considered in future scenario where velopment that directly competed with marketplace vendors AIs become much more powerful (Critch & Russell, 2023). (Mattioli, 2020). A concentration of power fostered by very large-scale AI systems that can be operated and control by only very few actors makes these multiple risks more problematic, as it Scale comes with consequences to society The growth challenges checks and balance mechanisms. and profit imperative of corporations dictates finding sources of revenue that can cover the costs of AI systems of ever- growing size. These often puts them at odds with respon- 6. Ways Forward: small can also be beautiful sible and ethical AI use, which puts the emphasis on more Machine learning research is central to defining technical consensual and incremental progress. We see this dynamic possibilities and building the narrative in AI. There is no at play in OpenAI’s shifting stance and policies as they have magic bullet that unlocks research on AI systems both small- become more closely tied to Microsoft, a choice they made scale and powerful, but shared goals and preferences do admittedly to access scarce computational resources. This is shape where research efforts go. We believe that the research illustrated by the 2024 change to their acceptable use policy, community can and must act to pursue scientific questions which significantly softened the proscription against using beyond ever-larger models and datasets, and needs to foster their products for “military and warfare”; this change un- scientific discussion engaging the trade-offs that come with locks new revenue streams (Stone & MBergen, 2024), and scale. As such, we call the research community to adopt the connects OpenAI’s large scale AI to Microsoft’s existing US following norms: military partnerships (Xiang, 2023). The public did not have a say in this determination, nor, we assume, did the global Assigning value to research on smaller systems stakeholders whose interests may not be safeguarded by the Through defining our research agendas (which papers and militaries that OpenAI chooses to provide services to. And talks are accepted, which questions are asked), the research yet, the concentrated private industry power over AI creates community can shape how AI progresses and is appreci- a small, and financially incentivized segment of AI decision ated. We can deepen our understanding of performance by makers. We should consider how such concentrated power diversifying our benchmarks, investigating the limitations with agency over centralized AI could shape society under of these benchmarks, and building bridges to other com- more authoritarian conditions. munities to pursue tasks and problems that are relevant in There are many other examples of financial incentives shap- different contexts. All work on large AI system should also ing the use of automated decision systems in ways that are be compared to simpler baselines, to open the conversa- not socially beneficial. Ross & Herman (2023) showed tion on trade-offs between scale and other factors. And we United Health used a diagnostic algorithm to deny patients should value research on open research questions such as care, an application now at the center of a class action uncertainty quantification or causality, even when it is con- lawsuit filed by patients and their families. The current ducted on smaller models, as it can bring valuable insights bigger-is-better AI paradigm is also exacerbating geopo- for the field as a whole. litical concentration and tensions. Beyond the power of large corporations, government imperatives are also in play, Talking openly about size and cost A scientific study in particularly given that the large AI industry players are pri- machine learning should report compute cost, energy usage marily located in the US. These tensions are apparent in and memory footprint for training and inference. Measuring the efforts on the part of the US government to limit the these requires additional work, but it is small compared to supply of computing resources such as GPUs to China in the work of designing, training, and evaluating models, and the global race for AI dominance (Alper et al., 2023). While can be done with available open-source tools like Tensor- a discussion of whether such limits are warranted is outside Board and Code Carbon. We should account for efficiency the scope of this paper, this dynamic highlights the role of as well as performance when comparing models for instance, AI as a source of geopolitical power, one that given the cur- reporting metrics such as samples/kWh or throughput. 9 Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI Holding reasonable expectations Experiments at scale AI will come from de-emphasizing scale as a blanket solu- are costly, and we cannot expect everyone in the ML com- tion for all problems, instead focusing on models that can be munity to be “GPU rich”. When assessing the merits of a run on widely-available hardware, at moderate costs. This scientific study, we should refrain from requiring additional will enable more actors to shape how AI systems are created experiments more costly than those already performed. And, and used, providing more immediate value in applications as ML researchers, we should keep in mind that 1) not every ranging from health to business, as well as enabling a more problem is solved by scale 2) solving a problem that is only democratic practice of AI. present at small scales is valuable as it decreases costs. References Abdalla, M. and Abdalla, M. The grey hoodie project: Big tobacco, big tech, and the threat on academic in- tegrity. In Proceedings of the 2021 AAAI/ACM Confer- ence on AI, Ethics, and Society, AIES ’21. ACM, July 2021. doi: 10.1145/3461702.3462563. URL http: //dx.doi.org/10.1145/3461702.3462563. Abdalla, M., Wahle, J. P., Lima Ruas, T., Névéol, A., Ducel, F., Mohammad, S., and Fort, K. The elephant in the room: Analyzing the presence of big tech in natural language processing research. In Proceedings of the 61st Annual Figure 7. Pareto optimality and different contributions The Meeting of the Association for Computational Linguis- state-of-the-art should be considered as a Pareto frontier in the tics (Volume 1: Long Papers). Association for Compu- trade-off space between task performance and computing resource. tational Linguistics, 2023. doi: 10.18653/v1/2023.acl- Visible contributions are often “resource intensive progress”, in- long.734. URL http://dx.doi.org/10.18653/ creasing task performance by increase computing resources. We v1/2023.acl-long.734. must also celebrate resource efficiency progress that decrease com- puting resources for the same task performance. Aghion, P., Bergeaud, A., Boppart, T., Klenow, P. J., and Li, H. A Theory of Falling Growth and Ris- Quantitatively framing the changes that we call for can help ing Rents. The Review of Economic Studies, 90(6): anchoring them in the empirical practices of the AI com- 2675–2702, 02 2023. ISSN 0034-6527. doi: 10. munity. For this, it is useful to consider, as in Figure 7, the 1093/restud/rdad016. URL https://doi.org/10. trade-off between task performance (e.g. performance on 1093/restud/rdad016. an ML benchmark) and the computing resources used. In Alper, A., Freifeld, K., and Nellis, S. Biden cuts this space, the state of the art appears as a pareto-optimal China off from more Nvidia chips, expands curbs frontier. Successes that typically make the headlines are to other countries. https://www.reuters. “resource intensive”: they move the field in the upper right com/technology/biden-cut-china-off- direction, where using more computing resources is asso- more-nvidia-chips-expand-curbs-more- ciated to better task performance. We must also celebrate countries-2023-10-17/, 2023. progress in resource efficiency, moving the Pareto frontier to the right: less resource for the same performance. We Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., believe that this framing and the discussions it entails are Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Z., useful to build the discussion, and we advocate their more Chu, E., Clark, J. H., Shafey, L. E., Huang, Y., Meier- systematic use by AI practitioners. Hellstern, K., Mishra, G., Moreira, E., Omernick, M., Robinson, K., Ruder, S., Tay, Y., Xiao, K., Xu, Y., Zhang, Conclusion As members of the AI community, we find Y., Abrego, G. H., Ahn, J., Austin, J., Barham, P., Botha, that, in recent years, AI research has acquired an unhealthy J., Bradbury, J., Brahma, S., Brooks, K., Catasta, M., taste for scale. This comes with dire consequences– eco- Cheng, Y., Cherry, C., Choquette-Choo, C. A., Chowd- nomic inequalities and environmental (un)sustainability, hery, A., Crepy, C., Dave, S., Dehghani, M., Dev, S., datasets that erode privacy and emphasize corrosive social Devlin, J., Dı́az, M., Du, N., Dyer, E., Feinberg, V., Feng, elements, a narrowing of the field, and a structural exclu- F., Fienber, V., Freitag, M., Garcia, X., Gehrmann, S., sion of small actors such as most academic labs and many Gonzalez, L., Gur-Ari, G., Hand, S., Hashemi, H., Hou, startups. This fixation on scale has emerged via norms that L., Howland, J., Hu, A., Hui, J., Hurwitz, J., Isard, M., It- shape how the scientific community acts. We believe that tycheriah, A., Jagielski, M., Jia, W., Kenealy, K., Krikun, scientific understanding and meaningful social benefits of M., Kudugunta, S., Lan, C., Lee, K., Lee, B., Li, E., Li, 10 Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI M., Li, W., Li, Y., Li, J., Lim, H., Lin, H., Liu, Z., Liu, Birhane, A., Prabhu, V., Han, S., Boddeti, V. N., and Luc- F., Maggioni, M., Mahendru, A., Maynez, J., Misra, V., cioni, A. S. Into the LAIONs den: Investigating hate in Moussalem, M., Nado, Z., Nham, J., Ni, E., Nystrom, A., multimodal datasets. arXiv preprint arXiv:2311.03449, Parrish, A., Pellat, M., Polacek, M., Polozov, A., Pope, 2023. R., Qiao, S., Reif, E., Richter, B., Riley, P., Ros, A. C., Roy, A., Saeta, B., Samuel, R., Shelby, R., Slone, A., Bornstein, M., Appenzeller, G., and Casado, M. Who owns Smilkov, D., So, D. R., Sohn, D., Tokumine, S., Valter, the generative AI platform? https://a16z.com/ D., Vasudevan, V., Vodrahalli, K., Wang, X., Wang, P., who-owns-the-generative-ai-platform/, Wang, Z., Wang, T., Wieting, J., Wu, Y., Xu, K., Xu, Y., 2023. Xue, L., Yin, P., Yu, J., Zhang, Q., Zheng, S., Zheng, C., Zhou, W., Zhou, D., Petrov, S., and Wu, Y. Palm 2 Bouthillier, X., Delaunay, P., Bronzi, M., Trofimov, A., technical report, 2023. Nichyporuk, B., Szeto, J., Mohammadi Sepahvand, N., Raff, E., Madan, K., Voleti, V., et al. Accounting for Barr, A. The tech world is being dividing into ’GPU variance in machine learning benchmarks. Proceedings rich’ and ’GPU poor.’ Here are the companies of Machine Learning and Systems, 3:747–769, 2021. in each group. Business Insider, 2023. URL https://www.businessinsider.com/gpu- Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., rich-vs-gpu-poor-tech-companies-in- Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., each-group-2023-8. Lundberg, S., et al. Sparks of artificial general intel- ligence: Early experiments with gpt-4. arXiv preprint Beeching, E., Fourrier, C., Habib, N., Han, S., arXiv:2303.12712, 2023. Lambert, N., Rajani, N., Sanseviero, O., Tun- stall, L., and Wolf, T. Open llm leader- Caballero, R. J. Creative destruction. In Economic growth, board. https://huggingface.co/spaces/ pp. 24–29. Springer, 2010. HuggingFaceH4/open_llm_leaderboard, 2023. Chang, Y., Wang, X., Wang, J., Wu, Y., Zhu, K., Chen, H., Yang, L., Yi, X., Wang, C., Wang, Y., et al. A survey Belz, A., Agarwal, S., Shimorina, A., and Reiter, E. A on evaluation of large language models. arXiv preprint systematic review of reproducibility research in natural arXiv:2307.03109, 2023. language processing. arXiv preprint arXiv:2103.07929, 2021. Copenhagen Centre on Energy Efficiency. Greenhouse gas emissions in the ICT sector: Trends and methodologies. Bender, E. M., Gebru, T., McMillan-Major, A., and https://c2e2.unepdtu.org/wp-content/ Shmitchell, S. On the dangers of stochastic parrots: uploads/sites/3/2020/03/greenhouse- Can language models be too big? In Proceedings gas-emissions-in-the-ict-sector.pdf, of the 2021 ACM Conference on Fairness, Account- 2020. ability, and Transparency, FAccT ’21, pp. 610–623, New York, NY, USA, 2021. Association for Comput- Critch, A. and Russell, S. Tasra: A taxonomy and anal- ing Machinery. ISBN 9781450383097. doi: 10.1145/ ysis of societal-scale risks from ai. arXiv preprint 3442188.3445922. URL https://doi.org/10. arXiv:2306.06924, 2023. 1145/3442188.3445922. Dastin, J. and Nellis, S. For tech giants, ai like Bernardi, L., Mavridis, T., and Estevez, P. 150 successful bing and bard poses billion-dollar search problem. machine learning models: 6 lessons learned at booking. https://www.reuters.com/technology/ com. In Proceedings of the 25th ACM SIGKDD interna- tech-giants-ai-like-bing-bard-poses- tional conference on knowledge discovery & data mining, billion-dollar-search-problem-2023- pp. 1743–1751, 2019. 02-22/, 2023. Besiroglu, T., Bergerson, S. A., Michael, A., Heim, L., Luo, Davidson, T., Denain, J.-S., Villalobos, P., and Bas, G. Ai X., and Thompson, N. The compute divide in machine capabilities can be significantly improved without expen- learning: A threat to academic contribution and scrutiny?, sive retraining. arXiv preprint arXiv:2312.07413, 2023. 2024. Deng, C., Zhao, Y., Tang, X., Gerstein, M., and Co- Biden, J. R. Executive order on the safe, secure, and trust- han, A. Investigating data contamination in modern worthy development and use of artificial intelligence. benchmarks for large language models. arXiv preprint 2023. arXiv:2311.09783, 2023. 11 Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, A. A framework for few-shot language model evaluation, L. Imagenet: A large-scale hierarchical image database. 12 2023. URL https://zenodo.org/records/ In 2009 IEEE conference on computer vision and pattern 10256836. recognition, pp. 248–255. IEEE, 2009. Ghassemi, M., Oakden-Rayner, L., and Beam, A. L. The Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: false hope of current approaches to explainable artificial Pre-training of deep bidirectional transformers for lan- intelligence in health care. The Lancet Digital Health, 3 guage understanding. arXiv preprint arXiv:1810.04805, (11):e745–e750, 2021. 2018. Grinsztajn, L., Oyallon, E., and Varoquaux, G. Why do tree- Ding, X. and Larson, E. C. Why deep knowledge tracing based models still outperform deep learning on typical has less depth than anticipated. International Educational tabular data? Advances in Neural Information Processing Data Mining Society, 2019. Systems, 35:507–520, 2022. Dodge, J., Sap, M., Marasović, A., Agnew, W., Ilharco, Grinsztajn, L., Oyallon, E., Kim, M. J., and Varoquaux, G. G., Groeneveld, D., Mitchell, M., and Gardner, M. Vectorizing string entries for data processing on tables: Documenting large webtext corpora: A case study when are larger language models better? arXiv preprint on the colossal clean crawled corpus. arXiv preprint arXiv:2312.09634, 2023. arXiv:2104.08758, 2021. Grynbaum, M. and Mac, R. The Times Sues OpenAI Epoch. Parameter, compute and data trends in ma- and Microsoft Over A.I. Use of Copyrighted Work. chine learning. https://epochai.org/data/ https://www.nytimes.com/2023/12/27/ pcd, 2023. business/media/new-york-times-open- ai-microsoft-lawsuit.html, 2023. Eriksson, M. H., Ripart, M., Piper, R. J., Moeller, F., Das, K. B., Eltze, C., Cooray, G., Booth, J., Whitaker, K. J., Hammond, G. Big tech outspends venture cap- Chari, A., et al. Predicting seizure outcome after epilepsy ital firms in ai investment frenzy. https: surgery: do we need more complex models, larger sam- //www.ft.com/content/c6b47d24-b435- ples, or better data? Epilepsia, 64(8):2014–2026, 2023. 4f41-b197-2d826cce9532, 2024. Federal Trade Commission. Ftc launches inquiry Hao, K. and Seetharaman, D. Cleaning up ChatGPT takes into generative ai investments and partnerships. heavy toll on human workers. The Wall Street Journal, https://www.ftc.gov/news-events/news/ 2023. press-releases/2024/01/ftc-launches- inquiry-generative-ai-investments- Hays, K. Zuck’s GPU flex will cost Meta partnerships, 2024. as much as $18 billion by the end of 2024. https://ca.style.yahoo.com/zucks-gpu- Flach, P. Performance evaluation in machine learning: the flex-cost-meta-171750011.html, 2024. good, the bad, the ugly, and the way forward. In Pro- ceedings of the AAAI conference on artificial intelligence, He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn- volume 33, pp. 9808–9814, 2019. ing for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, Fu, Z., Zhao, T. Z., and Finn, C. Mobile aloha: Learning pp. 770–778, 2016. bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv preprint arXiv:2401.02117, 2024. Hempel, J. Fei-fei li’s quest to make ai better for humanity. https://www.wired.com/story/fei-fei- fySy, R. Review of calibrated on average, but not within li-artificial-intelligence-humanity/, each slice: Few-shot calibration for all slices of a distri- 2018. bution. https://openreview.net/forum?id= T11rD8k578, 2024. “I wonder if the problem of this Higgins, T. Slow self-driving car progress tests investors’ paper would vanish as the data and model become larger. patience. https://www.wsj.com/articles/ Could you please demonstrate it?”. investors-are-losing-patience- with-slow-pace-of-driverless-cars- Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, 11669576382, 2022. A., Foster, C., Golding, L., Hsu, J., Le Noac’h, A., Li, H., McDonell, K., Muennighoff, N., Ociepa, C., Phang, Hintemann, R. and Hinterholzer, S. Cloud computing drives J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, the growth of the data center industry and its energy L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou, consumption. Data centers 2022. ResearchGate, 2022. 12 Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI IEA. Electricity 2024. https://www.iea.org/ Luccioni, A. S., Viguier, S., and Ligozat, A.-L. Estimat- reports/electricity-2024, 2024. ing the carbon footprint of BLOOM, a 176b parameter language model. arXiv preprint arXiv:2211.02001, 2022. Jevons, W. S. The Coal Question. MacMillan, London, 1865. Luccioni, S., Jernite, Y., and Strubell, E. Power hungry processing: Watts driving the cost of ai deployment? In Kaddour, J., Lynch, A., Liu, Q., Kusner, M. J., and The 2024 ACM Conference on Fairness, Accountability, Silva, R. Causal machine learning: A survey and open and Transparency, pp. 85–99, 2024. problems, 2022. URL https://arxiv.org/abs/ 2206.15475. Ma, J. and Wang, B. MICCAI FLARE23: Fast, Low- resource, and Accurate oRgan and Pan-cancer sEgmen- Koch, B., Denton, E., Hanna, A., and Foster, J. G. Re- tation in Abdomen CT. https://codalab.lisn. duced, reused and recycled: The life of a dataset in ma- upsaclay.fr/competitions/12239, 2023. chine learning research. arXiv preprint arXiv:2112.01716, 2021. Marie, B., Fujita, A., and Rubino, R. Scientific credibility of machine translation research: A meta-evaluation of Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet 769 papers. arXiv preprint arXiv:2106.15195, 2021. classification with deep convolutional neural networks. Advances in neural information processing systems, 25, Mattioli, D. Amazon scooped up data from its own 2012. sellers to launch competing products. https://www. wsj.com/articles/amazon-scooped-up- Kumar, D. Kaggle survey 2022 data analysis. https: data-from-its-own-sellers-to-launch- //www.kaggle.com/code/dhirajkumar612/ competing-products-11587650015, 2020. kaggle-survey-2022-data-analysis, 2022. McKinsey. The state of ai in 2022—and a half Law, M. Meta ramps up ai efforts, building massive compute decade in review. https://www.mckinsey. capacity. https://technologymagazine.com/ com/capabilities/quantumblack/our- articles/meta-ramping-up-ai-efforts- insights/the-state-of-ai-in-2022-and- expanding-gpu-capacity, 2024. a-half-decade-in-review, 2022. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Metz, C. The secret ingredient of ChatGPT is human Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft advice. https://www.nytimes.com/2023/ COCO: Common objects in context. In Computer Vision– 09/25/technology/chatgpt-rlhf-human- ECCV 2014: 13th European Conference, Zurich, Switzer- tutors.html, 2023. land, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer, 2014. Metz, R. Google’s emissions shot up 48% over five years due to ai. Bloomberg, 2024. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Mitchell, M., Luccioni, A. S., Lambert, N., Gerchick, M., Roberta: A robustly optimized bert pretraining approach. McMillan-Major, A., Ozoani, E., Rajani, N., Thrush, T., arXiv preprint arXiv:1907.11692, 2019. Jernite, Y., and Kiela, D. Measuring data. arXiv preprint arXiv:2212.05129, 2022. Luccioni, A. and Viviano, J. What’s in the box? an analysis of undesirable content in the common crawl corpus. In Moss, S. Nvidia updates geforce eula to pro- Proceedings of the 59th Annual Meeting of the Associa- hibit data center use, 2018. URL https: tion for Computational Linguistics and the 11th Interna- //www.datacenterdynamics.com/en/ tional Joint Conference on Natural Language Processing news/nvidia-updates-geforce-eula- (Volume 2: Short Papers), pp. 182–189, 2021. to-prohibit-data-center-use/. Luccioni, A. S. and Hernandez-Garcia, A. Counting carbon: Muennighoff, N., Tazi, N., Magne, L., and Reimers, N. A survey of factors influencing the emissions of machine Mteb: Massive text embedding benchmark. arXiv learning. arXiv preprint arXiv:2302.08476, 2023. preprint arXiv:2210.07316, 2022. Luccioni, A. S. and Rolnick, D. Bugs in the data: How Ima- Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R., geNet misrepresents biodiversity. In Proceedings of the and Yu, B. Definitions, methods, and applications in inter- AAAI Conference on Artificial Intelligence, volume 37, pretable machine learning. Proceedings of the National pp. 14382–14390, 2023. Academy of Sciences, 116(44):22071–22080, 2019. 13 Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI Oremus, W. Ai chatbots lose money every time you use Rathi, A. and Bass, D. Microsoft’s ai push imperils climate them. that is a problem. Washington Post, 2023. goal as carbon emissions jump 30%. Bloomberg, 2024. Paleyes, A., Urma, R.-G., and Lawrence, N. D. Challenges Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. You in deploying machine learning: a survey of case studies. only look once: Unified, real-time object detection, 2016. ACM Computing Surveys, 55(6):1–29, 2022. Rennison, J. and Murray, E. How big tech camouflaged wall Pardes, A. The best little unicorn in texas: Jasper was street’s crisis. International New York Times, pp. NA–NA, winning the ai race then chatgpt blew up the whole 2023. game. https://www.theinformation.com/ articles/the-best-little-unicorn-in- Rogers, A. Changing the world by changing the data. arXiv texas-jasper-was-winning-the-ai-race- preprint arXiv:2105.13947, 2021. then-chatgpt-blew-up-the-whole-game, Rogers, A. and Luccioni, S. Position: Key claims in 2022. LLM research have a long tail of footnotes. In Forty- Parliament, E. Eu ai act: first regulation on artificial intelli- first International Conference on Machine Learning, gence. Accessed June, 25:2023, 2023. 2024. URL https://openreview.net/forum? id=M2cwkGleRL. Patterson, D., Gonzalez, J., Hölzle, U., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Ross, C. and Herman, B. Unitedhealth pushed employees and Dean, J. The carbon footprint of machine learning to follow an algorithm to cut off medicare patients’ rehab training will plateau, then shrink, 2022. URL https: care. https://www.statnews.com/2023/11/ //arxiv.org/abs/2204.05149. 14/unitedhealth-algorithm-medicare- advantage-investigation, 2023. Pineau, J., Vincent-Lamarre, P., Sinha, K., Larivière, V., Beygelzimer, A., d’Alché Buc, F., Fox, E., and Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Par- Larochelle, H. Improving reproducibility in machine itosh, P., and Aroyo, L. M. “everyone wants to do the learning research (a report from the NeurIPS 2019 repro- model work, not the data work”: Data cascades in high- ducibility program). The Journal of Machine Learning stakes ai. In proceedings of the 2021 CHI Conference on Research, 22(1):7459–7478, 2021. Human Factors in Computing Systems, pp. 1–15, 2021. Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Schrader, A. A Class Action Lawsuit Brought by Hockenmaier, J., and Lazebnik, S. Flickr30k entities: Artists Against A.I. Companies Adds New Plain- Collecting region-to-phrase correspondences for richer tiffs. https://news.artnet.com/art- image-to-sentence models. In Proceedings of the IEEE world/lawyers-for-artists-suing-ai- international conference on computer vision, pp. 2641– companies-file-amended-complaint- 2649, 2015. after-judge-dismisses-some-claims- Post, M. A call for clarity in reporting BLEU scores. 2403523, 2023. In Proceedings of the Third Conference on Machine Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Translation: Research Papers, pp. 186–191, Belgium, Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, Brussels, October 2018. Association for Computational C., Wortsman, M., et al. Laion-5B: An open large-scale Linguistics. URL https://www.aclweb.org/ dataset for training next generation image-text models. anthology/W18-6319. Advances in Neural Information Processing Systems, 35: Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., 25278–25294, 2022. Matena, M.