Working Paper: Beyond AI Exposure, Cost-Effective AI Vision Tasks PDF

Working Paper Beyond AI Exposure: Which Tasks are Cost-Effective to Automate with Computer Vision? Maja S. Svanberg Massachusetts Institute of Technology, [email protected] Wensu Li Massachusetts Institute of Technology, [email protected] Martin Fleming The Productivity Institute, [email protected] Brian C. Goehring IBM’s Institute for Business Value, [email protected] Neil C. Thompson* Massachusetts Institute of Technology, neil [email protected] Abstract. The faster AI automation spreads through the economy, the more profound its potential impacts, both positive (improved pro- ductivity) and negative (worker displacement). The previous literature on “AI Exposure” cannot predict this pace of automation since it attempts to measure an overall potential for AI to affect an area, not the technical feasibility and economic attractiveness of building such systems. In this article, we present a new type of AI task automation model that is end-to-end, estimating: the level of technical performance needed to do a task, the characteristics of an AI system capable of that performance, and the economic choice of whether to build and deploy such a system. The result is a first estimate of which tasks are technically feasible and economically attractive to automate - and which are not. We focus on computer vision, where cost modeling is more developed. We find that at today’s costs U.S. businesses would choose not to automate most vision tasks that have “AI Exposure,” and that only 23% of worker wages being paid for vision tasks would be attractive to automate. This slower roll-out of AI can be accelerated if costs falls rapidly or if it is deployed via AI-as-a-service platforms that have greater scale than individual firms, both of which we quantify. Overall, our findings suggest that AI job displacement will be substantial, but also gradual – and therefore there is room for policy and retraining to mitigate unemployment impacts. Funding: MIT-IBM Watson AI Lab provided funding for the project. Key words: artificial intelligence, automation, AI, computer vision Acknowledgements: We are grateful to Ling Yue Poon and Haoran Lyu for research assistance and to Nicholas Borge, Ben Tang, Anna Pastwa, Subhro Das, Eni Mustafaraj, and Christophe Combemale for their contributions. An early draft of this paper appeared as Svanberg (2023).*Corresponding author. 1 Electronic copy available at: https://ssrn.com/abstract=4700751 Svanberg et al.: Which Tasks are Cost-Effective to Automate with Computer Vision? 2 Working Paper “Machines will steal our jobs” is a sentiment frequently expressed during times of rapid technological change. Such anxiety has re-emerged with the creation of large language models (e.g. ChatGPT, Bard, GPT-4) that show considerable skill in tasks where previously only human beings showed proficiency. A recent study found that about 50% of tasks could be at least partially automated with large language models (Eloundou et al. 2023). If task automation of that extent were to happen rapidly, it would represent an enormous disruption to the labor force. Conversely, if that amount of automation were to happen slowly then labor might be able to adapt as it did during other economic transformations (e.g. moving from agriculture to manufacturing). So, making good policy and business decisions depends on understanding how rapidly AI task automation will happen. While there is already evidence that AI is changing labor demand (Fleming et al. 2019, Acemoglu et al. 2022), most anxieties about AI flow from predictions about “AI Exposure” that classify tasks or abilities by their potential for automation, as measured by various proxies (Arntz et al. 2017, Brynjolfsson et al. 2018, Felten et al. 2018, Webb 2019, Felten et al. 2021, Tolan et al. 2021, Meindl et al. 2021, Zarifhonarvar 2023, Felten et al. 2023). Importantly, nearly all these predictions are vague about the timeline and extent of automation because they do not directly consider the technical feasibility or economic viability of AI systems, but instead use measures of similarity between tasks and AI capabilities to indicate exposure. The only exception in the literature known to us is a McKinsey report (Ellingrud et al. 2023) that estimates AI adoption of between 4% and 55%. With such imprecise predictions, it is unclear what conclusions should follow. AI exposure models also conflate predictions about full task automation, which is more likely to displace workers, with partial automation, which could augment their productivity. Separating these effects is enormously important for understanding the economic and policy implications of automation (Acemoglu and Restrepo 2018). In this paper, we address three important shortcomings of AI exposure models to construct a more economically-grounded estimate of task automation. First, we survey workers familiar with end-use tasks to understand what performance would be required of an automated system. Second, we model the cost of building AI systems capable of reaching that level of performance. This cost estimate is essential to understanding the deployment of AI, since technically-exacting systems can be enormously expensive. And third, we model the decision about whether AI adoption is economically-attractive. The result is the first end-to-end AI automation model. A simple hypothetical example makes clear why these considerations are so important. Consider a small bakery evaluating whether to automate with computer vision. One task that bakers do is to visually check their ingredients to ensure they are of sufficient quality (e.g. unspoiled). This task could theoretically be replaced with a computer vision system by adding a camera and training the system to detect food that has gone bad. Even if this visual inspection task could be separated from other parts of the production process, would it be cost effective to do so? Bureau of Labor Statistics O*NET data imply that checking food quality Electronic copy available at: https://ssrn.com/abstract=4700751 Svanberg et al.: Which Tasks are Cost-Effective to Automate with Computer Vision? Working Paper 3 comprises roughly 6% of the duties of a baker. A small bakery with five bakers making typical salaries ($48,000 each per year), thus has potential labor savings from automating this task of $14,000 per year. This amount is far less than the cost of developing, deploying and maintaining a computer vision system and so we would conclude that it is not economical to substitute human labor with an AI system at this bakery. The conclusion from this example, that human workers are more economically-attractive for firms (par- ticularly those without scale), turns out to be widespread. We find that only 23% of worker compensation “exposed” to AI computer vision would be cost-effective for firms to automate because of the large upfront costs of AI systems. The economics of AI can be made more attractive, either through decreases in the cost of deployments or by increasing the scale at which deployments are made, for example by rolling-out AI-as-a-service platforms (Borge 2022), which we also explore. Overall, our model shows that the job loss from AI computer vision, even just within the set of vision tasks, will be smaller than the existing job churn seen in the market, suggesting that labor replacement will be more gradual than abrupt. The rest of the paper is structured as follows: Section 1 introduces our framework to estimate which tasks are economically attractive to automate, section 2 presents the results, section 3 discusses how labor- replacing AI could proliferate, section 4 discusses the relevance of computer vision automation for other parts of AI, and section 5 concludes. 1. Method 1.1. Overview We develop a task-based approach to our analysis (Autor et al. 2003) that focuses on two key questions: (i) Exposure: might it possible to build an AI model to automate this task, and (ii) Economically-attractive: would it be more attractive to use an AI system for this task than to have human workers continue to do it. To assess exposure, we follow the literature (e.g., Brynjolfsson et al. (2018)) in evaluating task descriptions for whether it might be feasible for an AI systems to perform them. Our main contribution is the second part of the analysis: assessing the economic attractiveness of automation. For reasons discussed later, the economic attractiveness of human labor and AI systems is largely driven by the relative costs of each. Modeling the human cost is straightforward labor accounting. Modeling the cost of AI systems is more complicated, so we draw on the computer science literature on training and doing inference with deep learning,1 as well as 35 case studies that we performed to gather additional data. Central to our comparison of human and AI costs is the concept ofminimum viable scale, which occurs when the AI deployment’s fixed costs are sufficiently amortized that the average cost of using the computer vision system is the same as the cost of human labor of equivalent capability (Borge 2022), as shown in figure 1. AI automation is cost-effective only when the deployment scale is larger than the minimum viable scale. 1 The technique that has been the dominant source of AI progress since 2012. Electronic copy available at: https://ssrn.com/abstract=4700751 Svanberg et al.: Which Tasks are Cost-Effective to Automate with Computer Vision? 4 Working Paper Human Machine Advantage Advantage Minimum viable scale Average Cost to Computer Vision Perform Task Human Labor Scale of Deployment Figure 1 The minimum viable scale for AI deployment. M Concretely, we say that computer vision has an economic advantage Ei,j , where M is for machine, for deployment unit i, over human labor for a given task, j , when computer vision can be used for the task (i.e. M it has technology exposure, Tj ) and when the cost of the computer vision system, Ci,j , is less than the cost H of human labor, Ci,j : M M H Ei,j = Tj ∧ (Ci,j < Ci,j ) To get the fraction F of compensation that is at risk of being automated according to our model, we aggregate the cost of human labor for the economically feasible tasks over the total sum of compensation for all tasks J : M H X X Ei,j × Ci,j FI,J = H i∈I j∈J Ci,j Initially we consider the scope of deployment at the firm level, in other words, i denotes different firms. Later, we expand our analysis to broader scopes of deployment, namely, changing i to representing industry groups (4-digit NAICS), subsectors (3-digit NAICS), sectors (2-digit NAICS), or the entire U.S. Economy. In all cases, we focus on U.S. non-farm businesses to align our analysis with the data on firm sizes from the Statistics of U.S. Businesses (U.S. Census Bureau (2023)). 1.2. Exposure to Computer Vision (Tj ) Many tasks currently performed by workers could be carried out by a sufficiently sophisticated computer vision system, e.g., checking products for quality at the end of a factory assembly line or scanning medical imagery for anomalies. Other tasks have little use for vision technologies, e.g., negotiating the salary of subordinates. Electronic copy available at: https://ssrn.com/abstract=4700751 Svanberg et al.: Which Tasks are Cost-Effective to Automate with Computer Vision? Working Paper 5 j O*NET task Direct Work Activity (DWA) Vision Task? 1 Operate diagnostic equipment, such as radiographic or Analyze test data or images to inform Yes ultrasound equipment, and interpret the resulting images. diagnosis or treatment. 2 Operate diagnostic equipment, such as radiographic or Operate diagnostic imaging equip- No ultrasound equipment, and interpret the resulting images. ment. 3 Examine trays to ensure that they contain required items. – Yes Table 1 Computer Vision Exposure: Examples of O*NET Task - Direct Work Activity pairs, and their classification as vision tasks. To assess whether a task j has technology exposure, Tj , we need to identify which tasks in the economy are vision tasks and which are not. While prior AI exposure analyses inspired our approach to this paper, we cannot use their data directly. Felten et al. (2018, 2023) base their method on linking AI progress to abilities, which does not translate into our task-based model of replacement. Webb (2019) uses a task-based model but does not allow for an easy distinction between computer vision and other AI domains, and Eloundou et al. (2023) only cover language tasks. Brynjolfsson et al. (2018) did produce task-level data with an indicator for whether they could be performed using image data. However, when filtering their results for highly scoring image-based tasks, the output contained many tasks for which we, upon manual inspection, could not see an obvious computer vision use case, e.g., “Analyze market conditions or trends” or “Dispose of biomedical waste in accordance with standards.” Therefore, we create our own data on computer vision exposure. We take a manual approach to identifying Tj. Like Webb (2019), Eloundou et al. (2023), and Brynjolfsson et al. (2018), we rely on the O*NET Database 27.1 (U.S. Department of Labor 2023b). O*NET contains standardized characteristics of work and workers in the United States. By relying on this database, we assume that those tasks are an appropriate unit of replacement and that the initial technology deployment does not otherwise fundamentally change the task structure. The data contains descriptions of the nature of 1,016 occupations with 19,265 unique associated tasks, which are in turn mapped to 2,087 different direct work activities (DWAs) through a many-to-many relationship. Although the word “task” is a category in the O*NET schema, we find that the descriptions of the O*NET-tasks are too broad and include too many different capabilities. Therefore, we define a task for our purposes as the combination of an O*NET-Task and DWA, as shown in Table 1. This is a more detailed categorization than previous analyses that use just DWAs (Brynjolfsson et al. 2018), or just tasks (Webb 2019). O*NET-Tasks that do not have any associated DWAs are treated as one task. 2 The large number of O*NET-Tasks makes manual identification of vision tasks challenging, but because of the lack of prior art on automatically identifying vision tasks, it was still our preferred approach. To 2 To align the more-granular O*NET taxonomy with wage data that uses the Standard Occupational Classification (SOC), we truncate O*NET-SOC codes (see Appendix A.1). Electronic copy available at: https://ssrn.com/abstract=4700751 Svanberg et al.: Which Tasks are Cost-Effective to Automate with Computer Vision? 6 Working Paper classify the combinations of almost 20,000 tasks and 2,000 DWAs, we first identify 190 DWAs that indicate that they could be replaced by computer vision in some way. These included DWAs such as “Assess skin or hair conditions,” “Examine patients to assess general physical condition,” “Inspect items for damage or defects,” and “Monitor facilities or operational systems.” Filtering on these 190 DWAs yields 1,922 possible O*NET-Task-DWA combinations, which we also review manually to identify a total of 420 vision tasks, 414 of which exist in U.S. non-farm businesses. Additional details on task selection can be found in Appendix A.2. M 1.3. Economic Attractiveness of Computer Vision (Ei,j ) To assess the economic attractiveness of using computer vision systems, it is important to consider both the benefits and cost of their deployment, as compared to the human workers currently doing those tasks. For the analysis that follows, our base case considers the building of AI systems with capabilities equivalent to the human workers doing the task - that is, we are modeling what Brynjolfsson called “Turing Trap” automation (Brynjolfsson 2022). By definition, this approach equalizes the benefits provided by the human and the AI system, and thus the key determinant of economic attractiveness becomes the costs of each. In reality, we are only matching on some headline capabilities, so there are likely other remaining advantages and disadvantages to using computer vision in place of human workers. For example, a computer vision system might scale more easily if a factory added an additional shift. We assume that, in the short to medium term, these effects are second order. Since our results are robust to even significant changes in the benefits of the AI systems, these secondary effects would have to be large to meaningfully change our conclusions.3 Implicitly, our modeling is making an important assumption about the type of automation that will happen first. In particular, we consider systems that have the same capabilities as the human workers that they are replacing. But why shouldn’t adoption first occur with systems that are less capable than human workers, or those that are more capable? We do not consider less capable systems because we focus on replacing the human doing a task. Since one of our thresholds for human capabilities is the point at which human workers would be fired (e.g. because they misdiagnosed too many x-rays), we judge that less capable systems would do too poor a job to replace human workers doing this task. Less capable systems might still be able to augment the human doing the work, which we consider in other work. For more capable systems, the question for our analysis isn’t whether systems will be created that have better capabilities than human workers. This is already happening, for example in reading CAT scans (Agar- wal et al. 2023). But, insomuch as these systems are economically-attractive to build but so are systems with capabilities equal to human workers (as is true in this case), then our modeling approach will correctly identify the extent and timing of automation. The challenge to our approach would occur if building a more 3 Over the longer term, there could be adjustment strategies that are much more important, such as was seen with the adoption of electricity (David 1990). We do not attempt to model such changes. Electronic copy available at: https://ssrn.com/abstract=4700751 Svanberg et al.: Which Tasks are Cost-Effective to Automate with Computer Vision? Working Paper 7 capable system become economically-attractive before the equal-capabilities system. We argue that this is unlikely to be a common occurrance because improving the capability of AI systems results in an enor- mously rapid increase in the cost of these systems, as shown by Thompson et al. (2020) and as is consistent with foundational computer science work in this area (Kaplan et al. 2020, Henighan et al. 2020, Mikami et al. 2022). Since less capable systems are unlikely to be able to substitute for human workers, and more capable ones are likely to become economically-attractive only later, the modeling that will best predict the automation of human labor is the computer vision system with equivalent capabilities. And, because such a system provides similar benefits to the human doing that task (by definition), one can compare the economic attrac- tiveness of these systems by comparing their costs. M 1.3.1. The Cost of Computer Vision Systems (Ci,j ) To estimate the cost of a computer vision system, we rely on prior work by Thompson et al. (2021, 2022, 2024) to break down and calculate individual cost components. In general, the cost for firm i to fine-tuning and deploying a computer vision system to perform a task, j , can be divided into three categories: fixed costs, performance-dependent costs, and scale-dependent costs. Figure 2 shows an overview of the different components. Figure 2 Cost drivers of a AI computer vision system Fixed costs, or engineering costs,4 includes implementation costs, C eng,imp , and maintenance cost, C eng,m. Performance-dependent costs are the costs that vary based on the system requirements, and include the cost of data, Cjdata , and the compute cost per training round, Cjtrain. Finally, scale-dependent costs run depend on the amount of work that the system needs to perform, i.e., running costs of the system, Ci,j. 4 Although Thompson et al. (2022) also include infrastructure cost as a fixed cost, we ignore this since we assume the use of cloud computing. Electronic copy available at: https://ssrn.com/abstract=4700751 Svanberg et al.: Which Tasks are Cost-Effective to Automate with Computer Vision? 8 Working Paper To estimate the total cost of replacing human labor for a given task, we calculate the net present value of the cost of a system of a given lifespan. In addition to the initial round of fine-tuning, changes in the real world can lead to a decline in accuracy due to data drift, as explained by Moreno-Torres et al. (2012). To address this accuracy drop, the network must be retrained at regular intervals of K times per year. Denoting the discount rate as d, the yearly rate of decrease in computing costs as m, and the system lifespan as L, the M total cost of building, maintaining, and running a computer vision system, Ci,j , to perform task j in firm (or NAICS code) i is as follows: L−1 M X C eng,m + (Cjdata,retrain × K) Ci,j run + (Cjretrain × K) Ci,j = C eng,imp + Cjdata + Cjtrain + ( + ) t=0 (1 + d)t ((1 + d) × (1 + m))t Throughout the analysis, we assume a flat real discount rate of 5% applied across the economy, i.e., d = 0.05, corresponding to a conservative expected real stock market return based on historical values (Sullivan 2023). We assume that computing costs will decrease by 22% on an annual basis, i.e., m = 0.22 (Hobbhahn and Besiroglu 2022), and that the amount of system fine-tuning during retraining will be comparable to that during training (although our results are robust to significantly different assumptions). Furthermore, we assume that the system is going to be operational for L = 5 years, i.e., have a five-year lifespan, based on the custom software depreciation rate published by the U.S. Bureau of Economic Analysis (2003, p.31). Performance-Dependent Costs The cost of a computer vision system depends on the level of performance required. We use the regression proposed by Thompson et al. (2024), which models the cost of developing a deep learning system in terms of the required accuracy, the complexity of the task (as measured by entropy (Shannon 1948)), and the quantity and cost of the data needed. Importantly, the cost of such systems grows as a power law with the accuracy required, consistent with the broader deep learning scaling law literature (Thompson et al. 2020, Prato et al. 2021, Mikami et al. 2021, Zhai et al. 2022). Hence, we assume that, for each task j of type bj (either classification or semantic segmentation), with entropy ej , there exist a minimum level of accuracy aj that a human worker must achieve to be deemed fit for the job. If f (aj , ej , bj ) is the number of datapoints required to achieve aj and ej for a system of type bj (0 if classification and 1 if segmentation) according to Thompson et al. (2024), we model the total cost of data, Cjdata , as follows: Cjdata = f (aj , ej , bj ) × pdata j where pdata j is the cost per datapoint and log2 (1/ej ) log10 (1−aj )+0.81−0.61× 9.94 −0.62×bj f (aj , ej , bj ) = 10 −0.12 Electronic copy available at: https://ssrn.com/abstract=4700751 Svanberg et al.: Which Tasks are Cost-Effective to Automate with Computer Vision? Working Paper 9 Unlike previous work that asks AI experts about applicability on tasks that they are unfamiliar with, we choose to gather information on the tasks from domain experts and then use their answers to calculate the AI applicability. In particular, we use an online survey to collect data on the performance needed to complete each task. By using a survey, we get those familiar with doing the task to provide the accuracy required and the cost that would be involved in gathering additional data points. Respondents are recruited from the online crowd-sourcing platform Prolific, which directs them to a survey based on Qualtrics. Respondents are guided to choose the job that they are familiar with and answer questions about all the vision tasks involved in the selected job. They can skip tasks that they are unfamiliar with. We drop answers that fail the attention checks or where the respondents are unsure. We aim to collect at least 5 valid answers for each vision task, and we use the mean of the responses as our measure. In practice, we collected an average of 9 responses per task, with 80% of the tasks having 5+ valid responses. For 33 tasks, where we were unable to find any users familiar with them, so we use the mean value of the other tasks as the inferred value. We also attempted to gather information on the complexity (entropy) of applications from domain experts, but across multiple pilots surveys were unable to find a way to reliably get respondents to answer. Instead, we manually assess each task description to estimate the entropies. Since this data lacks the domain expert backing, we specifically test robustness to higher or lower entropy values. The details of the survey collec- tion and entropy data are outlined in Appendix A.4. To calculate the cost of compute, Cjtrain , we use the number of datapoints implied by f (aj , ej , bj ) and the following equation: f (aj , ej , bj ) × 2 × # Model Connections × 3 × # Epochs pGP U h Cjtrain = × GPU FLOPs/h U Here, the numerator of the left factor is the number of floating-point operations (FLOPs) required to train the model, based on research by Sevilla et al. (2022). The denominator is the number of FLOPs a given graphics processing unit (GPU) can perform in 1 hour at peak utilization. The right factor is the price per GPU hour, pGP U h , over the utilization, U , of that GPU. We assume that computation is done on the cloud and that the cost for a hour of time for a 4 FP-32 TFLOPS GPU costs pGP U h = $0.340, based on AWS pricing.5 We assume a GPU utilization of U = 85%, which is consistent with the utilization when training large computer vision models, such as ResNet50 (Yeung et al. 2020). Finally, we assume that 50 epochs are used and that the foundation model used has the same parameter size as VGG-19, i.e., 1.44 × 108 parameters (Simonyan and Zisserman 2014), the largest architecture in the “sane list of the most-commonly used model architectures” provided by Thompson et al. (2024). 5 eia2.xlarge pricing in U.S. East region on AWS https://aws.amazon.com/machine-learning/elastic-i nference/pricing/, Accessed: 2023-04-09 Electronic copy available at: https://ssrn.com/abstract=4700751 Svanberg et al.: Which Tasks are Cost-Effective to Automate with Computer Vision? 10 Working Paper Scale-Dependent Costs There is a marginal cost of running the model, i.e., making inferences. While running costs of large AI models are frequently cited as very large, we are only interested in knowing the running costs at the mini- mum viable scale to calculate the economic advantage. To determine this, we consider running costs that are proportional to the amount of human labor being displaced. Machines have an advantage over human labor in terms of speed (Chui et al. 2016, Combemale et al. 2022). For instance, a 4 FP-32 TFLOPS GPU hour is enough to make approximately 50,000,000 inferences using VGG-19 (Simonyan and Zisserman 2014).6 No human being could match this pace of almost 14,000 inferences per second. We therefore assume that fewer GPU hours than human hours are needed and thus making this calculation with the human hours will be an upper bound on the inference costs. We could make this more precise by estimating a relative factor between the two, but this would not meaningfully change our answer. run Therefore, the yearly running costs, Ci,j , for a computer vision system to perform task j within firm i are therefore modeled as follows: run pGP U h Ci,j = × 40 × 50 × vj × ni,j U Here, 40 is the number of hours worked per week, and 50 is the number of weeks worked per year, vj is the fraction of that task in the employees’ duties, and ni,j is the number of employees in the firm that perform that task. Like for our training costs, we use a GPU hourly rate of pGP U h = $0.34 and assumed a GPU utilization of U = 85%. Our method for finding vj and ni,j based on publicly provided data is outlined in Section 1.3.2. Fixed Costs The engineering project for a computer vision system involves two phases: implementation (C eng,imp ) and maintenance (C eng,m ). We assume that the implementation and maintenance costs are the same for all tasks, reflecting the complexity of the engineering process rather than the complexity of individual tasks. To estimate these costs, we referred to the case study presented by Thompson et al. (2021), which describes a deep learning time series prediction project. The study reports an upfront implementation cost of C eng,imp = $1, 765, 000 for a 6-month project and a yearly maintenance cost of C eng,m = $242, 840. A breakdown of these costs is shown in Appendix A.3. Importantly, these are the full cost of developing and deploying a production-ready system. Alternative: Bare-Bones Setup The costs outlined above assume that there is significant engineering work required to develop a computer vision system to replace the task at hand, but this is not always the case. There are instances where costs can be reduced or eliminated completely. For example, a foundation model might already be fit for the 6 (4 × 1012 × 3600)/(2 × 1.44 × 108 ), where the nominator is FLOPs per hour and the denominator is FLOPs per inference. Electronic copy available at: https://ssrn.com/abstract=4700751 Svanberg et al.: Which Tasks are Cost-Effective to Automate with Computer Vision? Working Paper 11 task or sufficiently close to it that fine-tuning can be done with available data and hardware. Therefore, in addition to the setup above, we explore the possibility that the only cost of the system is that of a small engineering team, with an implementation cost of C eng,imp,bb = $165, 000 and yearly maintenance cost of C eng,m,bb = $122, 840 (see Appendix A.3). Using the same assumption of discount rate, d = 0.05, and system lifespan, L = 5, as elsewhere in this M paper, the total cost, Ci,j , of implementing this bare-bones computer vision system for a task, j , can, hence, be written as L−1 M eng,imp,bb X C eng,m,bb Ci,j =C + t=0 (1 + d)t H 1.3.2. Cost of Human Labor to Firm (Ci,j ) Compared to the cost of computer vision, human labor does not exhibit the same economies of scale. To a large extent, the cost of human labor is the same as the marginal cost of compensation per worker, and such we model it this way. For a given firm i and task j , where j can be accomplished by a computer vision system with a lifespan of L, we define the present value of labor cost to the firm as follows: L−1 H X wi,j × r × vj × ni,j Ci,j = t=0 (1 + d)t Here, wi,j is the mean wage of the occupation that performs task j within firm i. r is the wage to total compensation ratio, vj is the fraction of an occupation’s duties that makes up j , ni,j is the number of workers that perform j within firm i (in later sections we consider deployment scales larger than the firm) and d is the discount rate. Like in previous sections, we use a discount rate of d = 0.05 and assume that the lifespan of the system is L = 5. A diagram of the relationship between these factors can be seen in Figure 3. To estimate wage costs wo,i , we used the 2022 Occupational Employment and Wage Statistics (OEWS) data tables created by the U.S. Bureau of Labor Statistics (2022b), imputing missing employment and wage numbers in the more granular North American Industry Classification System codes (NAICS) (Murphy 1998). We narrow down the data to U.S. non-farm businesses by excluding NAICS codes not covered by the 2020 Statistics of U.S. Businesses produced by U.S. Census Bureau (2023).7 To convert employee wages to employer costs, we used the wage-to-compensation ratio for civilians of published by the U.S. Bureau of Labor Statistics (2022a, p.4), i.e., r = 1.449,8. To assign a fraction of an employee’s duties, vj , and thereby implicitly also a fraction of its wages, to a given task and calculate labor cost, we weight each task by its score on the O*NET-Task-Importance scale, following the examples of Brynjolfsson et al. (2018) and Webb (2019). 7 NAICS codes excluded are Rail Transportation (482); Postal Service (491); Pension, Health, Welfare, and Other Insurance Funds (5251); Trusts, Estates, and Agency Accounts (525920); Offices of Notaries (541120); Private Households (NAICS 8111); and Public Administration (99), as well as public schools. 8 1/0.69 Electronic copy available at: https://ssrn.com/abstract=4700751 Svanberg et al.: Which Tasks are Cost-Effective to Automate with Computer Vision? 12 Working Paper Figure 3 Cost of human labor performing a task To calculate the number of employees of a given occupation per firm, we say that ni,j = sizei × occi,j where sizei is the size of the firm and occi,j is the fraction of that firm’s employment base that is of a given occupation. Since occi,j is not directly visible in aggregate data, we estimate it using the method outlined in Appendix A.2.2. 2. Firm-level Results We first report the results of applying our framework to AI adoption at the firm-level, the natural decision making point for market economies. Later, we consider the results if we allow for consolidation of economic activity via large shifts in market shares or the creation of AI-as-a-service platforms. 2.1. Key Findings There is a dramatic difference between the vision tasks that are exposed to AI and those that firms would find economically-attractive to automate. While 36% of jobs in U.S. non-farm businesses have at least one task that is exposed to computer vision, only 8% (23% of them) have a least one task that is economically attractive for their firm to automate (Figure 4a). Since only a small fraction (2%-30%) of any occupation are vision tasks, the more relevant metric is the share of compensation. Figure 4b aggregates the compensation per task, and presents the cost-comparison as a percentage of U.S. labor compensation instead of percentage of jobs. We find that vision tasks comprise 1.6% of U.S. non-farm compensation, where only 0.4% (again 23% of total) is attractive to automate with AI. Electronic copy available at: https://ssrn.com/abstract=4700751 Svanberg et al.: Which Tasks are Cost-Effective to Automate with Computer Vision? Working Paper 13 40% 1.6% Compensation in U.S. Businesses 35% 30% 1.4% Jobs in U.S. Businesses 1.2% 25% 1.0% 20% 0.8% 15% 0.6% 10% 0.4% 5% 0.2% 0% 0.0% Jobs with at least some Jobs with at least one Computer Vision Exposure Economically Attractive Computer Vision Exposure Economically Attractive Vision Task Computer Vision (a) Jobs (b) Worker compensation Figure 4 Comparison of AI exposure and firm-level economic attractiveness for computer vision. These results are driven by the cost of AI system deployments. Figure 5 shows the share of vision task compensation that firms could profitably replace by an AI computer vision system of a given cost. Even if a system only costs $1,000, there are tasks that are not economically attractive to replace (tasks in occupations with low wages, many different tasks per occupation, working in small firms). By contrast, for other tasks, there are sufficient labor cost savings to justify investments in systems costing more than $100 million (tasks with high wages, few tasks per occupation, firms with many workers doing the task). We have annotated the graph with the median estimated system cost to illustrate the impact of system costs on automation.9 As this graph shows, system costs are enormously important to economic attractiveness. This graph also shows an important result for the future dissemination of AI systems: exponential decreases in cost are needed for linear increases in the share of tasks attractive to automate. In Appendix B, we show our key findings by sector. 2.2. Sensitivity Analysis Because our model of AI automation is end-to-end, it necessarily includes a range of technical and eco- nomic assumptions. To ensure that our results are robust to reasonable deviations in these parameters, we consider three types of sensitivity tests (i) changes to cost parameters, (ii) changes to the benefits provided by switching to AI systems, and (iii) a “bare-bones” scenario that tests when many costs are lower. 2.2.1. Sensitivity to cost assumptions Table 2 provides an overview of the cost parameters that we test sensitivity for. In each case, we compare low and high cost cases to our baseline results. We model the low and high cases for the needed accuracy as polynomials to ensure that the range of possible values remains between 0 and 1. Figure 6a shows the results of varying our assumptions according to Table 2, one variable at a time. Many of the changes to have relatively small impact on the results. Changes to the required accuracy, data costs, or engineering costs are more consequential, although they only increase the 9 Note: the share of compensation attractive to automate at the median cost is not the same as the percentage of compensation in Figure 4b, because the latter comes from calculating benefits and costs across the distribution of firms Electronic copy available at: https://ssrn.com/abstract=4700751 Svanberg et al.: Which Tasks are Cost-Effective to Automate with Computer Vision? 14 Working Paper 100% Median Estimated System Cost Compensation in U.S. Businesses 80% Computer Vision Exposed 60% 40% Sum of Compensation for Vision Tasks where (Cost of Human Labor)>x 20% 0% $1K $10K $100K $1M $10M $100M Cost of System Figure 5 Relationship between the cost of AI computer vision systems and the share of human vision task compensation that would be attractive for firms to automate. Low Cost Base High Cost C eng Engineering costs 0.2 × C eng C eng 2 × C eng pdata Data costs 0.5 × pdata j pdata j 2 × pdata j pGP U h Cloud pricing $0.1 $0.34 $1 L System lifespan 10 years 5 years 2 years K Retraining cadence Never 1 year 2 months √ aj Accuracy a2j aj aj ej Entropy † ej ‡ Table 2 Base case parameter values and sensitivity analysis. †, ‡ see figure 14 in Appendix A.4 share of automation from 23% to 33% of compensation, at most. In addition to testing robustness to these specific costs, we also test for robustness in extrapolating costs to high levels of accuracy. This check is important because the estimates from (Thompson et al. 2024) are only observed over a limited range of accuracies and then we must extrapolate to higher levels. If we instead extrapolate using curves estimated by others, we do not see any substantial changes to our results. 2.2.2. Sensitivity to benefit assumptions As discussed in section 1.3, our model assumes that building AI computer vision systems with the same task accuracy capabilities as human workers means that they will provide similar benefits. However, this assumption could easily miss other dimensions of performance that would increase or decrease the value of such systems. For example, it would likely be easier to add a third shift at a factory by using a computer vision system for more hours per day than it Electronic copy available at: https://ssrn.com/abstract=4700751 Svanberg et al.: Which Tasks are Cost-Effective to Automate with Computer Vision? Working Paper 15 Model Parameter Task Value Multiplier Base Assumptions High Cost x0.125 Cloud pricing (P GPUh) Low Cost Base Case (1x) Entropy (ej) x0.25 System lifespan (L) x0.5 Retraining cadence (K) x2 Accuracy (aj) Data costs (p ) x4 Engineering costs (C eng) x8 0% 10% 20% 30% 40% 50% 0% 10% 20% 30% 40% 50% Computer Vision Exposed Computer Vision Exposed Compensation in U.S. Business Compensation in U.S. Business (a) Costs (b) Benefits Figure 6 Sensitivity of automation results to different cost and benefit assumptions. would be to hire an additional worker. On the other hand, if the volume of work being done is variable, human workers might be better equipped to do other tasks during slow times. To explore the sensitivity of automation to the value created by an AI system, we consider how the adoption decision would change if, for example, it generates 2× the value of human worker. To estimate this empirically, we start from a typical economic assumption: that workers are paid their marginal product (Clark 1908). Figure 6b tests how much the economics of AI adoption change if the AI system delivers some multiple of that human marginal product (as measured by the person’s wage). We find that an AI system that doubles the benefit of the human worker would increase the share of compensation that is attractive to automate from 23% to 30%. Our analysis here also echoes the finding from Figure 5, showing that exponential changes in benefits are needed for linear changes in automation share. Importantly, this analysis should only be thought of as applying to the short-to-medium adoption deci- sions. Over the longer-term, firms can adapt their production more fundamentally. Estimating the scale of benefits that could be derived from such deeper structural changes to production are beyond the scope of this article. 2.2.3. Bare-bones implementation Thus far, we have considered sensitivity to univariate changes in our parameters. In this analysis, we consider a bare-bones development setup described in Section 1.3.1. This assumes free data, free compute, and only minimal engineering effort is required. Even with Electronic copy available at: https://ssrn.com/abstract=4700751 Svanberg et al.: Which Tasks are Cost-Effective to Automate with Computer Vision? 16 Working Paper those extremely aggressive assumptions, the amount of economically attractive firm-level automation only increases to 49% (0.79% of compensation), as shown in Figure 7. This reflects the extremely fragmented distribution of tasks in the economy, which can make even moderate costs of development prohibitive. This result will be important for our later generalization to AI-as-a-service platforms because it likely also means that automating many of these tasks would likely require an extensive sales/coordination effort which would slow these efforts. 1.60% Best Estimate Compensation in U.S. Businesses Bare-Bones 1.40% 1.20% 1.00% 0.80% 0.60% 0.40% 0.20% 0.00% Computer Vision Exposure Economically Feasible Computer Vision Figure 7 Impact of “bare-bones” cost assumptions on which vision tasks are economically attractive to automate. 2.3. Predictive Power for Historical Labor Outcomes Over the past decade, many studies have investigated the relationship between technology exposure and the susceptibility of various occupations to labor replacement, see Table 3. The success of these measures in predicting historical labor outcomes was explored by Frank et al. (2023), who measured unemployment risk by calculate the probability of receiving unemployment benefits, using data from each state’s unemployment benefits office. To assess the power of our results against these other measures, we recreate Frank et al.’s analysis. Since their data is at the 2-digit SOC code, we aggregate our measure to (i) the share of tasks in that area that are computer vision using our AI exposure variable, and (ii) a composite score for how close tasks are from having an economic advantage. For each task, we calculate the cost difference between AI computer vision completing the task and a human completing it at the same level of proficiency. We then aggregate to the 2-digit SOC level. In Table 3, we re-state the findings from Frank et al. (2023) in models 1-8, and then compare them to the predictiveness of our computer vision measures for predicting the risk of unemployment across different occupations from 2010-2020 (model 9). As these results show, our method is as good or better at explaining unemployment risk as any of the other measures, explaining 10.9% of variance as compared to less than 3% for most others and 8.9% and 10.7% respectively for the 3 Webb measurements and Arntz et al. Our high predictive power relative to these other Electronic copy available at: https://ssrn.com/abstract=4700751 Svanberg et al.: Which Tasks are Cost-Effective to Automate with Computer Vision? Working Paper 17 Study Variable Description Acemoglu and Autor (2010) Computer Usage Assess occupations on computer usage. Acemoglu and Autor (2010) Routine Cognitive Assess occupations on routineness and cognitive. Acemoglu and Autor (2010) Routine Manual Assess occupations on manual requirements. O*NET Education O*NET %college The fraction of workers in an occupation holding a bachelor’s degree. Frey and Osborne (2017) auto Probability of computerization that combines a subset of occupation skills with subjective assessments of occupation automation levels. Arntz et al. (2016) auto2 Job automatibility risk in OECD countries based on an occu- pation task-based approach. O*NET Degree of Automation O*NET Deg.Auto. Level of automation integrated into the tasks and responsibili- ties of a particular job or occupation. Brynjolfsson et al. (2018) SML Suitability for machine learning at the task level within various job categories. Felten et al. (2018) AI2 Links AI advancements to occupational abilities and aggre- gates them at the occupation level. Webb (2019) % AI Exposure Compared technology patents with occupation tasks to mea- sure the exposure of occupations to AI. Webb (2019) % Robot Exposure Compared technology patents with occupation tasks to mea- sure the exposure of occupations to robots. Webb (2019) % Software Exposure Compared technology patents with occupation tasks to mea- sure the exposure of occupations to software. Our study % Computer Vision Exposure The percentage of compensation in U.S. businesses that are vision tasks. Our study Economic Attractiveness A composite score for how close vision tasks are to having an economic advantage. Table 3 Studies estimating AI exposure by occupation. Dependent Variable: log10 Unemployment Risk by Occupation, Month, & State Variable Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 Model 7 Model 8 Model 9 Computer Usage 0.000 Routine Cognitive -0.096∗∗∗ Routine Manual 0.137∗∗∗ O*NET % college -0.134∗∗∗ auto 0.024∗∗∗ auto2 0.327∗∗∗ O*NET Deg. Auto. 0.152∗∗∗ SML 0.082∗∗∗ AI2 -0.109∗∗∗ % AI Exposure 0.332∗∗∗ % Robot Exposure 0.412∗∗∗ % Software Exposure -0.294∗∗∗ % Computer Vision Exposure 0.134∗∗∗ Economic Attractiveness 0.166∗∗∗ R2 0.028 0.018 0.001 0.107 0.023 0.007 0.012 0.089 0.109 adj. R2 0.028 0.018 0.001 0.107 0.023 0.007 0.012 0.089 0.109 pval < 0.1∗ , pval < 0.01∗∗ , pval < 0.001∗∗∗ Table 4 Extension of the regression analysis in Frank et al. (2023) between technology exposure and unemployment risk in occupations. Bolded results are our analysis analyses is particularly noteworthy since we only consider computer vision, whereas they consider many more types of automation. This suggests that extending our approach to other forms of automation could be even more explanatory. Electronic copy available at: https://ssrn.com/abstract=4700751 Svanberg et al.: Which Tasks are Cost-Effective to Automate with Computer Vision? 18 Working Paper 3. Paths to AI Proliferation Section 2 reveals an important limitation for the proliferation of labor-replacing AI: with today’s technology, many of these systems are unattractive for firms to adopt. The economic attractiveness of AI could be substantially increased in two important ways. The first is deployment scale, finding ways for AI systems to automate more labor per system. The second is develop- ment costs, inventing less expensive ways to build AI systems. Here, we explore how these changes would affect the pace of AI deployment. In particular, to test the impact of scale, we estimate the impact of firms getting larger and of AI-as-a-service being used for more tasks. We also explore the hypothetical pace of computer vision proliferation based on different rates of cost decreases. 3.1. Human Labor is Replaced at Larger Scales In addition to firms custom-building their own systems, there is a possibility of achieving the minimum viable scale by aggregating human labor across firms. While this could hypothetically be done by one firm winning market share from its competitors because of increased efficiency, vision tasks are a small part of firm costs, so an advantage in vision is unlikely to generate large differences in competitive advantage at the firm level in many areas. We believe that aggregating demand for AI solutions is more likely to happen through AI-as-a-service business models – where, for example, one firm develops the AI system for a task and others that also need the system outsource. Real-world examples of this include a diamond classification tool built by NavTech (Thompson 2021a) and a self-driving platform collaboration by NVIDIA (Thompson 2021b). We define economic advantage for AI-as-a-service as when the aggregate compensation paid to workers performing tasks in a given NAICS code is strictly larger than the cost of developing and running a computer vision system. H We use the same definition of Ci,j as above, with the only difference being that i is a either the overall U.S. private non-farm economy, a sector, subsector, or industry group. We obtain ni,j directly using the imputed OEWS data described earlier. We find that the median employee works in a firm where close to none of the vision tasks are cost- effective to automate.10 Even a firm with 5,000 employees, i.e., larger than 99.9% of firms in the United States, could only cost-effectively automate less than one tenth of their existing vision labor at the current cost structure. This finding helps explain results from McElheran et al. (2023), fewer than 6% of firms use AI-related technologies but that these are disproportionately large firms, representing 18% of employment. At the extreme end of the firm size spectrum, even a hypothetical firm as big as Walmart lacks the scale to make automating 15% of their vision tasks attractive.11 As shown in our sensitivity analysis, large 10 The median employee works in a firm with between 500-749 employees. We gathered this statistic and the other ones mentioned in the paragraph from the SUSB Data Tables released by U.S. Census Bureau (2023) 11 1,600,000 U.S. employees according to https://corporate.walmart.com/about, Accessed: 2023-05-08 Electronic copy available at: https://ssrn.com/abstract=4700751 Svanberg et al.: Which Tasks are Cost-Effective to Automate with Computer Vision? Working Paper 19 differences in value creation would be needed to substantially change our results. As such, we expect only minor changes in computer vision attractiveness because of changes in firm size distribution or occupational concentration. Indeed, even a perfectly concentrated economy with exactly one occupation per firm would only have a tenfold multiplier on task value compared to our base assumption. Most firms are, and likely will remain, too small to cost-effectively develop computer vision to replace their existing workers. But, if labor costs for a given task can be aggregated across multiple firms, the economics of automation become much more attractive. If systems could be deployed at the national level – a single system doing all instances of that task across the entire economy – then AI already has an economic advantage for 88% of vision task compensation. Most of the scale needed is already present at the industry group level (i.e. NAICS 4-digit level), rather than the national level, as shown in Figure 8. These results suggest that business models that offer AI-as-a-service will likely be an important driver of AI automation, since they can provide a scale that makes many more tasks attractive to automate. But, while AI-as-a-service has the potential for much greater automation, there are important technical and economic reasons to doubt whether this industry-level automation can be achieved in the short-to- medium run. The technical challenge is whether systems designed for particular tasks generalize to the industry level. For example, building on the tasks shown in Table 1, O*NET groups the interpretation of radiographic and ultrasound results. But a system for interpreting x-rays for broken bones may not general- ize to interpreting ultrasounds for cancer. And while AI systems have indeed increasingly shown an ability to generalize across tasks (Tu et al. 2023), these have also been accompanied by rapidly rising costs (Cottier 2023). The economic challenge of deploying AI-as-a-service is the cost of coordination: getting many disparate firms onto single platforms is expensive. Whether that coordination comes in the form of salespeople pitch- ing clients, or advertising to get clients to opt themselves in, we would expect only partial adoption of platforms. There could be many reasons for this. For example, Hannan and Freeman (1984) describe how inertia, i.e., resistance to change, is a powerful force within companies, and Walsh (1991) illustrates how worker resistance plays a role in avoiding the automation of existing tasks. It is thus perhaps not surprising that closing rates are as low as 20% in Enterprise IT sales, according to a HubSpot Sales blogger Fuchs (2022). Combining these insights, we find it unlikely that any third-party vendor could capture more than a fraction of the total market. For all these reasons, we expect it will be hard and time consuming to capture the scale that AI-as-a-service could offer, even though we expect many start-ups and venture capitalist to actively pursue these opportunities. Access to data for fine-tuning is another obstacle to the proliferation of AI-as-a-service at scale. The rea- son we need fine-tuning is to be able to incorporate knowledge about objects and situations that the system needs to be able to handle gracefully, as discussed further in Section 4.2.1. This data can be most easily be collected within firms where these tasks are carried out. But those firms might have important reasons not to Electronic copy available at: https://ssrn.com/abstract=4700751 Svanberg et al.: Which Tasks are Cost-Effective to Automate with Computer Vision? 20 Working Paper 100% Compensation in U.S. Businesses 80% Computer Vision Exposed 60% 40% 20% 0% Individual Industry Groups Subsectors Sectors U.S. Economy Businesses (NAICS 4d) (NAICS 3d) (NAICS 2d) AI-as-a-service Figure 8 Fraction of vision task compensation economically-attractive to automate if single systems are deployed at this scope. release this data outside the company. This provides a barrier to the creation of AI-as-a-service offerings by third-parties. Some industry actors have come together to establish data-sharing agreements where a third party could not otherwise collect the required data, such as the NVIDIA Drive collaboration described by Thompson (2021b). Governments and regulatory bodies could also accelerate or hinder platform offerings through rules on data sharing. To quantify the effects of AI-as-a-service adoption, we perform a simulation exercise that models diffu- sion as increases in the effective deployment scale that firms get when they make the automation decision. For example, because of coordination, sales, etc. a firm might have market size X in year 0, but that could grow by a factor g to Xg in year 1, Xg 2 in year 2, etc. until the entire market is covered. Firms adopt if and when this additional scale is sufficient to justify their costs. Because there is ambiguity in the adoption order of firms, we assume an anti-trust ordering, such that the large firms must start with selling to smaller firms, rather than their largest competitors. Figure 9 shows the results of these simulations based on various growth rates of these platforms. As these results show, very rapid platformization across all vision tasks (+20% per year) would result in significant automation within the next decade, whereas a more gradual platformization (+5% per year) would require decades to get to the full platform automation potential. Given the challenges in developing AI-as-a-service for every vision tasks, including the significant indus- try restructuring needed to outsource so many tasks, we expect that much of the proliferation of computer Electronic copy available at: https://ssrn.com/abstract=4700751 Svanberg et al.: Which Tasks are Cost-Effective to Automate with Computer Vision? Working Paper 21 100% 20% Compound Annual Growth Rate Compensation in U.S. Businesses 2022 Computer Vision Exposed 10% Compound Annual Growth Rate 80% 5% Compound Annual Growth Rate 60% 40% 20% 0% 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 Economically Feasible by Year Figure 9 Simulation results: Computer vision automation if AI-as-a-platform offerings allow deployment sizes to grow vision will come not through platformization but through AI systems becoming less expensive through technological change. 3.2. AI Deployment Becomes Cheaper Through Technological Change As long as there is a need to customize AI to specific applications (e.g., through fine-tuning), the costs required will affect how it proliferates. Because computer vision, as it stands today, only has an economic advantage in 23% of vision tasks at the firm-level and barriers to AI-as-a-service deployments exist, there will most likely need to be a sharp reduction in cost for computer vision to replace human labor. Figure 10 simulates what will happen to the amount of economic advantage computer vision will have in vision tasks over time, if we keep other aspects of the model constant but have annual system cost decreases ranging from a 10% to 50%. Even with a 50% annual cost decrease, it will take until 2026 before half of the vision tasks have a machine economic advantage and by 2042 there will still exist tasks that are exposed to computer vision, but where human labor has the advantage. At a 10% annual system cost decrease, computer vision market penetration will still be less than half of exposed task compensation by 2042. We strongly agree with the proposition that computer vision costs will drop over time, albeit not as predictably as some might suggest. Ford (2015) argues that this will happen rapidly because of Moore’s law. More directly relevant, Thompson et al. (2020), Erdil and Besiroglu (2023) measure the annual cost decrease in the cost of computing on GPUs. We use the more recent estimate of a 22% annual cost decrease from Hobbhahn and Besiroglu (2022). As described in section 1, the costs of data and engineering must also be accounted for. These are likely to decrease but not as predictably. Data might become cheaper with increasing digitization, e.g., if the data needed is already collected and labelled for other purposes. Improved developer tools and the spread of machine learning engineering expertise might reduce staffing costs for the engineering team. Foundation models might improve, reducing the need for fine-tuning. There might Electronic copy available at: https://ssrn.com/abstract=4700751 Svanberg et al.: Which Tasks are Cost-Effective to Automate with Computer Vision? 22 Working Paper Compensation in U.S. Businesses 2022 Computer Vision Exposed 100% 50% Annual System Cost Decrease 80% 20% Annual System Cost Decrease 60% 10% Annual System Cost Decrease 40% 20% 0% 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 Economically Feasible by Year Figure 10 Simulation results: Computer vision automation if AI system costs drop be a paradigm shift in the way that we fine-tune models that will drastically reduce costs.12 Importantly, the overall pace of improvement will likely be set by the bottleneck of all of these – meaning that the cost components that improve rapidly become less and less important to the pace of improvement. 4. Discussion 4.1. Impacts on Worker Displacement One of the most important AI policy discussions is how much worker displacement will occur, and therefore how much retraining, social support, or other intervention might be required. Acemoglu et al. (2022) show correlations between hiring plans in AI and elsewhere. They document a rapid takeoff in AI hiring attempts starting in 2010 and significantly accelerating around 2015-16. They find that AI-exposed establishments reduce their non-AI and overall hiring at the establishment level.13 Labor effects are uneven, with Grennan and Michaely (2020) showing that security analysts with high exposure to AI are more likely to leave the profession and that departing analysts leave for non-research jobs that require management and social skills. According to data from the U.S. Census Bureau (2021), on average 11% of jobs in private sector estab- lishments were destroyed annually between 2017 and 2019.14 However, with substantial job creation, there was still a net gain of on average 1.6% over the period. Initially, we should expect a significant shock to the labor market as 23% of vision compensation tasks have only recently become attractive to automate, and thus we expect automation attempts to be scaling up. 12 One example of this would be training methods such as Andrew Ng’s Landing AI’s Visual Prompting tools (Dey 2023), but their performance in industrial applications is so far unknown to us. 13 But, interestingly, not at the occupation or industry level. 14 We purposefully excluded data from the pandemic since it caused extraordinary churn in the labor market. Electronic copy available at: https://ssrn.com/abstract=4700751 Svanberg et al.: Which Tasks are Cost-Effective to Automate with Computer Vision? Working Paper 23 In subsequent years, once this initial wave of automation occurs, the incremental automation falls to well below this existing job destruction rate. Figure 11 shows that if there is a 50% annual computer vision cost decrease and if we assume that all vision tasks for which machines gain economic advantage on the firm-level do get automated the same year, the percentage of vision task compensation that is lost every year will be 6-8% in the peak years. 14.0% Compensation in U.S. Businesses 12.0% Computer Vision Exposed U.S. Private Sector Annual 10.0% Jobs Destruction 2017-2019 8.0% 6.0% 4.0% 20% Annual System Cost Decrease 2.0% 10% Annual System Cost Decrease 0.0% 50% Annual System Cost Decrease 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 Year Figure 11 Simulation: Yearly human task destruction as a share of total vision task compensation. Similarly, Figure 12 shows that task automation from the platformization of AI will only peak above the overall job destruction rate briefly, if at all. 14.0% Compensation in U.S. Businesses 12.0% Computer Vision Exposed U.S. Private Sector Annual 10.0% Jobs Destruction 2017-2019 8.0% 6.0% 4.0% 5% Compound Annual Growth Rate 2.0% 10% Compound Annual Growth Rate 0.0% 20% Compound Annual Growth Rate 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 Year Figure 12 Simulation: Effect of AI-as-a-service platform growth on job destruction Our results suggest that we should expect the effects of AI automation to be smaller than the existing job automation/destruction effects already seen in the economy. Whether adding AI automation to these Electronic copy available at: https://ssrn.com/abstract=4700751 Svanberg et al.: Which Tasks are Cost-Effective to Automate with Computer Vision? 24 Working Paper existing effects will substantially increase overall job destruction is unclear. We would expect at least some increase, but we also find it likely that a substantial fraction of the AI task automation will happen in areas where traditional automation is occurring. Hence the two types will substitute for each other, at least in part, and the net effect will be less than the sum of each. 4.2. Foundation models and Automation outside Computer Vision In this section, we discuss how our findings on computer vision fit into the larger landscape of AI, including foundation models and generative AI. In particular, we consider how our modeling of computer vision automation can inform automation using other AI techniques, such as language modeling (e.g., ChatGPT). While there are important differences, we believe the economics of AI as described in this paper will be broadly applicable. 4.2.1. Foundation Models Bommasani et al. (2021) define foundation models as deep learning models that are “trained on broad data at scale and are adaptable to a wide range of downstream tasks”.The existence of foundation models in no way undermines the method of this paper; in fact, the underlying economic model for predicting the cost of computer vision used in this paper presupposes that a foundation model exists to fine-tune (Thompson et al. 2024). Our cost estimates, although sometimes substantial, result from specializing pre-existing models to suit a specific task. Foundation models could impact our results if, as they improve, tasks that workers are doing can increas- ingly be replaced by the foundation model without fine-tuning. This would reduce the costs of such imple- mentations, making them more economically-attractive. A smaller, but still relevant, effect could occur if better foundation models just reduce the amount of fine-tuning data needed. Hence, improvements in foun- dation models have a role in reducing costs as described in Section 3.2. We find it unlikely that foundation models will be able to entirely displace specialized models for two reasons: data availability and slowing progress in foundation models. Because many human vision tasks are not monitored by cameras, and because the data from those tracked by cameras might be sensitive or proprietary, it is likely that data on many tasks will not be shared with the creators of foundation models. This lack of available data will limit the ability of computer vision systems to generalize. For example, consider one of the case studies done in preparation for this paper. In that case, a industrial parts manufacture wanted to use a computer vision system to identify which of their proprietary parts needed to be shipped based on customers sending in pictures of broken ones. The idea that foundation model providers would have enough data to correctly label each of that firm’s products (never mind those of all firms) seems unlikely. A second restriction on foundation models’ impact is that progress in improving them may slow. This slowing would arise because the cost of training foundation models grows extremely rapidly, with Thomp- son et al. (2020) finding that costs for vision models would escalate rapidly into billions of dollars to even Electronic copy available at: https://ssrn.com/abstract=4700751 Svanberg et al.: Which Tasks are Cost-Effective to Automate with Computer Vision? Working Paper 25 improve vision models by small increments. We suspect that these escalations in cost will slow the progress of foundation models and indeed there is already evidence of such slowing (Lohn 2023). 4.2.2. Generative Language Models We believe that our economic model of AI adoption will still apply to generative language setting, although there are a few notable differences that are important. First, language, to a much greater extent than vision, seems to generalize across contexts. One potential reason for this could be that the amount of language data available for the training of foundation models is more comprehensive than the image data available; another is that language models are much better than vision models at taking advantage of unlabeled data (LeCunn and Misra 2021). However, much of this shared utility from language will still run into challenges because of specialized knowledge – and so fine- tuning, e.g., to know about specific product information, will still be needed. An important piece of future work will be the need to quantify the fraction of language tasks that do not require fine-tuning and thus can easily be automated. Another important difference better vision and language automation is the cost of data. Firms often have substantial stores of text data, and it is often easier to gather than photos or video. For example, text from customer support chats, email exchanges, and internal knowledge hubs may make language model fine- tuning cheaper than that of image models. 4.3. Limitations 4.3.1. Automation versus Augmentation In our paper we consider the automation of tasks. This provides only a partial view of AI adoption, since AI can also be deployed augment human labor or to make new products entirely. One survey cites 83% of executives believe AI will augment human labor rather than automate it (IBM Institute for Business Value 2023, p.8). Bessen et al. (2018) found that only 50% of AI startups help customers reduce labor costs, whereas 98% build products to enhance capabilities. While augmentation is not addressed in this article, we are addressing it elsewhere. Our article similarly does not consider new tasks that are created as a consequence of the roll-out of AI, which could easily include tasks for AI itself but could also include other complementary tasks done by human workers. 4.3.2. Cost estimates For our cost estimates, we rely heavily on the work by Thompson et al. (2024), which predicts the cost of developing a computer vision system ahead of time. However, as it stands, using their model for our purposes has two important limitations. The first is that the input data for their model does not contain any datapoints with accuracies higher than 95%. In other words, for around 40% of vision tasks where higher accuracy is needed, we are extrapolating using their costs function. Since these extrap- olations are power laws, they are sensitive to differences in the estimated coefficient. To address this, we consider several alternate scaling laws for these extrapolations (see Appendix B.1). These extrapolations can Electronic copy available at: https://ssrn.com/abstract=4700751 Svanberg et al.: Which Tasks are Cost-Effective to Automate with Computer Vision? 26 Working Paper have more substantial effects on our estimates, underscoring the importance of future research on scaling laws that can provide more precise estimates. The second limitation is that their analysis uses a limited number of models as sources for transfer learn- ing. An implementation could potentially save resources by starting with a larger foundation model that starts with higher accuracy. One might imagine that a model that was pre-trained closer to the target domain might also be better, although they surprisingly find a limited impact of data distance. We are currently doing work to further test their assumptions. 4.3.3. Survey We use an online survey to identify respondents with expertise in the selected vision tasks and collect information about the required accuracy and the cost per data point for each task. Although this enables us to account for the variation in different occupations and job tasks, this suffers from the lim- itation that respondents usually lack the knowledge of AI to give precisely the information that is needed for our analysis. For example, accuracy is a rigorously defined concept for a vision classification task which could be difficult for workers from regular occupations to understand. Respondents may only have a vague impression of how often making mistakes is acceptable when performing the task. They might mistak- enly include errors that are from other tasks carried out simultaneously with the vision tasks we defined when reporting the error rate. To address these issues, we design survey questions in a way that is easy for respondents to understand and then infer the information we need for our research. This is necessarily a compromise versus the ideal case where we can find experts who know both AI knowledge and work details for each of the 400+ tasks. 4.3.4. Task data and equivalence T

Working Paper: Beyond AI Exposure, Cost-Effective AI Vision Tasks PDF

Document Details

Tags

Related

Summary

Full Transcript