Chapter 8. Serverless Processing Systems PDF

Summary

This chapter explores serverless computing, a method of handling fluctuating application workloads. It emphasizes the benefits of pay-as-you-go billing, rapid scalability, and cost optimization offered by cloud platforms. The document outlines different cloud architectures, including the comparison of serverless platforms with traditional infra as a service.

Full Transcript

**Chapter 8. Serverless Processing Systems** ============================================ Scalable systems experience widely varying patterns of usage. For some applications, load may be high during business hours and low or nonexistent during nonbusiness hours. Other applications, for example, an...

**Chapter 8. Serverless Processing Systems** ============================================ Scalable systems experience widely varying patterns of usage. For some applications, load may be high during business hours and low or nonexistent during nonbusiness hours. Other applications, for example, an online concert ticket sales system, might have low background traffic 99% of the time. But when tickets for a major series of shows are released, the demand can spike by 10,000 times the average load for a number of hours before dropping back down to normal levels. Elastic load balancing, as described in [Chapter 5](https://learning.oreilly.com/library/view/foundations-of-scalable/9781098106058/ch05.html#application_services), is one approach for handling these spikes. Another is serverless computing, which I'll examine in this chapter. **The Attractions of Serverless** ================================= The transition of major organizational IT systems from on-premises to public cloud platforms deployments seems inexorable. Organizations from startups to government agencies to multinationals see clouds as digital transformation platforms and a foundational technology to improve business continuity. Two of the great attractions of cloud platforms are their pay-as-you-go billing and ability to rapidly scale up (and down) virtual resources to meet fluctuating workloads and data volumes. This ability to scale, of course, doesn't come for free. Your applications need to be architected to leverage the [scalable services](https://oreil.ly/lbMBp) provided by cloud platforms. And of course, as I discussed in [Chapter 1](https://learning.oreilly.com/library/view/foundations-of-scalable/9781098106058/ch01.html#introduction_to_scalable_systems), cost and scale are indelibly connected. The more resources a system utilizes for extended periods, the larger your cloud bills will be at the end of the month. Monthly cloud bills can be big. Really big. Even worse, unexpectedly big! Cases of "sticker shock" for significant cloud overspend are rife---in one survey, 69% of respondents regularly [overspent on their cloud budget by more than 25%](https://oreil.ly/har0N). In one well-known case, [\$500K was spent on an Azure task before it was noticed](https://oreil.ly/flEER). Reasons attributed to overspending are many, including lack of deployment of autoscaling solutions, poor long-term capacity planning, and inadequate exploitation of cloud architectures leading to bloated system footprints. On a cloud platform, architects are confronted with a myriad of architectural decisions. These decisions are both broad, in terms of the overall architectural patterns or styles the systems adopts---for example, microservices, N-tier, event driven---and narrow, specific to individual components and the cloud services that the system is built upon. In this sense, architecturally significant decisions pervade all aspects of the system design and deployment on the cloud. And the collective consequences of these decisions are highly apparent when you receive your monthly cloud spending bill. Traditionally, cloud applications have been deployed on an infrastructure as a service (IaaS) platform utilizing virtual machines (VMs). In this case, you pay for the resources you deploy regardless of how highly utilized they are. If load increases, elastic applications can spin up new virtual machines to increase capacity, typically using the cloud-provided load balancing service. Your costs are essentially proportional to the type of VMs you choose, the duration they are deployed for, and the amount of data the application stores and transmits. Major cloud providers offer an alternative to explicitly provisioning virtual processing resources. Known as *serverless* platforms, they do not require any compute resources to be statically provisioned. Using technologies such as AWS Lambda or Google App Engine (GAE), the application code is loaded and executed on demand, when requests arrive. If there are no active requests, there are essentially no resources in use and no charges to meet. Serverless platforms also manage autoscaling (up and down) for you. As simultaneous requests arrive, additional processing capacity is created to handle requests and, ideally, provide consistently low response times. When request loads drop, additional processing capacity is decommissioned, and no charges are incurred. Every serverless platform varies in the details of its implementation. For example, a limited number of mainstream programming languages and application server frameworks are typically supported. Platforms provide multiple configuration settings that can be used to balance performance, scalability and costs. In general, costs are proportional to the following factors: - - - However, the exact parameters used vary considerably across vendors. Every platform is proprietary and different in subtle ways. The devil lurks, as usual, in the details. So, let's explore some of those devilish details specifically for the GAE and AWS Lambda platforms. **Google App Engine** ===================== Google App Engine (GAE) was the first offering from Google as part of what is now known as the Google Cloud Platform (GCP). It has been in general release since 2011 and enables developers to upload and execute HTTP-based application services on Google's managed cloud infrastructure. The Basics ---------- GAE supports developing applications in Go, Java, Python, Node.js, PHP,.NET, and Ruby. To build an application on GAE, developers can utilize common HTTP-based application frameworks that are built with the GAE runtime libraries provided by Google. For example, in Python, applications can utilize Flask, Django, and web2py, and in Java the primary supported platform is servlets built on the Jetty JEE web container. Application execution is managed dynamically by GAE, which launches compute resources to match request demand levels. Applications generally access a managed persistent storage platform such as Google's [Firestore](https://oreil.ly/XWhwm) or [Google Cloud SQL](https://oreil.ly/7boAD), or interact with a messaging service like Google's [Cloud Pub/Sub](https://oreil.ly/T5Zn7). GAE comes in two flavors, known as the standard environment and the flexible environment. The basic difference is that the standard environment is more closely managed by GAE, with development restrictions in terms of language versions supported. This tight management makes it possible to scale services rapidly in response to increased loads. In contrast, the flexible environment is essentially a tailored version of Google Compute Engine (GCE), which runs applications in [Docker containers](https://www.docker.com/) on VMs. As its name suggests, it gives more options in terms of development capabilities that can be used, but is not as suitable for rapid scaling. In the rest of this chapter, I'll focus on the highly scalable standard environment. GAE Standard Environment ------------------------ In the standard environment, developers upload their application code to a GAE project that is associated with a base project URL. This code must define HTTP endpoints that can be invoked by clients making requests to the URL. When a request is received, GAE will route it to a processing instance to execute the application code. These are known as resident instances for the application and are the major component of the cost incurred for utilizing GAE. Each project configuration can specify a collection of parameters that control when GAE loads a new instance or invokes a resident instance. The two simplest settings control the minimum and maximum instances that GAE will have resident at any instant. The minimum can be zero, which is perfect for applications that have long periods of inactivity, as this incurs no costs. When a request arrives and there are no resident instances, GAE dynamically loads an application instance and invokes the processing for the endpoint. Multiple simultaneous requests can be sent to the same instance, up to some configured limit (more on this when I discuss autoscaling later in this chapter). GAE will then load additional instances on demand until the specified maximum instance value is reached. By setting the maximum, an application can put a lid on costs, albeit with the potential for increased latencies if load continues to grow. As mentioned previously, standard environment applications can be built in Go, Java, Python, Node.js, PHP, and Ruby. As GAE itself is responsible for loading the runtime environment for an application, it restricts the [supported versions](https://oreil.ly/HEoR0) to a small number per programming language. The language used also affects the time to load a new instance on GAE. For example, a lightweight runtime environment such as Go will start on a new instance in less than a second. In comparison, a more bulky JVM is on the order of 1--3 seconds on average. This load time is also influenced by the number of external libraries that the application incorporates. Hence, while there is variability across languages, loading new instances is relatively fast. Much faster than booting a virtual machine, anyway. This makes the standard environment extremely well suited for applications that experience rapid spikes in load. GAE is able to quickly add new resident instances as request volumes increase. Requests are dynamically routed to instances based on load, and hence assume a purely stateless application model to support effective load distribution. Subsequently, instances are released with little delay once the load drops, again reducing costs. GAE's standard environment is an extremely powerful platform for scalable applications, and one I'll explore in more detail in the case study later in this chapter. Autoscaling ----------- Autoscaling is an option that you specify in an app.yaml file that is passed to GAE when you upload your server code. An autoscaled application is managed by GAE according to a collection of default parameter values, which you can override in your app.yaml. The basic scheme is shown in [Figure 8-1](https://learning.oreilly.com/library/view/foundations-of-scalable/9781098106058/ch08.html#gae_autoscaling). GAE autoscaling ###### **Figure 8-1. GAE autoscaling** GAE basically manages the number of deployed processing instances for an application based on incoming traffic load. If there are no incoming requests, then GAE will not schedule any instances. When a request arrives, GAE deploys an instance to process the request. Deploying an instance can take anywhere between a few hundred ms to a few seconds [depending on the programming language you are using](https://oreil.ly/VLKTO). This means latency can be high for initial requests if there are no resident instances. To mitigate this instance loading latency effects, you can specify a minimum number of instances to keep available for processing requests. This, of course, costs money. As the request load grows, the GAE scheduler will dynamically load more instances to handle requests. Three parameters control precisely how scaling operates, namely: Got that? As is hopefully apparent, these three settings interact with each other, making configuration somewhat complex. By default, an instance will handle 10 × 0.6 = 6 concurrent requests before a new instance is created. And if these 6 (or fewer) requests cause the CPU utilization for an instance to go over 60%, the scheduler will also try to create a new instance. But wait, there's more! You can also specify values to control when GAE adds new instances based on the time requests spend in the request pending queue (see [Figure 8-1](https://learning.oreilly.com/library/view/foundations-of-scalable/9781098106058/ch08.html#gae_autoscaling)) waiting to be dispatched to an instance for processing. The max-pending-latency parameter specifies the maximum amount of time that GAE should allow a request to wait in the pending queue before starting additional instances to handle requests and reduce latency. The default value is 30 ms. The lower the value, the quicker an application will scale. And the more it will probably cost you.[**1**](https://learning.oreilly.com/library/view/foundations-of-scalable/9781098106058/ch08.html#ch01fn24) These auto-scaling parameter settings give us the ability to fine-tune a service's behavior to balance performance and cost. How modifying these parameters will affect an application's behavior is, of course, dependent on the precise functionality of the service. The fact that there are subtle interplays between these parameters makes this tuning exercise somewhat complicated, however. I'll return to this topic in the case study section later in this chapter, and explain a simple, platform-agnostic approach you can take to service tuning. **Case Study: Balancing Throughput and Costs** ============================================== Getting the required performance and scalability at lowest cost from a serverless platform almost always requires tweaking of the runtime parameter settings. When your application is potentially processing many millions of requests per day, even a 10% cost reduction can result in significant monetary savings. Certainly, enough to make your boss and clients happy. All serverless platforms vary in the parameter settings you can tune. Some are relatively straightforward, such as AWS Lambda in which choosing the amount of memory for a function is the dominant tuning parameter. The other extreme is perhaps Azure Functions, which has multiple parameter settings and deployment limits that differ based on which of three hosting plans are selected.[**9**](https://learning.oreilly.com/library/view/foundations-of-scalable/9781098106058/ch08.html#ch01fn32) GAE sits between these two, with a handful of parameters that govern autoscaling behavior. I'll use this as an example of how to approach application tuning. Choosing Parameter Values ------------------------- There are three main parameters that govern how GAE autoscales an application, as I explained earlier in this chapter. [Table 8-1](https://learning.oreilly.com/library/view/foundations-of-scalable/9781098106058/ch08.html#gae_auto_scaling_parameters) lists these parameters along with possible values ranges. **Parameter name** **Minimum** **Maximum** **Default** --------------------------------------- ------------- ------------- ------------- target\_throughput\_utilization 0.5 0.95 0.6 target\_cpu\_utilization 0.5 0.95 0.6 max\_concurrent\_requests 1 80 10 Table 8-1. GAE autoscaling parameters Given these ranges, the question for a software architect is, simply, how do you choose the parameter values that provide the required performance and scalability at lowest cost? Probably the hardest part is figuring out where to start. Even with three parameters, there is a large combination of possible settings that, potentially, interact with each other. How do you know that you have parameter settings that are serving both your users and your budgets as close to optimal as possible? There's some good [general advice available](https://oreil.ly/W2pJl), but you are still left with the problem of choosing parameter values for your application. For just the three parameters listed in [Table 8-1](https://learning.oreilly.com/library/view/foundations-of-scalable/9781098106058/ch08.html#gae_auto_scaling_parameters), there are approximately 170K different configurations. You can't test all of them. If you put your engineering hat on, and just consider values in increments of 0.05 for throughput and CPU utilization, and increments of 10 for maximum concurrent requests, you still end up with around 648 possible configurations. That is totally impractical to explore, especially as we really don't know a priori how sensitive our service behavior is going to be to any parameter value setting. So, what can you do? One way to approach tuning a system is to undertake a [parameter study](https://oreil.ly/l6his). Also known as a parametric study, the approach comprises three basic steps: - - - To illustrate this approach, I'll lead you through an example based on the three parameters in [Table 8-1](https://learning.oreilly.com/library/view/foundations-of-scalable/9781098106058/ch08.html#gae_auto_scaling_parameters). The aim is to find the parameter settings that give ideally the highest throughput at the lowest cost. The application under test was a GAE Go service that performs reads and writes to a Google Firestore database. The application logic was straightforward, basically performing three steps: - - - The ratio of write to read requests was 80% to 20%, thus defining a write-heavy workload. I also used a load tester that generated an uninterrupted stream of requests from 512 concurrent client threads at peak load, with short warm-up and cooldown phases of 128 client threads. GAE Autoscaling Parameter Study Design -------------------------------------- For a well-defined parameter study, you need to: - - For the example Go application with simple business logic and database access, intuition seems to point to the default GAE CPU utilization and concurrent request settings to be on the low side. Therefore, I chose these two parameters to vary, with the following values: - - This defines 12 different application configurations, as shown by the entries in [Table 8-2](https://learning.oreilly.com/library/view/foundations-of-scalable/9781098106058/ch08.html#parameter_study_selected_values). cpu\_utilization -------------------------------------------- ---- ---- ---- ---- **0.6** 10 35 60 80 **0.7** 10 35 60 80 **0.8** 10 35 60 80 Table 8-2. Parameter study selected values The next step is to run load tests on each of the 12 configurations. This was straightforward and took a few hours over two days. Your load-testing tool will capture various test statistics. In this example, you are most interested in overall average throughput obtained and the cost of executing each test. The latter should be straightforward to obtain from the serverless monitoring tools available. Now, I'll move on to the really interesting part---the results. Results ------- [Table 8-3](https://learning.oreilly.com/library/view/foundations-of-scalable/9781098106058/ch08.html#mean_throughput_for_each_test_configura) shows the mean throughput for each test configuration. The highest throughput of 6,178 requests per second is provided by the {CPU80, max10} configuration. This value is 1.7% higher than that provided by the default settings {CPU60, max10}, and around 9% higher than the lowest throughput of 5,605 requests per second. So the results show a roughly 10% variation from lowest to highest throughput. Same code. Same request load. Different configuration parameters. **Throughput** **max10** **max35** **max60** **max80** -------------------------------------------------------- ----------- ----------- ----------- ----------- CPU60 6,006 6,067 5,860 5,636 CPU70 6,064 6,121 5,993 5,793 CPU80 6,178 5,988 5,989 5,605 Table 8-3. Mean throughput for each test configuration Now I'll factor in cost. In [Table 8-4](https://learning.oreilly.com/library/view/foundations-of-scalable/9781098106058/ch08.html#mean_cost_for_each_test_configuration_n), I've normalized the cost for each test run by the cost of the default GAE configuration {CPU60, max10}. So, for example, the cost of the {CPU70, max10} configuration was 18% higher than the default, and the cost of the {CPU80, max80} configuration was 45% lower than the default. **Normalized instance hours** **max10** **max35** **max60** **max80** ------------------------------------------------------------------------------------------- ----------- ----------- ----------- ----------- CPU60 100% 72% 63% 63% CPU70 118% 82% 63% 55% CPU80 100% 72% 82% 55% Table 8-4. Mean cost for each test configuration normalized to default configuration cost There are several rather interesting observations we can make from these results: - - - - Armed with this information, you can choose the configuration settings that best balance your costs and performance needs. With multiple, dependent configuration parameters, you are unlikely to find the "best" setting through intuition and expertise. There are too many intertwined factors at play for that to happen. Parameter studies let you quickly and rigorously explore a range of parameter settings. With two or three parameters and three or four values for each, you can explore the parameter space quickly and cheaply. This enables you to see the effects of the combinations of values and make educated decisions on how to deploy your application.

Use Quizgecko on...
Browser
Browser