csc25-lecture-notes-179-186.pdf

Chapter 10 Warehouse-scale Computers Overview Concepts “Anyone can build a fast CPU. The trick is to bu...

Chapter 10 Warehouse-scale Computers Overview Concepts “Anyone can build a fast CPU. The trick is to build a fast system.” –Seymour Cray, Considered the father of the supercomputer. The warehouse-scale computer - WSC is considered the foundation of Internet services that billions of people use every day around the globe. WSC acts as one giant machine. It costs hundreds of millions of dollars for the building, the electrical and cooling infrastructure, the servers, and the networking equipment. The last is responsible for connecting and housing from around 50,000 to 100,000 servers. The very quick growth of commercial cloud computing makes WSC accessible to anyone with a credit card. Nowadays, the target is providing information technology for the world, instead of high-performance computing - HPC only for scientists and engineers. Main Goals The main goals of WSC include the cost-performance relation, energy efficiency, and dependability. In the cost-performance ratio, the work done per dollar is critical, i.e., the scale of WSC is big. In this case, reducing the small costs could save millions of dollars. In terms of energy efficiency, it has to be taken into account that some energy is turned into heat in the devices. Then, the concern involves both power and cooling. The work done per joule is really critical. Dependability is mainly achieved via redundancy. In view of this, WSC needs to have 99.99%1 of availability, i.e., the downtime has to be less than one hour per year. In this context, software redundancy also plays an important role, along with hardware redundancy. Regarding the network I/O, data have to be consistent between multiple WSC, as well as to interface with the public. 1 This is called the “four nines”. 173 10 Warehouse-scale Computers WSC has to consider also the interactive and batch processing workloads. The interactive workloads are the searches and social networking, for example. Another point is the massively parallel batch programs, e.g., computing lots of metadata that is useful to such services. Main Requirements In terms of requirements, WSC includes issues related to the ample parallelism, costs with the operation, location, and the trade-offs inherent from its big scale. In the sense of the ample parallelism, data-level parallelism is crucial, e.g., the web crawlers. There are also Internet service applications, also known as software as a service - SaaS. Another form of parallelism regarded here is the so-called “easy” parallelism, meaning the request-level parallelism - RLP. In RLP, many independent efforts go in parallel with little need for communication and synchronization. Some searches, for example, uses just a read-only index. Operational costs play an important role in this scenario. The energy, power distribution and cooling represent more than 30% of the costs over a 10 years period. The location of a WSC is relevant. Points such as inexpensive electricity, proximity to Internet backbone optical fibers, and human resources nearby to work with will help to decide on whether to build a WSC in a given place. Big scale trade-offs are inherent from WSC. Smaller unit costs represent really big quantities purchased at a time. The big scale impacts on less dependability, i.e., bigger failure rates. Nodes, Racks and Switches Fig. 10.1 illustrates the general concept of WSC by showing racks with an Ethernet switch on top of them. Then, many racks compose a WSC connected through an array switch. Figure 10.1: Switches hierarchy in WSC. 174 Examples Examples Workload MapReduce2 is considered a popular framework for batch processing in WSC. The map part applies a programmer-supplied function to each input, and runs on hundreds of computers to produce an intermediate result of key-value pairs. The reduce part then collects the output of those distributed tasks, and collapses them by using another programmer-defined function. A MapReduce’s words and documents indexing example is presented in Fig. 10.2. Figure 10.2: MapReduce usage at Google from 2004 to 2016. PB stands for petabytes. Next, a MapReduce’s words and documents indexing code fragment example is shown in Listing 10.1. Listing 10.1: MapReduce’s words and documents indexing code spinet. 1 map ( String key , String value ): 2 // key : document name 3 // value : document contents 4 for each word w in value : 5 EmitIntermediate (w , " 1 " ); // Produce list of all words 6 7 reduce ( String key , Iterator values ): 8 // key : a word 9 // values : a list of counts 10 int result = 0; 11 for each v in values : 12 result += ParseInt ( v ); // get integer from key - value pair 13 Emit ( AsString ( result )); Memory Hierarchy In this example considering memory hierarchy, each computing node contains: 2 Hadoop is an open-source alternative to MapReduce. 175 10 Warehouse-scale Computers 16 GiB DRAM (×80RACK )(×30ARRAY ); 128 GiB Flash (×80RACK )(×30ARRAY ); 2,048 GiB Disk (×80RACK )(×30ARRAY ); and 1 Gbit/s Ethernet port. Here, the rack holds 80 nodes, and the array has 30 racks. Fig. 10.1 gives an illustration of such arrangement. Table 10.1 shows the latency numbers considering this memory hierarchy from the local node, passing through the rack, and finally getting to the array. Table 10.1: Numbers of latency in µs. The networking software and switch overhead increase the latency in the rack. Finally, the array switch hardware/software also increases latency. These are the numbers for DRAM, Flash and Disk. Local Node Rack Array DRAM 0.1 300 500 Flash 100 400 600 Disk 10,000 11,000 12,000 Table 10.2 gives the numbers considering the impact on bandwidth. DRAM and Disk numbers are available only for the local node. Table 10.2: Bandwidth is given MiB/s. The 1 Gbit/s Ethernet switch limits the remote bandwidth within the rack. Finally, the bandwidth of the array switch also limits the remote bandwidth. The complete numbers were computed only considering the Flash memory. Local Node Rack Array DRAM 20,000 Flash 1,000 100 10 Disk 200 The bottom line is the following when comparing the numbers from the local node until the array. The network overhead considerably increases latency (see the numbers for DRAM for instance), and the network collapses differences in the bandwidth. Networking Hierarchy Some WSC needs more than one array, and in that case, there is one more level in the networking hierarchy, as shown in Fig. 10.3. 176 Examples Figure 10.3: Regular “Layer 3” routers connecting arrays together and also to the Internet. The core router operates in the Internet backbone. Power Utilization Effectiveness - PUE The power utilization effectiveness - PUE is a metric to evaluate the efficiency of a WSC, as given in Eq. (10.1). Total facility power consumption PUE = (10.1) IT equipment power consumption where PUE must be ≥ 1. However, the bigger the PUE, the less efficient the WSC. A big PUE figure means that most of the power is being drained to places other than the information technology - IT equipment, i.e., an overhead. The power for AC and other uses3 is normalized to the power for the IT equipment when computing the PUE. The power for IT equipment must be 1.0. The AC varies from about 0.30 to 1.40 times the power of the IT equipment. And, the power for other equipment varies from about 0.05 to 0.60 of the IT power. Considering the values in Fig. 10.4, the median PUE is 1.69. Cooling the infrastructure uses more than half as much power as the servers, i.e., on average 0.55 of the 1.69 is for cooling. Fig. 10.5 shows the average PUE from 15 Google centers. 3 Such as power distribution. 177 10 Warehouse-scale Computers Figure 10.4: Power utilization efficiency of 19 data centers as in 2006. PUE from the most to the least efficient. Figure 10.5: Average PUE from 15 Google WSC, from 2008 to 2017. The spiking line is the quarterly average PUE, and the straighter line is the trailing 12-month average PUE. The averages were 1.11 and 1.12. A Server from a Google WSC Fig. 10.6 illustrates an example of a Google server. Figure 10.6: Intel Haswell 2.3 GHz CPUs with 2 sockets × 18 cores × 2 threads given 72 “virtual cores” per machine; 2.5 MiB last level cache per core; 256 GB DDR3-1600 DRAM; 2 × 8 TB SATA disk drives, or 1 TB SSD; and a 10 Gbit/s Ethernet link. 178 Examples Why Using Regular Machines Table 10.3 gives a brief comparison between an Itanium2 machine and a regular HP computer. By the numbers, it can be concluded the regular machines pays off. Table 10.3: A comparison between two machines. One Itanium2 and one regular. HP Integrity Superdome - HP ProLiant ML350 G5 Itanium2 Processor 64 sockets; 128 cores, dual- 1 socket, quad-core; 2.66 GHz threaded; 1.6 GHz Itanium2, 12 X5355 CPU, 8 MB last-level MB last-level cache cache Memory 2,048 GB 24 GB Disk Storage 320,974 GB with 7,056 drives 3,961 GB with 105 drives TPC-C price/per- $2.93/tpmC $0.73/tpmC formance price/performance $1.28/tpm $0.10/tpm (server HW only) TPC-C benchmark is measured in transactions per minutes (tpmC)4. Intel officially announced the end of life and product discontinuance of the Itanium CPU family on January 30, 2019. Seeks and Scans Considering the seeks and scans numbers, let’s assume a 1 TB database with 100 Bytes records. The task is about updating 1% of the records. In a first scenario, with random access, let’s also say each update takes approximately 30 ms, i.e., to seek, read, and write. This would yield 108 updates, meaning around 35 days to complete. In a second scenario, considering the rewrite of all records and assuming a 100 MiB/s throughput, the time would be around hours (5.6 hours), instead of days. The lesson here is to avoid random seeks whenever it is possible. Numbers to Think About The numbers in Table 10.4 give an interesting perspective of time. They are called “Numbers Everyone Should Know”. 4 As in www.tpc.org 179 10 Warehouse-scale Computers Table 10.4: Those are approximated numbers, in nanosecond. L1 cache reference (double-word) 0.5 Branch mispredict (double-word) 5 L2 cache reference (double-word) 7 Main memory reference (double-word) 100 Send 2 KiB over 1 Gbps network 20,000 2KiB×8 1Gbps + 112.5ns × 2KiB 64 Read 1 MiB sequentially from memory 250,000 Round trip within the same data center 500,000 Disk seek 10,000,000 Read 1 MiB sequentially from disk 20,000,000 Send a packet CA → Netherlands → CA 150,000,000 CA stands for California. Cloud Computing “If computers of the kind I have advocated become the computers of the future, then computing may someday be organized as a public utility just as the telephone system is a public utility... The computer utility could become the basis of a new and important industry.” –John McCarthy MIT centennial celebration, 1961. Back in 1961, McCarthy thought about timesharing. To fulfill the demand of the increasing number of users, the big Internet players5 make very large warehouse-scale computers from commodity components. McCarthy’s prediction eventually came true nowadays. 5 Such as Amazon, Google, and Microsoft 180

csc25-lecture-notes-179-186.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue