Educative - System Design Interview Prep.pdf

Elementary Design Problems Domain Name System (DNS) Using the phone book analogy to understand the domain name system (DNS) What is the DNS? The domain name system (DNS) is the Internet’s naming service that maps human-friendly domain names to machine-readable IP addresses. The service of the DNS is transparent to users. When a user enters a domain name in the browser, the browser has to translate the domain name to an IP address by asking the DNS infrastructure. Once the desired IP address is obtained, the user’s request is forwarded to the destination web server. Name servers: It’s important to understand that the DNS isn’t a single server. It’s a complete infrastructure with numerous servers. DNS servers that respond to user queries are called name servers. Resource records: The DNS database stores domain name to IP address mappings in the form of resource records (RRs). The RR is the smallest unit of information that users request from the name servers. There are different types of RRs. The table below describes common RRs. The three important pieces of information are type, name, and value. The name and value change depending upon the type of the RR. Common Types of Resource Records Caching: The DNS uses caching at different layers to reduce request latency for the user. Caching plays an important role in reducing the burden on DNS infrastructure because it has to cater to the queries of the entire Internet. Hierarchy: DNS name servers are in a hierarchical form. The hierarchical structure allows the DNS to be highly scalable because of its increasing size and query load. In the next section, we’ll look at how a tree-like structure is used to manage the entire DNS database. DNS hierarchy As stated before, the DNS isn’t a single server that accepts requests and responds to user queries. It’s a complete infrastructure with name servers at different hierarchies. There are four main types of servers in the DNS hierarchy: DNS resolver: Resolvers initiate the querying sequence and forward requests to the other DNS name servers. Typically, DNS resolvers lie within the premise of the user’s network. However, DNS resolvers can also cater to users’ DNS queries through caching techniques, as we will see shortly. These servers can also be called local or default servers. Root-level name servers: These servers receive requests from local servers. Root name servers maintain name servers based on top-level domain names, such as.com,.edu,.us, and so on. For instance, when a user requests the IP address of educative.io, root-level name servers will return a list of top-level domain (TLD) servers that hold the IP addresses of the.io domain. Top-level domain (TLD) name servers: These servers hold the IP addresses of authoritative name servers. The querying party will get a list of IP addresses that belong to the authoritative servers of the organization. Authoritative name servers: These are the organization’s DNS name servers that provide the IP addresses of the web or application servers. How are DNS names processed? For example, will educative.io be processed from left to right or right to left? Unlike UNIX files, which are processed from left to right, DNS names are processed from right to left. In the case of educative.io, the resolvers will first resolve the.io part, then educative, and so on. Visually, however, the DNS hierarchy can be viewed as a tree. Iterative versus recursive query resolution There are two ways to perform a DNS query: Iterative: The local server requests the root, TLD, and the authoritative servers for the IP address. Recursive: The end user requests the local server. The local server further requests the root DNS name servers. The root name servers forward the requests to other name servers. Note: An iterative query is typically preferred to reduce query load on the DNS infrastructure. These days, we’ll find many third-party public DNS resolvers offered by Google, Cloudflare, OpenDNS, and many more. The interesting fact is that these public DNS servers may provide quicker responses than the local ISP DNS facilities. Caching Caching refers to the temporary storage of frequently requested resource records. A record is a data unit within the DNS database that shows a name-to-value binding. Caching reduces response time to the user and decreases network traffic. When we use caching at different hierarchies, it can reduce a lot of querying burden on the DNS infrastructure. Caching can be implemented in the browser, operating systems, local name server within the user’s network, or the ISP’s DNS resolvers. Note: Even if there is no cache available to resolve a user’s query and it’s imperative to visit the DNS infrastructure, caching can still be beneficial. The local server or ISP DNS resolver can cache the IP addresses of TLD servers or authoritative servers and avoid requesting the root-level server. DNS as a distributed system Although the DNS hierarchy facilitates the distributed Internet that we know today, it’s a distributed system itself. The distributed nature of DNS has the following advantages: It avoids becoming a single point of failure (SPOF). It achieves low query latency so users can get responses from nearby servers. It gets a higher degree of flexibility during maintenance and updates or upgrades. For example, if one DNS server is down or overburdened, another DNS server can respond to user queries. There are 13 logical root name servers (named letter “A” through “M”) with many instances spread throughout the globe. These servers are managed by 12 different organizations. Highly scalable Due to its hierarchical nature, the DNS is a highly scalable system. Roughly 1,000 replicated instances of 13 root-level servers are strategically spread throughout the world to handle user queries. The working labor is divided among TLD and root servers to handle a query and the authoritative servers that are managed by the organizations themselves to make the entire system work. As shown in the DNS hierarchy tree above, different services handle different portions of the tree that enable the scalability and manageability of the system. To maintain high availability, should the time-to-live (TTL) value be large or small? To maintain high availability, the TTL value should be small. This is because if any server or cluster fails, the organization can update the resource records right away. Users will experience nonavailability only for the time the TTL isn’t expired. However, if the TTL is large, the organization will update its resource records, whereas users will keep pinging the outdated server that would have crashed long ago. Companies that aim for high availability maintain a TTL value as low as 120 seconds. Therefore, even in case of a failure, the maximum downtime is a few minutes. Reliable There are three main reasons that the DNS is a reliable system: Caching: The caching is done in the browser, the operating system, and the local name server, and the ISP DNS resolvers also maintain a rich cache of frequently visited services. Even if some DNS servers are temporarily down, cached records can be served to make the DNS a reliable system. Server replication: The DNS has replicated copies of each logical server spread systematically across the globe to entertain user requests at low latency. The redundant servers improve the reliability of the overall system. Protocol: Although many clients use the DNS over unreliable user datagram protocol (UDP), UDP has its advantages. UDP is much faster and therefore improves DNS performance. Furthermore, Internet service reliability has improved since its inception, so UDP is usually favored over TCP. A DNS resolver can resend the UDP request if it didn’t get a reply to a previous one. This request-response needs just one round trip, which provides a shorter delay as compared to TCP, which needs a three-way handshake before data exchange. What happens if a network is congested? Should the DNS continue using UDP? Typically, the DNS uses UDP. However, the DNS can use TCP when its message size exceeds the original packet size of 512 bytes. This is because large-size packets are more prone to be damaged in congested networks. The DNS always uses TCP for zone transfers. Some clients prefer the DNS over TCP to employ transport layer security for privacy reasons. Consistent DNS uses various protocols to update and transfer information among replicated servers in a hierarchy. The DNS compromises on strong consistency to achieve high performance because data is read frequently from DNS databases as compared to writing. However, the DNS provides eventual consistency and lazily updates records on replicated servers. Typically, it can take from a few seconds up to three days to update records on the DNS servers across the Internet. The time it takes to propagate information among different DNS clusters depends on the DNS infrastructure, the size of the update, and the part of the DNS tree that is being updated. Consistency can suffer because of caching too. Since authoritative servers are located within the organization, it may be possible that certain resource records are updated on the authoritative servers in case of server failures at the organization. Therefore, cached records at the default/local and ISP servers may be outdated. To mitigate this issue, each cached record comes with an expiration time called time to live (TTL). If we need the DNS to tell us which IP to reach a website or service, how will we know the DNS resolver’s IP address? (It seems like a chicken and egg problem.) End users’ operating systems have configuration files (/etc/resolv.conf in Linux) with the DNS resolvers’ IP addresses, which in turn obtain all information for them. (Often, DHCP provides the default DNS resolver IP address along with other configurations.) The end systems request DNS resolvers for any DNS queries. DNS resolvers have special software installed to resolve queries through the DNS infrastructure. The root server’s IP addresses are within the special software. Typically, the Berkeley Internet Name Domain (BIND) software is used on DNS resolvers. The InterNIC maintains the updated list of 13 root servers. Therefore, we break the chicken and egg problem by seeding each resolver with a prior knowledge of root DNS servers (whose IPs rarely change). Load Balancers The job of the load balancer is to fairly divide all client requests among the pool of available servers. Load balancers perform this job to avoid overloading or crashing servers. The load balancing layer is the first point of contact within a data center after the firewall. A load balancer may not be required if a service entertains a few hundred or even a few thousand requests per second. However, for increasing client requests, load balancers provide the following capabilities: Scalability: By adding servers, the capacity of the application/service can be increased seamlessly. Load balancers make such upscaling or downscaling transparent to the end users. Availability: Even if some servers go down or suffer a fault, the system still remains available. One of the jobs of load balancers is to hide the faults and failures of the servers. Performance: Load balancers can forward requests to servers with a lesser load so the user can get a quicker response time. This not only improves performance but also improves resource utilization. Placing load balancers Generally, LBs sit between clients and servers. Requests go through to servers and back to clients via the load balancing layer. However, that isn’t the only point where LBs are used. Let’s consider the three well-known groups of servers: the web, the application, and the database servers. To divide the traffic load among the available servers, LBs can be used between the server instances of these three services in the following way: Place LBs between end users of the application and web servers/application gateway. Place LBs between the web servers and application servers that run the business/application logic. Place LBs between the application servers and database servers. In reality, load balancers can be potentially used between any two services with multiple instances within the design of a system. Services offered by load balancers LBs not only enable services to be scalable, available, and highly performant, but they offer some key services like the following: Health checking: LBs use the heartbeat protocol to monitor the health and, therefore, the reliability of end servers. Another advantage of health checking is the improved user experience. TLS termination: LBs reduce the burden on end servers by handling TLS termination with the client. Predictive analytics: LBs can predict traffic patterns through analytics performed over traffic passing through them or using statistics of traffic obtained over time. Reduced human intervention: Because of LB automation, the system administration efforts that are required in handling failures are reduced. Service discovery: An advantage of LBs is that the client requests are forwarded to appropriate hosting servers by inquiring about the service registry. Security: LBs may also improve security by mitigating attacks like denial-of-service (DoS) at different layers of the OSI model (layers 3, 4, and 7). As a whole, load balancers provide flexibility, reliability, redundancy, and efficiency to the overall design of the system. What if load balancers fail? Are they not a single point of failure (SPOF)? Load balancers are usually deployed in pairs as a means of disaster recovery. If one load balancer fails, and there’s nothing to failover to, the overall service will go down. Generally, to maintain high availability, enterprises use clusters of load balancers that use heartbeat communication to check the health of load balancers at all times. On failure of primary LB, the backup can take over. But, if the entire cluster fails, manual rerouting can also be performed in case of emergencies. In any system, load balancing is necessary both on a global and local level. Let’s delve into the purpose of these two types of load balancing: Global server load balancing (GSLB): GSLB involves the distribution of traffic load across multiple geographical regions. Local load balancing: This refers to load balancing achieved within a data center. This type of load balancing focuses on improving efficiency and better resource utilization of the hosting servers in a datacenter. Global server load balancing GSLB ensures that the globally arriving traffic load is intelligently forwarded to a datacenter. For example, power or network failure in a datacenter requires that all the traffic be rerouted to another datacenter. GSLB takes forwarding decisions based on the users’ geographic locations, the number of hosting servers in different locations, the health of datacenters, and so on. GSLB offers automatic zonal failover. A GSLB service can be installed on premises or obtained through Load Balancing as a Service (LBaaS). The illustration below shows that GSLB can forward requests to three different datacenters. Each local load balancing layer within a datacenter will maintain a control plane connection with the GSLB providing information about the health of the LBs and the server farm. GSLB uses this information to drive traffic decisions and forward traffic load based on each region’s configuration and monitoring information. Load balancing in the DNS We understand that the DNS can respond with multiple IP addresses for a DNS query. The nslookup command-line tool can help us understand how DNS performs load balancing. DNS uses a simple technique of reordering the list of IP addresses in response to each DNS query. Therefore, different users get a reordered IP address list. It results in users visiting different servers to entertain their requests. In this way, the DNS distributes the load of requests to different datacenters. This is performing GSLB. In particular, the DNS uses the round-robin technique to perform load balancing, as shown below: As shown above, the round-robin technique in the DNS forwards clients to datacenters in a strict circular order. However, round-robin has the following limitations: Different ISPs have a different number of users. An ISP serving many customers will provide the same cached IP to its clients, resulting in uneven load distribution on end servers. Because the round-robin load balancing algorithm doesn’t consider any end server crashes, it keeps on distributing the IP addresses of the crashed servers until the TTL of the cached entries expires. In that case, the availability of the service can take a hit due to DNS-level load balancing. Despite its limitations, round-robin is still widely used by many DNS service providers. Furthermore, the DNS uses short TTL for cached entries to do effective load balancing among different datacenters. The need for local load balancers The DNS plays a vital role in balancing the load, but it suffers from the following limitations: The small size of the DNS packet (512 bytes) isn’t enough to include all possible IP addresses of the servers. There’s limited control over the client’s behavior. Clients may arbitrarily select from the received set of IP addresses. Some of the received IP addresses may belong to busy data centers. Clients can’t determine the closest address to establish a connection. Although DNS geolocation and solutions like anycasting can be implemented, they aren’t trivial solutions. In the case of failures, recovery can be slow through the DNS because of the caching mechanism, especially when TTL values are longer. What is local load balancing? Local load balancers reside within a datacenter. They behave like a reverse proxy and make their best effort to divide incoming requests among the pool of available servers. Incoming clients’ requests seamlessly connect to the LB that uses a virtual IP address (VIP). Algorithms of load balancers LBs distribute client requests according to an algorithm. Some well-known algorithms are given below: Round-robin scheduling: In this algorithm, each request is forwarded to a server in the pool in a repeating sequential manner. Weighted round robin: If some servers have a higher capability of serving clients’ requests, then it’s preferred to use a weighted round-robin algorithm. In a weighted round-robin algorithm, each node is assigned a weight. LBs forward client requests according to the weight of the node. The higher the weight, the higher the number of assignments. Least connections: In certain cases, even if all the servers have the same capacity to serve clients, uneven load on certain servers is still a possibility. For example, some clients may have a request that requires longer to serve. Alternatively, some clients may have subsequent requests on the same connection. In that case, we can use algorithms like least connections where newer arriving requests are assigned to servers with fewer existing connections. LBs keep a state of the number and mapping of existing connections in such a scenario. Least response time: In performance-sensitive services, algorithms such as the least response time are required. This algorithm ensures that the server with the least response time is requested to serve the clients. IP hash: Some applications provide a different level of service to users based on their IP addresses. In that case, hashing the IP address is performed to assign user requests to servers. URL hash: It may be possible that some services within the application are provided by specific servers only. In that case, a client requesting service from a URL is assigned to a certain cluster or set of servers. The URL hashing algorithm is used in those scenarios. There are also other algorithms, like randomized or weighted least connections algorithms. Static vs. dynamic algorithms Algorithms can be static or dynamic, depending on the machine’s state. Static algorithms don’t consider the changing state of the servers. Therefore, task assignment is carried out based on existing knowledge about the server’s configuration. Naturally, these algorithms aren’t complex, and they get implemented in a single router or commodity machine where all the requests arrive. Dynamic algorithms are algorithms that consider the current or recent state of the servers. Dynamic algorithms maintain state by communicating with the server, which adds a communication overhead. State maintenance makes the design of the algorithm much more complicated. Dynamic algorithms require different load balancing servers to communicate with each other to exchange information. Therefore, dynamic algorithms can be modular because no single entity will do the decision-making. Although this adds complexity to dynamic algorithms, it results in improved forwarding decisions. Finally, dynamic algorithms monitor the health of the servers and forward requests to active servers only. Note: In practice, dynamic algorithms provide far better results because they maintain a state of serving hosts and are therefore worth the effort and complexity. Stateful vs. stateless LBs While static and dynamic algorithms are required to consider the health of the hosting servers, a state is maintained to hold session information of different clients with hosting servers. If the session information isn’t kept at a lower layer (distributed cache or database), load balancers are used to keep the session information. Below, we describe two ways of handling session maintenance through LBs: Stateful Stateless Stateful load balancing As the name indicates, stateful load balancing involves maintaining a state of the sessions established between clients and hosting servers. The stateful LB incorporates state information in its algorithm to perform load balancing. Essentially, the stateful LBs retain a data structure that maps incoming clients to hosting servers. Stateful LBs increase complexity and limit scalability because the session information of all the clients is maintained across all load balancers. This means that LBs share their state information with each other to make forwarding decisions. Stateless load balancing Stateless load balancing maintains no state and is therefore faster and lightweight. Stateless LBs use consistent hashing to make forwarding decisions. However, if infrastructure changes (for example, a new application server joining), stateless LBs may not be as resilient as stateful LBs because consistent hashing alone isn’t enough to route a request to the correct application server. Therefore, a local state may still be required along with consistent hashing. Therefore, a state maintained across different load balancers is considered stateful load balancing. Whereas a state maintained within a load balancer for internal use is assumed to be stateless load balancing. For a website serving static content like text and images, a stateless load balancer is likely a good choice because there’s no need to track user sessions. This approach offers simplicity, improved performance, and easier scalability. Types of load balancers Depending on the requirements, load balancing can be performed at the network/transport and application layer of the open systems interconnection (OSI) layers. Layer 4 LBs: Layer 4 refers to the load balancing performed on the basis of transport protocols like TCP and UDP. These types of LBs maintain connection/session with the clients and ensure that the same (TCP/UDP) communication ends up being forwarded to the same back-end server. Even though TLS termination is performed at layer 7 LBs, some layer 4 LBs also support it. Layer 7 LBs: Layer 7 LBs are based on the data of application layer protocols. It’s possible to make application-aware forwarding decisions based on HTTP headers, URLs, cookies, and other application-specific data—for example, the user ID. Apart from performing TLS termination, these LBs can take responsibilities like rate limiting users, HTTP routing, and header rewriting. Note: Layer 7 LBs are smart in terms of inspection. However, layer 4 LBs are faster in terms of processing. Conclusion In conclusion, load balancers efficiently distribute client requests among servers, ensuring scalability, availability, and improved performance. They offer services like health checking, TLS termination, predictive analytics, and security enhancements. Global and local load balancing techniques handle traffic distribution. Load balancers use algorithms and can be static or dynamic, stateful or stateless, and operate at Layer 4 or Layer 7. They provide flexibility, reliability, and efficiency to system design. Databases in Distributed Systems File storage The elementary and most convenient method to store data for an application is to use a simple file. However, using this approach has limitations, such as no concurrent management, limited access rights, and scalability and search challenges, as illustrated below. The limitations of simple file storage can be addressed using databases. Let's explore about the database in the following sections: Database A database is an organized collection of data that can be easily managed and accessed. Databases are created to make it easier to store, retrieve, modify, and delete data in connection with different data-processing procedures. Primarily, databases are divided into the following two categories: Relational databases are also called SQL databases because the primary language used to interact with these databases is SQL (Structured Query Language). The SQL operations include insertion, deletion, and retrieval of data. Non-relational databases are also called NoSQL (Not only SQL) because SQL is not the only primary language for interacting with such databases. They differ in terms of their intended use case, the type of information they hold, and the storage method they employ. Relational databases, like phone books that record contact numbers and addresses, are organized and have predetermined schemas. Non-relational databases, like file directories that store anything from a person’s constant information to shopping preferences, are unstructured, scattered, and feature a dynamic schema. Following are some of the reasons why the database is important: Managing large data: A large amount of data can be easily handled with a database, which wouldn’t be possible using other tools. Retrieving accurate data (data consistency): Due to different constraints in databases, we can retrieve accurate data whenever we want. Easy updation: It is quite easy to update data in databases using data manipulation language (DML). Security: Databases ensure the security of the data. A database only allows authorized users to access data. Data integrity: Databases ensure data integrity by using different constraints for data. Availability: Databases can be replicated on different servers, which can be concurrently updated. These replicas ensure availability. Scalability: Databases are divided into multiple partitions to manage the load on a single node. This increases scalability. Relational databases Relational databases store data in structured schemas, organizing it into tables with unique keys for each row. Data entities are represented as instances (tuples) and attributes, with instances stored in rows and attributes in columns. The tables within a database can be linked using primary and foreign keys, allowing connections between tuples in different tables. Relational databases provide atomicity, consistency, isolation, and durability (ACID) properties to maintain the integrity of the database. ACID is like a big hammer by design. This means that it’s generic enough for all problems. If some specific application only needs to deal with a few anomalies, there’s a window of opportunity to use a custom solution for higher performance, though there is added complexity. Why relational databases? One of the greatest powers of a relational database is its abstractions of ACID transactions and related programming semantics. This makes it very convenient for the end programmer to use a relational database. Let’s explore some important features of relational databases. Flexibility: In the context of SQL, data definition language (DDL) allows us to modify the database, including tables, columns, renaming the tables, and other changes. DDL even allows us to modify schema while other queries are happening and the database server is running. Reduced redundancy: Relational databases reduce redundancy through normalization techniques. In normalization, data is broken down into multiple tables and establishing relationships between them, minimizing duplicate storage and improving data integrity. Concurrency: Concurrency is vital in enterprise databases for coordinating simultaneous data access by multiple users. Relational databases handle concurrency through transactional access, ensuring data consistency and providing error-handling capabilities. Integration: Aggregating data from multiple sources in enterprise applications is often achieved through a shared database. This allows applications to access each other's data while concurrency control ensures smooth multi-application access. Backup and disaster recovery: Relational databases guarantee that the state of data is consistent at any time. The export and import operations make backup and restoration easier. Most cloud-based relational databases perform continuous mirroring to avoid data loss and make restoration easier and quicker. Why non-relational (NoSQL) databases? A NoSQL database is designed for a variety of data models to access and manage data. There are various types of NoSQL databases, which we’ll explain in the next section. These databases are used in applications that require a large volume of semi-structured and unstructured data, low latency, and flexible data models. This can be achieved by relaxing some of the data consistency restrictions of other databases. Here are some characteristics of the NoSQL database: Simple design: Unlike relational databases, NoSQL doesn’t require dealing with the impedance mismatch—for example, storing all the employees’ data in one document instead of multiple tables that require join operations. This strategy makes it simple and easier to write less code, debug, and maintain. Horizontal scaling: NoSQL is primarily preferred due to its ability to run databases on a large cluster. This solves the problem when the number of concurrent users increases. NoSQL makes it easier to scale out since the data related to a specific employee is stored in one document instead of multiple tables over nodes. NoSQL databases often spread data across multiple nodes and balance data and queries across nodes automatically. In case of a node failure, it can be transparently replaced without any application disruption. Availability: To enhance the availability of data, node replacement can be performed without application downtime. Most of the non-relational databases’ variants support data replication to ensure high availability and disaster recovery. Support for unstructured and semi-structured data: Many NoSQL databases work with data that doesn’t have schema at the time of database configuration or data writes. For example, document databases are structureless. They allow documents (JSON, XML, BSON, and so on) to have different fields. For example, one JSON document can have fewer fields than the others. Cost: Licenses for many RDBMSs are pretty expensive, while many NoSQL databases are open-source and freely available. Similarly, some RDBMSs rely on costly proprietary hardware and storage systems, while NoSQL databases usually use clusters of cheap commodity servers. NoSQL databases are divided into various categories based on the nature of the operations and features, including document store, columnar database, key-value store, and graph database. We’ll discuss each of them and their use cases from the system design perspective in the following sections. Types of NoSQL databases Key-value databases: These databases use key-value methods like hash tables to store data in key-value pairs. The key serves as a unique or primary key, and the values can be anything ranging from simple scalar values to complex objects. These databases allow easy partitioning and horizontal scaling of the data. Some popular key-value databases include Amazon DynamoDB, Redis, and Memcached DB. Document databases: Document databases are designed to store and retrieve documents in formats like XML, JSON, BSON, and so on. These documents are composed of a hierarchical tree data structure that can include maps, collections, and scalar values. Documents in this type of database may have varying structures and data. MongoDB and Google Cloud Firestore are examples of document databases. Graph database: These databases use the graph data structure to store data, where nodes represent entities, and edges show relationships between entities. The organization of nodes based on relationships leads to interesting patterns between the nodes. This database allows us to store the data once and then interpret it differently based on relationships. Popular graph databases include Neo4J, OrientDB, and InfiniteGraph. Graph data is kept in store files for persistent storage. Each of the files contains data for a specific part of the graph, such as nodes, links, properties, and so on. Columnar database: These databases store data in columns instead of rows. They enable quick and efficient access to all entries in the database column. Popular columnar databases include Cassandra, HBase, Hypertable, and Amazon SimpleDB. Data replication Data is an asset for an organization because it drives the whole business. Data provides critical business insights into what’s important and what needs to be changed. Organizations also need to securely save and serve their clients’ data on demand. Timely access to the required data under varying conditions (increasing reads and writes, disks and node failures, network and power outages, and so on) is required to successfully run an online business. We need the following characteristics from our data store: Availability under faults (failure of some disk, nodes, and network and power outages). Scalability (with increasing reads, writes, and other operations). Performance (low latency and high throughput for the clients). It’s challenging, or even impossible, to achieve the characteristics above on a single node. Replication Replication refers to keeping multiple copies of the data at various nodes (preferably geographically distributed) to achieve availability, scalability, and performance. We assume that a single node is enough to hold all our data. We won’t use this assumption while discussing data partitioning in multiple nodes. Often, the concepts of replication and partitioning go together. However, with many benefits, like availability, replication comes with its complexities. Replication is relatively simple if the replicated data doesn’t require frequent changes. The main problem in replication arises when we have to maintain changes in the replicated data over time. Synchronous vs. asynchronous replication There are two ways to disseminate changes to the replica nodes: Synchronous replication Asynchronous replication In synchronous replication, the primary node waits for acknowledgments from secondary nodes about updating the data. After receiving acknowledgment from all secondary nodes, the primary node reports success to the client. In asynchronous replication, the primary node doesn’t wait for acknowledgment from the secondary nodes and reports success to the client after updating itself. The concepts are discussed in the following illustration: The advantage of synchronous replication is that all the secondary nodes are completely up to date with the primary node. However, there’s a disadvantage to this approach. If one of the secondary nodes doesn’t acknowledge due to failure or fault in the network, the primary node would be unable to acknowledge the client until it receives the successful acknowledgment from the crashed node. This causes high latency in the response from the primary node to the client. On the other hand, the advantage of asynchronous replication is that the primary node can continue its work even if all the secondary nodes are down. However, if the primary node fails, the writes that weren’t copied to the secondary nodes will be lost. Partitioning To divide the load among multiple nodes, we need to partition the data by a phenomenon known as partitioning or sharding. In this approach, we split a large dataset into smaller chunks of data stored at different nodes on our network. The partitioning must be balanced so that each partition receives about the same amount of data. If partitioning is unbalanced, the majority of queries will fall into a few partitions. Partitions that are heavily loaded will create a system bottleneck. The efficacy of partitioning will be harmed because a significant portion of data retrieval queries will be sent to the nodes that carry the highly congested partitions. Such partitions are known as hotspots. Generally, we use the following two ways to shard the data: Vertical partitioning Horizontal partitioning Vertical partitioning We can put different tables in various database instances, which might be running on a different physical server. We might break a table into multiple tables so that some columns are in one table, while the rest are in the other. We should be careful if there are joins between multiple tables. We may like to keep such tables together on one shard. Vertical partitioning is often used to increase data retrieval speed from a table consisting of columns with very wide text or a binary large object (BLOB). In this case, the column with the large text or a BLOB is split into a different table. Vertical partitioning has its intricacies and is more amenable to manual partitioning, where stakeholders carefully decide how to partition data. In comparison, horizontal partitioning is suitable for automation, even under dynamic conditions. As shown in the following figure, the Employee table is divided into two tables: a reduced Employee table and an EmployeePicture table. The EmployePicture table has just two columns, EmployeID and Picture, separated from the original table. Moreover, the primary key EmpoloyeeID of the Employee table is added in both partitioned tables. This makes the data read and write easier, and the reconstruction of the table is performed efficiently. Horizontal partitioning At times, some tables in the databases become too big and affect read/write latency. Horizontal partitioning or sharding is used to divide a table into multiple tables by splitting data row-wise. Each partition of the original table distributed over database servers is called a shard. Primarily, there are the following important strategies used for horizontal partitioning: Key-range based partitioning Hash-based partitioning Key-range based partitioning In the key-range based partitioning, each partition is assigned a continuous range of keys. In the following figure, horizontal partitioning on the Invoice table is performed using the key-range based sharding with Customer_Id as the partition key. The two different colored tables represent the partitions. Sometimes, a database consists of multiple tables bound by foreign key relationships. In such a case, the horizontal partition is performed using the same partition key on all tables in a relation. Tables (or subtables) that belong to the same partition key are distributed to one database shard. Hash-based partitioning Hash-based partitioning uses a hash-like function on an attribute, and it produces different values based on which attribute the partitioning is performed. The main concept is to use a hash function on the key to get a hash value and then mod by the number of partitions. Once we’ve found an appropriate hash function for keys, we may give each partition a range of hashes (rather than a range of keys). Any key whose hash occurs inside that range will be kept in that partition. Apart from key-based and hash-based sharding, various other techniques are used for horizontal sharding, including consistent hashing, list-based, and round-robin sharding. Why do you need databases? Why can’t you just use files? Databases are crucial for several reasons: they efficiently manage large volumes of data, ensure data consistency through constraints, allow easy data updates, provide robust security measures to protect data, ensure data integrity, offer high availability through replication, support scalability by partitioning, enable efficient data retrieval, and facilitate data backup and recovery to safeguard against data loss. These capabilities make databases superior to using simple files for storing and managing data in most system design contexts. Conclusion In conclusion, databases offer a solution to the limitations of simple file storage, providing organized and accessible data management. Relational databases excel in structured schemas and data integrity, while NoSQL databases offer simplicity, scalability, and support for diverse data models. Replication and partitioning techniques enhance availability and performance. Organizations can achieve reliable and efficient data management solutions by leveraging the strengths of different database types and implementing effective strategies. Design a Key-Value Store Introduction to key-value stores Key-value stores are distributed hash tables (DHTs). A key is generated by the hash function and should be unique. In a key-value store, a key binds to a specific value and doesn't assume anything about the structure of the value. A value can be a blob, image, server name, or anything the user wants to store against a unique key. Let’s list the requirements of designing a key-value store to overcome the problems of traditional databases. Functional requirements The functional requirements are as follows: Configurable service: Some applications might have a tendency to trade strong consistency for higher availability. We need to provide a configurable service so that different applications could use a range of consistency models. We need tight control over the trade-offs between availability, consistency, cost-effectiveness, and performance. Ability to always write: The applications should always have the ability to write into the key-value storage. Hardware heterogeneity: The system shouldn’t have distinguished nodes. Each node should be functionally able to do any task. Non-functional requirements The non-functional requirements are as follows: Scalable: Key-value stores should run on tens of thousands of servers distributed across the globe. Incremental scalability is highly desirable, and the system should be able to handle a large number of users. Available: We need to provide continuous service, so availability is very important. This property is configurable. Fault tolerance: The key-value store should operate uninterrupted despite failures in servers or their components. Now that you understand the functional and non-functional requirements of a key-value store, what do you think are the key differences between key-value stores and traditional databases? In what scenarios are key-value stores particularly advantageous? Key-value stores and traditional relational databases differ primarily in their structure and approach to data management. Key-value stores are designed for simplicity and speed, focusing on retrieving values directly through keys without the need for a complex query language. They often prioritize availability and scalability, sometimes at the expense of strict consistency, and are suited for unstructured data. These stores are particularly advantageous in scenarios requiring rapid data access, handling large volumes of data, or when the data structure is flexible or not fully known in advance. This makes them ideal for high-performance applications and services where response time and scalability are critical. API design Key-value stores, like ordinary hash tables, provide two primary functions, which are get and put. The get function: The API call to get a value is get(key). The key is to retrieve a specific value associated with the key. The put function: The API call to insert a value is put(key, value). Let’s start with adding scalability in the Key-values Store, which is one of the core design requirements, in the following section: Scalability We store key-value data in storage nodes. With a change in demand, we might need to add or remove storage nodes. It means we need to partition data over the nodes in the system to distribute the load across all nodes. To achieve scalability, we'll use consistent hashing due to its ability to provide a balanced load distribution. Use virtual nodes We’ll use virtual nodes to ensure a more evenly distributed load across the nodes. Instead of applying a single hash function, we’ll apply multiple hash functions onto the same key. Let’s take an example. Suppose we have three hash functions. For each node, we calculate three hashes and place them into the ring. For the request, we use only one hash function. Wherever the request lands onto the ring, it's processed by the next node found while moving in the clockwise direction. Each server has three positions, so the load of requests is more uniform. Moreover, if a node has more hardware capacity than the others, we can add more virtual nodes by using additional hash functions. This way, it'll have more positions in the ring and serve more requests. We’ve made the proposed design of key-value storage scalable. The next task is to make our system highly available. Data replication We have various methods to replicate the storage. It can be either a primary-secondary relationship or a peer-to-peer relationship. When network partitions and node failures occur during an update, an object's version history might become fragmented. As a result, it requires a reconciliation effort on the part of the system. It’s necessary to build a way that explicitly accepts the potential of several copies of the same data to avoid losing any updates. So to handle inconsistency, we need to maintain causality between the events. To maintain causality effectively, we will use vector clocks. A vector clock is a list of (node, counter) pairs. There’s a single vector clock for every version of an object. If two objects have different vector clocks, we can tell whether they’re causally related (more on this in a bit). Unless one of the two changes is reconciled, the two are deemed at odds. Usage of 𝑟 and 𝑤 One of our functional requirements is that the system should be configurable. We want to control the trade-offs between availability, consistency, cost-effectiveness, and performance. Let’s make our service configurable by having the ability to control these trade-offs. We can use a consistency protocol similar to those used in quorum systems. Handle temporary failures What are the limitations of using hinted handoff? A minimal churn in system membership and transient node failures are ideal for hinted handoff. However, hinted replicas may become unavailable before being restored to the originating replica node in certain circumstances. Handle permanent failures In the event of permanent failures of nodes, we should keep our replicas synchronized to make our system more durable. We need to speed up the detection of inconsistencies between replicas and reduce the quantity of transferred data. We’ll use Merkle trees for that. In a Merkle tree, the values of individual keys are hashed and used as the tree leaves. There are hashes of their children in the parent nodes higher up the tree. Each branch of the Merkle tree can be verified independently without downloading the complete tree or the entire dataset. Anti-entropy with Merkle trees Each node keeps a distinct Merkle tree for the range of keys it hosts for each virtual node. The nodes can determine if the keys in a given range are correct. The root of the Merkle tree corresponding to the common key ranges is exchanged between two nodes. We’ll make the following comparison: 1. Compare the hashes of the root node of Merkle trees. 2. Do not proceed if they’re the same. 3. Traverse left and right children using recursion. The nodes identify whether or not they have any differences and perform the necessary synchronization. The disadvantage of using Merkle trees is when a node joins or departs the system, the tree’s hashes are recalculated because multiple key ranges are affected. We want our nodes to detect the failure of other nodes in the ring, so let’s see how we can add it to our proposed design. Promote membership in the ring to detect failures The nodes can be offline for short periods, but they may also indefinitely go offline. We shouldn’t rebalance partition assignments or fix unreachable replicas when a single node goes down because it’s rarely a permanent departure. Therefore, the addition and removal of nodes from the ring should be done carefully. Planned commissioning and decommissioning of nodes results in membership changes. These changes form history. They’re recorded persistently on the storage for each node and reconciled among the ring members using a gossip protocol. A gossip-based protocol also maintains an eventually consistent view of membership. When two nodes randomly choose one another as their peer, both nodes can efficiently synchronize their persisted membership histories. Conclusion A key-value store provides flexibility and allows us to scale the applications that have unstructured data. Key-value stores can power real-time recommendations and advertising because the stores can swiftly access and present fresh recommendations. Design a Content Delivery Network (CDN) What is a CDN? A CDN is a group of geographically distributed proxy servers. A proxy server is an intermediate server between a client and the origin server. The proxy servers are placed on the network edge. Because the network edge is close to the end users, the placement of proxy servers helps quickly deliver the content to the end users by reducing latency and saving bandwidth. A CDN has added intelligence on top of being a simple proxy server. Note: The network edge is the zone where a device or local network interfaces with the Internet. Following are some of the advantages of a CDN: Improves content delivery speed by leveraging a distributed network of servers. Reduces server load by offloading static content delivery to the CDN. Increases scalability and availability through load distribution and failover mechanisms. Global reach and geo-targeting for faster and location-specific content delivery. Functional requirements The following functional requirements will be a part of our design: Retrieve: Depending upon the type of CDN model, a CDN should be able to retrieve content from the origin servers. We’ll cover CDN models in the coming sections. Request: Content delivery from the proxy server is made upon the user’s request. CDN proxy servers should be able to respond to each user’s request in this regard. Deliver: In the case of the push model, the origin servers should be able to send the content to the CDN proxy servers. Search: The CDN should be able to execute a search against a user query for cached or otherwise stored content within the CDN infrastructure. Update: In most cases, content comes from the origin server, but if we run script in CDN, the CDN should be able to update the content within peer CDN proxy servers in a Point of Presence (PoP). Delete: Depending upon the type of content (static or dynamic), it should be possible to delete cached entries from the CDN servers after a certain period. Note: A Point of Presence (PoP) is a physical place that allows two or more networks or devices to communicate with each other. Typically, each CDN PoP has a large number of cache servers. Non-functional requirementsda Performance: Minimizing latency is one of the core missions of a CDN. The proposed design should have minimum possible latency. Availability: CDNs are expected to be available at all times because of their effectiveness. Availability includes protection against attacks like DDoS. Scalability: An increasing number of users will request content from CDNs. Our proposed CDN design should be able to scale horizontally as the requirements increase. Reliability and security: Our CDN design should ensure no single point of failure. Apart from failures, the designed CDN must reliably handle massive traffic loads. Furthermore, CDNs should provide protection to hosted content from various attacks. CDN components The following components comprise a CDN: Clients: End users use various clients, like browsers, smartphones, and other devices, to request content from the CDN. Routing system: The routing system directs clients to the nearest CDN facility. To do that effectively, this component receives input from various systems to understand where content is placed, how many requests are made for particular content, the load a particular set of servers is handling, and the URI (Uniform Resource Identifier) namespace of various contents. In a later section, we’ll discuss different routing mechanisms to forward users to the nearest CDN facility. Scrubber servers: Scrubber servers are used to separate the good traffic from malicious traffic and protect against well-known attacks, like DDoS. Scrubber servers are generally used only when an attack is detected. In that case, the traffic is scrubbed or cleaned and then routed to the target destination. Proxy servers: The proxy or edge proxy servers serve the content from RAM to the users. Proxy servers store hot data in RAM, though they can store cold data in SSD or hard drive as well. These servers also provide accounting information and receive content from the distribution system. Distribution system: The distribution system is responsible for distributing content to all the edge proxy servers to different CDN facilities. This system uses the Internet and intelligent broadcast-like approaches to distribute content across the active edge proxy servers. Origin servers: The CDN infrastructure facilitates users with data received from the origin servers. The origin servers serve any unavailable data at the CDN to clients. Origin servers use appropriate stores to keep content and other mapping metadata. Though, we won’t discuss the internal architecture of origin infrastructure here. Management system: The management systems are important in CDNs from a business and managerial aspect where resource usage and statistics are constantly observed. This component measures important metrics, like latency, downtime, packet loss, server load, and so on. For third-party CDNs, accounting information can also be used for billing purposes. Workflow The workflow for the abstract design is given below: 1. The origin servers provide the URI namespace delegation of all objects cached in the CDN to the request routing system. 2. The origin server publishes the content to the distribution system responsible for data distribution across the active edge proxy servers. 3. The distribution system distributes the content among the proxy servers and provides feedback to the request routing system. This feedback is helpful in optimizing the selection of the nearest proxy server for a requesting client. This feedback contains information about what content is cached on which proxy server to route traffic to relevant proxy servers. 4. The client requests the routing system for a suitable proxy server from the request routing system. 5. The request routing system returns the IP address of an appropriate proxy server. 6. The client request routes through the scrubber servers for security reasons. 7. The scrubber server forwards good traffic to the edge proxy server. 8. The edge proxy server serves the client request and periodically forwards accounting information to the management system. The management system updates the origin servers and sends feedback to the routing system about the statistics and detail of the content. However, the request is routed to the origin servers if the content isn’t available in the proxy servers. It’s also possible to have a hierarchy of proxy servers if the content isn’t found in the edge proxy servers. For such cases, the request gets forwarded to the parent proxy servers. Push CDN Content gets sent automatically to the CDN proxy servers from the origin server in the push CDN model. The content delivery to the CDN proxy servers is the content provider’s responsibility. Push CDN is appropriate for static content delivery, where the origin server decides which content to deliver to users using the CDN. The content is pushed to proxy servers in various locations according to the content’s popularity. If the content is rapidly changing, the push model might struggle to keep up and will do redundant content pushes. Pull CDN A CDN pulls the unavailable data from origin servers when requested by a user. The proxy servers keep the files for a specified amount of time and then remove them from the cache if they’re no longer requested to balance capacity and cost. When users request web content in the pull CDN model, the CDN itself is responsible for pulling the requested content from the origin server and serving it to the users. Therefore, this type of CDN is more suited for serving dynamic content. As stated, the push CDN is mostly used for serving static content. Since static content is served to a wide range of users for longer than dynamic content, the push CDN scheme maintains more replicas than the pull CDN, improving availability. On the other hand, the pull CDN is favored for frequently changing content and a high traffic load. Low storage consumption is one of the main benefits of the pull CDN. Note: Most content providers use both pull and push CDN caching approaches to get the benefits of both. Dynamic content caching optimization Since dynamic content often changes, it’s a good idea to cache it optimally. This section deals with the optimization of frequently changing content. Certain dynamic content creation requires the execution of scripts that can be executed at proxy servers instead of running on the origin server. Dynamic data can be generated using various parameters, which can be beneficial if executed at the proxy servers. For example, we can generate dynamic content based on user location, time of day at a location, third-party APIs specific to a location (for instance, weather API), and so on. So, it’s optimal to run the scripts at proxy servers instead of the origin servers. To reduce the communication between the origin server and proxy servers and storage requirements at proxy servers, it’s useful to employ compression techniques as well. For example, Cloudflare uses Railgun to compress dynamic content. Multi-tier CDN architecture The content provider sends the content to a large number of clients through a CDN. The task of distributing data to all the CDN proxy servers simultaneously is challenging and burdens the origin server significantly. CDNs follow a tree-like structure to ease the data distribution process for the origin server. The edge proxy servers have some peer servers that belong to the same hierarchy. This set of servers receives data from the parent nodes in the tree, which eventually receive data from the origin servers. The data is copied from the origin server to the proxy servers by following different paths in the tree. The tree structure for data distribution allows us to scale our system for increasing users by adding more server nodes to the tree. It also reduces the burden on the origin server for data distribution. A CDN typically has one or two tiers of proxy servers (caches). The following illustration shows the two tiers of proxy servers: Find the nearest proxy server to fetch the data It’s vital for the user to fetch data from the nearest proxy server because the CDN aims to reduce user-perceived latency by bringing the data close to the user. However, the question remains of how users worldwide request data from the nearest proxy server. The goal of this section is to answer that question. Important factors that affect the proximity of the proxy server There are two important factors that are relevant to finding the nearest proxy server to the user: Network distance between the user and the proxy server is crucial. This is a function of the following two things: ○ The first is the length of the network path. ○ The second is the capacity (bandwidth) limits along the network path. The shortest network path with the highest capacity (bandwidth) is the nearest proxy server to the user in question. This path helps the user download content more quickly. Requests load refers to the load a proxy server handles at any point in time. If a set of proxy servers are overloaded, the request routing system should forward the request to a location with a lesser load. This action balances out the proxy server load and consequently reduces the response latency. Let’s look at the techniques that can be used to route users to the nearest proxy server. DNS redirection In a typical DNS resolution, we use a DNS system to get an IP against a human-readable name. However, the DNS can also return another URI (instead of an IP) to the client. Such a mechanism is called DNS redirect. Content providers can use DNS redirect to send a client to a specific CDN. As an example, if the client tries to resolve a name that has the word “video” in it, the authoritative DNS server provides another URL (for example, cdn.xyz.com). The client does another DNS resolution, and the CDN’s authoritative DNS provides an IP address of an appropriate CDN proxy server to fetch the required content. Depending on the location of the user, the response of the DNS can be different. Let’s see the slides below to understand how DNS redirection works: There are two steps in the DNS redirection approach: 1. It maps the clients to the appropriate network location. 2. It distributes the load over the proxy servers in that location to balance the load among the proxy servers (see DNS and LB building blocks for more details on this). DNS redirection takes both of these important factors—network distance and requests load—into consideration, and that reduces the latency toward a proxy server. Anycast Anycast is a routing methodology in which all the edge servers located in multiple locations share the same single IP address. It employs the Border Gateway Protocol(BGP) to route clients based on the Internet’s natural network flow. A CDN provider can use the anycast mechanism so that clients are directed to the nearest proxy servers for content. Note: BGP is a network-level protocol used by Internet edge routers to share routing and reachability information so that every node on the network, even if independent, is aware of the status of their closest network neighbors. Client multiplexing Client multiplexing involves sending a client a list of candidate servers. The client then chooses one server from the list to send the request to. This approach is inefficient because the client lacks the overall information to choose the most suitable server for their request. This may result in sending requests to an already-loaded server and experiencing higher access latency. HTTP redirection HTTP redirection is the simplest of all approaches. With this scheme, the client requests content from the origin server. The origin server responds with an HTTP protocol to redirect the user via a URL of the content. Apart from finding the nearest server to fetch the data, an important factor that should be considered is to ensure content consistency in a CDN. Following are some approaches through which we can ensure content consistency in a CDN: Mechanisms to Ensure Cotent Consistency in CDN Ensure non-functional requirements We evaluate the CDN’s design with respect to the non-functional requirements it meets: Performance: We greatly minimized latency using proxy servers, including those in CDNs and ISP/IXPs. Then, we served content from nonvolatile storage systems and paired that with a request routing system directing users to the nearest proxy server. Availability: Our CDN can handle large traffic volumes with distributed cached content that ensures availability, redundancy through operational proxy servers and replicated data, and load balancing to direct user requests to active proxy servers. Scalability: The design of our CDN facilitates scalability by bringing content closer to users, reducing bandwidth requirements, and enabling horizontal scalability through additional edge proxy servers while also addressing limitations with a layered architecture to expand storage capacity. Reliability and security: A CDN prevents single points of failure through careful maintenance and additional hardware and software, distributes traffic loads evenly across edge proxy servers, uses scrubber servers to prevent DDoS attacks and securely host content, monitors server health with the heartbeat protocol, and can be customized for real-time applications to ensure secure content delivery. Design a Unique ID Generator Requirements Unique IDs are important for identifying events and objects within a distributed system. However, designing a unique ID generator within a distributed system is challenging. Using UUID Pros The pros of using the UUID approach are as follows: It’s a simple approach. It’s available. It’s scalable. The system is scalable because each server generates IDs independently. It doesn’t require synchronization between servers. Cons Using 128-bit numbers as primary keys makes the primary key indexing slower, which results in slow inserts. A workaround might be interpreting an ID as a hex string instead of a number. However, non-numeric identifiers might not be suitable for many use cases. The ID isn’t of 64-bit size. Moreover, there’s a chance of duplication. Although this chance is minimal, we can’t claim UUID to be deterministically unique. Additionally, UUIDs given to clients over time might not be monotonically increasing. The following table summarizes the requirements we have fulfilled using UUID: Using a database Let’s try mimicking the auto-increment feature of a database. Consider a central database that provides a current ID and then increments the value by one. We can use the current ID as a unique identifier for our events. What can be a potential problem of using a central database? This design has a considerable problem: a single point of failure. Reliance on one database can severely affect a system. The entire system will stop working if the central database goes down. To cater to the problem of a single point of failure, we modify the conventional auto-increment feature that increments by one. Instead of incrementing by one, let’s rely on the value m, where m equals the number of database servers we have. Each server generates an ID, and the following ID adds m to the previous value. This method is scalable and prevents the duplication of IDs. The following image provides a visualization of how a unique ID is generated using a database: Pros This approach is scalable. We can add more servers, and the value of m will be updated accordingly. Cons Though this method is somewhat scalable, it’s difficult to scale for multiple data centers. The task of adding and removing a server can result in duplicate IDs. For example, suppose m=3, and server A generates the unique IDs 1, 4, and 7. Server B generates the IDs 2, 5, and 8, while server C generates the IDs 3, 6, and 9. Server B faces downtime due to some failure. Now, the value m is updated to 2. Server A generates 9 as its following unique ID, but this ID has already been generated by server C. Therefore, the IDs aren’t unique anymore. The table below highlights the limitations of our solution. A unique ID generation system shouldn’t be a single point of failure. It should be scalable and available. The table below highlights the limitations of our solution. A unique ID generation system shouldn’t be a single point of failure. It should be scalable and available. Using a range handler Let’s try to overcome the problems identified in the previous methods. We can use ranges in a central server. Suppose we have multiple ranges for one to two billion, such as 1 to 1,000,000; 1,000,001 to 2,000,000; and so on. In such a case, a central microservice can provide a range to a server upon request. Any server can claim a range when it needs it for the first time or if it runs out of the range. Suppose a server has a range, and now it keeps the start of the range in a local variable. Whenever a request for an ID is made, it provides the local variable value to the requestor and increments the value by one. Let’s say server 1 claims the number range 300,001 to 400,000. After this range claim, the user ID 300,001 is assigned to the first request. The server then returns 300,002 to the next user, incrementing its current position within the range. This continues until user ID 400,000 is released by the server. The application server then queries the central server for the next available range and repeats this process. This resolves the problem of the duplication of user IDs. Each application server can respond to requests concurrently. We can add a load balancer over a set of servers to mitigate the load of requests. We use a microservice called range handler that records all the taken and available ranges. The status of each range can determine if a range is available or not. The state—that is, which server has what range assigned to it—can be saved on a replicated storage. Imagine a scenario where you’re planning to bring a new data center online to accommodate increasing user requests. How would you modify your range allocation strategy to accommodate this change? To accommodate the new data center and manage increasing user requests efficiently, you would modify your range allocation strategy by implementing Geographic Sharding. This approach involves allocating ranges based on the data center location, ensuring that servers within the same geographic area access the same range handler. This method helps in optimizing response times and managing load more effectively across different locations. Pros This system is scalable, available, and yields user IDs with no duplicates. Moreover, we can maintain this range in 64 bits, which is numeric. Cons We lose a significant range when a server dies and can only provide a new range once it’s live again. We can overcome this shortcoming by allocating shorter ranges to the servers, although ranges should be large enough to serve identifiers for a while. We developed a solution that provides us with a unique ID, which we can assign to various events and even use as a primary key. But what if we add a requirement that the ID is time sortable too? Using UNIX time stamps UNIX time stamps are granular to the millisecond and can be used to distinguish different events. We have an ID-generating server that can generate one ID in a single millisecond. Any request to generate a unique ID is routed to that server, which returns a timestamp and then returns a unique ID. The ability to generate an ID in milliseconds allows us to generate a thousand identifiers per second. This means we can get 24(ℎ𝑜𝑢𝑟)∗60(𝑚𝑖𝑛/ℎ𝑜𝑢𝑟)∗60(𝑠𝑒𝑐/𝑚𝑖𝑛)∗1000(𝐼𝐷/𝑠𝑒𝑐)=86400000𝐼𝐷𝑠 24(hour)∗60(min/hour)∗60(sec/min)∗1000(ID/sec)=86400000IDs in a day. That’s less than one billion per day. Our system works well with generating IDs, but it poses a crucial problem. The ID-generating server is a single point of failure (SPOF), and we need to handle it. To fix the SPOF, we can add more servers. Each server generates a unique ID for every millisecond. We attach the server ID with the UNIX timestamp to make the overall identifier unique across the system. Then, we add a load balancer to distribute the traffic more efficiently. The design of a unique ID generator using UNIX time stamps is given below: Pros This approach is simple, scalable, and easy to implement. It also enables multiple servers to handle concurrent requests. Cons The same timestamp is returned for two concurrent events, and the same ID can be assigned to them. This way, the IDs are no longer unique. Twitter Snowflake The explanation of the bits division is as follows: The following slides show the conversion of the timestamp to UTC. Pros Twitter Snowflake uses the timestamp as the first component. Therefore, they’re time sortable. The ID generator is highly available as well. Cons IDs generated in a dead period are a problem. The dead period is when no request for generating an ID is made to the server. These IDs will be wasted since they take up identifier space. The unique range possible will deplete earlier than expected and create gaps in our global set of user IDs. The following table gives an overview of the requirements fulfilled using different design approaches: Summary We want to avoid duplicate identifiers. Consider what will happen if duplicate payments or purchase orders are generated. UUIDs provide probabilistic guarantees about the keys’ non-collision. Deterministically getting non-collision guarantees might need consensus among different distributed entities or stores and read from the replicated store. As key length becomes large, it often causes slower tuple updates in a database. Therefore, identifiers should be big enough but not too big. Often, it’s desirable that no one can guess the next ID. Otherwise, undesirable data leaks can happen, and the organization’s competitors may learn how many orders were processed in a day by simply looking at order IDs. Adding a few random numbers to the identifier bits makes it hard to guess, although this comes at a performance cost. We can use simple counters for generating unique IDs if we don’t want to relate IDs to time. Fetching time stamps is slower than simple counters. We can use simple counters for generating unique IDs if we don’t want to relate IDs to time. Fetching timestamps is slower than simple counters, though this requires persistently storing generated IDs. The counter needs to be stored in the database. Storage comes with its own issues. These include multiple concurrent writes becoming overwhelming for the database and the database being the single point of failure. The physical clocks are unreliable. For such clocks, the error can be 17 seconds per day. If we measure time using these on a server, the time drifts away. So, we cannot avoid the error when using physical clocks. Design a Client-Side Monitoring Service Monitoring: metrics and alerting A good monitoring system needs to clearly define what to measure and in what units (metrics). The monitoring system also needs to define threshold values of all metrics and the ability to inform appropriate stakeholders (alerts) when values are out of acceptable ranges. Knowing the state of our infrastructure and systems ensures service stability. The support team can respond to issues more quickly and confidently if they have access to information on the health and performance of the deployments. Monitoring systems that collect measurements, show data, and send warnings when something appears wrong are helpful for the support team. What are the conventional approaches to handling failures in IT infrastructure? The two conventional approaches to handling IT infrastructure failures are the reactive and proactive approaches. In the reactive approach, corrective action is taken after the failure occurs. In this approach, even if DevOps takes quick action to find the cause of the error and promptly handle the failures, it causes downtime. As a result, there will be system downtime in the reactive approach, which is generally undesirable for continuously running applications. In a proactive approach, proactive actions are taken before failure occurs. Therefore, it prevents downtimes and associated losses. The proactive approach works by predicting system failures to take corrective action to avoid the failure. This approach offers better reliability by preventing downtime. In modern services, completely avoiding problems is impossible. Something is always failing inside huge datacenters and network deployments. The goal is to find the impending problems early on and design systems in such a way that service faults are invisible to the end users. Metrics Metrics objectively define what we should measure and what units will be appropriate. Metric values provide an insight into the system at any point in time. For example, a web server’s ability to handle a certain amount of traffic per second or its inclusion in a web server pool are examples of high-level data correlated with a component’s specific purpose or activity. Another example can be measuring network performance in terms of throughput (megabits per second) and latency (round-trip time). We need to collect the values of metrics with a minimal performance penalty. We may measure this penalty using user-perceived latency or the number of computational resources. Values that track how many physical resources our operating system uses can be a good starting point. If we have a monitoring system in place, we don’t have to do much additional work to get data regarding processor load, CPU statistics like cache hits and misses, RAM usage by OS and processes, page faults, disc space, disc read and write latencies, swap space usage, and so on. Metrics provided by many web servers, database servers, and other software help us determine whether everything is running smoothly or not. Populate the metrics The metrics should be logically centralized for global monitoring and alerting purposes. Fetching metrics is crucial to the monitoring system. Metrics can either be pushed or pulled into a monitoring system, depending on the user’s preference. Here, we confront a fundamental design challenge now: Do we utilize push or pull? Should the server proactively send the metrics’ values out, or should it only expose an endpoint and wait reactively for an inquiry? In the pull strategy, each monitored server merely needs to store the metrics in memory and send them to an exposed endpoint. The exposed endpoint allows the monitoring application to fetch the metrics itself. Servers sending too much data or sending data too frequently can’t overload the monitoring system. The monitoring system will pull data as per its own schedule. In other situations, pushing may be beneficial, such as when a firewall prevents the monitoring system from accessing the server directly. The monitoring system has the ability to adjust a global configuration about the data to be collected and the interval at which servers and switches should push the data. In other situations, pushing may be beneficial, such as when a firewall prevents the monitoring system from accessing the server directly. The monitoring system has the ability to adjust a global configuration about the data to be collected and the interval at which servers and switches should push the data. Push and pull terminologies might be confusing. We’ll consider the push or pull strategy from the monitoring system’s perspective whenever we discuss it. Either the system will pull the metrics values from the applications, or the metrics will be pushed to the monitoring system. To avoid confusion, we’ll stick to the monitoring system’s viewpoint. Logging is the act of keeping records of events in a software system. How does it help in monitoring? In logging, the application servers log the information into the file. The information can be CPU usage, application-related information, and other relevant properties necessary to backtrace or debug a file when a problem is encountered. We can populate our metrics based on the values logged in the logs. Logs and metrics both help in monitoring a service. But this isn’t always true since processing the log information takes time. In real time, we need to act swiftly for early detection of issues. So, logging is also one of the inputs of metrics. Logging is just a mechanism to collect information, and the monitoring system can use it to collect the necessary information. Logging can also help to temporarily keep the data on a server to absorb any momentary data spikes or to decouple data generation and monitoring systems. Persist the data It is crucial to determine the appropriate storage method for the monitored server’s metrics. A centralized in-memory metrics repository may be all that’s needed. However, for a large data center with millions of things to monitor, there will be an enormous amount of data to store, and a time-series database can help in this regard. Application metrics# We may need to add code or APIs to expose metrics we care about for other components, notably our own applications. We embed logging or monitoring code in our applications, called code instrumentation, to collect information of interest. Looking at metrics as a whole can shed light on how our systems are performing and how healthy they are. Monitoring systems employ these inputs to generate a comprehensive view of our environment, automate responses to changes like commissioning more EC2 instances if the applications’ traffic increases, and warn humans when necessary. Metrics are system measurements that analyze historical trends, correlations, and changes in performance, consumption, or error rates. Alerting Alerting is the part of a monitoring system that responds to changes in metric values and takes action. There are two components to an alert definition: a metrics-based condition or threshold and an action to take when the values fall outside of the permitted range. Client-side errors Clients often access the service via an HTTP request in a distributed system. We can monitor our web and application servers’ logs if a request fails to process. If multiple requests fail, we can observe a spike in internal errors (error 500). Those errors whose root cause is on the client’s side are hard to respond to because the service has little to no insight into the client’s system. We might look for a dip in the load compared to averages, but such a graph is usually hard. It can have false positives and false negatives due to factors such as unexpected variable load or if a small portion of the client population is affected. Many factors can cause failures that can result in clients being unable to reach the server. These include the following: Failure in DNS name resolution. Any failure in routing from the client to the service provider. Any failures with third-party infrastructure, such as middleboxes and CDNs. Design of a client-side monitoring system# A service has no visibility of the errors that don’t occur at its infrastructure. Initial design We’ll act as clients and perform reachability and health checks to ensure that the client’s requests reach the server. We’ll need various vantage points across the globe. We can run a service, let’s call it prober, that periodically sends requests to the service to check availability. This way, we can monitor the reachability to our service from many different places. Issues with probers# We can have the following issues with probers: Incomplete coverage: We might not have good coverage across all autonomous systems. There are 100,000 unique autonomous systems on the Internet as of March 2021. It’s not cost-effective or even possible to put those many probers across the globe. Country or ISP-specific regulations and the need for periodic maintenance are additional hurdles to implementing such a scheme. Lack of user imitation: Such probers might not represent a typical user behavior to explain how a typical user will use the service. Improve the design Instead of using a prober on vantage points, we can embed the probers into the actual application instead. We’ll have the following two components: Agent: This is a prober embedded in the client application that sends the appropriate service reports about any failures. Collector: This is a report collector independent of the primary service. It’s made independent to avoid situations where client agents want to report an error to the failed service. We summarize error reports from collectors and look for spikes in the errors graph to see client-side issues. These collectors are a hierarchy of big data processing systems. We can place them near the client network, and over time, we can accumulate these statistics from all such localized sites. We’ll use online stream processing systems to make such a system near real time. If we’re mainly looking for summary statistics, our system can tolerate the loss of some error reports. Some reports will be relative to the overall user population. We might say 1% of service users are “some.” If we don’t want to lose any reports, we’ll need to design a system with more care, which will be more expensive. Now, we’ll solve the following concerns: Can a user activate and deactivate client-side reports? How do client-side agents reach collectors under faulty conditions? How will we protect user privacy? Activate and deactivate reports We’ll use a custom HTML header to send appropriate information to the collectors. Though a client accesses the service via a browser, a specific browser should know about this feature to appropriately fill in the header information in the HTTP requests. For organizations that make browsers and provide services (for example, Chromium-based browsers), such features can be incorporated and standardized over time. Another solution can be to use a client-side application that the service controls, and then we can easily include such headers over HTTP. The client can fill in the request header if the client has already consented to that. The service can then reply with appropriate values for the policy and collection endpoints. Reach collectors under faulty conditions The collectors need to be in a different failure domain from the web service endpoint that we’re trying to monitor. The client side can try various collectors in different failure domains until one works. We can see a similar pattern in the following examples. At times, we refer to such a phenomenon as being outside the blast radius of a fault. If we want to see the reachability of an IP, we host the service on a different IP. If we monitor the availability of a domain, we host the collector on a different domain. And if we want to detect that an autonomous system route isn’t hijacked, we host the service in a different autonomous system. Though, for last-mile errors, there isn’t much we could do as a service provider. We might accumulate such events at the client side and report them on the next connectivity. A service can influence the remaining component failures. Protect user privacy The human user who uses the client-side software should be in full control to precisely know what data is collected and sent with each request. The user should also be able to reactivate the feature any time they wish. If we use our client-side application (and not a browser application), we have a lot of flexibility in what diagnostic could be included in the report. For a browser-based client, we can avoid the following information: We can avoid including traceroute hops to see a client to the service path. Users can be susceptible to their geographic location. It might be akin to collecting location information. We can avoid including which DNS resolver is being used. Again, details of the DNS can leak some information about the location. We can avoid including round-trip-time (RTT) and packet loss information. Note: As a guiding rule, we should try to collect as little information as possible, and it must only be used for the specific purpose a user gave consent for. Ideally, for a web-based client, we should only collect the information that’s logged in the weblog when any request has been successful. We shouldn’t use any active probing except to test the service’s standard functionality and report such probes’ results. So, traceroute and RTT or packet loss information is excluded. Any intermediary (like ISPs or middleboxes) can’t change, add, or remove the error reporting mechanism due to encryption. Similarly, designated collectors are the only place where such data can go. Conclusion In a distributed system, detecting and responding to errors on the client side is difficult. So, monitoring such events is necessary to provide a good user experience. We can handle errors using an independent agent that sends service reports about any failures to a collector. Such collectors should be independent of the primary service in terms of infrastructure and deployment. Design a Server-Side Monitoring Service It’s challenging to know what’s happening at the hardware or application level when our infrastructure is distributed across multiple locations and includes many servers. Components can run into failures, response latency overshoot, overloaded or unreachable hardware, and containers running out of resources, among other issues. Multiple services are running in such an infrastructure, and anything can go awry. When one of the services goes down, it can be the reason for other services to crash, and as a result, the application is unavailable to users. If we don’t know what went wrong early, it could take us a lot of time and effort to debug the system manually. Monitoring helps in analyzing such complex infrastructure where something is constantly failing. Monitoring distributed systems entails gathering, interpreting, and displaying data about the interactions between processes that are running at the same time. It assists in debugging, testing, performance evaluation, and having a bird’s-eye view over multiple services. Server-side errors are usually visible to monitoring services as they occur on servers. Such errors are reported as error 5xx in HTTP response codes. In a distributed system, why is a dedicated monitoring solution necessary instead of simply relying on individual server logs? In a distributed system, a dedicated monitoring solution is necessary for several reasons: 1. Centralized visibility across the entire system: It allows for an aggregated view of the system’s health and performance, making it easier to oversee and manage. 2. Detecting correlations and patterns that individual logs would miss: By analyzing data from across the system, it can identify issues that are not apparent when looking at single components. 3. Proactive alerts and faster troubleshooting: It enables the system to alert on anomalies and helps in quickly pinpointing the root cause of issues, leading to faster resolution times. Requirements Let's sum up what we want our monitoring system to do for us: Server monitoring: It includes monitoring critical local processes on a server for crashes and detecting any anomalies in CPU, memory, disk, or network usage by server processes. Additionally, overall server health is monitored, encompassing metrics like CPU, memory, disk, network bandwidth, and average load. Hardware monitoring: It involves monitoring hardware component faults on a server, such as memory failures, failing or slowing disk, and so on. Datacenter infrastructure monitoring: It involves monitoring all network switches, load balancers, and other specialized hardware within the datacenter. Additionally, monitoring power consumption at the server, rack, and datacenter levels ensures optimal resource allocation and energy efficiency. Furthermore, it is important to monitor the overall service health, which can span multiple datacenters, such as monitoring the performance of a CDN (Content Delivery Network). Network monitoring: This involves monitoring routing information and DNS for external clients to ensure efficient communication and connectivity. Additionally, monitoring network links and paths' latency within and across data centers helps identify any bottlenecks or issues affecting network performance. It is also important to monitor the network status at peering points, which ensures smooth connectivity with external networks. We want automated monitoring that identifies an anomaly in the system and informs the alert manager or shows the progress on a dashboard. Cloud service providers provide a health status of their services: AWS: https://health.aws.amazon.com/health/status Azure: https://status.azure.com/en-us/status Google: https://status.cloud.google.com/ High-level design The high-level components of our monitoring service are the following: Storage: A time-series database stores metrics data, such as the current CPU use or the number of exceptions in an application. Data collector service: This fetches the relevant data from each service and saves it in the storage. Querying service: This is an API that can query on the time-series database and return the relevant information. Storage We’ll use time-series databases to save the data locally on the server where our monitoring service is running. Then, we’ll integrate it with a separate storage node. We’ll use blob storage to store our metrics. We need to store metrics and know which action to perform if a metric has reached a particular value. For example, if CPU usage exceeds 90%, we generate an alert to the end user so the alert receiver can do take the necessary steps, such as allocate more resources to scale. For this purpose, we need another storage area that will contain the rules and actions. Let’s call it a rules databse. Upon any violation of the rules, we can take appropriate action. Here, we have identified two more components in our design—that is, a rules and action database and a storage node (a blob store). Data collector We need a monitoring system to update us about our several data centers. We can stay updated if the information about our processes reaches us, which is possible through logging. We’ll choose a pull strategy. Then, we’ll extract our relevant metrics from the logs of the application. As discussed in our logging design, we used a distributed messaging queue. The message in the queue has the service name, ID, and a short description of the log. This will help us identify the metric and its information for a specific service. Exposing the relevant metrics to the data collector is necessary for monitoring any service so that our data collector can get the metrics from the service and store them into the time-series database. A real-world example of a monitoring system based on the pull-based approach is DigitalOcean. It monitors millions of machines that are dispersed globally. What are some drawbacks of using a push-based approach? The push-based monitoring tool collects metric data from the applications and servers and sends i

Educative - System Design Interview Prep.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue