Podcast
Questions and Answers
What is essential to start with for giant-scale services?
What is essential to start with for giant-scale services?
A professional data center and layer-7 switches
Which availability metric should be focused on as much as MTBF?
Which availability metric should be focused on as much as MTBF?
Data replication is sufficient for preserving uptime under faults.
Data replication is sufficient for preserving uptime under faults.
False
What does intelligent admission control help implement?
What does intelligent admission control help implement?
Signup and view all the answers
Use DQ analysis on all _____ to ensure reliability.
Use DQ analysis on all _____ to ensure reliability.
Signup and view all the answers
What should be developed to minimize downtime during upgrades?
What should be developed to minimize downtime during upgrades?
Signup and view all the answers
Eric A. Brewer founded the Federal Search Foundation in 2001.
Eric A. Brewer founded the Federal Search Foundation in 2001.
Signup and view all the answers
Who is the Chief Scientist of Inktomi?
Who is the Chief Scientist of Inktomi?
Signup and view all the answers
What is one of Eric A. Brewer's research interests?
What is one of Eric A. Brewer's research interests?
Signup and view all the answers
What is one load-management approach that uses custom nodes for session management?
What is one load-management approach that uses custom nodes for session management?
Signup and view all the answers
Which approach includes clients in the load-management process?
Which approach includes clients in the load-management process?
Signup and view all the answers
Round-robin DNS assigns different servers to different clients to achieve simple load balancing.
Round-robin DNS assigns different servers to different clients to achieve simple load balancing.
Signup and view all the answers
What is the defined formula for yield?
What is the defined formula for yield?
Signup and view all the answers
What does MTTR stand for?
What does MTTR stand for?
Signup and view all the answers
What is the preferred focus for giant-scale systems regarding availability?
What is the preferred focus for giant-scale systems regarding availability?
Signup and view all the answers
What happens to the effective size of a partitioned persistent store during a node failure?
What happens to the effective size of a partitioned persistent store during a node failure?
Signup and view all the answers
A perfect system would have 100 percent yield and 100 percent harvest.
A perfect system would have 100 percent yield and 100 percent harvest.
Signup and view all the answers
Which metric focuses on the fraction of completed queries?
Which metric focuses on the fraction of completed queries?
Signup and view all the answers
Which system aims to maintain 100 percent harvest under a fault?
Which system aims to maintain 100 percent harvest under a fault?
Signup and view all the answers
What are giant web services?
What are giant web services?
Signup and view all the answers
The focus of the article is on wide-area issues such as network partitioning.
The focus of the article is on wide-area issues such as network partitioning.
Signup and view all the answers
Which of the following is NOT a component of the basic model for giant-scale services?
Which of the following is NOT a component of the basic model for giant-scale services?
Signup and view all the answers
What advantage does centralizing infrastructure services offer?
What advantage does centralizing infrastructure services offer?
Signup and view all the answers
What is the primary role of the load manager in giant-scale services?
What is the primary role of the load manager in giant-scale services?
Signup and view all the answers
Giant-scale services should maintain _____ availability to meet user expectations.
Giant-scale services should maintain _____ availability to meet user expectations.
Signup and view all the answers
What is a significant challenge mentioned in relation to giant-scale services?
What is a significant challenge mentioned in relation to giant-scale services?
Signup and view all the answers
What is the expected number of people with internet access predicted in the next ten years?
What is the expected number of people with internet access predicted in the next ten years?
Signup and view all the answers
In which type of traffic do read-only queries outnumber updates?
In which type of traffic do read-only queries outnumber updates?
Signup and view all the answers
Clusters in giant-scale services are used for independent faults.
Clusters in giant-scale services are used for independent faults.
Signup and view all the answers
What load redirection method does the Inktomi search engine use?
What load redirection method does the Inktomi search engine use?
Signup and view all the answers
What is the implication of losing two of five nodes in a replica group?
What is the implication of losing two of five nodes in a replica group?
Signup and view all the answers
Replication on disk is cheap, but accessing the replicated data requires __________ points.
Replication on disk is cheap, but accessing the replicated data requires __________ points.
Signup and view all the answers
Graceful degradation mechanisms are critical for delivering high availability.
Graceful degradation mechanisms are critical for delivering high availability.
Signup and view all the answers
What can cause traffic to exceed average levels in online ticket sales?
What can cause traffic to exceed average levels in online ticket sales?
Signup and view all the answers
What basic constraints must be taken into account for graceful degradation?
What basic constraints must be taken into account for graceful degradation?
Signup and view all the answers
Which method guarantees that stock trade requests will be executed within 60 seconds?
Which method guarantees that stock trade requests will be executed within 60 seconds?
Signup and view all the answers
What is the role of dynamic database reduction?
What is the role of dynamic database reduction?
Signup and view all the answers
Natural disasters affect only one replica at a time.
Natural disasters affect only one replica at a time.
Signup and view all the answers
What is one approach to perform online evolution?
What is one approach to perform online evolution?
Signup and view all the answers
During the 'big flip', we __________ switch all traffic to the upgraded nodes.
During the 'big flip', we __________ switch all traffic to the upgraded nodes.
Signup and view all the answers
Study Notes
Giant-Scale Services Overview
- Growth of web portals and ISPs, such as AOL and Yahoo, has multiplied over tenfold in five years.
- Essential focus on infrastructure services, which include instant messaging and various remote access applications.
Key Requirements of Giant-Scale Services
- Need for high availability, particularly for major platforms like eBay and CNN.
- Services must always be available, requiring robust infrastructure to handle growth and evolution.
Basic Model for Giant-Scale Services
- Services rely on a load manager to balance traffic among servers, enhancing availability.
- Clients access services over the internet, utilizing a best-effort IP network.
- Serves as an intermediary between external names and server IP addresses, ensuring reliability amidst server failures.
Advantages of the Basic Model
- Access Anywhere: Facilitates user access from various locations and devices, including set-top boxes and smart devices.
- Cost Efficiency: Centralized infrastructure allows for better resource utilization compared to standalone devices.
- Groupware Support: Centralizes data for collaboration tools, improving functionality for applications like teleconferencing and group management.
- Efficient Upgrades: Services can be upgraded seamlessly without physical distribution capabilities.
Clusters in Giant-Scale Services
- Clusters consist of multiple commodity servers functioning together to meet high scalability requirements.
- Example deployments:
- AOL Web cache: over 1,000 nodes, processing 10 billion queries/day.
- Inktomi search engine: over 1,000 nodes, more than 80 million queries/day.
- Nodes generally have a three-year depreciation timeline, providing scalability as service needs grow.
Load Management Advances
- Modern load management utilizes layer-4 and layer-7 switches to monitor server health and distribute traffic effectively.
- Methods include:
- Round-robin DNS for basic load balancing.
- Session management via service-specific front-end nodes.
Challenges and Considerations
- Downtime prevention is critical, requiring automatic detection and isolation of non-functioning nodes.
- Multiple load management strategies ensure service continuity and resilience during failures.
Persistent Data Store
- Data storage across servers uses replicated or partitioned approaches to maintain data availability and integrity.
- Includes options for network-attached storage systems to enhance overall system performance.
Implications for Design and Evolution
- Focus on scalability, availability, graceful degradation, and ease of upgrading is crucial for meeting user expectations.
- Equipment and operational costs are often weighed against the service bandwidth demands and service quality.### System Complexity and Design
- A simple Web farm utilizes round-robin DNS for load management with a persistent data store achieved by replicating all content to all nodes.
- Web farms typically experience no coherence traffic and may not require a backplane, although a secondary LAN for manual updates is common.
- In contrast, a search engine cluster supports external programs (e.g., Web servers) via layer-4 switches that balance load and mask faults, ensuring data availability despite node failures.
- Persistent data in search engine clusters is partitioned across servers, increasing capacity but risking data loss during server outages.
High Availability Metrics
- High availability is critical for large-scale systems; uptime, typically expressed in “nines,” measures how often a system is operational.
- Mean Time Between Failures (MTBF) and Mean Time To Repair (MTTR) are key components impacting uptime, with uptime calculable as:
uptime = (MTBF – MTTR) / MTBF. - Focus on improving MTTR is encouraged, as it allows for more manageable system adjustments compared to reducing MTBF.
Availability Measurement Terms
-
Yield: Fraction of completed queries, calculated as
yield = queries completed / queries offered. -
Harvest: Fraction of available data relative to the complete dataset, defined as
harvest = data available / complete data. - A perfect system would achieve 100% yield and 100% harvest, ensuring each query reflects the entire database.
DQ Principle
- The Data per Query (DQ) Principle suggests a relationship between data requested per query and queries processed per second, remaining constant under normal conditions.
- Design should accommodate capacity constraints tied to physical limitations like total I/O bandwidth, particularly at high utilization common in large-scale systems.
Replication vs. Partitioning
- Replication increases availability by maintaining complete data but reduces yield when failures occur; partitioning maintains yield better under fault conditions.
- Both methods initially maintain the same DQ value, but under failure, replication sees a yield reduction, while partitioned systems preserve yield but experience reduced harvest.
Capacity and Overload Considerations
- When a system experiences failures, redirected loads can drastically increase stress on remaining nodes, complicating server management.
- Replication is often deemed inefficient under high utilization without adequate excess capacity to handle failures, emphasizing the need for effective load-balancing solutions.
Graceful Degradation
- Mechanisms for graceful degradation become crucial during excess load conditions to maintain service availability.
- The DQ principle offers methods for graceful degradation, including limiting query capacity or reducing data to improve overall performance.
- Strategies may involve admissions controls to lessen query load or dynamic database reductions to lower the amount of data processed, thereby increasing the effective operational capacity.### Giant-Scale Services Overview
- Emphasis on graceful degradation to manage system availability during failures and saturation.
- Partitioned systems can replicate key data to enhance reliability, allowing a backup node to take over if the primary fails.
Cost-Based Admission Control (AC)
- Inktomi employs dynamic AC based on estimated query costs, balancing data and query metrics.
- Reducing data (D) during high demand allows for increased query capacity (Q), optimizing service provision.
- Simplistic policies may reduce D too aggressively, risking performance.
Priority-Based Admission Control
- Datek prioritizes stock trade queries, ensuring execution within a strict time frame, enhancing user experience.
- Low-value queries may be denied to preserve resources for higher-priority requests.
Data Freshness and Saturation
- Financial services may allow for less frequent updates on stock quotes during system saturation, trading off data accuracy for performance.
- Cached data may not reflect the current state, impacting user experience and system yield.
Disaster Tolerance Strategies
- Effective disaster recovery involves managing replica groups and implementing graceful degradation to mitigate failover impacts.
- Diversifying locations for replicas minimizes the risk from localized disasters.
Online Evolution and Migration
- Software and system upgrades are crucial for maintaining giant-scale services, with a focus on minimizing downtime and preserving quality.
- Upgrades can be executed through fast reboot, rolling upgrades, or "big flip," each impacting system availability differently.
Upgrade Approaches
- Fast Reboot: Quick system restart; dependent on staging areas to minimize service disruption.
- Rolling Upgrade: Sequential node upgrades; maintains service continuity with minimal capacity loss.
- Big Flip: Simultaneous upgrades of node halves; complex but effective for substantial changes, allowing for controlled failovers.
Key Lessons for Scalable Systems
- Establish a professional infrastructure with appropriate metrics for availability, focusing on both uptime and user experience.
- Monitor and measure performance with tools like DQ analysis to inform system operations and upgrades.
Automation and Control
- Maximize automation in upgrades to minimize disruptions, integrating smart clients for enhanced disaster recovery.
- Anticipate and plan for fault management through intelligent resource allocation and analysis.
Conclusion
- Understanding and managing availability metrics is critical in designing resilient giant-scale services.
- Continuously evolving systems require a balance between minimal changes and effective upgrades to maintain high performance.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Test your knowledge on the lessons learned from giant-scale web services. This quiz explores the new tools and methods required to handle issues in scalable internet services. Perfect for those interested in web technology and infrastructure.