The Tail at Scale: Concepts, Techniques and Impact
Causes of High Tail Latency
Even a single server can suffer from high tail latency due to a variety of factors:
Shared Resources: Multiple applications or even different parts of the same application compete for a limited supply of CPU cores, memory, and network bandwidth, leading to unpredictable delays for some requests.
Background Activities: Periodic background tasks like data reconstruction in distributed file systems or log compactions can cause sudden, temporary spikes in resource consumption, slowing down user-facing requests.
Queueing: Queueing at various layers of the system, including servers and network switches amplify this variability. The dynamic nature of traffic means queue lengths are constantly changing, leading to inconsistent wait times for requests.
Hardware Trends: Modern hardware also contributes to variability. Power limits and energy management features can temporarily throttle CPU performance to save energy or prevent overheating.
Garbage Collection: The internal garbage collection process of an SSD can cause I/O operations to pause, introducing latency spikes that are outside the control of the application.
The Amplification Effect
A key insight of the paper is that tail latency is amplified by scale. This means that even if a single server is highly reliable and only has a rare, momentary performance slowdown, this rare event becomes a common occurrence for a user request that depends on many such servers. For a system that needs to contact 100 different servers in parallel, the probability of the entire request succeeding quickly is the product of the individual success probabilities of all 100 servers. As the number of servers grows, the chance of at least one of them being slow skyrockets.
The paper uses a simple example: if a user request must collect responses from 100 servers in parallel and each of these 100 servers has a 1% chance of being slow, the probability that the overall request is slow is roughly 63%. This turns an individually rare tail event into a significant factor affecting a large percentage of user requests.
Reducing Response-Time Variability
To combat response-time variability, the paper proposes a number of strategies:
Differentiating Service Classes: Distinguishing between latency-sensitive user-facing tasks and non-critical background jobs to ensure that user requests get priority access to resources.
Reducing Head-of-Line Blocking: Head-of-Line (HOL) blocking is a performance issue in computer networks and other systems where a long-running or stalled request in a queue blocks all subsequent requests in that queue from being processed, even if they are ready to go. In such a situation it is useful for the system to break long-running requests into a sequence of smaller requests to allow interleaving of the execution of other short-running requests.
For example, Google’s Web search system uses time-slicing to prevent a small number of very computationally expensive queries from adding substantial latency to a large number of concurrent cheaper queries
Another example of this is the QUIC transport protocol, which eliminates head-of-line blocking by using independent, bidirectional streams within a single connection, a technique called stream multiplexing. In QUIC when a packet is lost, QUIC's loss detection and retransmission mechanisms operate on a per-stream basis. Thus, a lost packet in one stream only delays the progress of that specific stream-allowing other independent streams to continue uninterrupted, unlike TCP where a single packet loss halts all subsequent data.
Managing Background Activities: Background tasks like garbage collection or log compaction can spike resource usage and increase tail latency. To mitigate this following strategies are suggested by authors.
Background tasks can be throttled, broken into smaller operations, or scheduled to run during times of low overall load. This prevents them from overwhelming system resources.
For large systems, it can be beneficial to synchronize background activity across many machines. This forces a brief, simultaneous burst of activity that impacts only a few requests at a time, preventing a constant, low-level increase in tail latency that would otherwise affect all requests.
The paper also notes that caching does not directly address tail latency. However, later research, such as "RobinHood: Tail Latency Aware Caching" suggests that caching systems can be redesigned to address tail latency by dynamically reallocating resources from cache-rich components (backends which don’t affect request tail latency) to the cache-poor components (backends which affect request tail latency).
Tail-Tolerant Techniques
It is advantageous to develop tail-tolerant techniques because it's difficult to eliminate the sources of latency variability entirely. Even if such perfect behavior could be achieved in isolated environments, systems with shared computational resources exhibit performance fluctuations beyond the control of application developers. It is often more practical and cost-effective to build a system that can work around these issues. The paper classifies these techniques into short-term and long-term adaptations.
Within-Request Short-Term Adaptations (Client-side)
These are strategies performed within a single request to mitigate latency.
Hedged Requests: Hedged requests address tail latency by sending a duplicate request to a different server if the first one doesn't respond within a specific time. The client waits for the fastest response and then cancels the others. A more efficient approach is to wait until the first request has been outstanding for longer than the 95th percentile expected latency before sending a second request. This method significantly reduces tail latency while only adding a small amount (approx 5%) of extra load to the system. This is used in systems like gRPC, where you can configure a hedgingDelay.
Tied Requests: While hedging requests reduces latency, a naive approach can cause extra work. A better method is to delay the second request until the first has exceeded the 95th-percentile latency. This, however, limits the benefit to a small number of requests. A more advanced technique called tied requests addresses this. The client sends duplicate requests to two servers, each aware of the other (“tied”). When one request starts executing, it sends a cancellation message to the other server's request. This allows the system to quickly abort the duplicate and prevent unnecessary work, effectively reducing queueing delays and improving efficiency. This is used in systems like Bigtable.
Probing a remote queue before submitting a request is an alternative to hedged requests. This method is less effective because the load can change between probing and submitting the request. Additionally, it is difficult to accurately estimate the service time due to system and hardware variability, and it can create hot spots if many clients choose the same seemingly least-loaded server.
Cross-Request Long-Term Adaptations (Server-side)
These are system-level strategies that adapt over time.
Micro-partitions: Micro-partitions are a technique to combat load imbalance and latency variability in distributed systems. Instead of assigning one large partition to each machine, the system creates many smaller partitions. These are then dynamically assigned and load-balanced across machines. This approach allows for smoother and faster load shedding, as partitions can be moved in smaller increments. It also improves fault recovery, as many machines can take over a small portion of the failed machine's work, speeding up the process. The BigTable distributed-storage system stores data in tablets, with each machine managing between 20 and 1,000 tablets at a time.
Selective Replication: The system observes which data is being accessed most frequently and automatically creates additional replicas of "hot" data. This spreads the load and reduces contention, preventing a single server from becoming a bottleneck. Google’s main Web search system uses this approach, making additional copies of popular and important documents in multiple micro-partitions.
Latency-Induced Probation: The system monitors the performance of individual servers. If a server consistently exhibits high tail latency, the system stops sending it new requests, giving it time to recover and clear out any internal issues. . However, the system continues to issue shadow requests to these excluded servers, collecting statistics on their latency so they can be reincorporated into the service when the problem abates.
"Good Enough" Response: Some applications can provide a partial response to the user if a full response is not available. For example, a search engine might display a portion of the results while waiting for a few slow ones to finish, then update the page.
Canary Requests: A small number of requests are sent to a new or updated version of a service to test its performance and ensure it doesn't introduce any new latency issues before it is fully rolled out.
Another technique for reducing tail latency is Erasure coding. This technique is not mentioned in the paper “The tale at scale”. Erasure coding is a data protection technique where data is broken into fragments and encoded with extra "code" fragments. These fragments are stored across different locations (like servers or disks). The original data can be fully reconstructed even if some fragments are lost or temporarily unavailable, using a subset of the remaining data and code fragments.
Erasure coding can help reduce read tail latency in distributed storage systems. When fetching data, the system can issue requests for the necessary fragments in parallel. If some servers are slow to respond, the system doesn't need to wait for them. It can use the fragments it receives first from the faster servers to reconstruct the original data, thus avoiding delays caused by the slowest components and improving tail performance. This is often achieved through "degraded reads" or "reconstructed reads".
Hardware Trends and Their Effects
This paper notes that hardware trends are making system latency more variable due to aggressive power optimizations and fabrication challenges. This growing heterogeneity means software techniques for tolerating variability are becoming more critical. Fortunately, some trends, like higher network bandwidth and lower per-message overheads from technologies like RDMA, are making these software solutions more effective. These improvements reduce the cost of redundant requests and allow for more fine-grained communication, helping to prevent head-of-line blocking.
The Paper's Enduring Influence
The concepts mentioned in the paper are now standard in the industry. The influence of "The Tail at Scale" extends to various other systems and academic research. Two notable examples are the papers "The Tail at Scale: How to Predict It?" and "The Tail at AWS Scale."
"The Tail at Scale: How to Predict It?": The paper "The Tail at Scale: How to Predict It?" is a research paper that builds upon the original "The Tail at Scale" paper. Its main contribution is a prediction model for tail latency. It finds that tail latency in highly-loaded, scale-out systems can be accurately predicted using only the mean and variance of individual task response times. This allows engineers to explicitly link system-level latency goals with the performance of individual components, which facilitates more efficient resource provisioning for large-scale systems. The paper's model helps move from reactive mitigation to proactive, predictive resource management.
"The Tail at AWS Scale": The research paper "The Tail at Amazon Web Services Scale" details how AWS addressed tail latency, the unpredictable delays in its network, which are caused by the limitations of traditional TCP at massive scale. To combat this, AWS developed the Scalable Reliable Datagram (SRD) protocol. This proprietary protocol, implemented in custom Nitro networking cards, is designed to be "tail-tolerant." SRD intelligently uses multiple network paths simultaneously, "spraying" packets across available routes to bypass congestion. It also features ultra-fast retransmission and a relaxed packet-ordering policy to prevent head-of-line blocking. By using this hardware-software co-design to quickly react to network issues, SRD has been able to reduce high-percentile latency by up to 90% for services like Elastic Block Storage (EBS).
References and Further Reading:
Dean, J., & Barroso, L. A. (2013). The tail at scale.
Langley, A., Riddoch, A., Wilk, A., Vicente, A., Krasic, C., Zhang, D., ... & Shi, Z. (2023). The QUIC Transport Protocol: Design and Internet-Scale Deployment.
Berger, D. S., Berg, B., Zhu, T., Sen, S., & Harchol-Balter, M. (2018). RobinHood: Tail Latency Aware Caching-Dynamic Reallocation from Cache-Rich to Cache-Poor.
Nguyen, M., Li, Z., Duan, F., Che, H., Lei, Y., & Jiang, H. (2016). The Tail at Scale: How to Predict It?
Shalev, L., Ayoub, H., Bshara, N., Fatael, Y., Golan, O., Ilany, O., ... & Saidi, A. (2024). The Tail at AWS Scale.
Marc Brooker’s blog “Erasure Coding versus Tail Latency”.
Comments
Post a Comment