RSS Feed

Posts Tagged ‘latency’

  1. Make a predictable whole out of unpredictable parts

    August 8, 2015 by xudifsd

    Recently, I read many papers & articles about latency, especially tail latency, so this post is served as my notes on my reading.

    Latency is a very important metric for a system, because it affects user experience a lot. In distributed system, tail latency is much more important than in single node system. This is based on a simple observation:

    if you have components that 1% of requests exhibit high latency, and if request from client have to touch 100 components in a distributed system, then you’ll have 64% of client requests exhibit high latency. That’s awful user experience. This is a fundamental property of scaling systems: you need to worry not just about latency, but tail latency. High performance equals high tolerances. At scale you can’t ignore tail latency.

    From this article, we can know that latency can arise from many sources, from hardware to software. It also links to interesting discussion on TCP vs SPDY.

    Authors of another paper discussed their effort on tuning web server software, they found out that head-of-line blocking in kernel queue has great impact on latency. This kind of blocking is sometime helpful for throughput on disk write, but can severely degrades latency. Also, if you have blocking invocation in an event loop, it will also causes head-of-line blocking, and will makes tail latency higher, by removing blocking invocation out of event loop, we can solve this kind of problem.

    Another article from linkedin discussed how they solve high tail latency by installing another net card to serve specific network request.

    Two examples above show how difficult it is to identify the source of latency. As always, optimization requires a lot of insight on how system works.

    Another interesting paper proposed a runtime coordination system that coordinate GC in distributed system written in GC language. GC usually causes hiccup for system, especially major collection. This paper shows that minor GC have little impact on batch workload like Spark, but will affect real-time application significantly, GC is a key contributor to stragglers in many interactive system. By deploying coordinate system, PageRank computation on Spark completed 15% faster, 99.9% tail update latency on Cassandra is improved from 3.3ms to 1.6ms, the worst case from 83ms to 19ms, which is quite impressive.

    High tail latency is not only caused by GC, head-of-line blocking etc, it’s prevalent. This paper shows that 1% tail latency is unpredictable no matter what config, programming language or OS is. After removing 1% tail latency from statistic, the server’s behavior is predictable. So that, many systems could focusing on provide QoS guarantees for statistical measures such as 99th latency percentile.

    Ok, enough talk about background, let’s read some really awesome stuff.

    Jeff Dean gives a talk about tail latency in 2013. Here is the slide and summary article. Instead of trying to identify the source of latency & eliminate it, he tries to live with it. The talk gives some general techniques to cut the tail, like: hedged requests, tied requests, micro-partitions, selective replication and latency-induced probation. These techniques is general enough to applied to many scenarios.

    The talk compared faults tolerance and variability tolerance. To tolerate variability, one has to:

    make a predictable whole out of unpredictable parts

    and this is where the title of this post came from. It’s very similar to:

    make a reliable whole out of unreliable parts

    which is the key point of distributed system. And since latency is more and more important as a metric, predictable latency should also be more and more important. I think this is the direction that next age distributed system should heading.


  2. Cope with high latency caused by fork(2)

    July 8, 2015 by xudifsd

    In normal applications, we won’t care much about latency caused by non-IO system calls, because they’re usually very fast, and this is more true for fork(2), since fork(2) in Linux is much more lightweight than most other OSs.

    But we met this very problem. The application we’re talking about is master node server for distributed cluster, we need to fork to make checkpoint of our internal data structures, the checkpoint can speed up recovery phase if server is down and up again. Because the scale of cluster is very large, application in master node will consume a lot of memory, in our production cluster, we have 200+G memory machine, and the application consumed almost 200G of memory, this will causes the fork(2) exhibit high latency, almost 10s as we observe, this has also been documented by redis.

    Linux deployed copy-on-write in fork(2), but it still needs to copy page table, for 200G memory, it constituted by 200G/4K == 50M pages, and since 1 page can have 4K/8 == 512 PTEs, so PTEs alone occupied 50M/512 == 100K pages, that’s 400M of memory, and kernel is not just copying that 400M of memory, it also needs to do some bookkeeping.

    Alought Linux introduced lazy copy of page table in 2.6.14, but this laziness only works for pages backed by file, which is not applicable in our cases.

    Because fork(2) will also causes other threads pause during the call. This latency is unacceptable for some threads, especially for those are sensible to time, for example, threads that are responsible for sending/receiving heartbeat.

    If we don’t sending or responding heartbeat for a long time, other nodes will think we’re down, and if this server is primary master, quorum may run next election to elect a new primary. To make it worse, the number of ‘running’ server may be less than quorum, this will makes server completely not responding during that time.

    Luckily enough, we only do checkpoint in secondary master, instead of primary master, so we don’t need to deal with election problem. But since we usually have only 3 masters, we need to make sure there’re always 2 masters are responding, otherwise the server will not responding.

    Previously we workaround this by making heartbeat missing tolerant time longer than latency caused by these time consuming operations, but this will also makes failing node detecting much slower. Also we need to make sure 2 secondary masters are not doing fork(2) at the same time. Currently we notify the thread that we’re going to have a long time pause before we actually call fork(2), and notify it again after call, and in that thread, we pretend we received the heart beat just after the pause, but these’re all workaround, instead of real solution.

    IMHO, this fragile to latency is the design fault. I’ve test that forking with 30G memory takes 1s, so if your application is targeting at that scale, you should take fork(2) latency seriously. For our case, I think we may setup a process that is only responsible for reading the log outputted by master on the same machine, and generating checkpoint periodically, but this may consume twice of memory, we can treat this as trading space for time. Or we can doing checkpoint for one data structure only at the time, this strategy will not consume that much of memory, but will makes data structures more complicated.

    Well, I’m more and more convinced by

    it’s important to estimate performance of a system design without actually having to build it.

    To make this happend, programmer should be more experienced and more familiar with latency caused by normal operations. It’s still a long way for me to go.