RSS Feed

Posts Tagged ‘distributed system’

  1. Lease Activating

    August 30, 2016 by xudifsd

    Lease is a very useful and pervasive technique in distributed system, it can be used to authorize other nodes in the system. For example, in master election, follower nodes would use lease to promise current master that they will not elect another master until the lease is expired. But the problem is how can you make sure that current master will not think it still holds the lease after lease grantors think otherwise?

    It is not an easy question in practice, because you have to cope with clock skew and asynchronous network. So you can not grant lease on absolute time, and you have to assume your package may have arbitrary time delays.

    Normally, we can make a conservative guess about network delay. To be absolutely safe, we have to be very conservative. So we may build a system that its lease is valid for 60 seconds, but the master have to assume the package has been delayed, say 30 seconds, and the master refresh its lease every 10 seconds. This looks good if the lease duration is long enough, but it won’t work if the lease expired quickly. The benefit of shorter lease is higher availability in the system, because long-lived lease will prevent system from working for longer time if the master crashed.

    I didn’t have a very good solution to this until I read a paper. The paper proposed a novel way to perform reads with high throughput and low latency in Paxos system without sacrificing consistency. It is especially useful in wide-area scenarios. Apart from the main topic of the paper, it also has a new way to grant & refresh leases without depending on external clock synchronization.

    lease

    As above picture shown, it uses guard to bound the promise duration: if grantor does not receive promise_ACK during t_guard, lease would expire at T3 + t_guard + t_lease. If holder does not receive promise during t_guard, the lease won’t be activated at all. The receival of promise_ACK only shorten the lease duration. When renewing active leases, there is no need to send the guard anymore, the most recent promise_ACK plays the role of the guard.

    With this protocol, we can use very short lease in the system because we make no guess at all. In the evaluation section of the paper, authors set lease duration to 2 seconds, and let grantor renew the current lease after 500ms. In this case if the holder crashed or unavailable, the lease won’t prevent the system from working for more than 2 seconds.


  2. chubby & zookeeper: different consistency level

    June 12, 2016 by xudifsd

    Many people know zookeeper, which is a widely used open source project. But few people know chubby, which is a service only used internally in Google and public can only know its detail through paper published by Google. Zookeeper provides almost the same functionality as chubby, but there’re a few subtly differences between them, and I will discuss the differences in this post. So later time when you read a Google paper saying they used chubby as a underlaying infrastructure, do not treat it as a zookeeper equivalent.

    Zookeeper is a distributed process coordinator, as O’Reilly puts it. Chubby is a distributed lock service provides strong consistency. Although both of them have a file system like API from user’s perspective, they provide different level of consistency, you can get a clue from their descriptions: coordinator is a much weaker word compare to lock service.

    Like most other systems, there are few write operations in chubby, so its network flow is dominated by read operations and heartbeat between client and server, because chubby provides coarse-grained locking, it’s necessary to use cache, also since chubby is intended to provide strong consistency, it needs to ensure that the client should never see staled data. It achieve these by cache data in library, and if anyone want to change the data, chubby ensures that all the caches are invalidated before the change take effect. From the paper, chubby has a rather heavy client library, and the library is the essential part of the system, provides not only connection and session management but also cache management as we discussed earlier. And it is the cache management that makes huge difference between zookeeper and chubby. Most users thought zookeeper provides a strong consistency but find out they’re wrong later.

    Not like chubby, zookeeper do not provides such strong consistency, actually, native client library of zookeeper do not have cache at all. What the difference can make by cache, you asked. Well, for example: if two zookeeper clients want to agree on a single volatile value, they should use the same path to write file, and content of that file signify the value of given key. If client A want to know the latest value, and client B want to change it, client A can either poll the file regularly or use watch mechanism provided by native zookeeper library, let’s assume A uses watch since it is more efficient. If B changed the value, zookeeper would respond to the change call before notifying client A, once B got response, it may assume any other clients who watched that value has been notified, but it is not the case: A may not received notification timely due to slow network, so it is possible that two clients see different value at the same time. This usually is not a problem if users do not require strong consistency, but if they do, they have to invent some buggy ad-hoc solution to get around of this. The author of zookeeper book argued that it would have made the design of zookeeper more complex if the library manage cache, and could cause zookeeper operations to stall while waiting for a client to acknowledge cache invalidation request. It is true that strong consistency come at cost, but people would find it is easier to use if strong consistency is guaranteed.

    To remedy this, Netflix created curator library which later moved to Apache foundation, this library provides the commonly used functionality and cache management. This additional layer to zookeeper allows it providing strong consistency needed by some users. So whenever you want to use zookeeper, use curator library instead of native library unless you know what you are doing.


  3. Use percentile in performance stats

    May 8, 2016 by xudifsd

    Here is a fun story of mine: last year, when I was preparing Google Summer of Code for finagle project, I encountered a class I didn’t understand. It is not too long, but from only source and comments I can not imagine its usage and why it is useful. Since it’s not a very important class, I skipped it, and tried to read other parts of the project. Unfortunately, finagle failed to be accepted as an organization in GSoC 2015(what a shame), so I stopped my reading.

    This year, I accepted a full-time job working on a distributed system, this is my favorite field in computer programming, I worked very hard to do my best. Recently, I read a post <Notes on Distributed Systems for Young Bloods>, it mentioned a lot of things that new recruits in distributed system should paying attention to. It has a lot of valuable advice, in particular, it emphasized the importance of metrics: to get a better understanding of the behaviour of the system, we, as system engineers, should expose metrics, and when doing so, we should not only expose averages, but also percentiles. This, however, reminds me that the mysterious class I met in finagle has something to do with metrics and stats, so I read that class again, and find out I can fully understand that class and its usage now.

    From <The Tail at Scale>, I understood that it is critical for the system to keep tail latency low, otherwise the overall tail latency would make the system intolerable slow. But I failed to see the implication that we should expose latency data, especially tail latency data to let people tune the system. Well, no wonder why finagle needs that class.

    In searching for a better algorithm to implement percentile metrics, I found another interesting paper <Effective Computation of Biased Quantiles over Data Streams>, this paper proposed a way to calculate quantiles over streams with finer error guarantees at higher ranks. It’s very efficient since its space bound is O(log n), but this algorithm still doesn’t fit in normal needs for performance stats, since the server is long time running, the number of metrics data is unbounded, this algorithm would consume too much memory if the server kept running for a long time. From this point BucketedHistogram in finagle perform great, we can specify error we can tolerate with limits, if we want more precise data in higher rank we can add more slots in higher rank limits, and it’s very easy to understand, what’s more, it uses constant space although this constant is a little large.


  4. Make a predictable whole out of unpredictable parts

    August 8, 2015 by xudifsd

    Recently, I read many papers & articles about latency, especially tail latency, so this post is served as my notes on my reading.

    Latency is a very important metric for a system, because it affects user experience a lot. In distributed system, tail latency is much more important than in single node system. This is based on a simple observation:

    if you have components that 1% of requests exhibit high latency, and if request from client have to touch 100 components in a distributed system, then you’ll have 64% of client requests exhibit high latency. That’s awful user experience. This is a fundamental property of scaling systems: you need to worry not just about latency, but tail latency. High performance equals high tolerances. At scale you can’t ignore tail latency.

    From this article, we can know that latency can arise from many sources, from hardware to software. It also links to interesting discussion on TCP vs SPDY.

    Authors of another paper discussed their effort on tuning web server software, they found out that head-of-line blocking in kernel queue has great impact on latency. This kind of blocking is sometime helpful for throughput on disk write, but can severely degrades latency. Also, if you have blocking invocation in an event loop, it will also causes head-of-line blocking, and will makes tail latency higher, by removing blocking invocation out of event loop, we can solve this kind of problem.

    Another article from linkedin discussed how they solve high tail latency by installing another net card to serve specific network request.

    Two examples above show how difficult it is to identify the source of latency. As always, optimization requires a lot of insight on how system works.

    Another interesting paper proposed a runtime coordination system that coordinate GC in distributed system written in GC language. GC usually causes hiccup for system, especially major collection. This paper shows that minor GC have little impact on batch workload like Spark, but will affect real-time application significantly, GC is a key contributor to stragglers in many interactive system. By deploying coordinate system, PageRank computation on Spark completed 15% faster, 99.9% tail update latency on Cassandra is improved from 3.3ms to 1.6ms, the worst case from 83ms to 19ms, which is quite impressive.

    High tail latency is not only caused by GC, head-of-line blocking etc, it’s prevalent. This paper shows that 1% tail latency is unpredictable no matter what config, programming language or OS is. After removing 1% tail latency from statistic, the server’s behavior is predictable. So that, many systems could focusing on provide QoS guarantees for statistical measures such as 99th latency percentile.

    Ok, enough talk about background, let’s read some really awesome stuff.

    Jeff Dean gives a talk about tail latency in 2013. Here is the slide and summary article. Instead of trying to identify the source of latency & eliminate it, he tries to live with it. The talk gives some general techniques to cut the tail, like: hedged requests, tied requests, micro-partitions, selective replication and latency-induced probation. These techniques is general enough to applied to many scenarios.

    The talk compared faults tolerance and variability tolerance. To tolerate variability, one has to:

    make a predictable whole out of unpredictable parts

    and this is where the title of this post came from. It’s very similar to:

    make a reliable whole out of unreliable parts

    which is the key point of distributed system. And since latency is more and more important as a metric, predictable latency should also be more and more important. I think this is the direction that next age distributed system should heading.


  5. Cope with high latency caused by fork(2)

    July 8, 2015 by xudifsd

    In normal applications, we won’t care much about latency caused by non-IO system calls, because they’re usually very fast, and this is more true for fork(2), since fork(2) in Linux is much more lightweight than most other OSs.

    But we met this very problem. The application we’re talking about is master node server for distributed cluster, we need to fork to make checkpoint of our internal data structures, the checkpoint can speed up recovery phase if server is down and up again. Because the scale of cluster is very large, application in master node will consume a lot of memory, in our production cluster, we have 200+G memory machine, and the application consumed almost 200G of memory, this will causes the fork(2) exhibit high latency, almost 10s as we observe, this has also been documented by redis.

    Linux deployed copy-on-write in fork(2), but it still needs to copy page table, for 200G memory, it constituted by 200G/4K == 50M pages, and since 1 page can have 4K/8 == 512 PTEs, so PTEs alone occupied 50M/512 == 100K pages, that’s 400M of memory, and kernel is not just copying that 400M of memory, it also needs to do some bookkeeping.

    Alought Linux introduced lazy copy of page table in 2.6.14, but this laziness only works for pages backed by file, which is not applicable in our cases.

    Because fork(2) will also causes other threads pause during the call. This latency is unacceptable for some threads, especially for those are sensible to time, for example, threads that are responsible for sending/receiving heartbeat.

    If we don’t sending or responding heartbeat for a long time, other nodes will think we’re down, and if this server is primary master, quorum may run next election to elect a new primary. To make it worse, the number of ‘running’ server may be less than quorum, this will makes server completely not responding during that time.

    Luckily enough, we only do checkpoint in secondary master, instead of primary master, so we don’t need to deal with election problem. But since we usually have only 3 masters, we need to make sure there’re always 2 masters are responding, otherwise the server will not responding.

    Previously we workaround this by making heartbeat missing tolerant time longer than latency caused by these time consuming operations, but this will also makes failing node detecting much slower. Also we need to make sure 2 secondary masters are not doing fork(2) at the same time. Currently we notify the thread that we’re going to have a long time pause before we actually call fork(2), and notify it again after call, and in that thread, we pretend we received the heart beat just after the pause, but these’re all workaround, instead of real solution.

    IMHO, this fragile to latency is the design fault. I’ve test that forking with 30G memory takes 1s, so if your application is targeting at that scale, you should take fork(2) latency seriously. For our case, I think we may setup a process that is only responsible for reading the log outputted by master on the same machine, and generating checkpoint periodically, but this may consume twice of memory, we can treat this as trading space for time. Or we can doing checkpoint for one data structure only at the time, this strategy will not consume that much of memory, but will makes data structures more complicated.

    Well, I’m more and more convinced by

    it’s important to estimate performance of a system design without actually having to build it.

    To make this happend, programmer should be more experienced and more familiar with latency caused by normal operations. It’s still a long way for me to go.