RSS Feed

‘编程’ Category

  1. Lease Activating

    August 30, 2016 by xudifsd

    Lease is a very useful and pervasive technique in distributed system, it can be used to authorize other nodes in the system. For example, in master election, follower nodes would use lease to promise current master that they will not elect another master until the lease is expired. But the problem is how can you make sure that current master will not think it still holds the lease after lease grantors think otherwise?

    It is not an easy question in practice, because you have to cope with clock skew and asynchronous network. So you can not grant lease on absolute time, and you have to assume your package may have arbitrary time delays.

    Normally, we can make a conservative guess about network delay. To be absolutely safe, we have to be very conservative. So we may build a system that its lease is valid for 60 seconds, but the master have to assume the package has been delayed, say 30 seconds, and the master refresh its lease every 10 seconds. This looks good if the lease duration is long enough, but it won’t work if the lease expired quickly. The benefit of shorter lease is higher availability in the system, because long-lived lease will prevent system from working for longer time if the master crashed.

    I didn’t have a very good solution to this until I read a paper. The paper proposed a novel way to perform reads with high throughput and low latency in Paxos system without sacrificing consistency. It is especially useful in wide-area scenarios. Apart from the main topic of the paper, it also has a new way to grant & refresh leases without depending on external clock synchronization.

    lease

    As above picture shown, it uses guard to bound the promise duration: if grantor does not receive promise_ACK during t_guard, lease would expire at T3 + t_guard + t_lease. If holder does not receive promise during t_guard, the lease won’t be activated at all. The receival of promise_ACK only shorten the lease duration. When renewing active leases, there is no need to send the guard anymore, the most recent promise_ACK plays the role of the guard.

    With this protocol, we can use very short lease in the system because we make no guess at all. In the evaluation section of the paper, authors set lease duration to 2 seconds, and let grantor renew the current lease after 500ms. In this case if the holder crashed or unavailable, the lease won’t prevent the system from working for more than 2 seconds.


  2. chubby & zookeeper: different consistency level

    June 12, 2016 by xudifsd

    Many people know zookeeper, which is a widely used open source project. But few people know chubby, which is a service only used internally in Google and public can only know its detail through paper published by Google. Zookeeper provides almost the same functionality as chubby, but there’re a few subtly differences between them, and I will discuss the differences in this post. So later time when you read a Google paper saying they used chubby as a underlaying infrastructure, do not treat it as a zookeeper equivalent.

    Zookeeper is a distributed process coordinator, as O’Reilly puts it. Chubby is a distributed lock service provides strong consistency. Although both of them have a file system like API from user’s perspective, they provide different level of consistency, you can get a clue from their descriptions: coordinator is a much weaker word compare to lock service.

    Like most other systems, there are few write operations in chubby, so its network flow is dominated by read operations and heartbeat between client and server, because chubby provides coarse-grained locking, it’s necessary to use cache, also since chubby is intended to provide strong consistency, it needs to ensure that the client should never see staled data. It achieve these by cache data in library, and if anyone want to change the data, chubby ensures that all the caches are invalidated before the change take effect. From the paper, chubby has a rather heavy client library, and the library is the essential part of the system, provides not only connection and session management but also cache management as we discussed earlier. And it is the cache management that makes huge difference between zookeeper and chubby. Most users thought zookeeper provides a strong consistency but find out they’re wrong later.

    Not like chubby, zookeeper do not provides such strong consistency, actually, native client library of zookeeper do not have cache at all. What the difference can make by cache, you asked. Well, for example: if two zookeeper clients want to agree on a single volatile value, they should use the same path to write file, and content of that file signify the value of given key. If client A want to know the latest value, and client B want to change it, client A can either poll the file regularly or use watch mechanism provided by native zookeeper library, let’s assume A uses watch since it is more efficient. If B changed the value, zookeeper would respond to the change call before notifying client A, once B got response, it may assume any other clients who watched that value has been notified, but it is not the case: A may not received notification timely due to slow network, so it is possible that two clients see different value at the same time. This usually is not a problem if users do not require strong consistency, but if they do, they have to invent some buggy ad-hoc solution to get around of this. The author of zookeeper book argued that it would have made the design of zookeeper more complex if the library manage cache, and could cause zookeeper operations to stall while waiting for a client to acknowledge cache invalidation request. It is true that strong consistency come at cost, but people would find it is easier to use if strong consistency is guaranteed.

    To remedy this, Netflix created curator library which later moved to Apache foundation, this library provides the commonly used functionality and cache management. This additional layer to zookeeper allows it providing strong consistency needed by some users. So whenever you want to use zookeeper, use curator library instead of native library unless you know what you are doing.


  3. 子序列中的滑动窗口

    September 28, 2015 by xudifsd

    很多子序列题都可以用滑动窗口解决,这个方案适用于需要找一个子序列的开始和结束位置。注意,这里指的子序列类似于子字符串,而非subsequence

    例如 leetcode 的 Substring with Concatenation of All Words 。最简单的方法就是找到一个开始位置,然后判断这个位置是否可行,代码如下:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    #!/usr/bin/env python
    
    class Solution:
        def findSubstring(self, s, words):
            if len(words) == 0 or len(s) == 0:
                return []
            dic = {}
            count = 0
            for w in words:
                dic[w] = dic.get(w, 0) + 1
                count += 1
            lw = len(words[0])
            lws = lw * len(words)
            result = []
            for i in xrange(len(s) - lws + 1):
                if self.possible(s, lw, i, dic, count):
                    result.append(i)
            return result
    
        def possible(self, s, lw, i, dic, count):
            dic2 = dic.copy()
            if count == 0:
                return True
            while count > 0:
                cur = s[i:i+lw]
                if dic2.get(cur) != None and dic2.get(cur) > 0:
                    dic2[cur] -= 1
                    if dic2[cur] < 0:
                        return False
                    count -= 1
                    i += lw
                else:
                    return False
            return True
    

    但是时间复杂度度也很高,有O(n2),这里面有很多冗余计算,比如对于输入

    “aabbccaadd”, [“aa”, “bb”, “cc”, “dd”]

    就会处理 “bbccaa” 两次。最好的办法是使用滑动窗口:维护开始和结束位置,一旦开始和结束位置一样就退出循环,否则在可行的情况下增加结束位置,如果不可行则增加开始位置。

    滑动窗口可以将复杂度降低到O(n)。在这一题中,由于字典中的字符串都是同样大小,所以可以以这个大小作为步长移动。但是需要添加另一个循环来覆盖以不同的偏移开始计算的情况。代码如下:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    #!/usr/bin/env python
    
    class Solution(object):
        def findSubstring(self, s, words):
            if len(s) == 0 or len(words) == 0:
                return []
            m = {}
            count = 0
            for w in words:
                m[w] = m.get(w, 0) + 1
                count += 1
            lw = len(words[0])
            result = []
            for delta in xrange(lw):
                i = delta
                while i < len(s) - lw + 1:
                    key = s[i:i + lw]
                    if m.get(key) > 0:
                        m[key] -= 1
                        count -= 1
                        j = i + lw
                        while i < j:
                            if count == 0:
                                result.append(i)
                            if j < len(s) - lw + 1:
                                key = s[j:j + lw]
                                if m.get(key) > 0:
                                    m[key] -= 1
                                    count -= 1
                                    j += lw
                                    continue
                            m[s[i:i + lw]] += 1
                            count += 1
                            i += lw
                    else:
                        i += lw
            return result
    

    其中,for循环用来处理不同偏移的情况,for循环中的循环用于实现滑动窗口。这个实现复杂度为O(lw * n),如果将lw看成常数则复杂度为O(n)。

    滑动窗口适用于很多子序列类的算法,如:Minimum Size Subarray SumMinimum Window SubstringLongest Substring Without Repeating Characters


  4. Cope with high latency caused by fork(2)

    July 8, 2015 by xudifsd

    In normal applications, we won’t care much about latency caused by non-IO system calls, because they’re usually very fast, and this is more true for fork(2), since fork(2) in Linux is much more lightweight than most other OSs.

    But we met this very problem. The application we’re talking about is master node server for distributed cluster, we need to fork to make checkpoint of our internal data structures, the checkpoint can speed up recovery phase if server is down and up again. Because the scale of cluster is very large, application in master node will consume a lot of memory, in our production cluster, we have 200+G memory machine, and the application consumed almost 200G of memory, this will causes the fork(2) exhibit high latency, almost 10s as we observe, this has also been documented by redis.

    Linux deployed copy-on-write in fork(2), but it still needs to copy page table, for 200G memory, it constituted by 200G/4K == 50M pages, and since 1 page can have 4K/8 == 512 PTEs, so PTEs alone occupied 50M/512 == 100K pages, that’s 400M of memory, and kernel is not just copying that 400M of memory, it also needs to do some bookkeeping.

    Alought Linux introduced lazy copy of page table in 2.6.14, but this laziness only works for pages backed by file, which is not applicable in our cases.

    Because fork(2) will also causes other threads pause during the call. This latency is unacceptable for some threads, especially for those are sensible to time, for example, threads that are responsible for sending/receiving heartbeat.

    If we don’t sending or responding heartbeat for a long time, other nodes will think we’re down, and if this server is primary master, quorum may run next election to elect a new primary. To make it worse, the number of ‘running’ server may be less than quorum, this will makes server completely not responding during that time.

    Luckily enough, we only do checkpoint in secondary master, instead of primary master, so we don’t need to deal with election problem. But since we usually have only 3 masters, we need to make sure there’re always 2 masters are responding, otherwise the server will not responding.

    Previously we workaround this by making heartbeat missing tolerant time longer than latency caused by these time consuming operations, but this will also makes failing node detecting much slower. Also we need to make sure 2 secondary masters are not doing fork(2) at the same time. Currently we notify the thread that we’re going to have a long time pause before we actually call fork(2), and notify it again after call, and in that thread, we pretend we received the heart beat just after the pause, but these’re all workaround, instead of real solution.

    IMHO, this fragile to latency is the design fault. I’ve test that forking with 30G memory takes 1s, so if your application is targeting at that scale, you should take fork(2) latency seriously. For our case, I think we may setup a process that is only responsible for reading the log outputted by master on the same machine, and generating checkpoint periodically, but this may consume twice of memory, we can treat this as trading space for time. Or we can doing checkpoint for one data structure only at the time, this strategy will not consume that much of memory, but will makes data structures more complicated.

    Well, I’m more and more convinced by

    it’s important to estimate performance of a system design without actually having to build it.

    To make this happend, programmer should be more experienced and more familiar with latency caused by normal operations. It’s still a long way for me to go.


  5. Optimize a clojure macro

    March 14, 2015 by xudifsd

    Note, if you’re a macro guru, you can ignore this post, this is just used to record my learning path to master clojure macro.

    Many times, I find myself using format to generate string from string template, and values filled in are always derived from local variables, so I’m wondering if I can use macro to generate this kind of code for me.

    So I wrote the first version:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    (ns before-opt
      (:require [clojure.string :as str]))
    
    ; matches {a} {a-b} {a_b} but not {:a} {a/b}
    (defonce template-var-re #"\{((?:\w|\-)+)\}")
    
    (defmacro gen-from-tpl
      "gen-from-tpl should invoked with \"exception {ex} while {method} {url}\",
      and we'll generate new string based on var value in current context env.
      For example (let [foo 1 bar 2] (gen-from-tpl \"{foo}:{bar}\")) => \"1:2\""
      [tpl-str]
      {:pre [(string? tpl-str)]}
      (let [matches (re-seq template-var-re tpl-str)
            str-to-replace (mapv (fn [[orign key]]
                                   [orign (symbol key)])
                                 matches)]
        `(reduce (fn [acc# [orign# val#]]
                   (str/replace acc# orign# (str val#)))
                 ~tpl-str
                 ~str-to-replace)))
    

    with this macro I can use

    (gen-from-tpl "{scheme}://{subdomain}.{host}/foo/bar/{token}")

    instead of

    (format "%s://%s.%s/foo/bar/%s" scheme subdomain host token)

    Yes, this doesn’t eliminate much typing, but I don’t need to check if the number of “%s” matches the number of args, and template looks more pretty. Anyway I happily used it in some places, but after some days, I realized the code generated from it is not optimal: it generate a reduce call with string replace in reducer fn, and this is runtime cost, not compile time cost, which means it will call reduce to generate wanted string from template at runtime!

    I want generated code to be more efficient, to do this, I have to move some computation from runtime to compile time, and generate something like

    (str scheme "://" subdomain "." host "/foo/bar/" token)

    this will be much better.

    To achieve this, I need one util function that partition string with regular expression, it will return matched string and unmatched string sequentially, instead of just return matched like re-seq do. After some searches, I found it doesn’t exist, so I have to implement it on my own, and finally finished optimized version:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    (ns after-opt)
    
    ; matches {a} {a-b} {a_b} but not {:a} {a/b}
    (defonce template-var-re #"\{((?:\w|\-)+)\}")
    
    (defn- re-partition
      "partition string by regex"
      [re string]
      {:pre [(string? string)]}
      (letfn [(construct-not-matched [start end]
                {:match? false
                 :result (subs string start end)})
              (construct-matched [start end]
                (let [matched-str (subs string start end)]
                  {:match? true
                   :result (re-matches re matched-str)}))]
        (let [matcher (re-matcher re string)
              str-count (count string)]
          (loop [last-index 0
                 result []]
            (if (.find matcher)
              (let [start (.start matcher)
                    end (.end matcher)
                    cur-result (construct-matched start end)]
                (if (= start last-index)
                  (recur end (conj result cur-result))
                  (recur end (conj result
                                   (construct-not-matched last-index start)
                                   cur-result))))
              (if (= last-index str-count)
                result
                (conj result (construct-not-matched last-index str-count))))))))
    
    (defmacro gen-from-tpl
      "gen-from-tpl should invoked with something like
      \"exception {ex} while {method} {url}\"
      and we'll generate new string based on var value in current context env.
      For example: (let [foo 1 bar 2] (gen-from-tpl \"{foo}:{bar}\")) => \"1:2\""
      [tpl-str]
      {:pre [(string? tpl-str)]}
      (let [partitions (re-partition template-var-re tpl-str)
            string-and-symbol (map (fn [{:keys [match? result]}]
                                     (if match?
                                       (-> result second symbol)
                                       result))
                                   partitions)]
        `(str ~@string-and-symbol)))
    

    re-partition is indeed a little ugly, but it involves java object, and have to handle some corner cases, this is the best I can do.

    The problem with this macro is the argument must be string literal, variable that have string value won’t works. We can get around this by probing its argument type, and generate optimal code on string or code that doing re-partition at runtime on other types, but since I don’t need that ability, I didn’t do that, maybe you can have a try.

    Best thing about macro: you can use Lisp function at both runtime and compile-time. I’m very appreciate its homoiconicity now.