RSS Feed

Posts Tagged ‘programming’

  1. Use percentile in performance stats

    May 8, 2016 by xudifsd

    Here is a fun story of mine: last year, when I was preparing Google Summer of Code for finagle project, I encountered a class I didn’t understand. It is not too long, but from only source and comments I can not imagine its usage and why it is useful. Since it’s not a very important class, I skipped it, and tried to read other parts of the project. Unfortunately, finagle failed to be accepted as an organization in GSoC 2015(what a shame), so I stopped my reading.

    This year, I accepted a full-time job working on a distributed system, this is my favorite field in computer programming, I worked very hard to do my best. Recently, I read a post <Notes on Distributed Systems for Young Bloods>, it mentioned a lot of things that new recruits in distributed system should paying attention to. It has a lot of valuable advice, in particular, it emphasized the importance of metrics: to get a better understanding of the behaviour of the system, we, as system engineers, should expose metrics, and when doing so, we should not only expose averages, but also percentiles. This, however, reminds me that the mysterious class I met in finagle has something to do with metrics and stats, so I read that class again, and find out I can fully understand that class and its usage now.

    From <The Tail at Scale>, I understood that it is critical for the system to keep tail latency low, otherwise the overall tail latency would make the system intolerable slow. But I failed to see the implication that we should expose latency data, especially tail latency data to let people tune the system. Well, no wonder why finagle needs that class.

    In searching for a better algorithm to implement percentile metrics, I found another interesting paper <Effective Computation of Biased Quantiles over Data Streams>, this paper proposed a way to calculate quantiles over streams with finer error guarantees at higher ranks. It’s very efficient since its space bound is O(log n), but this algorithm still doesn’t fit in normal needs for performance stats, since the server is long time running, the number of metrics data is unbounded, this algorithm would consume too much memory if the server kept running for a long time. From this point BucketedHistogram in finagle perform great, we can specify error we can tolerate with limits, if we want more precise data in higher rank we can add more slots in higher rank limits, and it’s very easy to understand, what’s more, it uses constant space although this constant is a little large.


  2. 子序列中的滑动窗口

    September 28, 2015 by xudifsd

    很多子序列题都可以用滑动窗口解决,这个方案适用于需要找一个子序列的开始和结束位置。注意,这里指的子序列类似于子字符串,而非subsequence

    例如 leetcode 的 Substring with Concatenation of All Words 。最简单的方法就是找到一个开始位置,然后判断这个位置是否可行,代码如下:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    #!/usr/bin/env python
    
    class Solution:
        def findSubstring(self, s, words):
            if len(words) == 0 or len(s) == 0:
                return []
            dic = {}
            count = 0
            for w in words:
                dic[w] = dic.get(w, 0) + 1
                count += 1
            lw = len(words[0])
            lws = lw * len(words)
            result = []
            for i in xrange(len(s) - lws + 1):
                if self.possible(s, lw, i, dic, count):
                    result.append(i)
            return result
    
        def possible(self, s, lw, i, dic, count):
            dic2 = dic.copy()
            if count == 0:
                return True
            while count > 0:
                cur = s[i:i+lw]
                if dic2.get(cur) != None and dic2.get(cur) > 0:
                    dic2[cur] -= 1
                    if dic2[cur] < 0:
                        return False
                    count -= 1
                    i += lw
                else:
                    return False
            return True
    

    但是时间复杂度度也很高,有O(n2),这里面有很多冗余计算,比如对于输入

    “aabbccaadd”, [“aa”, “bb”, “cc”, “dd”]

    就会处理 “bbccaa” 两次。最好的办法是使用滑动窗口:维护开始和结束位置,一旦开始和结束位置一样就退出循环,否则在可行的情况下增加结束位置,如果不可行则增加开始位置。

    滑动窗口可以将复杂度降低到O(n)。在这一题中,由于字典中的字符串都是同样大小,所以可以以这个大小作为步长移动。但是需要添加另一个循环来覆盖以不同的偏移开始计算的情况。代码如下:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    #!/usr/bin/env python
    
    class Solution(object):
        def findSubstring(self, s, words):
            if len(s) == 0 or len(words) == 0:
                return []
            m = {}
            count = 0
            for w in words:
                m[w] = m.get(w, 0) + 1
                count += 1
            lw = len(words[0])
            result = []
            for delta in xrange(lw):
                i = delta
                while i < len(s) - lw + 1:
                    key = s[i:i + lw]
                    if m.get(key) > 0:
                        m[key] -= 1
                        count -= 1
                        j = i + lw
                        while i < j:
                            if count == 0:
                                result.append(i)
                            if j < len(s) - lw + 1:
                                key = s[j:j + lw]
                                if m.get(key) > 0:
                                    m[key] -= 1
                                    count -= 1
                                    j += lw
                                    continue
                            m[s[i:i + lw]] += 1
                            count += 1
                            i += lw
                    else:
                        i += lw
            return result
    

    其中,for循环用来处理不同偏移的情况,for循环中的循环用于实现滑动窗口。这个实现复杂度为O(lw * n),如果将lw看成常数则复杂度为O(n)。

    滑动窗口适用于很多子序列类的算法,如:Minimum Size Subarray SumMinimum Window SubstringLongest Substring Without Repeating Characters


  3. Cope with high latency caused by fork(2)

    July 8, 2015 by xudifsd

    In normal applications, we won’t care much about latency caused by non-IO system calls, because they’re usually very fast, and this is more true for fork(2), since fork(2) in Linux is much more lightweight than most other OSs.

    But we met this very problem. The application we’re talking about is master node server for distributed cluster, we need to fork to make checkpoint of our internal data structures, the checkpoint can speed up recovery phase if server is down and up again. Because the scale of cluster is very large, application in master node will consume a lot of memory, in our production cluster, we have 200+G memory machine, and the application consumed almost 200G of memory, this will causes the fork(2) exhibit high latency, almost 10s as we observe, this has also been documented by redis.

    Linux deployed copy-on-write in fork(2), but it still needs to copy page table, for 200G memory, it constituted by 200G/4K == 50M pages, and since 1 page can have 4K/8 == 512 PTEs, so PTEs alone occupied 50M/512 == 100K pages, that’s 400M of memory, and kernel is not just copying that 400M of memory, it also needs to do some bookkeeping.

    Alought Linux introduced lazy copy of page table in 2.6.14, but this laziness only works for pages backed by file, which is not applicable in our cases.

    Because fork(2) will also causes other threads pause during the call. This latency is unacceptable for some threads, especially for those are sensible to time, for example, threads that are responsible for sending/receiving heartbeat.

    If we don’t sending or responding heartbeat for a long time, other nodes will think we’re down, and if this server is primary master, quorum may run next election to elect a new primary. To make it worse, the number of ‘running’ server may be less than quorum, this will makes server completely not responding during that time.

    Luckily enough, we only do checkpoint in secondary master, instead of primary master, so we don’t need to deal with election problem. But since we usually have only 3 masters, we need to make sure there’re always 2 masters are responding, otherwise the server will not responding.

    Previously we workaround this by making heartbeat missing tolerant time longer than latency caused by these time consuming operations, but this will also makes failing node detecting much slower. Also we need to make sure 2 secondary masters are not doing fork(2) at the same time. Currently we notify the thread that we’re going to have a long time pause before we actually call fork(2), and notify it again after call, and in that thread, we pretend we received the heart beat just after the pause, but these’re all workaround, instead of real solution.

    IMHO, this fragile to latency is the design fault. I’ve test that forking with 30G memory takes 1s, so if your application is targeting at that scale, you should take fork(2) latency seriously. For our case, I think we may setup a process that is only responsible for reading the log outputted by master on the same machine, and generating checkpoint periodically, but this may consume twice of memory, we can treat this as trading space for time. Or we can doing checkpoint for one data structure only at the time, this strategy will not consume that much of memory, but will makes data structures more complicated.

    Well, I’m more and more convinced by

    it’s important to estimate performance of a system design without actually having to build it.

    To make this happend, programmer should be more experienced and more familiar with latency caused by normal operations. It’s still a long way for me to go.


  4. Optimize a clojure macro

    March 14, 2015 by xudifsd

    Note, if you’re a macro guru, you can ignore this post, this is just used to record my learning path to master clojure macro.

    Many times, I find myself using format to generate string from string template, and values filled in are always derived from local variables, so I’m wondering if I can use macro to generate this kind of code for me.

    So I wrote the first version:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    (ns before-opt
      (:require [clojure.string :as str]))
    
    ; matches {a} {a-b} {a_b} but not {:a} {a/b}
    (defonce template-var-re #"\{((?:\w|\-)+)\}")
    
    (defmacro gen-from-tpl
      "gen-from-tpl should invoked with \"exception {ex} while {method} {url}\",
      and we'll generate new string based on var value in current context env.
      For example (let [foo 1 bar 2] (gen-from-tpl \"{foo}:{bar}\")) => \"1:2\""
      [tpl-str]
      {:pre [(string? tpl-str)]}
      (let [matches (re-seq template-var-re tpl-str)
            str-to-replace (mapv (fn [[orign key]]
                                   [orign (symbol key)])
                                 matches)]
        `(reduce (fn [acc# [orign# val#]]
                   (str/replace acc# orign# (str val#)))
                 ~tpl-str
                 ~str-to-replace)))
    

    with this macro I can use

    (gen-from-tpl "{scheme}://{subdomain}.{host}/foo/bar/{token}")

    instead of

    (format "%s://%s.%s/foo/bar/%s" scheme subdomain host token)

    Yes, this doesn’t eliminate much typing, but I don’t need to check if the number of “%s” matches the number of args, and template looks more pretty. Anyway I happily used it in some places, but after some days, I realized the code generated from it is not optimal: it generate a reduce call with string replace in reducer fn, and this is runtime cost, not compile time cost, which means it will call reduce to generate wanted string from template at runtime!

    I want generated code to be more efficient, to do this, I have to move some computation from runtime to compile time, and generate something like

    (str scheme "://" subdomain "." host "/foo/bar/" token)

    this will be much better.

    To achieve this, I need one util function that partition string with regular expression, it will return matched string and unmatched string sequentially, instead of just return matched like re-seq do. After some searches, I found it doesn’t exist, so I have to implement it on my own, and finally finished optimized version:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    (ns after-opt)
    
    ; matches {a} {a-b} {a_b} but not {:a} {a/b}
    (defonce template-var-re #"\{((?:\w|\-)+)\}")
    
    (defn- re-partition
      "partition string by regex"
      [re string]
      {:pre [(string? string)]}
      (letfn [(construct-not-matched [start end]
                {:match? false
                 :result (subs string start end)})
              (construct-matched [start end]
                (let [matched-str (subs string start end)]
                  {:match? true
                   :result (re-matches re matched-str)}))]
        (let [matcher (re-matcher re string)
              str-count (count string)]
          (loop [last-index 0
                 result []]
            (if (.find matcher)
              (let [start (.start matcher)
                    end (.end matcher)
                    cur-result (construct-matched start end)]
                (if (= start last-index)
                  (recur end (conj result cur-result))
                  (recur end (conj result
                                   (construct-not-matched last-index start)
                                   cur-result))))
              (if (= last-index str-count)
                result
                (conj result (construct-not-matched last-index str-count))))))))
    
    (defmacro gen-from-tpl
      "gen-from-tpl should invoked with something like
      \"exception {ex} while {method} {url}\"
      and we'll generate new string based on var value in current context env.
      For example: (let [foo 1 bar 2] (gen-from-tpl \"{foo}:{bar}\")) => \"1:2\""
      [tpl-str]
      {:pre [(string? tpl-str)]}
      (let [partitions (re-partition template-var-re tpl-str)
            string-and-symbol (map (fn [{:keys [match? result]}]
                                     (if match?
                                       (-> result second symbol)
                                       result))
                                   partitions)]
        `(str ~@string-and-symbol)))
    

    re-partition is indeed a little ugly, but it involves java object, and have to handle some corner cases, this is the best I can do.

    The problem with this macro is the argument must be string literal, variable that have string value won’t works. We can get around this by probing its argument type, and generate optimal code on string or code that doing re-partition at runtime on other types, but since I don’t need that ability, I didn’t do that, maybe you can have a try.

    Best thing about macro: you can use Lisp function at both runtime and compile-time. I’m very appreciate its homoiconicity now.


  5. Simple cache using binding & memoize in Clojure

    January 4, 2015 by xudifsd

    In this post I will talk about a simple way to do cache in Clojure application.

    I am building an API server in Clojure, a lot of permission check should be done before serving the request, so I abstract it to some kind of ACL checker. Our server is built with ring SPEC, and because I want the ACL checker could be independent and reusable, I just abstract ACL checker as a ring handler, and only passes request to actual handler if it passed ACL check, this is very sweet because I can combine several ACL checker together just like ring middleware, this also makes permission requirements very obvious in router.

    The problem with this, however, is ACL checker will do many redundant database reads. For example, we have following ACL checker:

    and many APIs need these two checkers at the same time, but clearly these two checkers all need to get user from database, and what they get is identical, because the same request passed to them! And we have many cases like this, the most obvious way to reduce database read is add extra cache server before database, and deploy write-back or write-through policy to serve the write, but this requires changing the architecture and many models’ code. I just want a simple cache that just work for this case, and do not require global changes.

    Then I realized I can combine binding & memoize to construct a simple cache. The basic idea is I will not invoke model function directly, I just binding many dynamic vars with memoized model function before the check, and invoke dynamic var in checker instead of model function:

    The advantage of this is I don’t need to refactor much code, just binding the dynamic name and replace actually function name with it. This optimization is quite sucessful: I didn’t spend much time to refactor code structure, and also didn’t complicated code, but the profile result shows I reduced 1 ~ 2 database read in every request.

    I think we can use this technique whenever following condition meet:

    • only needs to read
    • many duplicated reads
    • cache is not long-lived

    So, after this optimization, I also optimized our email notification system with same technique, because our notification system scheduled at fixed rate, and it also needs to read database many times with many duplicated reads.