RSS Feed

Posts Tagged ‘optimization’

  1. Optimize a clojure macro

    March 14, 2015 by xudifsd

    Note, if you’re a macro guru, you can ignore this post, this is just used to record my learning path to master clojure macro.

    Many times, I find myself using format to generate string from string template, and values filled in are always derived from local variables, so I’m wondering if I can use macro to generate this kind of code for me.

    So I wrote the first version:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    (ns before-opt
      (:require [clojure.string :as str]))
    
    ; matches {a} {a-b} {a_b} but not {:a} {a/b}
    (defonce template-var-re #"\{((?:\w|\-)+)\}")
    
    (defmacro gen-from-tpl
      "gen-from-tpl should invoked with \"exception {ex} while {method} {url}\",
      and we'll generate new string based on var value in current context env.
      For example (let [foo 1 bar 2] (gen-from-tpl \"{foo}:{bar}\")) => \"1:2\""
      [tpl-str]
      {:pre [(string? tpl-str)]}
      (let [matches (re-seq template-var-re tpl-str)
            str-to-replace (mapv (fn [[orign key]]
                                   [orign (symbol key)])
                                 matches)]
        `(reduce (fn [acc# [orign# val#]]
                   (str/replace acc# orign# (str val#)))
                 ~tpl-str
                 ~str-to-replace)))
    

    with this macro I can use

    (gen-from-tpl "{scheme}://{subdomain}.{host}/foo/bar/{token}")

    instead of

    (format "%s://%s.%s/foo/bar/%s" scheme subdomain host token)

    Yes, this doesn’t eliminate much typing, but I don’t need to check if the number of “%s” matches the number of args, and template looks more pretty. Anyway I happily used it in some places, but after some days, I realized the code generated from it is not optimal: it generate a reduce call with string replace in reducer fn, and this is runtime cost, not compile time cost, which means it will call reduce to generate wanted string from template at runtime!

    I want generated code to be more efficient, to do this, I have to move some computation from runtime to compile time, and generate something like

    (str scheme "://" subdomain "." host "/foo/bar/" token)

    this will be much better.

    To achieve this, I need one util function that partition string with regular expression, it will return matched string and unmatched string sequentially, instead of just return matched like re-seq do. After some searches, I found it doesn’t exist, so I have to implement it on my own, and finally finished optimized version:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    (ns after-opt)
    
    ; matches {a} {a-b} {a_b} but not {:a} {a/b}
    (defonce template-var-re #"\{((?:\w|\-)+)\}")
    
    (defn- re-partition
      "partition string by regex"
      [re string]
      {:pre [(string? string)]}
      (letfn [(construct-not-matched [start end]
                {:match? false
                 :result (subs string start end)})
              (construct-matched [start end]
                (let [matched-str (subs string start end)]
                  {:match? true
                   :result (re-matches re matched-str)}))]
        (let [matcher (re-matcher re string)
              str-count (count string)]
          (loop [last-index 0
                 result []]
            (if (.find matcher)
              (let [start (.start matcher)
                    end (.end matcher)
                    cur-result (construct-matched start end)]
                (if (= start last-index)
                  (recur end (conj result cur-result))
                  (recur end (conj result
                                   (construct-not-matched last-index start)
                                   cur-result))))
              (if (= last-index str-count)
                result
                (conj result (construct-not-matched last-index str-count))))))))
    
    (defmacro gen-from-tpl
      "gen-from-tpl should invoked with something like
      \"exception {ex} while {method} {url}\"
      and we'll generate new string based on var value in current context env.
      For example: (let [foo 1 bar 2] (gen-from-tpl \"{foo}:{bar}\")) => \"1:2\""
      [tpl-str]
      {:pre [(string? tpl-str)]}
      (let [partitions (re-partition template-var-re tpl-str)
            string-and-symbol (map (fn [{:keys [match? result]}]
                                     (if match?
                                       (-> result second symbol)
                                       result))
                                   partitions)]
        `(str ~@string-and-symbol)))
    

    re-partition is indeed a little ugly, but it involves java object, and have to handle some corner cases, this is the best I can do.

    The problem with this macro is the argument must be string literal, variable that have string value won’t works. We can get around this by probing its argument type, and generate optimal code on string or code that doing re-partition at runtime on other types, but since I don’t need that ability, I didn’t do that, maybe you can have a try.

    Best thing about macro: you can use Lisp function at both runtime and compile-time. I’m very appreciate its homoiconicity now.


  2. Simple cache using binding & memoize in Clojure

    January 4, 2015 by xudifsd

    In this post I will talk about a simple way to do cache in Clojure application.

    I am building an API server in Clojure, a lot of permission check should be done before serving the request, so I abstract it to some kind of ACL checker. Our server is built with ring SPEC, and because I want the ACL checker could be independent and reusable, I just abstract ACL checker as a ring handler, and only passes request to actual handler if it passed ACL check, this is very sweet because I can combine several ACL checker together just like ring middleware, this also makes permission requirements very obvious in router.

    The problem with this, however, is ACL checker will do many redundant database reads. For example, we have following ACL checker:

    and many APIs need these two checkers at the same time, but clearly these two checkers all need to get user from database, and what they get is identical, because the same request passed to them! And we have many cases like this, the most obvious way to reduce database read is add extra cache server before database, and deploy write-back or write-through policy to serve the write, but this requires changing the architecture and many models’ code. I just want a simple cache that just work for this case, and do not require global changes.

    Then I realized I can combine binding & memoize to construct a simple cache. The basic idea is I will not invoke model function directly, I just binding many dynamic vars with memoized model function before the check, and invoke dynamic var in checker instead of model function:

    The advantage of this is I don’t need to refactor much code, just binding the dynamic name and replace actually function name with it. This optimization is quite sucessful: I didn’t spend much time to refactor code structure, and also didn’t complicated code, but the profile result shows I reduced 1 ~ 2 database read in every request.

    I think we can use this technique whenever following condition meet:

    • only needs to read
    • many duplicated reads
    • cache is not long-lived

    So, after this optimization, I also optimized our email notification system with same technique, because our notification system scheduled at fixed rate, and it also needs to read database many times with many duplicated reads.


  3. 使用swap进行内存优化

    May 30, 2014 by xudifsd

    上次优化后,我们的程序又一次被内存问题所烦。上次使用惰性解决了内存消耗的问题,但是由于上次我们只是需要使用AST一次,使用完后就可以将其删除,比较简单。但是现在问题更加复杂了一些。

    由于我们做的是符号执行,程序需要像JVM一样在运行中保存应用的所有类。但又由于我们没有对内存中的类的表示也就是AST做任何优化,所以非常耗内存。因此也可以说这次的优化就是在给之前的代码擦屁股。只是因为这次擦得漂亮,而且有一定的通用性,值得写点东西记录下。

    和上次一样,这次最耗内存的结构还是那个AST,但是又由于我们需要保存所有的AST,所以之前惰性的优化有些用,但是也只是让我们死得更慢些而已。优化AST不可取就是因为工作量会很大,而且之前的AST经过了很严密的测试,不会有问题,如果优化后AST错了也会很难debug。于是我们想到了直接把不常使用的AST像操作系统那样swap out到磁盘上,在需要时再swap in即可。而且最棒的是我们之前使用Map来存所有的AST,所以这次优化只需要写一个实现了Map接口的类即可,可以做到只需要修改现有的很少代码。

    由于实现了Map接口,而且功能是swap一部分内容出内存,因此取名为SwapMap。分离出来的SwapMap的代码见。需要注意的是这个类对使用有很多限制:首先是这个类只有在key很小,value很大时才会起作用,因为它会一直保存key在内存中,而只对value优化。其次是我们虽然将value扫出,但是却没法像操作系统那样将其所使用的内存也收回,只有等待JVM把内存GC,所以使用这个类的其他代码也必须小心地处理引用问题,如果一直有对value的引用,那么即使value被扫出也没法释放内存。当然还有个要求就是value必须实现Serializable接口。

    当前这个类只实现了get和put方法,其他方法应该能比较简单地实现,只是由于我们不需要而已。另外需要强调的是这个类的作者不是我,而是我们实验室的YKG

    这是我们优化前的运行图,内存使用随着AST的生成一直在增长,最后到1.5G时基本挂掉。

    优化前

    这是我们优化后的图,基本上能做到只需要很少的内存即可,因为执行有很强的局部性,所以运行时只需要执行的AST在内存中,其他的AST都可以直接扫出。所以曲线会增长一点然后等AST使用完后在被扫出时下降,这样的内存使用比优化前不知道少了多少倍。

    优化后

    可以看到效果很好嘛(其实就是因为之前太懒造成AST结构太大才能显现这么好的优化效果)。

    这样的优化和之前的优化一样,都是对整个结构的优化,效果会非常好,但是这样的机会并不多,对于更加细致的优化也更加需要能力,而且需要十分清楚程序的内部热点。这里推荐下VisualVM,可以很清楚地显示每个对象占用的内存,不过比较坑爹的就是它不会将对象的成员变量也算入该对象所占的内存,除非变量是非引用类型。


  4. 记一次内存优化

    January 3, 2014 by xudifsd

    我们现在在做一个安卓应用的分析,需要读取apk文件,再通过工具分析成一个抽象语法树(以下简称为A树),之后再做一层和编译器相似的翻译,将A树翻译成我们的抽象语法树(以下简称为B树)。和java类似,apk中每个类都生成一个文件,而我们处理的单位也是这样的文件。为了模块化我们把分析和翻译分成两个独立步骤,所以分析函数和翻译函数的产出是一个装着各自结果的链表。

    现在为了验证树的正确性,我们写了个B树的输出,再打包回去运行。所以我们总共有三个分开的步骤:分析、翻译和输出。

    对于一些小型的apk这样做没什么问题,但前几天测一些大应用时会因为OOM挂掉,后来把jvm虚拟机堆大小设成2G才不会挂,拿工具看了下运行时的内存消耗峰值快达到1.5G,如下图

    优化前

    本来耗点内存也就算了,懒得调,不过之后我们准备批量测试apk,而跑测试的机子才2G内存,稍微跑大点的apk就挂可受不了。 之前也试了各种各样的优化方法,比如在翻译过程时破坏性地访问A树所在链表,翻译出一个B树就将相应的A树删除,或在链表没用后马上赋值为null。但是这种优化并没有带来很多效果,内存消耗还是很大,后来有人说应该将这两个独立步骤合并,因为我们现在的目的只是输出B树,所以可以写一个独立的函数,一次一个地分析出A树并马上翻译成B树再输出,这样内存中同时存在的树最多就两个文件产生的树,而不是之前的一个项目所有文件产生的树,内存消耗就会大大减少。

    但是总觉得这样的做法并不优雅:需要维护两个内容基本相同的模块,而且目的变了后之前代码完全没法复用。最好的方法就是在不增加冗余代码并且不修改大体框架的情况下实现这种优化。

    昨天晚上又看了看整个框架,发现这种流程不就是lisp里面的map么:先以解析函数和文件的链表作为参数传给map,将翻译函数传给map再过一遍之前的输出,之后把输出函数传给map再过一遍前面的输出。

    想到map我就想到了之前被clojure的惰性map坑过一次:将函数传给map,之后一直等它输出却没有任何反应,调了很久才发现原来是惰性的问题。突然发现惰性求值不就是我们需要的么:可以将解析函数处理成惰性的,把翻译函数也处理成惰性的,在输出阶段再打破这种惰性就行了,因为只有到了输出阶段才真正地需要树的结构。这样一来我们之前的架构也可以在基本不修改的情况下使用这样的优化,更妙的是这样的架构之后也能复用,只有在真正需要树的情况下再打破这种惰性即可,这样不管以后的目的如何都能保证在内存中不会完整保留整个项目的树结构,除非真正用到整个树,即使访问了整个树也可以根据需求释放内存,像输出的这种情况,输出后就将对应的树删除即可。

    优化点想到后实现了下,删除了100多行,增加了60多行就实现了,而且效果惊人,内存使用的峰值从1.5G下降到500M。如下图

    优化后

    实现之前还想自己写个接口做这种惰性求值,后来发现Callable和Runnable接口就能直接拿来干这事,打破惰性时还能直接扔给ExecutorService多线程跑。

    这是迄今为止我做优化最成功的一次了,不仅效果惊人而且还挺优雅。以前做优化都什么效果,有的优化后甚至还差了。看来以后还是得真正了解到优化点并且有比较优雅的解决方案时再着手优化,否则不仅效果差而且最后丑到自己都不想看了。