RSS Feed

Posts Tagged ‘clojure’

  1. Optimize a clojure macro

    March 14, 2015 by xudifsd

    Note, if you’re a macro guru, you can ignore this post, this is just used to record my learning path to master clojure macro.

    Many times, I find myself using format to generate string from string template, and values filled in are always derived from local variables, so I’m wondering if I can use macro to generate this kind of code for me.

    So I wrote the first version:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    (ns before-opt
      (:require [clojure.string :as str]))
    
    ; matches {a} {a-b} {a_b} but not {:a} {a/b}
    (defonce template-var-re #"\{((?:\w|\-)+)\}")
    
    (defmacro gen-from-tpl
      "gen-from-tpl should invoked with \"exception {ex} while {method} {url}\",
      and we'll generate new string based on var value in current context env.
      For example (let [foo 1 bar 2] (gen-from-tpl \"{foo}:{bar}\")) => \"1:2\""
      [tpl-str]
      {:pre [(string? tpl-str)]}
      (let [matches (re-seq template-var-re tpl-str)
            str-to-replace (mapv (fn [[orign key]]
                                   [orign (symbol key)])
                                 matches)]
        `(reduce (fn [acc# [orign# val#]]
                   (str/replace acc# orign# (str val#)))
                 ~tpl-str
                 ~str-to-replace)))
    

    with this macro I can use

    (gen-from-tpl "{scheme}://{subdomain}.{host}/foo/bar/{token}")

    instead of

    (format "%s://%s.%s/foo/bar/%s" scheme subdomain host token)

    Yes, this doesn’t eliminate much typing, but I don’t need to check if the number of “%s” matches the number of args, and template looks more pretty. Anyway I happily used it in some places, but after some days, I realized the code generated from it is not optimal: it generate a reduce call with string replace in reducer fn, and this is runtime cost, not compile time cost, which means it will call reduce to generate wanted string from template at runtime!

    I want generated code to be more efficient, to do this, I have to move some computation from runtime to compile time, and generate something like

    (str scheme "://" subdomain "." host "/foo/bar/" token)

    this will be much better.

    To achieve this, I need one util function that partition string with regular expression, it will return matched string and unmatched string sequentially, instead of just return matched like re-seq do. After some searches, I found it doesn’t exist, so I have to implement it on my own, and finally finished optimized version:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    (ns after-opt)
    
    ; matches {a} {a-b} {a_b} but not {:a} {a/b}
    (defonce template-var-re #"\{((?:\w|\-)+)\}")
    
    (defn- re-partition
      "partition string by regex"
      [re string]
      {:pre [(string? string)]}
      (letfn [(construct-not-matched [start end]
                {:match? false
                 :result (subs string start end)})
              (construct-matched [start end]
                (let [matched-str (subs string start end)]
                  {:match? true
                   :result (re-matches re matched-str)}))]
        (let [matcher (re-matcher re string)
              str-count (count string)]
          (loop [last-index 0
                 result []]
            (if (.find matcher)
              (let [start (.start matcher)
                    end (.end matcher)
                    cur-result (construct-matched start end)]
                (if (= start last-index)
                  (recur end (conj result cur-result))
                  (recur end (conj result
                                   (construct-not-matched last-index start)
                                   cur-result))))
              (if (= last-index str-count)
                result
                (conj result (construct-not-matched last-index str-count))))))))
    
    (defmacro gen-from-tpl
      "gen-from-tpl should invoked with something like
      \"exception {ex} while {method} {url}\"
      and we'll generate new string based on var value in current context env.
      For example: (let [foo 1 bar 2] (gen-from-tpl \"{foo}:{bar}\")) => \"1:2\""
      [tpl-str]
      {:pre [(string? tpl-str)]}
      (let [partitions (re-partition template-var-re tpl-str)
            string-and-symbol (map (fn [{:keys [match? result]}]
                                     (if match?
                                       (-> result second symbol)
                                       result))
                                   partitions)]
        `(str ~@string-and-symbol)))
    

    re-partition is indeed a little ugly, but it involves java object, and have to handle some corner cases, this is the best I can do.

    The problem with this macro is the argument must be string literal, variable that have string value won’t works. We can get around this by probing its argument type, and generate optimal code on string or code that doing re-partition at runtime on other types, but since I don’t need that ability, I didn’t do that, maybe you can have a try.

    Best thing about macro: you can use Lisp function at both runtime and compile-time. I’m very appreciate its homoiconicity now.


  2. Simple cache using binding & memoize in Clojure

    January 4, 2015 by xudifsd

    In this post I will talk about a simple way to do cache in Clojure application.

    I am building an API server in Clojure, a lot of permission check should be done before serving the request, so I abstract it to some kind of ACL checker. Our server is built with ring SPEC, and because I want the ACL checker could be independent and reusable, I just abstract ACL checker as a ring handler, and only passes request to actual handler if it passed ACL check, this is very sweet because I can combine several ACL checker together just like ring middleware, this also makes permission requirements very obvious in router.

    The problem with this, however, is ACL checker will do many redundant database reads. For example, we have following ACL checker:

    and many APIs need these two checkers at the same time, but clearly these two checkers all need to get user from database, and what they get is identical, because the same request passed to them! And we have many cases like this, the most obvious way to reduce database read is add extra cache server before database, and deploy write-back or write-through policy to serve the write, but this requires changing the architecture and many models’ code. I just want a simple cache that just work for this case, and do not require global changes.

    Then I realized I can combine binding & memoize to construct a simple cache. The basic idea is I will not invoke model function directly, I just binding many dynamic vars with memoized model function before the check, and invoke dynamic var in checker instead of model function:

    The advantage of this is I don’t need to refactor much code, just binding the dynamic name and replace actually function name with it. This optimization is quite sucessful: I didn’t spend much time to refactor code structure, and also didn’t complicated code, but the profile result shows I reduced 1 ~ 2 database read in every request.

    I think we can use this technique whenever following condition meet:

    • only needs to read
    • many duplicated reads
    • cache is not long-lived

    So, after this optimization, I also optimized our email notification system with same technique, because our notification system scheduled at fixed rate, and it also needs to read database many times with many duplicated reads.


  3. 2014总结

    December 30, 2014 by xudifsd

    2014年我的关键字应该是“遇见”和“Clojure”吧。

    经历

    今年对我影响最大的一件事情就是Google Summer of Code。大二就想参加了,但是由于种种原因直到今年才真正参加,从去年十二月就开始在往届项目的idea page上去找感兴趣的项目,二三月联系mentor、提交patch,四月看类型系统的论文,五月开始,八月结束。整个过程耗费了大量的精力和时间,但是收获也是巨大的。

    之前预期的收获就是Google发的钱以及简历上可以多写点东西,但是后来发现远远不止于此,不知道其他的开源组织怎么样,但是我觉得Clojure这个组织还挺有意思:它会筹集一些钱来赞助参加了GSoC的学生去参加Clojure/conj或Clojure/west,而且社区内部似乎很多人也愿意帮助一些项目的进行,比如我做的Typed Clojure项目本身就是诞生于GSoC项目,维护者也就是我的mentor就是全职在做这个项目,靠的就是从社区筹集的资金。现在Clojure社区经常讨论的一个话题就是diversity,说现在的Clojure开发者80%都是白人男性,希望能吸引到其他的人多参与,Clojure/conj就提供这样的grant。社区本身也挺有活力,在刚开始GSoC的时候,就会有Clojure Gazette这样的媒体邮件采访做GSoC的学生,并问一些项目的问题这样。

    不过我觉得最主要的一个收获是开阔了眼界,知道了一个非公司资助的开源项目是怎么运行的,遇见了mentor,知道了他是怎么去做事情。最近才知道他现在在美国读phd,仍会继续做Typed Clojure项目,而且还计划资助几个学生兼职开发,命名为diversity scholarship,主要想资助非白人学生,增加Clojure社区的diversity,原因就是在GSoC中与不同国家的学生交互的经历很有意思(其中还提到了我,很开心,笑)。

    除了GSoC,我还通过其他的渠道遇见了很多人。

    八月的时候,国内参加了GSoC的学生还在北京聚了一次,遇见了更多有意思的人,有传说中做CERN项目的牛人、扬言只写了14行代码就完成了GSoC的大神、写gcc go前端的萌妹子。之后还参加了十一月份的hackathon,约了两个一直想面基的豆瓣友邻参加,虽然没有得奖,但是整个过程还是很有意思,认识了很多上交的学生,还认识了超厉害的前端妹子

    另外还遇见并喜欢上了一个女生,虽然最后无疾而终,但是还是很感激这段感情,让我成长了很多。最后觉得都是时辰的错啊。

    总结

    之前看一个关于Google面试tips的文章,说最好的状态就是“know a little a lot, know a lot a little”,现在觉得这话很有道理:做到了前一点才能显示自己的独特,做到了后一点才能为自己创造更多的机会。一两年前我还一直只想着找一个内核相关的工作,学Lisp完全就是副业,但是谁能想到今年我写的绝大部分代码都是Clojure?我发现,懂得各个领域的不同知识可以让你把很多知识串起来连成网络,这样很方便理解与记忆,而且这样在任何时刻你都可以不花太多时间就能把know a little的知识扩充成know some,甚至是know a lot,然后就能投入使用。这似乎和投资里的“不要把鸡蛋放在一个篮子里”的道理一样吧。

    另外我也发现了多接触些人、多看一些书的必要性,这个社会不管是资本驱动的还是技术驱动的最终还是由人构成的,了解不同的人的经验和理念会给自己带来很多启示和帮助,很可能下一个你遇见的人或书就会彻底改变你。就我而言,如果当年没读过《黑客与画家》,我也不会去学Lisp,也不会接触到Clojure,恐怕今年也就没法参加GSoC并且找到工作了吧,笑。

    希望新的一年能出一次国,最好是能参加Clojure/west,然后就是多点一点分布式系统的技能树,一直想点,但是一直没机会实施,最后就是要更多地遇见一些人,探索生命中更多的可能性。

    明年我会尽量用英文写技术相关的博客,因为发现读我博客的国人基本上都懂英文,而且用英文写博客可以扩大受众,何乐而不为呢。


  4. Macro

    December 20, 2014 by xudifsd

    A few month ago, I wrote a Clojure macro, looks like this:

    It means if current env is test, then generate normal function, otherwise generate async function and log the error if catched. We need this because when we call some functions, we don’t want to wait them to return, but we still want to catch their exception and log it. But when in test env, we want them to throw exception to make error more obvious. We determine env by environment variable, and default to test env if not provided.

    After I finished this macro, and kicked it online, everything seems fine. But somehow I recall this macro while I was doing something else, and found it has a bug that is difficult to detect. The bug stems from the way I treat it as normal function instead of macro.

    Because macro expansion happens at compile time instead of runtime, this macro needs env variable at compile time, otherwise it will generate normal function. But normally we won’t set env variable at compile time (lein jar), and only set env variable at runtime, because normally env variables only take effect at runtime. So we didn’t get async function online at all, they’re all normal functions. Solution for this is simple, just set env variable before compile.

    This kind of problem is hard to find, because no matter what function we get, the behavior is identical, just a little bit slower.

    It seems mastering macro is not that simple. But macro is less mysterious to me now, when I was reading , macro is something I can’t understand, because the book tells all the goodness about macro without telling any principle about it, this just makes it mysterious, but after you understand all the principle, it’s just a tool to generate code for you, the goodness is you can use normal function in macro just like in other function, but actually macro is not as flexible as function sometimes: they can’t being passed as value.

    Now, I’m quite understand the rule for macro: “don’t write macro if you can”.


  5. 逻辑编程及应用

    October 13, 2014 by xudifsd

    最近看了下逻辑编程,据说逻辑编程最牛的是prolog,但是没空再学一门语言了,而且看到clojure中刚好有core.logic的这个库,而且可以拿这个库写《The Reasoned Schemer》中的例子,所以就直接用这个库来学了。

    准备工作

    先把

    [org.clojure/core.logic "0.8.8"]
    

    放到你的project.clj的依赖中,然后在repl中运行

    user=> (require '[clojure.core.logic :as logic])
    

    使用这个库的代码一般像

    user=> (logic/run 1 [x] (logic/== x 100))
    (100)
    

    其中1表明要的结果数量,因为约束条件不一定只有一种可能;x表示需要对哪个变量进行求解;而(logic/== x 100)就是所谓的goal了,整个逻辑编程的推动力就是满足goal。所以上面的代码读作“告诉我x的一个值让其满足x等于100”。因此表达式的值就是(100),是一个包含值100的链表,这就是x应该取的值,因为我们限制结果数量为1,所以结果链表的长度不会大于1,当然,如果没有满足条件的值长度就为0。

    因为logic/run的body内部必须是goal而非普通表达式,一般的表达式都不能用在里面。所以这个库另外提供了一些针对链表操作的primitive,像firsto,resto和conso这样和clojure内置的first,rest和cons对应,加了一个o以示区别,而且与内置函数相比都多了一个参数——用于接受类似于内置函数的返回值。

    比如:

    user=> (logic/run 1 [x] (logic/conso 1 [2 3 4] x))
    ((1 2 3 4))
    

    意思是求让x满足等于(cons 1 [2 3 4])的值,所以结果也合乎想象,这种调用方式乍看起来感觉很像C语言中的传入一个指针,函数往这个地址中写数据来影响外部环境而不用写全局变量。但是其实不是这样理解,这种形式主要是为了让你可以指定一个结果,让库帮忙推导可以得出这种结果的初始状态,像:

    user=> (logic/run 1 [x] (logic/conso 1 x '(1 2 3)))
    ((2 3))
    

    这里就求出了x必须是'(2 3)才能满足(= (cons 1 x) '(1 2 3))

    《The Reasoned Schemer》前面几章讲的都是对链表的操作,完全没涉及数学计算,但是第7、8章就描述了如何使用链表来表示数字的计算,方式类似于电子的门电路,然后组成全加器,最后组成高阶的+o*o这些内置函数。不知道真实计算会用什么方法,可能会内置数学的一些操作吧,因为用链表来表示实在是太低效了,这样的用法只是适合理解一些原理。

    逻辑编程最基本的内容差不多就是这些了,抽象出来了一个goal,让编程者可以只需要编写规则,让程序自动求解,会大量用于Automated reasoning

    内部原理

    我对逻辑编程感兴趣的原因之一就是没法理解它是怎么实现的,甚至之前完全不知道逻辑编程是怎么样的一种形式,幸好《The Reasoned Schemer》不仅仅讲了怎么用,还讲了内部的实现原理。

    其实逻辑编程最基本的操作就是==操作了,用它来让两个值相等,然后求出满足这些等式的具体变量值,比如(== x y)(== x 100)以及(== '(x y) '(1 2))等等,而这个操作最终依赖于unify函数,这也是我在做Typed Clojure项目之前查类型系统资料时完全没法理解的unification,也就是合一,但是书中给的代码就很好理解了。这里吐槽一下wiki,有些原理查wiki完全看不懂,它倾向于把一个道理讲得很细,但是在这众多的细节中很容易抓不到重点,其实wiki里面那么多废话完全就可以用几行代码来理解。

    goal在内部是一个函数,接受一个substitution,返回false或一个新的substitution,false表示合一失败,而返回substitution表示满足该goal,并且这个新的substitution也包含了满足该goal的新条件。goal一般不需要用户自己编写,一般是使用==让系统生成。而substitution可以理解成一个符号到值的映射,当然也可以是符号到符号的映射。比如给goal (== x y)一个空的substitution,就会产生一个{x -> y}这样的substitution,而如果这个substitution再传递给(== x 10)这样的goal,那么就会产生{x -> 10, y -> 10}这样的映射。

    它的内部原理其实就是相当于将需要求解的goal转化成了一个树,然后让程序自动遍历这个树而已。

    比如

    其中每个conde条目都是在创建一个树的分支,conde在书中有详细描述,这里就不详细介绍了,简单来说它就相当于clojure里的cond。因此上面的代码实际上相当于遍历这样的树:

    logic_tree

    表达式的结果为([5 5 _0] [7 _0 7] [10 10 _0] [_0 _0 _0]),其中_0表示可以为任意值,这样的每一个结果正好对应从根到叶子节点所积累的条件,所以有4个叶子节点就有4个结果,虽然我们请求5个结果。需要注意的是叶子节点不一定会满足条件,如果不满足则积累的substitution会被丢弃,就不会生成结果,而且一旦生成的结果满足了我们的需求量就不会继续计算,这也有点像lazy evaluation。

    所以逻辑编程的意义就在于不需要写树的遍历,让程序自动遍历罢了。不过这也是相当有用的一个抽象了,完全开创了一个新的编程模型。

    实际用途

    我知道的其中一个用途,也是我想要了解逻辑编程的原因,就是因为它可以用于类型系统的类型推导,比如haskell里面的

    Prelude> map id [1, 2, 3]
    [1,2,3]
    

    这样的函数调用,其中map的类型为map :: (a -> b) -> [a] -> [b]id的类型为id :: a -> a,而[1, 2, 3]的类型为[Num],编译器需要怎么推断map id [1, 2, 3]的类型呢?这里困难的地方就在于mapid都是多态函数,所谓多态函数就是函数类型并不是确定的,需要用类型变量来确定真实的类型,因为map并不关心第一个参数的参数和返回值类型。如果在静态类型语言中不能用类型变量会非常痛苦,就像java那样还需要为int,double写各种版本的函数。但是这样的类型变量就会让类型检查很不方便,因为不能直接比较了,而如果让程序员在每个多态函数的调用部分都加上类型注释又显得太不智能,所以必须要编译器自己去做这样的类型推导。整个推导的过程就需要用到逻辑编程里的合一,也就是unification。

    这里详细解释一下整个推导过程。先解释一下函数的类型标记,map(a -> b) -> [a] -> [b],这样的标记的意思表示map函数接受两个参数,第一个参数是参数类型为类型变量a,返回类型为类型变量b的函数,map的第二个参数是类型为类型变量a的链表,而map的返回值的类型为类型变量b。所以,如果类型变量a为Int,而类型变量b为String,那么map的第一个参数就必须是接受Int,返回String的函数,第二个参数必须是Int的链表,返回值是String的链表。

    id这个函数接受什么返回什么,就是clojure中的identity,所以类型为a -> a,为了之后的叙述方便把它改为c -> c

    所以在检查表达式map id [1, 2, 3]的类型时需要先推断出所有的类型变量所代表的真正类型,先推断第一个参数,可以很容易得到{a -> c, b -> c}这样的映射,之后再检查第二个参数,同时考虑已经得到的信息就会得到{a -> Num, b -> Num, c -> Num}这样的映射,过程中没有类型变量的冲突,所以整个检查成功,而且可以得知整个表达式的返回值是Num的链表。

    考虑一个类型错误的表达式map odd ["a", "b", "c"],其中odd函数的类型为Num -> Bool,不是多态函数,这里推导出第一个参数时会得出{a -> Num, b -> Bool}这样的映射,推导第二个参数时会得出{a -> String},这和前面推导出的结果相冲突,所以合一失败,也就导致了类型检查失败。