Here is a fun story of mine: last year, when I was preparing Google Summer of Code for finagle project, I encountered a class I didn’t understand. It is not too long, but from only source and comments I can not imagine its usage and why it is useful. Since it’s not a very important class, I skipped it, and tried to read other parts of the project. Unfortunately, finagle failed to be accepted as an organization in GSoC 2015(what a shame), so I stopped my reading.
This year, I accepted a full-time job working on a distributed system, this is my favorite field in computer programming, I worked very hard to do my best. Recently, I read a post <Notes on Distributed Systems for Young Bloods>, it mentioned a lot of things that new recruits in distributed system should paying attention to. It has a lot of valuable advice, in particular, it emphasized the importance of metrics: to get a better understanding of the behaviour of the system, we, as system engineers, should expose metrics, and when doing so, we should not only expose averages, but also percentiles. This, however, reminds me that the mysterious class I met in finagle has something to do with metrics and stats, so I read that class again, and find out I can fully understand that class and its usage now.
From <The Tail at Scale>, I understood that it is critical for the system to keep tail latency low, otherwise the overall tail latency would make the system intolerable slow. But I failed to see the implication that we should expose latency data, especially tail latency data to let people tune the system. Well, no wonder why finagle needs that class.
In searching for a better algorithm to implement percentile metrics, I found another interesting paper <Effective Computation of Biased Quantiles over Data Streams>, this paper proposed a way to calculate quantiles over streams with finer error guarantees at higher ranks. It’s very efficient since its space bound is O(log n), but this algorithm still doesn’t fit in normal needs for performance stats, since the server is long time running, the number of metrics data is unbounded, this algorithm would consume too much memory if the server kept running for a long time. From this point BucketedHistogram in finagle perform great, we can specify error we can tolerate with limits, if we want more precise data in higher rank we can add more slots in higher rank limits, and it’s very easy to understand, what’s more, it uses constant space although this constant is a little large.