RSS Feed

Posts Tagged ‘zookeeper’

  1. chubby & zookeeper: different consistency level

    June 12, 2016 by xudifsd

    Many people know zookeeper, which is a widely used open source project. But few people know chubby, which is a service only used internally in Google and public can only know its detail through paper published by Google. Zookeeper provides almost the same functionality as chubby, but there’re a few subtly differences between them, and I will discuss the differences in this post. So later time when you read a Google paper saying they used chubby as a underlaying infrastructure, do not treat it as a zookeeper equivalent.

    Zookeeper is a distributed process coordinator, as O’Reilly puts it. Chubby is a distributed lock service provides strong consistency. Although both of them have a file system like API from user’s perspective, they provide different level of consistency, you can get a clue from their descriptions: coordinator is a much weaker word compare to lock service.

    Like most other systems, there are few write operations in chubby, so its network flow is dominated by read operations and heartbeat between client and server, because chubby provides coarse-grained locking, it’s necessary to use cache, also since chubby is intended to provide strong consistency, it needs to ensure that the client should never see staled data. It achieve these by cache data in library, and if anyone want to change the data, chubby ensures that all the caches are invalidated before the change take effect. From the paper, chubby has a rather heavy client library, and the library is the essential part of the system, provides not only connection and session management but also cache management as we discussed earlier. And it is the cache management that makes huge difference between zookeeper and chubby. Most users thought zookeeper provides a strong consistency but find out they’re wrong later.

    Not like chubby, zookeeper do not provides such strong consistency, actually, native client library of zookeeper do not have cache at all. What the difference can make by cache, you asked. Well, for example: if two zookeeper clients want to agree on a single volatile value, they should use the same path to write file, and content of that file signify the value of given key. If client A want to know the latest value, and client B want to change it, client A can either poll the file regularly or use watch mechanism provided by native zookeeper library, let’s assume A uses watch since it is more efficient. If B changed the value, zookeeper would respond to the change call before notifying client A, once B got response, it may assume any other clients who watched that value has been notified, but it is not the case: A may not received notification timely due to slow network, so it is possible that two clients see different value at the same time. This usually is not a problem if users do not require strong consistency, but if they do, they have to invent some buggy ad-hoc solution to get around of this. The author of zookeeper book argued that it would have made the design of zookeeper more complex if the library manage cache, and could cause zookeeper operations to stall while waiting for a client to acknowledge cache invalidation request. It is true that strong consistency come at cost, but people would find it is easier to use if strong consistency is guaranteed.

    To remedy this, Netflix created curator library which later moved to Apache foundation, this library provides the commonly used functionality and cache management. This additional layer to zookeeper allows it providing strong consistency needed by some users. So whenever you want to use zookeeper, use curator library instead of native library unless you know what you are doing.


  2. chubby(zookeeper)和paxos

    August 22, 2013 by xudifsd

    解决的问题

    先说为什么要有这两个东西。比如你开了个银行,里面有三个账户A,B和C。这时有两个人想转账,一个人想从A中转到B中,另一个人想从A中转到C中。但是A中的金额只够支持转一次,所以要不第一个人转账失败,要不第二个失败。

    如何实现?非常简单,只部署一个服务器接收所有请求,第一个收到的请求成功,第二个失败即可,但是这会产生单节点失败——假设这个服务器挂了,那么银行的所有业务都会挂掉。这样我们可以起两个服务器(或者多个服务器),指定一个为主服务器,其他服务器在主服务器挂掉后成为主服务器,如果只有两个服务器那么另一个就成为主服务器,但是如果有多个服务器如何选择主服务器呢?并且在两个服务器的情况下可能另一个没有挂掉,仍然能收到别人的请求,只是无法收到另一个服务器的请求了,这样就会造成有两个服务器都认为自己是主服务器(网络分区)。这样就可能会造成两个转账都成功。

    问题的核心就是多个服务器状态的一致性,状态的一致就需要服务器达成共识,前面描述的问题可以靠两个共识来解决:

    • 谁是主服务器
    • 谁的转账应该成功

    如果系统能达成这两个共识的其中一个就能解决问题:

    如果能达成第一个共识,那么所有的转账操作都需要由主服务器做出判断,其他从服务器只需要跟随即可,同时为了解决前面的网络分区问题只需要做一个master lease即可(一群服务器选举出一个master,并且承诺不会在一定时间内的任何情况下选出其他master,这样如果出现网络分区,整个服务可能不可用,但是过了master lease期,集群就能选出另一个master,并重新提供服务。而原来的master因为没能成为新master而成为跟随者)。

    或者达成另一个共识“谁的转账应该成功”,这样就不要求谁来成为master来做决定:给两个转账分配序列号,这个序列号是单调递增的,并且对转账都是唯一,这个序列号就是需要所有服务器达成的共识。达成共识后服务器再按照序列号的顺序应用操作即可,这样所有服务器也能达到一致状态。

    总结起来就是:

    • 为了解决单节点失败问题,需要多个服务器。
    • 为了让多个服务器的状态一致,可以用前面提到的两种解决方法之一。

    前面的两个解决方法对应的就是chubbypaxos。开源的解决方案对应的就是zookeeperlibpaxos

    chubby或者说zookeeper提供一个类似于文件系统的接口,让使用者能用类似于文件锁的方式选举出自己的master,而chubby本身为了维持高可用需要部署在至少3台服务器上(这样挂掉少于一半的服务器,整体还能工作),而且他们内部就是使用paxos算法产生内部共识的(zookeeper使用类似于paxos的算法Zab,详情见《Hadoop权威指南》)。

    这样看起来似乎使用libpaxos比较方便,因为都能解决问题,并且libpaxos不需要多余的服务器来运行chubby服务,只需要服务本身使用这个库就行。但是chubby论文就说道了为什么Google内部实现了chubby而不是实现一个库:

    1. 大部分开发人员在开始开发时都不会考虑到高可用性问题,所以一开始都只会运行在一台服务器上。只有当服务的用户增多以后,才开始认真对待这个问题。使用chubby可以在保持原有的程序架构的情况下,通过添加简单的语句就可以解决一致性问题。
    2. 有时候并不仅仅需要选出一个master,还需要让zookeeper托管一些信息,比如服务的配置信息以及master的地址等,但是zookeeper并不是专门用来做存储的,所以不要往zookeeper上写太多东西。
    3. 一个基于锁的接口更容易被开发者接受。因为并不是所有的开发者都了解分布式共识协议的,但大部分都用过锁。
    4. 一个分布式共识协议需要使用到好几台副本来保证高可用,而使用chubby,就算只有一个用户也能用。

    这样来说chubby这样的系统更加好用,因为Chubby不仅解决了一致性问题,还可以提供更多更有用的功能,像配置的管理。并且能用上zookeeper的公司都不是小公司,这样的公司有许多组件需要zookeeper这样的东西来维持高可用,而且zookeeper设计时就尽量减少使用者与它的交互,避免zookeeper成为瓶颈,所以可以很多服务同时使用一个zookeeper集群。

    zookeeper和redis的不同

    有时候很多人会觉得zookeeper很像redis,但是他们完全不是一个东西。

    首先,他们的配置就完全不一样,redis需要手动配置谁是master,谁是slaver,而且slaver主要用于加速读而已,并且要想让redis的master挂掉后让其中一个slaver当master还得自己写很多东西(貌似redis的sentinel在解决这个问题,具体没研究过)。但是这在zookeeper中完全不是问题,配置里甚至不会指定谁是master,由他们自己选举(注意,zookeeper中选举master和别的服务靠zookeeper选举master是两回事,zookeeper中选master主要是为了避免活锁,提高性能,详见paxos那篇论文)。

    第二个不同就是客户端连接zookeeper给的地址类似“zk1:2181,zk2:2181,zk3:2181”这样的逗号分割的IP、端口对,客户端自动连接到zookeeper中的master,并在zookeeper的master换了之后自动连接master,这和redis也完全不同。

    第三个不同就是zookeeper提供了大量的接口用于分布式机器间的同步。举例来说,在zookeeper中可以通过一个接口用文件来表示某个客户端,一旦zookeeper与客户端失去连接,zookeeper会自动把文件删除。并且zookeeper还提供了异步的API让客户端可以watch某个节点下文件的变动,这两个API结合起来就能用于让zookeeper通知客户端:他们中某个客户端已经挂掉。这样就不需要让这些客户端不断轮询zookeeper来获取信息。而redis要实现这个功能则需要服务器自己去查,很费流量的。