August 26, 2010

While working on Kafka, a distributed pub/sub system (more on that later) at LinkedIn, I need to use Zookeeper (ZK) to implement the load-balancing logic. I’d like to share my experience of using Zookeeper. First of all, for those of you who don’t know, Zookeeper is an Apache project that implements a consensus service based on a variant of Paxos (it’s similar to Google’s Chubby). ZK has a very simple, file system like API. One can create a path, set the value of a path, read the value of a path, delete a path, and list the children of a path. ZK does a couple of more interesting things: (a) one can register a watcher on a path and get notified when the children of a path or the value of a path is changed, (b) a path can be created as ephemeral, which means that if the client that created the path is gone, the path is automatically removed by the ZK server. However, don’t let the simple API fool you. One needs to understand a lot more than those APIs in order to use them properly. For me, this translates to weeks asking the ZK mailing list (which is pretty responsive) and our local ZK experts.

To get started, it’s important to understand the state transitions and the associated watcher events inside a ZK client. A ZK client can be in one of the 3 states, disconnected, connected, and closed. When a client is created, it’s in the disconnected state. Once a connection is established, the client is moved to the connected state. If the client loses its connection to a server, it switches back to the disconnected state. If it can’t connect to any server within some time limit, it’s eventually transitioned to the closed state. For each state transition, a state changing event (disconnected, syncconnected and expired) is sent to the client’s watcher. As you will see, those events are critical to the client. Finally, if one performs an operation on ZK when the client is in the disconnected state, a ConnectionLossException (CLE) is thrown back to the caller. More detailed information can be found at the ZK site. A lot of the subtleties when using ZK are to deal with those state changing events.

The first tricky issue is related to CLE. The problem is that when a CLE happens, the requested operation may or may not have taken place on ZK. If the connection was lost before the request reached the server, the operation didn’t take place. On the other hand, it can happen that the request did reach the server and got executed there. However, before the server can send a response back, the connection was lost. If the request is a read or an update, one can just keep retrying until the operation succeeds. It becomes a problem if the request is a create. If you simply retry, you may get a NodeExistsException and it’s not clear whether it’s you or someone else have created the path. What one can do is to set the value of the path to a client specific value during creation. If a NodeExistsException is thrown, read the value back to check who actually created it. One can’t use this approach for sequential paths (a ZK feature that creates a path with a generated sequential id) though. If you retry, a different path will be created. You also can’t check who created the path, since if you get a CLE, you don’t know the name of the path that gets created. For this reason, I think that sequential paths have very limited benefit since it’s very hard to use them correctly.

The second tricky issue is to distinguish between a disconnect and an expired event. The former happens when the ZK client can’t connect to the server. This is because either (1) the ZK server is down, or (2) the ZK server is up, but the ZK client is partitioned from the server or it is in a long GC pause and can’t send the heartbeat in time. In case (1), when the ZK server comes back, the client watcher will get a syncconnected event and everything is back to normal. Surprisingly, in this case, all the ephemeral paths and the watchers are still kept at the server and you don’t have to recreate them. In case (2), when the client finally reconnects to the server, it will get back an expired event. This implies that the server thinks the client is dead and has taken the liberty to delete all the ephemeral paths and watchers created by that client. It’s the responsibility of the client to start a new ZK session and to recreate the ephemeral paths and the watchers.

To deal with the above issues, one has to write additional code that keeps track of the ZK client state, starts a new session when the old one expires, and handles the CLE appropriately. For my application, I find the ZKClient package quite useful. ZKClient is a wrapper of the original ZK client. It maintains the current state of the ZK client, hides the CLE from the caller by retrying the request when the state is transitioned to connected again, and reconnects when necessary. ZKClient has an Apache license and has been used in Katta for quite some time. Even with the help of ZKClient, I still have to handle things like who actually created a path when a NodeExistsException occurs and re-registering after a session expires.

Finally, how do you test your ZK application, especially the various failure scenarios? One can use utilities like “ifconfig down/up” to simulate network partitioning. Todd Lipcon’s Gremlins seems very useful too.