Finding bugs

I’ve sometimes searched for a week or more trying to find a bug. But nothing like this:

After more than a month of tireless research and testing, we have finally gotten to the bottom of our ZooKeeper mystery. Corruption during AES encryption in Xen v4.1 or v3.4 paravirtual guests running a Linux 3.0+ kernel, combined with the lack of TCP checksum validation in IPSec Transport mode, leads to the admission of corrupted TCP data on a ZooKeeper node, resulting in an unhandled exception from which ZooKeeper is unable to recover. Jeez. Talk about a needle in a haystack…

Another guy at work says we will probably be using ZooKeeper in the project we are working on. I’m glad these guys found it before we ran into it.

The problem occurred when the system received a single corrupted network packet.

4 thoughts on “Finding bugs

  1. What were they thinking when they skipped the TCP checksum validation? HMAC (IPSec authentication) finds corruption on the wire, but it doesn’t protect you from bad encryption or decryption

  2. Do you have pointers to more info on that Xen AES encryption bug? We’ve encountered something that smells like that, but never was able to get the time to dig in deep enough to find where exactly the fault lay. If not, no big deal, but I thought it wouldn’t hurt to ask. 🙂 Thanks!

Comments are closed.