I’ve sometimes searched for a week or more trying to find a bug. But nothing like this:
After more than a month of tireless research and testing, we have finally gotten to the bottom of our ZooKeeper mystery. Corruption during AES encryption in Xen v4.1 or v3.4 paravirtual guests running a Linux 3.0+ kernel, combined with the lack of TCP checksum validation in IPSec Transport mode, leads to the admission of corrupted TCP data on a ZooKeeper node, resulting in an unhandled exception from which ZooKeeper is unable to recover. Jeez. Talk about a needle in a haystack…
Another guy at work says we will probably be using ZooKeeper in the project we are working on. I’m glad these guys found it before we ran into it.
The problem occurred when the system received a single corrupted network packet.