I help design and implement solutions to improve security of various things against intelligent, determined adversaries. I think I’m pretty good at it. But I don’t think I would have envisioned and prevented the dust bunny that took down a network, supporting thousands, for hours.
I was in a meeting this afternoon where we were figuring out how to handle all the different possible failure situations in a communication protocol. As we progressed I was getting more and more concerned. The designers were explaining how things would work and I would come up with all these different situations they hadn’t considered. Things like (not exactly, but close enough to get the point across) if your encryption keys are being updated every ten minutes what happens when your main unit goes down and you have to bring online the back up control center 100 miles away? How does the backup know what the current keys are? They hadn’t thought of that. Lots and lots of examples like that things they hadn’t thought of but were valid concerns. They were very good with finding solutions to the “hand grenades” I was throwing at them, but it bothered me that I was the only one coming up with the complications. I may be better than the average person at thinking of all the exceptions to a general rule (my wife sometimes gets angry with me when I do this in “normal conversation”) but I’m far from perfect. What about all the exceptions I hadn’t thought of? If two or more people from different perspectives are “lobbing hand grenades” at the proposed solution I would feel a lot better about the robustness of the solution. I didn’t say anything about it, I just stewed on it, “Who else can we get to take a look at this for vulnerabilities? Should I hire an outside consultant to review our work? We really need to make sure we have thought of nearly everything…” I was right in the middle of those thoughts when one of the guys told a story of something happened at the lab a year or two ago. I burst out laughing and continued even though they kept insisting it wasn’t funny. Of course it wasn’t funny to them, they were there until the wee hours of the morning bringing the network back up with thousands of people needing for them to be successful. All I could think about was that I knew that no matter how many people were brought in or who those people were, they wouldn’t have envisioned a killer dust bunny.
If you have a critical resource like an engine on a airplane or a computer system that your entire company requires to function you go to extraordinary efforts to make sure it doesn’t fail or that you can fail in a graceful manner. A power failure to a system with a UPS can give the computer a few minutes warning the power is going away when the batteries go dead. The computer then gets to shutdown gracefully. If one computer system and/or UPS system fails the second computer system and it’s independent UPS can continue without skipping a beat until the primary can be fixed. But as reliability engineer Ted Yellman from Boeing (and Teltone where I met him) once told me many years ago, “The question usually isn’t how reliable or how many redundant systems you have, it’s how independent they are.” In this case someone was routing some cables through the false ceiling over the computer room for the network at the lab. Some dust came down (technically not a dust bunny, but it makes a better story if it is a dust bunny) and the fast moving air in the computer room pulled the dust into the smoke detector. The smoke detector set off the fire control mechanism which “knew” that you don’t want the electricity on when you turn on the sprinklers. And since designers of the fire control system knew the computers were on a UPS, not just the normal power mains, it shut down the UPS as well. That brought down the all the computers, main and backup, in a fraction of a second without the computers able to gracefully shutdown. Imagine planting your face in the middle of your plate of spaghetti during dinner instead of going to your room and getting in bed to fall asleep. And so it was with a room full of racks filled with computers–splat! It took them something like 170 man hours to bring the system back up. Some of the computers hadn’t been turned off in a year or more and some hard drives and other hardware failed on startup. Other systems had corrupted files systems that were discovered after they booted. The startup procedure had been written before new equipment and software had been installed. It was a nightmare–they had to diagnosis and repair a complex system under time pressure with multiple simultaneous and unknown failures.
So I’m thinking what hope do we have to guard against determined, intelligent adversaries when something as undetermined and unintelligent as a dust bunny can take us out? And I’m reminded of the joke about computer programmers versus carpenters.
If carpenters build houses like programmers wrote software the first woodpecker that came along would destroy civilization.