docwhat's avatardocwhat's blog

Smoking craters are good

Every wondered how to make bulletproof software? What tricks do the guys who build bank machines use to be sure their software doesn’t ever crash?

Look no further!

History

Tandem Computers used to make Non-Stop fault tolerant hardware and software. Their premier system was a million dollar Mainframe system. It could have up to 16 CPUS and 16 IO cabinets, each with 60 IO cards. Everything was in pairs. Each CPU was actually two cpus, if one disagreed with the other, they shutdown. All processes were doubled. All IO was doubled; two scsi cards, to two different scsi drives (mirrored) in two different cabinets, each with two power supplies.

Jimmy, the CEO and founder used to give demos where he would take the bank manager, or other high-muckety muck and show him one of Tandem’s live data-centers. He’d point at the screens and say, “You see those messages? That’s a bunch of transactions happening on these systems right here. We’re using them for our day-to-day work, right here.”

“Let me show you how reliable these systems are…”, he’d then pull a Colt .45 from his jacket and blow a hole in the side of one of the IO cabinets. The boom would usually cause the unsuspecting bank manager to leap for the nearest exit.

Jimmy would continue, “…and as you can see, while the system noticed that little hole, it isn’t actually stopping. Everything is re-routing around the damage I just did.”

Every mistake should leave a crater

So, how did Tandem make software and hardware that ended up in most of the newpapers and banks in the world? They used a simple secret; if something goes wrong, it should leave a huge smoking crater.

Basically, it works like this. Let’s say Dan the Developer makes a mistake; when the right situation happens then something divides by zero. When the divide by zero happens, if the software tries to recover without understanding why it happened, then it the software can actually make the problems worse. In addition, it may be the case that nobody will notice the problem.

However, let’s say that instead of hiding the problem, the divide by zero causes the whole software package to stop. The user will notice this. They will complain. The developer will be notified. It will get fixed.

This is one of the easiest ways to make sure your software is bullet proof. Make it all go boom! when something unexpected happens.

Comments

Gravatar for richard remington
Richard Remington

This methodology is great, IF the developer is still around to fix his code after a spectacular meltdown. Sadly, one of the more recent places I worked had had a developer create code by the Tandem method and only partly finish. Some things blew up spectacularly, some not at all, and regardless, there was no one with experience, tribal knowledge, or proper documentation to fix anything when it did. :(

Gravatar for richard remington
Richard Remington

I have another thought. As part of the ‘boom!’ process, I assume you feel that it should be simple to find the bug? Are there ways to ensure that happens? By the way, I love your It’s All Text software and use it daily.

Gravatar for docwhat
docwhat

Not necessarily “The” developer, but “a” developer.

The “is the code maintainable” issue is separate, but just as important.

Gravatar for docwhat
docwhat

Yes. Good point; it should blow up in such a way that the developer can troubleshoot it. One of the reasons I hate threading.

Glad you like IAT. :-)

Submit a Comment

docwhat

The personal blog of Christian Höltje.
docwhat docwhat contact