Computer code that I’m writing usually doens’t keep me up at night. After all, it’s only bits & bites and if something doesn’t work properly you can always fix it, that’s the beauty of software.
But it isn’t always like that. Plenty of computer software has the capability of causing material damage, bodily harm or even death. Which is why the software industry has always been very quick to disavow any kind of warranty for the product it creates and - strangely enough - society seems to accept this. It’s all the more worrysome because as software is ‘eating the world’ more and more software is inserting itself into all kinds of processes where safety of life, limb and property are on the line.
Because most of the sofware I write is of a pretty mundane type and has only one customer (me) this is of no great concern to me, worst case I have to re-run some program after fixing it or maybe I’ll end up with a sub-optimal result that’s good enough for me. But there are two pieces of software that I wrote that kept me awake at night worrying about what could go wrong and if there was any way in which I could anticipate the errors before they manifested themselves.
The first piece of software that had the remarkable property of being able to interfere with my sleep schedule was when you looked at it from the outside trivially simple: I had built a complex CAD/CAM system for metalworking, used to drive lathes and mills retrofitted with stepper motors or servos to help transition metal working shops from being all manual to CNC. I have written about this project before in ‘the start-up from hell’, which if you haven’t read it yet will give you some more background. The software was pretty revolutionary, super easy to use and in general received quite well. The electronics that it interfaced to were for the most part fairly simple, the most annoying bit in the whole affair was that we were using a glorified game computer (an Atari ST) to drive the whole thing.
The ST, for its time was a remarkable little piece of technology. It came out of the box with a 32 bit processor (Motorola 68K), a relatively large amount of RAM (up to 1M, for the time that was unheard of at that pricepoint), and all kinds of ports coming out of the back and the side of the machine. Besides the usual suspects, Centronics and Serial ports, the ST also came with two MIDI ports, two joystick ports, a harddrive connector and some other places to stick external peripherals into.
As the project progressed all these ports were occupied one-by-one until there was no socket left unused to feed information to the CPU or to take control signals back out. The first design iteration we only had two stepper motors to drive, which was trivially done with the Centronics port. A toolchanger then occupied some more bits on the Centronics port creating the need for some off-board latches. These were then immediately used to drive a bunch of relays as well to give us the ability to switch coolant pumps, warning lights and other stuff on and off. Analogue IO went to one of the MIDI ports and an encoder to determine the position of the spindle went to the remaining i/o port. Before long everything was occupied. And then, with the software stable and all hardware I/O available already in use we determined the need to have a software component to the E-Stop circuitry.
This is not some kind of overzealous form of paranoia, some of the servos we’d drive from these boards were as large as buckets and would be more than happy to crush you or rip your arm off, one of the machines we built drove a lathe with a 5 meter (15’) chuck to cut harbour crane wheels. Fuck-ups are not at all appreciated with gear like that.
So, this was bad. There literally wasn’t a single IO port left that we could have used for this in a reliable way without having to take out a piece of functionality that was already in use at various customers. And yet, the E-Stop circuitry was a hard requirement, for one the local equivalent of OHSA would not sign off on the machine without it (and rightly so), for another we overly relied on our end-users keeping their wits about them while using the machine, which is something that is a really bad idea when it comes to dealing with the aftermath of what quite possibly involved shutting down a chunk of dangerous machinery to avoid an accident. After all the ‘E’ in E-stop stands for ‘Emergency’, and once that switch gets pushed you should not assume anything at all about the state of the machine.
So, this was a bit of a brain teaser: how to reliably restart the machine after an E-stop condition is detected. The E-stop mechanism itself was super simple: mushroom switches wired in series were placed at strategic points on the machine and the equipment case, pressing any switch would latch that switch in the ‘off’ position (E-stop engaged) and to release the switch you had to rotate it. Breaking the circuit caused a relay to fall off which cut the power to all hardware driving motors, pumps and so on. That way you could very quickly disable the machine but you had to make a very conscious decision to release the E-stop condition.
The hard part was that once that situation had come and gone if you did not have each and every output port in a defined state releasing the E-stop switch would most likely lead to an instantaneous replay of the previous emergency or potentially an even bigger problem! Better yet, if the power had been cut to the computer itself the boot process was not guaranteed to put sane values onto all output ports causing the machine to malfunction the instant you tried to power it up. Staying in sync between the state of the hardware and the software was a must. Once the CAM software was up and running and had gone through its port initialization routine you could enable the relay again but there was no output port available to do this, and even if there was using a single line to drive relay would likely cause it to be activated briefly during a reboot, something you really do not want on a relay wired to ‘hold’ itself in the on position.
After many sleepless nights I hit on the following solution: analyzing the output of the 8 bit centronics port after many power up cycles it was pretty clear that even though the outputs were terribly noisy they were also quite regular. This held across all the ST’s that I could get my hands on (quite a few of them, we had a whole bunch of systems ready to be shipped). Knowing that it suggested that instead of using single lines a pattern could be made that would not occur at all while the machine was ‘off’, and that was impossible to generate when the machine was ‘on’ (to avoid accidentally triggering the sequence during an output phase). A pattern clocked out via the normal Centronics sequence (load byte, trigger the ‘strobe’ line, wait for a bit, next byte) and a decoder and a bunch of hardwired diodes to 8 bit input comparators took care of the remainder. That way the relay would never switch into the ‘on’ state by accident during a reboot or cold boot, no single wire output state could drive the relay and no regular operation could accidentally trigger the relay. All conditions satisfied.
Some of those machines sold in the 80’s are still alive today and are still running production, quite a few of them were sold and the successor lines are still being sold today (though with completely redesigned hardware and software).
The second piece of software that kept me awake at night was a fuel estimation program for a small cargo airline operating out of Schiphol Airport in the Netherlands.
Writing software for aviation is a completely different affair than writing software for almost every other purpose. The degree of attention to correct operation, documentation, testing, resilience against operator error and so on is eye-opening if you have never done it before. I landed this job through a friend of mine who thought it was right up my alley: a really ancient system running a compiled BASIC binary on a PC needed to be re-written because the source code could not be produced, and even if it could be produced the hardware was being phased out and the development environment the software was based on no longer existed (the supplier of that particular dialect of BASIC had gone out of business). The software was getting further and further behind, the database of airports that it relied on was seriously outdated (which is a safety issue) and so on.
So this was to be a feature-for-feature and bit identical output clone of the existing system, but with auditable source code, up-to-date data and on a more modern platform. The first hurdle was the compiled binary, a friend of mine (who at the time worked at Digicash) and the person that landed me the job brought some substantial help here, they decompiled the binary in record time giving us a listing of the BASIC source code. This was a great help because it at least gave good insight into what made the original program tick, but it also showed how big the job really was: 10K+ lines of spaghetti basic interspersed with 1000’s of lines of ‘DATA’ statements without a single line of documentation to go with it to indicate what was what. The only thing that helped is that we could figure out where the various input screens were and what fields drove which parts of the computation.
To make it perfectly clear: if this software malfunctioned a 747 with an unknown load of cargo and between 5 and 7 crew members would take an unscheduled dip into some ocean somewhere so failure was really not an option and given the state of the available data it seemed that failure was very much a likely outcome.
Weeks went by, painstakingly documenting each and every input into the program, which variables held what values (after the decompilation process all variable names were two letters long without any relation to their meaning), allowed ranges (sometimes in combination with other fields), what calculations were done on them and which part of the output resulted. Complicating matters was that the BASIC runtime contained its own slightly funny floating point library which could make it very hard to get identical output from two pieces of code that should have done just that. For each and every such deviation there had to be a chunk of documentation explaining exactly what caused the deviation and what the range of such deviations would be and how this could affect the final estimate.
I learned a lot about take-off-weights, cruising altitude, trade winds, alternate airports, the different types of engines 747’s can be fitted with and how all these factors (and 100’s more) affected the fuel consumption. I have never before or after written a piece of software with so many inputs to produce only one number as the output: how much fuel to take on to fulfill all operational, legal and safety requirements, and to be able to prove that this is the case. In the end it all worked out, the software went live next to the old software for a while, consistently did a better job and eventually someone pulled the plug on the old system. No cargo 747’s flown by that company ever ended up in unscheduled mid-ocean or short-of-the-runway stops due to lack of fuel (or any other cause).
What I also learned from this job, besides all the interested details about airplanes and flying, is that you can’t take anything for granted when it comes to computing critical values. You need to know all the details of the underlying runtime environment, floating point computations, the hardware it runs on and so on. Then and only then can you sleep soundly knowing that you have ruled out all the elements that could throw a wrench in your careful computations.
I liked both of these jobs quite a bit, even though in the first the environment was (literally) murderous I learned a tremendous amount and I knew going out of those jobs I had ‘levelled up’ as a programmer. Even so, working on software when there are lives on the line is something that really opened my eyes to how incredibly irresponsible the software industry in general is when it comes to the work product. The ease with which we as an industry disavow responsibility, how casually we throw beta grade software out there that interacts directly with for instance vehicles and how little of such software is auditable has me worried.
This time it is not my software that keeps me up at night, but yours. So if you are working on such software, take your time to get it right, make sure that you have thought through all the failure modes of what you put out there and pretend that your spouse, children or other people whose lives you care about depends on it, one day it just might.
And if my words don’t convince you then please read up on the Therac 25 debacle, maybe that will do the job.