/ software

Bad software can kill you, and at this rate it probably will.

When I started out in my career I worked on video games. Everyone worked very hard to make sure that the game we were building was the best it could be, quality mattered and was a big part of that. That said, I remember hearing and making the joke that we weren’t making heart/lung machines as a way of drawing the line on a particular bug or feature; the line between valuable quality for a video game and something that doesn’t matter that much because we in fact were making something that had pretty low stakes in the big scheme of things. A pixelated death not having quite the same impact as a real one. We tested to make sure the game worked fine on leap day, but we might not have fixed a minor bug that only happened once every 4 years.

We expect our technology to work well in a wide range of very complex scenarios. Good software components work together to support this, but good software is remarkably difficult to build, support, maintain and pay for. Good software requires substantial investment from top to bottom, and the bigger your product gets the deeper you need to make that high quality foundation. It is unfortunately not just as simple as going slow and spending money, good software in complex use cases requires actually solving the problems of that complexity step by step, chunk by chunk. Most of the time spent doing this looks ineffective to an outside observer.

You might be surprised to find that not everyone actually wants to make good software. Large numbers of problems have been solved already. Many of these solutions can be reused, and the good software already made can give you a substantial head start. You can deliver substantial, successful and profitable products quite quickly by connecting these pieces together into a new solution. If your customer demand isn’t that complex, it will probably work well. Unfortunately most of the time the results of efforts like this are quite complicated. Complicated software solutions are usually cheaper and faster to build that ones that solve complex problems, but the complication prevents you from evolving them quickly, responding to problems, or eventually, building on top of them to become something more than you are. A complicated software product will almost always be more expensive in the long run.

For many teams it’s better to build quickly, avoiding the complex problems, and then deal with them later when you have money coming in. Letting your customers pay for the difficult expensive work is a tried and true business plan. There are many companies that should know better, if you plan to be in the business for the long haul, or want to expand your product quickly, the investment in high quality software will pay off 10x. Politically this can be difficult sometimes. If you can’t explain this distinction to management, management will always make the wrong choice.

I remember one notable example. A team was making great progress on a high quality product something that would set the company up for a long term solution, but it wasn’t understandable for a non-technical management team. They ended up spending several multiples of what it would have cost to finish the internal product, and ended up with a rigid solution they couldn’t control instead of the solution they needed. In the end the people who made that “safer” decision all ended up losing their jobs, and autonomy, but it’s doubtful if any real lessons were learned.

The company culture has a big influence on this. The expectations on a company making video games should be different than that of companies making elevator control software, autonomous vehicles, or say aircraft control systems. Gregory Travis writing in Spectrum about the 737 MAX nails it:

It is astounding that no one who wrote the MCAS software for the 737 Max seems even to have raised the possibility of using multiple inputs, including the opposite angle-of-attack sensor, in the computer’s determination of an impending stall. As a lifetime member of the software development fraternity, I don’t know what toxic combination of inexperience, hubris, or lack of cultural understanding led to this mistake.
But I do know that it’s indicative of a much deeper problem. The people who wrote the code for the original MCAS system were obviously terribly far out of their league and did not know it. How can they can implement a software fix, much less give us any comfort that the rest of the flight management software is reliable?

I completely recognize this behavior and it’s actually incredibly common. Someone lost sight of the fact that they were working on software that needed to be good al the way through. They were working on that the proverbial heart/lung machine. They built a solution that worked along the simple every day path, but because the real, more complex problem of using everything we know about the aircraft state make the right decision was unsolved it fails in a tragic and catastrophic way exactly when needed most.

In the 737 Max, only one of the flight management computers is active at a time—either the pilot’s computer or the copilot’s computer. And the active computer takes inputs only from the sensors on its own side of the aircraft.

This is completely indicative an unsolved problem, an opportunity to write good software that was skipped. The flight management computer software was not built to reconcile conflicting or inconsistent inputs. Two machines that generate differing results without a system to reconcile them isn’t redundancy, a confusion generator. Trust the machine that isn’t trying to kill you isn’t a solution.

Was the 737 Max software project started with the mission statement of: Update the latest 737 software to perform in the new aircraft? I’ll bet it was something like that, something that frames the problem as narrow in scope and commercial cost. Good software demands that engineers dig through complex problems, and solve them, which is an activity usually incompatible with just a quick update projects.

Software teams that have our lives in their hands need to be constantly aware of where they are and aren’t solving these problems versus working around them. There is room for both in a commercial setting, and sometimes a quick hack is exactly what’s called for; but if you look at your software product and you see more avoidance than solutions you are probably not working on the kind of good software you would want your life to depend on.  It has to work every day, even on leap day. It has to work when it’s loud, when it’s dirty, when there is a spider nearby. This kind of reliability in complex scenarios is not a solved problem, you can’t get it from a box or an open source repo. You can only build it, one step at a time, as fundamentally good software. Companies will need to make the investment, teams will need to find better ways to explain complexity and dig deep. If software is going to eat the world, we better make sure it is good software, the alternative is not the kind of robot apocalypse we see in the movies; but it will still kill us just the same.