Concepts for the Current US Mess

by Malte Skarupke

An unforeseen disaster is never the consequence of a single factor, but rather is like a whirlwind, a point of cyclonic depression in the consciousness of the world, towards which a whole multiplicity of converging causalities have conspired

– Carlo Emilio Gadda (in That Awful Mess on the Via Merulana)

It’s hard for me to write a focused blog post at the moment because there just seem to be too many active problems. I could have written a focused blog post about a programming topic, but that feels tone-deaf. So instead this will be a scatter-shot blog post about ways of thinking that could help us out of this mess. Also, since I usually write about programming, I will try to feed the lessons back to programming.

For context (if you’re reading this in the future or from another country) the US has had a really bad year. We nearly started a war with Iran, we impeached our president but couldn’t get him out of office, and then we completely failed our response to the global pandemic. After initially doing nothing and hoping it would just go away, the US decided to react in the most costly way possible, causing mass unemployment while still proving mostly impotent in fighting the virus. Now, after that huge sunk cost, we have mostly given up on fighting the coronavirus, just in time for a new problem to arise: Massive amounts of protests all over the country, some of which even turned into riots. The immediate cause is that the police killed another unarmed black man because he was briefly resisting them. But of course it’s pent-up anger from years of police brutality. And of course it couldn’t have come at a worse time with mass-unemployment and a pandemic still raging through the country.

All of this didn’t have to be, so here are some helpful tools of thought:

Defense in Depth

This is a military term, but it really should be used more widely. The idea of defense in depth is that you don’t rely on a single line of defense. Instead you try to find many ways of attacking the problem, and you do as many of them as you can with your budget.

The classic example is fire prevention: Defense in depth means that in big cities you don’t allow construction of wooden buildings. And you have a professional fire department. And you maintain fire hydrants all over the city. And you mandate that people have smoke detectors. And fire extinguishers. And fire escapes or stairs with fire-proof doors. And a sprinkler system. And emergency exits with signs pointing to those exits. And you mandate regular fire drills.

There are two benefits of defense in depth: 1. You’re more effective than any individual measure can be. 2. You’re reducing pressure on the individual measures. If you list all the fire prevention measures (I’m sure I’ve forgotten many) you might think that we are spending a crazy amount of money on fire prevention, but you’re forgetting that each of them can be kinda cheap because it can rely on the others. The reason why your smoke detector is crummy is that it doesn’t have to be perfect because there are enough other layers of defense to pick up the slack.

This is obviously relevant to the coronavirus because the US did not do defense in depth. Before the lockdown we seemed to have two lines of defense: A bit of testing with contract tracing, and the last line of defense, the hospitals. If you don’t do defense in depth, you will be forced into much more drastic and costly measures, like a lockdown, to contain the virus.

How does defense in depth look like for the coronavirus? The country that probably did it best is Taiwan, because they were probably at the highest risk with their strong connection to China. They shut down flights to China early. They increased face mask production to almost ten million face masks per day. (for a population of 23 million) They disallowed exports of face masks. They had free hand sanitizer all over the place. They did extensive contact tracing. They enforced quarantine for infected people and those found by contact tracing, and gave money to people in quarantine. (to incentivize reporting your symptoms) They punished people who didn’t report their symptoms. Companies would test the temperature of their employees three times per day. Companies would work half from home, half in the office. The two halves alternate and the employees are also not supposed to be in contact outside of work. (so if somebody gets sick, they can at most spread the virus to half the company) They did a big information campaign and fought Chinese disinformation. They were very active in the development of tests and research on the virus.

Each of these measures is much cheaper than what the US did. You bet companies would rather buy one of those infrared thermometers and check everyone’s temperature three times per day than having to shut down or work entirely from home. This approach was also much more effective than the US approach. Yes, you heard that right: It was both cheaper and more effective. But the US doesn’t understand defense in depth. They thought face masks don’t help because they don’t offer 100% protection. They don’t have to offer 100% protection if they’re just one layer of defense. (for reference, if the US had followed Taiwan’s lead and produced 100 million face masks per day for 100 days, that would have cost roughly 0.05% of the stimulus package, and yes I didn’t mean 5%) Americans would probably also say that checking temperature doesn’t make sense because many people show no symptoms and can still spread the virus. But checking temperature doesn’t have to detect all cases of the virus. If you find that somebody has a high temperature, they can get a test. If that test comes back positive you can test everyone they came in contact with. So you can find the asymptomatic cases and order them to quarantine, too. (and then you reward them for going to quarantine)

This is also obviously relevant to programming. I’ve had a lot of frustrating conversations, trying to convince game developers to do more unit testing. Before I knew the concept of “defense in depth” I didn’t know how to respond to the various counter-points. Like “tests can be wrong too” or “tests will miss lots of bugs” or “tests only find the easy bugs, which I spend little time on anyway. Why would I want to spend a lot of time on writing tests if I could spend a little time on fixing those easy bugs?”

But the answer is obviously defense in depth: Tests don’t have to be perfect because they are just one line of defense among many. You don’t have to demand that they’re perfect. You don’t even have to write tests for everything. If you do defense in depth, the pressure on your tests is reduced.

A good programmer will have plenty of defenses against bugs: A debugger, asserts, logging, a static type system, abstractions, static analysis, sanitizers, refactoring tools, fuzzing, automated tests, manual tests, QA, code review, coding standards, source control, release management, continuous delivery etc.

You don’t just rely on a single line of defense so the burden on tests is much lower. They don’t have to find all bugs and they can even occasionally be wrong. The only thing they have to do is save you time and money because they make it cheaper to find bugs than the more costly measures you’d have to do later. (It was incredible how the people who didn’t write tests would spend the end of the project in panicked bug-fixing mode, often working overtime. How do you not realize that this is crazy expensive? Plus all the time that was wasted on other people encountering those bugs during development…)

I think defense in depth can clarify a lot of conversations about these concepts. I still see similar discussions about static typing. People give the same reasons for why static types suck as they give for why smoke detectors suck. Smoke detectors don’t prevent fire, and they have lots of false positives. Static types don’t prevent bugs and they have lots of false positives. But that’s missing the point. Smoke detectors will prevent some fires and static types will prevent some bugs, and that’s all they have to do because you should deploy them as part of a defense in depth.

Systems Thinking

If A causes B, is it possible that B also causes A?

– Donella H. Meadows – Thinking in Systems

The recent protests against police violence, the violent response of the police to those protests, and the looting and the response and all those things are incredibly frustrating. And it makes me want to talk about systems thinking.

I sometimes like to read conservative opinion pieces to understand the other side, so I was reading this piece by Andrew C. McCarthy. He starts it off by saying that “The ‘Institutional Racism’ canard bears no resemblance to reality – not in police forces, and not in America.” which sounds like great news. There is no institutional racism. His first data point is that white Americans are more often victims of police violence than African Americans. But then he is forced to admit that if you take into account that there are more white Americans than African Americans, the situation reverses. Luckily he has a way out of that:

While African Americans are involved in two times more police shootings than their percentage of the population would seem to warrant, they commit 53 percent of murders and 60 percent of robberies — well over four times their percentage of the population. The political establishment would have you assume this statistical disparity is caused by institutional racism that myopically beams police attention onto black men. But we know the statistics accurately reflect reality because crimes get reported by victims — a large percentage of whom are black (also outstripping their share of the overall population).

So the reason why African Americans are more likely to be shot by police is that they’re violent criminals. That certainly makes it easy.

Systems Thinking would teach you that you can’t just pick an arbitrary point like that to stop your reasoning. You’re not done until you’ve looked at the whole system. And lots of that thinking is in terms of stocks and feedback loops. Like you could treat the “education” or “wealth” of a population as a stock. Maybe that stock has an influence on the crime rate of the population. And maybe you should look into why that stock is lower for African Americans than it should be. If you drew all the connections, do they form a loop? If yes, then maybe you just found a feedback loop from government policies to the depletion of those stocks.

But lets say there is no institutional racism. Then there would still be plenty of reasons for the protests. I walked in one because I was so upset with the police response to the protests and riots. There were too many videos of police attacking peaceful protesters. To be clear, I still think 95% of police are fine people looking out for the best of everyone, but the sheer number of videos has convinced me that 5% of cops are problematic, and that number is far too high. Now Hanlon’s razor tells us to “Never attribute to malice that which is adequately explained by stupidity” so lets give them the charitable interpretation and conclude that they’re incompetent and don’t know how to act responsibly with weapons. If that was the case, people should certainly be protesting because why did we put these people in a position where they have weapons?

But it’s not just the response to the protests, the response to the riots has been totally inadequate as well. Here is an opinion piece from the same conservative website that explains the frustration from a former police officers, talking about looting last weekend:

For example, on Saturday afternoon and evening, as officers struggled to contain looting in the Fairfax district, I monitored radio traffic from the scene in which an officer in a circling helicopter asked for more personnel to supplement the cops on skirmish lines and those chasing looters. No more officers were available, he was told. At that very moment, about 200 officers were waiting for instructions in a staging area miles away. They remained in that staging area for four hours before being dispatched to the trouble zone, by which time the looting had all but ended.

But it hadn’t ended completely, and I spoke to officers who had the maddening experience of waiting for orders at a command post while watching live news programs on their cell phones. A television-news helicopter was filming looters as they ransacked a computer store about a mile away, and the officers, who were among at least 200 at the command post at the time, could look up in the sky and see the helicopter hovering over the scene as it broadcast the images. The spectacle continued for 45 minutes as carload after carload of looters arrived and carried off computers and other merchandise, presumably until there was nothing left to steal. “I don’t know why anybody in the C.P. wasn’t watching the same thing I was,” one of them told me. “My partner and I could have walked there and handled it ourselves, but they didn’t send us. They didn’t send anybody.”

Again lets take the charitable interpretation that they’re just incompetent. But at least for me that’s enough reason to be protesting. How can they sit around doing nothing as looting is going on? And then you arm these incompetent people with tear-gas and pepper spray, which they shoot at peaceful protesters as if they didn’t know the consequences of escalation… So even the most charitable interpretation I can come up with would justify protesting, and any less charitable interpretation would especially justify protesting.

Beyond specifics like this, it seems to me like the whole discussion has fallen into one of the systems thinking traps: Drift to Low Performance. Which the book “Thinking in Systems” defines like this:

The Trap: Drift to Low Performance

Allowing performance standards to be influenced by past performance, especially if there is a negative bias in perceiving past performance, sets up a reinforcing feedback loop of eroding goals that sets a system drifting toward low performance.

That first opinion piece I linked points out that the number of fatal shootings by the police has been steadily at around 1000 per year, with the most recent number being 1004 people killed by police in the US in 2019. To get a feeling for if that’s good or bad, let’s compare it to my home country, Germany. Wikipedia claims that 11 people were killed by security forces in Germany in 2018. Now you might say “but more people live in the US than in Germany” and that’s right, but even if we adjust for that the police kill more than twenty times as many people in the US as in Germany. That is another very strong sign of incompetence in the US police.

Now I want to be fair and balanced here: I think the protesters also have no idea of systems thinking. Most of the measures that I have heard proposed, nonsense like “defunding the police”, would be undone by feedback loops within a few years. You defund the police now, then various feedback loops kick in and ten years later the funding is back to what it was. According to “Thinking in Systems” the most impactful things you can do to fix system problems, in order from most impact to least, are

  1. Paradigms
  2. Goals
  3. Self-organization
  4. Incentives, punishments and constraints
  5. Information flows
  6. Reinforcing feedback loops
  7. Balancing feedback loops
  8. Delays
  9. Stock-and-flow structures
  10. Buffers – The sizes of stabilizing stocks relative to their flows
  11. Numbers – Constants and Parameters such as subsidies, taxes, standards

The two things that have the least impact on a system, numbers and buffers, those are the things that “defunding police” tries to attack. They think they’re attacking point 1 by changing the paradigm for policing, but the actual defunding would change the points all the way at the bottom of the list.

So lets go through the top five items and see how you would fix police brutality in the US:

Paradigms: Change the picture of what police is supposed to be. Don’t glorify violence and don’t use military equipment. This is a big difference between the US and other countries, but it’s also vague and I won’t pretend it’s easy to do, so lets look at the other points.

Goals: Set a goal of reducing police violence. Say 8 years to get to the level of Germany, meaning a 95% reduction of shootings. You’d do a 50% reduction after two years, another 50% reduction after four years, (to a 75% reduction in total) then another 50% reduction after six years (87.5% reduction in total) and then cover the remaining range in the last two years.

Self-Organization: Once you set the goal, allow the police to organize to fulfill the goal. A big part of the current problem is that the police is not part of the conversation. They need to be involved and they also need to come up with solutions. This also has to happen at all layers. Right now you hear too many stories of officers being intimidated or excluded for stepping out of line. Meaning if one good cop speaks out about a “bad apple”, the good cop is likely to be punished. You can’t get self-organization with that.

Incentives, Punishments, Constraints: Get rid of qualified immunity. Make sure that every use of force has to be justified. Every bullet shot has to be written up. Police should be held to a higher standard than the average citizen, not a lower standard. They’re supposed to be the adults in the room, the ones who calm things down, not the ones who escalate. Make it illegal to assault peaceful protesters with batons or tear-gas. If it’s already illegal, start enforcing the law. It wouldn’t be a problem that 5% of cops are problematic if that 5% was held accountable.

Information Flows: Make the data on police brutality transparent. Who are the “bad apples”? What is being done to help them become better? Can we analyze the situations in which they acted badly? Can we give them more information to improve their own behavior?

There is a order to these in that they get easier the further you go down the list. And of course the two worst points, Numbers and Buffers, are easiest of all to attack. Maybe that’s why people go for them. But even in this list of the top five items there are things we can do that don’t seem terribly difficult. I don’t know how to change the paradigm, that one is difficult, but I do know things you could do further down.

Of course you can’t ignore the other items. If you try to reduce the violence and don’t take into account feedback loops, crime might go up which would immediately force you to undo your measures. But if you set a goal of a 95% reduction of fatal shootings by the police over eight years, (with the first 50% reduction in the first two years) you bet there would suddenly be a lot of clarity about what needs to be done. Like you’d have to get a lot of guns off the street. The point is that if you have a good goal, it’s easier to see what you need to do about the other items in the list. Of course it actually has to be everyone’s goal and there can’t be competing goals. If the person at the top just claims that that’s the goal, but it isn’t actually the goal of the majority of people involved in the system, then you’re not really setting a goal.

This should also clarify why some other things don’t help: A more conservative-leaning friend suggested that police should be paid more so that we get better people working for the police. That’s probably a good idea if we apply Hanlon’s Razor to the above problems and generously assume that the problem is stupidity, not malice, but how much do we expect to improve things? Better pay is an incentive, so we’re addressing a point pretty high up on the list, so that sounds good. But what are we incentivizing? We are incentivizing more people to apply for police jobs. That might be good because then we can be more selective in who we hire, but how does that relate to police violence? It relates through the size of stocks and buffers, which are pretty low points on the list. (you might say that it will impact feedback loops, but it only does that by acting on a stock, not by changing the loop itself) Lets say we start paying police better and as of next year we get much better people applying to the police. Then they make their way through the ranks, and five years later we see a 2% reduction in police shootings. Remember that our goal was a 95% reduction if we want to get to the level of Germany. So… maybe it’s a good idea as part of a larger pool of measures, following a defense in depth approach, but I would probably be more interested in those other measures. I think in terms of bang-for-the-buck impact it won’t compare that well. (and this analysis is actually being generous. In reality if you start hiring different kinds of people, balancing feedback loops start kicking in as existing officers will want you to hire more people who are like them. The pay incentive doesn’t do anything about that)

The other problem with that kind of approach is that it assumes that the problem is widespread. If you assume that most cops are bad, you need to replace a lot of them, so you need to change hiring. But if you assume that 5% are bad, then you need more targeted interventions. I actually think that even among the police that we now see behaving badly, most would behave much better if they acted within a system that has different paradigms, goals, incentives, punishments, constraints and information flows.

But the current situation is where you find yourself if you allow a drift to low performance. You suddenly find yourself at a point where you’re twenty times worse than comparable countries and you need a 95% reduction to get back to normal levels. All the while you get people writing opinion pieces about how the situation isn’t that bad and look, the numbers have been pretty stable, only a slight increase maybe.

Capability Traps

Talking about a drift to low performance, the US healthcare system and much of the US government are stuck badly in a capability trap. Sorry, we’re back to talking about the coronavirus, but I also promise that this one is directly applicable to programmers.

The name comes from the classic paper “Nobody Ever Gets Credit for Fixing Problems that Never Happened.” The title refers to the fact that people don’t get credit when they act early on a problem and ensure that it doesn’t escalate. But when the problem gets bad and they need heroic efforts to save it, they get lots of credit and might even get rewarded. And if that is your reward structure, you’re rewarding people for letting problems escalate.

“Capability” in this context refers to lots of things that allow you to do good work. Good tools, good processes, practiced interactions with colleagues, internal knowledge, just in general the ability to make things happen. Whether that means good tools that allow you to be productive all day or the knowledge of who to talk to when a problem arises and what steps are necessary to deal with it. The paper makes the claim that this “capability” is a stock that deteriorates over time, but that you can also invest in.

And they found that a surprisingly large number of organizations don’t invest in their capabilities, even when there are seemingly obvious ways to save money. Looking for the reasons for this lack of investment, they found that once your capability erodes, you’re constantly behind: Your unaddressed problems pile up, you’re trying to catch up and you certainly have no time to try to improve your processes. Every once in a while a manager tries to improve things but the improvements take a while, so in the meantime people fall even further behind because you have to get your normal work done while also trying to make the improvements happen. So the improvement gets abandoned because you have to get stuff done right now and now you’re even further behind.

At the other end of the spectrum, if you have a business that runs well, a shock might happen and you might have to cut costs by 10%. Surprisingly you find that output isn’t actually affected by the cost cutting and you manage to do the same amount of work with fewer resources. Which is great and the responsible manager will probably get rewarded. What’s invisible to you is that your capabilities might be eroding as you had to fire the only person who knows how to fix X, so the next time that X breaks it takes longer to repair. Or your employees do less preventative maintenance in order to get their work done on time, but without preventative maintenance you’ll be getting more problems a few years down the line.

So if you cut corners you get better results in the short term and worse results in the long term. If you invest, you get worse results in the short term and better results in the long term. So the thing that’s best in the short term is exactly the opposite of what’s best in the long term. And once your capabilities have eroded too far, you can never stomach the kind of investments that would bring long term improvements because you’re already constantly stressed and barely able to keep up. At that point any shock will send you into deep trouble, even if you could have easily stomached that shock if you had just kept your capabilities up.

If this sounds like the New York response to the coronavirus, you’d be right. After years of cuts to the health care system, capability was already eroded when the virus hit. Then the government had a catastrophic initial response where they failed to do contact tracing and refused to do cheap measures that would prevent lots of damage. A good amount of this was due to poor internal communication and distrust. Also capabilities had been eroded at the federal level so there weren’t enough tests. When New York finally started testing people in March, the numbers went from 100 cases to 10,000 cases in two weeks. The virus can’t multiply by 100 that quickly, the only possible explanation was that it was already widespread because of a failed initial response, and the tests were just catching up.

So obviously Cuomo launches a heroic large effort to lock down the city (after initially resisting for far too long) and to organize all the hospitals and to get supplies and PPE (so many eroded capabilities) and he became hugely popular. This is always how these go: The more you mess up, the more popular you become when you later save the day. Nobody notices that the measures you had to do to save the day were much more expensive than if somebody more capable had been in charge and had addressed the problem early on. (for reference, if we see the same number of deaths in the whole country that we saw in New York City, we’d have 600k deaths with just 25% of people getting the virus. Obviously the numbers would be much higher if 60% or more get infected)

This is also relevant to programmers. I used to work in game development and everyone in game development is jealous of Nintendo and Blizzard. (and some other companies, but I’ll stick with these two) The reason is that they seem to have found a method to consistently produce high quality games. Every single Mario and Zelda game is very good. Every single Blizzard game is very good. How do they do it? Rob Pardo addressed the question of “polish” in this talk. He says that other game developers seem to think that Blizzard just spends some more time and money at the end of the project to make the game extra good before it comes out. Rob Pardo says that they do spend more time at the end, but that’s not the reason for the high quality. You see at many companies the game is kinda shit during development. Lots of bugs, half-implemented features that may or may not go somewhere some day, and few hints of the fun that will one day be had. And then at the end of the project you allocate some time for “polish” where you fix as many bugs as you can and make the graphics real nice and bring all the features together. Rob Pardo says that this is not how it happens at Blizzard. At Blizzard the polish happens all throughout production. They always have a fun game. Sure, it’s unfinished, but even in the unfinished state it’s fun and polished. It’s never a broken unfun mess.

When you tell most game developers about how Blizzard works, they give you the answers of people who are stuck in a capability trap. There isn’t time to polish during development. You can barely get the features done that you’re working on because you’re busy fixing features that you submitted six months ago because new bugs keep on breaking them. It’s like the Red Queen’s race where you have to run as fast as you can just to keep up.

The tricky part is that these developers are right. If they started doing the Blizzard approach, they would just fall hopelessly behind. Because if you are already behind and you try to improve things, things get worse before they get better. It almost seems like there is no way out.

So what is the way out? How do we get the healthcare system out of this mess? The good news is that a shock can actually spur change, and boy are we seeing a shock right now. So now we just have to make sure we’re making right changes. For that you need to realize that things can be better. Just like game developers look at Blizzard and Nintendo and conclude that they could never be like them, the US looks at the good healthcare system of other countries and thinks that it can never be like them. That thinking has to stop. You have to have more confidence in your ability to improve things. Unfortunately you also have to know and communicate that while you’re making changes, things will be worse before they get better. You bet if the US had a model of Medicare for all, things would initially be worse before they get better. To give some encouragement I’ll quote one part of the paper where Du Pont had similar problems: They paid more than other companies for maintenance while they got worse maintenance outcomes and often had to spend extra for panicked repairs. Just like the US system pays more for healthcare and gets worse outcomes:

In 1991, an in-house benchmarking study documented a gap between Du Pont’s maintenance record and those of the best performing companies in the chemicals industry. The benchmarking study revealed an apparent paradox: Du Pont spent more on maintenance than industry leaders but got less for it. Du Pont had the highest number of maintenance employees per dollar of plant value, yet its mechanics worked more overtime. Spare parts inventories were excessive, yet they relied heavily on costly expedited procurement of critical components. Overall, Du Pont spent 10-30% more on maintenance per dollar of plant value than the industry leaders, while overall plant uptime was some 10-15% lower.

[…]

To see how the capability trap arose in the chemicals industry, imagine the effects of cost cuts on maintenance, such as those beginning with the oil crisis of 1973 and subsequent recession. In chemical plants, when critical equipment breaks down, it must be fixed. Hence maintenance managers required to reduce costs must cut preventive maintenance, training, and investments in equipment upgrades. The drop in planned maintenance eventually causes breakdowns to increase, forcing management to reassign more mechanics from planned maintenance to repair work. Breakdowns then rise even more. As uptime falls, operators find it harder to meet demand and become less willing to take equipment down for scheduled maintenance, leading to more breakdowns and still lower uptime. More breakdowns simultaneously constrain revenue and increase costs (due to overtime, expedited parts procurement, the non-routine and often hazardous nature of outages, collateral damage, and so forth). More subtly, lower uptime erodes a plant’s ability to meet its delivery commitments. As it develops a reputation for poor delivery reliability, business volume and margins fall further. The plant slowly slides into the capability trap, with high breakdowns, low uptime, and high costs.

[…]

Policy analysis showed that escaping the capability trap necessarily meant performance would deteriorate before it could improve: While continuing to repair breakdowns, the organization has to invest additional resources in planned maintenance, training and part quality, raising costs. Most importantly, increasing planned maintenance reduces uptime in the short run because operable equipment must be taken off-line for the planned maintenance to be done. Only later, as the Reinvestment loop begins to work in the virtuous direction, does the breakdown rate drop. Fewer unplanned breakdowns give mechanics more time for planned maintenance. As maintenance expenses drop the savings can be reinvested in training, parts quality, reliability engineering, planning and scheduling systems, and other activities that further reduce breakdowns. For example, upgrading to a more durable pump seal improves reliability, allowing maintenance intervals to be lengthened and inventories of replacement seals to be cut. Higher uptime also yields more revenue and provides additional resources for still more improvement. All the positive feedbacks that once acted as vicious cycles dragging reliability down become virtuous cycles, progressively and cumulatively boosting uptime and cutting costs.

Now the challenge facing the team was implementation. They knew nothing could happen without the willing participation of thousands of people, from the lowest-grade hourly mechanic to regional vice presidents. They also realized that their views had changed because they had participated in the modeling process. Somehow they had to facilitate a similar learning process throughout the plants.

[…]

At plants that implemented the program by the end of 1993, the meantime between failure for pumps (the focus of the program) rose by an average of 12% each time cumulative operating experience doubled. Direct maintenance costs fell an average of 20%. In 23 comparable plants not implementing the program the learning rate averaged just 5% and costs were up an average of 7%. Washington Works boosted production capability 20%, improved customer service 90%, and cut delivery lead time by 50%, all with minimal capital investment and a drop in maintenance costs. For the company as a whole, conservative estimates exceed $350 million/year in avoided maintenance costs alone.

However, success creates its own challenges. One issue related to the persistence of the cost-saving mentality. A member of the modeling team commented, “As soon as you get the problems down, people will be taken away from the effort and the problems will go back up.” In fact, mandated corporate cost-cutting programs did cause significant downsizing throughout the entire company, weakening the reinvestment feedback and limiting their ability to expand the program.

This is the full dynamics of the capability trap in one example: Things deteriorate because your capabilities are eroding and you end up with higher costs while getting worse results. If you recognize the problem you can make huge improvements and make back way more money than you invested. Unfortunately at the end, management cut the whole thing short because they wanted to save even more costs, thus killing the goose that laid the golden eggs. One of the people involved in the process moved on to BP, leading to this quote:

A BP team reduced butane flare-off to zero, saving $1.5 million/year and reducing pollution. The effort took two weeks and cost $5000, a return on investment of 30,000%/year. Members of the team had known about the problem and how to solve it for eight years. They already had all the engineering know-how they needed, and most of the equipment and materials were already on site. What had stopped them from solving the problem long ago? The only barrier was the mental model that there were no resources or time for improvement, that these problems were outside their control, and that they could never make a difference.

I bet you there are tons of examples like this in the healthcare systems: Huge costs that everybody knows how to fix, but where everybody also assumes that they have no power to fix them. And they’re probably right, because without support from somebody higher up you are probably just forced to do the best work you can in the system that you work in.

This also explains why cuts to healthcare spending don’t help. All you’re doing is eroding capability and then things will be even more expensive a few years down the line. The better way to save money is to improve capability and get rid of all the wasteful work that’s going on.

When you find yourself in this situation, the only thing you can do is compare with other people who don’t seem to have the same problem. Du Pont compared against other chemical companies. Game developers compared against Blizzard. For the health care system and the police violence problem, you can compare against other countries. The most important thing is to build the confidence that you can make changes. That you can reduce police shootings by 95% to get to the level of Germany, and that you can reduce healthcare spending to get to the level of other countries. Then you have to build support because things will get worse before they get better, and you need the support from the higher-ups to survive that time. And then you need continued support so that you don’t get the rug pulled out under you as soon as you see the slightest improvement. But as long as you keep your eye on an obviously good comparison point, you can eventually get there.

Outlier Visibility

Back to police violence. I think part of the problem of the Internet is that it makes outlier opinions very visible. If you read the National Review right now, you will find opinion pieces that get angry at the left for defending riots. If you’re on the left, you are likely to be really puzzled by this. Who is defending riots? Most people on the left are mad at the rioters because they are undermining the cause. But sure enough, if you go searching, you will find plenty of people on the left defending the riots. And obviously the opinion writers on the right will focus on those people.

The left of course makes the same mistake: They write about how the right defends police violence. People on the right may be surprised by this because of course they’re against police violence. But sure enough, if you go searching, you will find plenty of people on the right defending police violence. And obviously the opinion writers on the left will focus on those people.

In this sense the Internet has made it much easier for the left and the right to talk past each other. The outliers are likely to be amplified and to be just as visible as mainstream opinions. In fact on the opposite side, the outliers are probably more visible. Meaning if you’re on the left, the outliers on the right are probably more visible to you than the mainstream opinion on the right, and vice versa.

This was less likely to happen before the Internet because you had to get through a few filters before becoming visible to a wide audience. The Internet reversed the logic of those filters: Before the Internet it was easier to be widely visible if you made a reasonable point, now it’s easier to be widely visible if you make a fringe point. (of course there are plenty of counter examples from before the Internet, but now it’s happening every single day and even complete nobodies can be widely visible if the opposite side finds them outrageous enough)

I don’t know what to do about that other than the usual advice to not just read one side of the story. But I also find that just being aware of this phenomenon helps. When some random person says something outrageous on Facebook and that gets widely shared, it helps to keep in mind that they are likely this visible because the opinion is an outlier, not one that’s held by a lot of people.

Conclusion

It may be a bit of a mess of a blog post, but maybe a point came together, illustrated by the initial quote: Problems of this scale always have multiple sources and because of that you also have to attack them from multiple angles. I saw too many discussions that don’t take that into account. Especially about the coronavirus. I still hear people defending Trump and blaming China. Even if China is to blame for the initial spread, you can’t just end the discussion there and conclude that none of this is Trump’s fault. You have to look at the whole multiplicity of causes and all the actions and ask “OK given that China was the initial problem, how did we act?” Did we take the right actions? Did we do a good job implementing the measures we decided to take? Were the measures we took worth it? How much benefit did we get for the cost? Could we have done more? Could we have achieved the same results for a lower cost by doing different things? Are there any outside reference points we can use for what’s good a good performance here?

Unfortunately if you’re stuck in a capability trap, you don’t do any of that kind of thinking. You see that the whole thing is a mess and given the situation you did the best you could. Never mind that you only really started acting once the situation escalated. You don’t have time to realize that and look at the bigger picture because you are busy putting out fires. And then once the immediate problem is over you don’t have time to invest in your capabilities to ensure that this won’t happen again. Because the time and money you invested in trying to fix the last problem made you fall further behind, and there are already new problems on the horizon.

It’s always really hard to convince people that this kind of thinking is wrong, because from their standpoint there were no alternatives. It’s especially frustrating because it felt like you did a lot of things, heroically saving the day with all your work, and now you’re criticized for that. But the point of this blog post is to allow you to see that you can get out of situations like that, and to give you some ways of thinking about the problem that allow you to get the whole system to a better point. The goal isn’t to heroically put out the fires and deal with the crisis. The goal is to get the system to a point where, when encountering an unanticipated problem, you’re easily able to deal with it early, before it has time to interact with other causalities.