Were These Companies Let Down by Bad Engineers?

Looking back at history, the most expensive software disasters all share a common question: how do companies this large make mistakes this basic?

The answer surprises most people: in the majority of cases, the engineers were not the problem. They had written their part of the code correctly. The failures came from how those parts connected to each other, how teams touched ageing systems, and which decisions were made without enough deliberation.

In this article we examine ten real disasters — one from Turkey, the rest from the global stage. Each ends with a single lesson. By the time you finish reading, you will see large-scale collapses differently: not as technical catastrophes, but as the predictable result of small decisions gone wrong.

10 Software Disaster Map
Global map of the 10 most expensive software failures

1. Knight Capital: $460 Million in 45 Minutes

What Happened?

In 2012, US trading firm Knight Capital deployed a software update across eight servers. Seven received it correctly. One was missed.

That overlooked server still held a dormant piece of old code that had not been touched in years. The update inadvertently woke it up. For 45 minutes, the system placed orders no one had requested. By the time the team realised what was happening, roughly $460 million had been lost. The company was gone within weeks.

Technical Term

Technical Debt — The accumulation of old, unmaintained code left inside a system over time. Think of it like stuffing rubbish under a bed — it does not bother you at first, but eventually it becomes a problem you cannot ignore.

Lesson Learned

Every piece of dormant code in a live system is a potential time bomb. Cleaning up unused code is not a waste of time — it is how future disasters are prevented.

Knight Capital 45 Minute Crash
Knight Capital's $460 Million Loss in 45 Minutes

2. Akbank: 43 Hours Without Banking in Turkey (2021)

What Happened?

In October 2021, Akbank applied a software update to its mainframe (IBM Mainframe) — the central system that all banking operations depend on. The update triggered an unexpected conflict, which cascaded into a complete service outage.

Branches closed. ATMs stopped responding. Mobile banking froze. Card payments failed. For 43 hours, tens of millions of customers were locked out of basic banking services. Akbank's market value dropped by approximately 20 billion Turkish lira during those days.

Technical Term

Mainframe — A large, central computer system that major institutions have built their operations on over decades. Think of it as a load-bearing wall that the entire building leans against — vital, but increasingly difficult to touch safely as years pass.
Disaster Recovery Plan — A pre-written answer to the question: "What would we do if this system failed today?" Akbank had one; the scale of the outage turned out to be far beyond what the plan had accounted for.

Lesson Learned

Touching large legacy systems is one of the highest-risk actions in software operations. Any such intervention should be thoroughly tested, carried out gradually wherever possible, and accompanied by a clear rollback plan.

3. CrowdStrike: The Update That Stopped the World (2024)

What Happened?

In July 2024, cybersecurity company CrowdStrike pushed a software update to its endpoint protection tool. The update contained a flawed logic check — it attempted to read data that did not exist and crashed as a result.

The problem: this update was delivered simultaneously to millions of Windows systems worldwide. Airports, hospitals, banks, broadcasters. All of them displayed the blue screen of death at the same time. The estimated global economic cost exceeded five billion dollars.

Technical Term

Staged Rollout — Releasing an update to a small group of users first, observing the results, and only then expanding to everyone. Similar to how a pharmaceutical company runs trials before launching a drug to the public. CrowdStrike skipped this step entirely.

Lesson Learned

The size of an update does not determine its risk. Releasing anything to everyone simultaneously turns even a minor defect into a global outage.

Touching Old Systems
Uncontrolled Interventions in Large, Old Systems

4. Mars Climate Orbiter: Lost Because of Units (1999)

What Happened?

NASA's $125 million Mars Climate Orbiter was destroyed as it attempted to enter the planet's orbit. It approached at the wrong angle and broke apart in the atmosphere.

The cause was disarmingly simple: one team at NASA had used metric units (kilograms, metres) in their calculations while the team at contractor Lockheed Martin had used imperial units. No one had asked which system the other was using.

Technical Term

Interface Contract — A written agreement between two teams specifying exactly what format data will be sent in and what rules each side must follow. Think of it as two ambulance crews from different countries agreeing to use the same blood type system before working in the same hospital.

Lesson Learned

This was not an engineering failure. It was an organisational one. Two teams worked independently and never aligned on the most basic shared rule. In complex systems, integration points must be defined before anything else.

5. Ariane 5: Copy-Paste Worth $370 Million (1996)

What Happened?

The European Space Agency's Ariane 5 rocket exploded 37 seconds after launch. A $370 million loss.

The reason: a software module had been copied directly from its predecessor, Ariane 4. That module was calibrated for Ariane 4's velocity range. Ariane 5 was significantly faster. The speed value exceeded what the module's data type could hold, and the system crashed.

Technical Term

Integer Overflow — What happens when a number grows beyond the capacity of the container holding it. Imagine a kitchen scale that only reads up to 5 kg — put 6 kg on it and the display shows something meaningless. The same thing happened here with the velocity figure.

Lesson Learned

Code that works perfectly in one context can be dangerous in another. Copying software between projects without re-testing it in the new environment is a risk that often goes unacknowledged until it is too late.

6. TSB Bank: The "Big Bang" Migration Crisis (2018)

What Happened?

British bank TSB decided to migrate from its legacy system to a new platform. Engineers recommended a phased approach. Senior management chose to go live with everything at once, overnight.

The migration failed. Around 1.9 million customers were locked out of their accounts for days. Some saw incorrect balances. The total cost of the crisis to the bank reached approximately £330 million.

Technical Term

"Big Bang" Migration — Moving everything from one system to another in a single, simultaneous switch. Like rewiring an entire house's electrical system in one night — either everything works in the morning, or you are in the dark. In software, this gamble rarely pays off.

Lesson Learned

When technical teams recommend gradual migration and management overrides them in favour of speed, it is the end users who absorb the risk. Organisational pressure is not a valid substitute for technical readiness.

Decision Ripple Effect
How One Small Decision Turns Into a Disaster Years Later

7. Equifax: $1.4 Billion Worth of a Missed Update (2017)

What Happened?

In 2017, Equifax — one of the largest credit reporting agencies in the United States — disclosed that the personal and financial data of approximately 147 million people had been stolen.

Attackers had exploited a known vulnerability in an open-source library the company used. A patch had been available for two months. Equifax had not applied it. Executives resigned. The total cost to the company exceeded $1.4 billion.

Technical Term

Software Dependency — A ready-made code library written by someone else that your application relies on to function. Like canned goods from a supermarket — you did not make them, but you are serving them. If the best-before date has passed, you could be unknowingly putting people at risk.

Lesson Learned

Third-party code carries the same responsibility as code you write yourself. Unpatched dependencies are not technical debt — they are open security vulnerabilities waiting to be exploited.

8. GitLab: Five Backup Systems, None of Them Working (2017)

What Happened?

A GitLab system administrator accidentally ran a deletion command on the wrong server, wiping a large portion of the production database. The team turned to their backups. They discovered that none of their five backup mechanisms were functioning properly — one pointed to the wrong server, one had never been configured, one contained data that was far too old to be useful.

GitLab made the unusual decision to share everything publicly in real time as it happened. Some data was recovered. Some was permanently lost.

Technical Term

Disaster Recovery Test — Running through a "what would we do if this broke today" scenario in a real environment before any actual emergency occurs. Like a fire drill — a fire extinguisher that has never been tested is just an object on the wall.

Lesson Learned

An untested backup is effectively no backup at all. The existence of a backup system means nothing unless its functionality has been verified.

9. Toyota: The Cost of Spaghetti Code (2009–2014)

What Happened?

Toyota recalled tens of millions of vehicles following complaints of unintended acceleration. Legal investigations examined the engine control software.

Independent experts found that the code violated fundamental principles of engineering safety: over 10,000 global variables (values accessible from anywhere in the system) and a structure so tangled that tracing any individual fault was nearly impossible. Toyota settled the resulting litigation for approximately $1.2 billion.

Technical Term

Global Variable — A value in a software system that any part of the code can read or change. Like a shared whiteboard in an office where everyone can write and erase — eventually, you cannot tell who changed what or why, and contradictions become inevitable.
Spaghetti Code — Code so tangled and interwoven that it is impossible to tell where any individual thread begins or ends. The name comes from a plate of pasta — pull one strand and you inevitably drag everything else with it.

Lesson Learned

Architectural simplicity is not a design preference — it is a safety requirement. In systems where human lives are at stake, "we will clean it up later" is not an acceptable position.

10. Mt. Gox: What Happens When Security Comes Last (2014)

What Happened?

In 2014, Mt. Gox — then the largest Bitcoin exchange in the world — announced that approximately 850,000 Bitcoin (worth approximately $460 million at the time) had gone missing.

Attackers had exploited a vulnerability in how the platform recorded transactions. As the platform had grown rapidly, its security architecture had never been revisited. Security had been treated as something to address after the product was finished, not as a foundational design decision.

Technical Term

Security by Design — Building security into a system from the very first line of code, rather than layering it on afterwards. Like designing a door to be lockable from the start, rather than installing a padlock on a door that was never built with one in mind.

Lesson Learned

Security added after the fact is always catching up. The longer it is deferred, the more it costs — and the more there is to protect.

What Did These Failures Have in Common?

Only two of the ten cases involved a purely technical defect — Ariane 5 and Mt. Gox. In the other eight, the root cause fell into one of the following categories:

  • Old, unmaintained code left active in production systems
  • Absent or incomplete communication between teams
  • Organisational pressure overriding technical caution
  • Backup and recovery systems that had never been tested
  • Updates released to everyone simultaneously without staged validation
Software Failure Categories
Core Error Points of Billion-Dollar Disasters

None of these failures were primarily about knowledge. They were about decisions — decisions made under pressure, with incomplete information, or simply deferred until the cost of deferral became unavoidable.

The most expensive software disasters in history usually begin with a ten-line decision. And those decisions are almost always buried inside what seems, at the time, like a reasonable course of action.

How is AI changing this picture? If you are curious about how the industry is shifting from "writing code" to "reviewing it", you can also read We Don't Write Code Anymore - We Review It.