The Resilient Entrepreneur, Edition #113

Hi there

I hope you had a great week!

Here are the topics in today's edition:

24 Hours to Fix a Disaster: A True SaaS Recovery Story
Stop Testing the Safe Parts of Your Code: A Real-Life Story

Please reach out with comments, questions, or suggestions for articles!

Talk soon,
Tom

TACTICS FOR RESILIENT ENTREPRENEURS

24 Hours to Fix a Disaster: A True SaaS Recovery Story

When a data migration went wrong and wiped production data on a Thursday, our team had one goal: Full SaaS recovery before the weekend.

At Yonder, the B2B SaaS company I co-founded, we perform software releases regularly — just like all software companies. And just like any B2B SaaS company, we perform weekly backup and disaster recovery tests.

A few weeks ago, we planned a release for Thursday at 4 pm. No big deal, shipping a few bugs and preparing a few things under the hood for upcoming features. Testing took a little while due to a data migration required for the upcoming features. Nothing suspicious, and QA signed off on the release.

Just two hours after the release was out, it became clear that the data migration had deleted some production data due to an erroneous operation that was not caught during testing.

Now the whole chain of disaster recovery hit our team. Here is what happened, and what we learned from it.

H+2: Houston, We Have A Problem

Just two hours after the release, I received a chat message from our lead developer saying we had a serious issue and that I was needed urgently on a call. I was in another meeting, but since I’m not often called urgently into a call, I left the meeting and joined the call.

In that call, it was still unclear what caused the data loss. Nevertheless, we agreed to start immediate disaster recovery from backups taken last night. We also prioritized the sequence of customer tenants to be restored from backup.

I left the call and went back to the meeting I was in. I didn’t pay as much attention as I should have because I was thinking about the root cause and possible fixes for the data loss.

H+4: Reducing the Problem

My evening meeting was over, and I was at the networking reception that followed the meeting. My phone vibrated, informing me that the first customer tenants had been restored from backup and were up and running again.

Even though I love networking, my thoughts were still with the data loss problem. Finding the root cause was one thing, but I was starting to worry that the team would pull an all-nighter to restore each customer tenant from backup and start making mistakes when they got tired. And since it was Thursday night and I didn’t know the root cause of the problem yet, I created my own worst-case scenario: Late on Friday afternoon, we would still not have fixed the problem, the problem was amplified due to mistakes made by a tired team, and our customers would be left without a solution over the weekend.

So I focused my mind on reducing the problem rather than amplifying it. Since the data loss was only affecting one particular file type and not all our customers use this particular file type, our Chief Customer Officer and I could reduce the problem from all customer tenants to roughly half the customer tenants. So I texted back to our lead developer that I had some good news in this bad situation, and that I would call him as soon as I was on the train home from my networking reception.

H+6: The Cherry-Pick Alternative

I try to avoid delicate business calls on a train, but this time, I made an exception. Time was more important than confidentiality, plus the train was empty anyway at 10 pm.

I called our lead developer and told him about the “good news” that only half of our customers were affected. And now it was his turn with the good news: First, the root cause could be clearly identified. And second, instead of restoring customer tenants from backup, the team had already prepared a solution to cherry-pick the deleted data from the backup without requiring a full restore, saving us lots of time. The only thing still open was a test of this cherry-pick solution on a large customer tenant to assess the time component for the fix.

I shared my worry that people would do an all-nighter and get tired and make mistakes, so we agreed that we would call it a night, continue testing the cherry-pick solution the next morning, and make the necessary decisions for the next steps tomorrow at 10 am.

H+14: An Early Start

When I start working around 6 am in the morning, I normally don’t see any “available” badges in Microsoft Teams from our dev team. On this Friday morning, however, I wasn’t the first person at work. The cherry-pick tests were already in full swing.

H+18: Decision Meeting

The whole dev team and I met for a decision meeting, where we discussed the status, risks, and timing of the cherry-pick solution. We all agreed that this solution was safe to proceed, so I informed our Chief Customer Officer that we have a way forward. Within minutes, he provided the priorities list for us to work on restoring service for our customers in the right order. And work started immediately once we left the call.

H+24: End of Crisis

Exactly 24 hours after the release that went wrong, full service was restored to all customer tenants. And just like in a good movie on cybercrime or the news ticker of your favorite news outlet, everybody who was involved could follow the operation in real time in a Microsoft Teams chat.

Success Factors

For such an operation to succeed, many gears have to mesh together. Here are the key success factors we identified:

The instant availability of the entire team to work late and start working again early in the morning. This was not just the dev team; it was also the customer team communicating with customers, setting priorities, and checking customer data after restoration. Of course, we have an on-call team on stand-by for emergencies 24/7, but this operation needed much more resources than we keep on stand-by.
From the start, the entire team was thinking in options rather than focusing on one single course of action. That proved decisive in speeding up the back-to-normal timeline.
Whilst I was busy coordinating the technical activities with the dev team, our Chief Customer Officer communicated proactively with our customers to keep them up to date about system limitations, restoration options, and timelines. Despite the inconvenience caused, our customers appreciated the proactive communication and the frequent status updates. And because the status updates were kept away from the dev team, they could focus on the technical solution instead of being disturbed by constant requests for status updates.
Last but not least, despite all the hectic, we managed to reduce instead of amplifying the problem. Looking back on the hectic we had even after reducing the problem, I don’t want to imagine the havoc we would have created by amplifying the problem.

Conclusion

Bad things do happen, no matter how skilled your team is, or how seriously you take QA. But when the shit hits the fan, you can only succeed when everybody contributes.

Once the hectic was over, we debriefed on what had happened and why, and we could identify some important lessons on how to further improve our QA.

Ironically, this all happened when I was at an event on aircraft accident investigation. The parallels between aircraft accident investigation and software accident investigation are striking — but that’s a story for another day (see below).

STRATEGIES FOR RESILIENT ENTREPRENEURS

Stop Testing the Safe Parts of Your Code: A Real-Life Story

End-to-end tests miss the fatal errors. You need the annoying testers who ask the dumb questions to find the real risks in your code.

At Yonder, the B2B SaaS company I co-founded, we recently suffered partial data loss after a release that went wrong due to a faulty data migration for a new feature.

If you’re interested in how we managed disaster recovery as a team, please head over to the article above.

If you’re interested in the QA side of this incident, read on.

Incidents, Serious Incidents, and Accidents

Ironically, I was attending an event on aircraft accident investigation when disaster struck. And just like with aircraft accidents, disaster struck because one little but important thing was overlooked.

In aviation, there is empirical evidence that 100 incidents lead to 1 serious incident, and 100 serious incidents lead to 1 accident.

Now, therefore, there are two ways aircraft accident investigation can work: Reactively, by analyzing the underlying incidents and serious incidents that led to an accident once an accident has occurred. Or proactively, trying to collect information on as many incidents and serious incidents as possible to prevent future accidents.

Bombers in World War II

In World War II, of every 100 Allied airmen serving on bombers, 45 were killed, 6 were seriously wounded, 8 became Prisoners of War and only 41 escaped physically unharmed. Of those who were flying at the beginning of the war, only 10 percent survived.

Despite those dire statistics, the Allied command insisted that bombing was critical to the success of the war. They wanted to increase armor on their bombers to reduce losses. But because you can’t strengthen a bomber like a tank, they wanted to find out where to put the additional armor to minimize the losses.

Look at the header image of this article for a moment.

As the planes returned from their missions, they counted up all the bullet holes on various parts. The planes showed similar concentrations of damage in three areas: The fuselage, the outer wings, and the tail.

The obvious but wrong answer was to add armor to these heavily damaged areas.

Why was this obvious answer wrong? Because they had only looked at airplanes that had returned. Armor was needed on the sections that, on average, had few bullet holes, such as the cockpit or the engines. Planes with bullet holes in those parts never made it back.

And Now, How Does This Relate to Software QA?

The same sort of bias that the Allied bomber command made in World War II applies to software QA.

In QA, looking at the bullet holes in the fuselage, the outerwings, and the tail is what happens when you run your automated end-to-end tests. You bullet-proof your software for clearly defined routines that you think your users will perform regularly. And every time one of those routines fails, you add a bugfix before the next release.

There is nothing wrong with that. At Yonder, our bug rate has dramatically reduced since we introduced systematic end-to-end tests.

However, what happened during our recent data loss episode was not covered by end-to-end tests. For a new feature, we needed a data migration that modified the metadata of all PDF files in our system. Although this data migration correctly modified the metadata, it corrupted the actual PDF files during the migration.

Now you might ask, how could something as serious as that go unnoticed?

Well, it didn’t go unnoticed. When the data migration was tested, our team noticed that some PDF files in our system were corrupted after running the migration. But they incorrectly attributed the file corruption to the test files used, rather than the data migration itself.

Conclusion

Software QA is not just about looking at the obvious cases. Sure, you need to look at the obvious cases, and that’s why automated end-to-end tests make sense.

But you also need the inconvenient people on your team who treat your software like the dumbest user on the planet, and ask tedious questions on apparently irrelevant things. That’s inconvenient, but it helps you avoid your next data loss disaster.

About Me

I’m a tech entrepreneur, active reserve officer, and father of three — writing about entrepreneurship, leadership, and crisis management from hard-won experience. No AI, no fluff, no promos. Just plain-text insights for people building and leading under pressure.

When I’m not solving problems, I find clarity in the mountains around Zermatt.

If this was useful, here’s how to get more:

📌 All my articles, no paywall — read everything in one place. Visit the blog.

📌 Buy me a coffee—it keeps the writing going. Thank you.

Unsubscribe · Preferences

The Resilient Entrepreneur

The Resilient Entrepreneur, Edition #113

The Resilient Entrepreneur, Edition #113

24 Hours to Fix a Disaster: A True SaaS Recovery Story

H+2: Houston, We Have A Problem

H+4: Reducing the Problem

H+6: The Cherry-Pick Alternative

H+14: An Early Start

H+18: Decision Meeting

H+24: End of Crisis

Success Factors

Conclusion

Stop Testing the Safe Parts of Your Code: A Real-Life Story

Incidents, Serious Incidents, and Accidents

Bombers in World War II

And Now, How Does This Relate to Software QA?

Conclusion

About Me

The Resilient Entrepreneur, Edition #112

The Resilient Entrepreneur, Edition #111

The Resilient Entrepreneur, Edition #110