Software resilience testing is more critical than ever

When internet services platform Cloudflare suffered an outage in November, it took a big chunk of the online world down with it.

Major platforms like ChatGPT, X, and Canva became unreachable. So did digital services offered by countless banks, retailers, and many other businesses. During the six-hour meltdown, as many as 2.4 billion users could have felt the impact.

Software outages like this have always been and always will be part of online life. But today our systems are more interconnected than ever, so a single failure can ripple outward. AI only amplifies that risk.

Yet, too many companies still lack protection against such disasters. In an era when outages are inevitable, they’re effectively operating without a safety net.

The fundamental missing ingredient is something simple but easily overlooked: resilience testing.

In a nutshell, resilience testing is all about pressure testing your software, before issues happen. It ensures that systems keep working—or quickly bounce back—when things go wrong.

Think of resilience testing as a small safety step to prevent big problems. The annual median cost of a high-impact IT outage is about $76 million. Businesses can also suffer reputational damage, lose customers, and get hit with regulatory penalties. Cloudflare is only one recent example. In the past year alone, AWS, Microsoft 365, and Starlink all went down, to name just a few.

So why aren’t more businesses stress-testing their software for inevitable failure? Here’s why, and what companies can do about it.

MOST COMPANIES DON’T BOTHER WITH RESILIENCE TESTING

As high as the stakes are, businesses have reasons to avoid software resilience testing. The process is technical, and it can get messy.

Modern resilience testing, also called chaos engineering, was put in the spotlight 15 years ago by Netflix software developers. Realizing that the only way to test for resilience is to simulate problems “in the wild” or in production, they created a suite of tools that replicated network crashes, cloud services meltdowns, and other real-world failures.

Netflix might have been able to roll with the punches, but few other companies have the expertise or the stomach to compromise their systems like this. It’s the equivalent of starting a controlled fire to ensure you have the resources to put it out.

Resilience testing requires the technical acumen to know what failures to simulate for and responses to take. Putting these drills into action also entails risk, like triggering your home’s fire sprinkler system which could ruin the furniture. Most importantly, developers need to know what to do when tests reveal weaknesses.

Because the threshold for resilience testing is so high, it isn’t integrated into most companies’ software development processes. There’s rarely a dedicated team, and often no one except maybe the CTO is clearly in charge. As a result, resilience testing becomes a bottleneck, so companies don’t bother with it.

A BETTER WAY FORWARD: HELP FROM AI

The good news: It no longer has to be this way. For companies that want to adopt resilience testing, new platforms and tools—powered by AI—are making the process safer and easier.

Specialized resilience testing agents now enable companies to automate and optimize testing, without needing dedicated experts or teams.

First, the AI agent identifies likely edge cases—unusual or unexpected scenarios that could compromise reliability. It examines system behavior in production, how services interact, and where similar systems have previously failed.

For example, the agent might highlight a scenario where a service slows, rather than fails outright. Another edge case: A code deployment updates only half the company’s servers, leading to inconsistent user experiences.

The agent then generates and prioritizes the test cases most likely to reveal resilience issues, explaining why each one matters. It can also set up and run those tests.

After problems are identified, the AI agent suggests targeted fixes, making the software more resilient. With the heavy lifting completed, developers can review and apply those insights.

WHY RESILIENCE TESTING NEEDS TO SHIFT LEFT

Having the right tools is one thing, but effective resilience testing requires more than just software.

Creating a culture of resilience is part of the solution. Software teams need to include testing in their routine. Ultimately, the only way to strengthen yourself against failures is to practice for them. If you never run those drills, you never know how bad things can get until it’s too late.

Developers should also remember that resilience testing isn’t just about full-scale, five-alarm outages. It’s also about small, partial failures that create a poor user experience for customers, without necessarily taking the whole system down.

Let’s say a platform like Cloudflare has an issue affecting a major bank’s consumer app, leaving millions unable to check their balances. Resilience testing should anticipate this problem and provide a viable workaround.

But the best way to encourage a culture of resilience is to “shift left”—moving resilience testing to the software development preproduction phase, before code ever goes live.

Shifting left helps teams catch weaknesses long before customers feel them. That’s crucial with today’s complex, interconnected software systems, where seemingly minor issues can rapidly spiral into major outages. Rather than scramble to diagnose problems during live incidents, developers can uncover and fix them in a safe environment.

Shifting left can save money and stress, too. Fixing resilience issues in production is costly and disruptive, often pulling team members away from other vital tasks. By taking a proactive approach, developers and business leaders can be more confident in the product they deliver to customers.

Ultimately, resilience testing isn’t rocket science. Companies that run fire drills for their software and embrace a culture of resilience testing will find themselves in a stronger position when the next disruption strikes. And in an increasingly interconnected world, where AI tools and features depend on more underlying services than ever, it’s safe to say that might be sooner rather than later.

Jyoti Bansal is CEO of Harness.

source https://www.fastcompany.com/91467320/software-resilience-testing-is-more-critical-than-ever