Skip to main content

Command Palette

Search for a command to run...

How One Failing Service Can Crash Your Entire System And How to Prevent It

Updated
6 min read
G

I am a developer with learnings in many different languages, frameworks and technologies.

In today's article we will discuss about why failing services can crash your entire system and what we can do to prevent it. First,

What is considered failing service?

A service does not need to completely crash to be considered a failing service. In distributed systems, a service is considered unhealthy/failing when it:

  • responds very slowly

  • frequently times out

  • returns repeated errors

  • becomes overloaded under traffic

  • behaves inconsistently

In simple words, if a service cannot respond reliably within an acceptable time for its given business requirements, it is treated as a failing service.


Cascading failures and how failing services cause them?

A cascading failure happens when a problem in one service starts spreading to other dependent services, eventually impacting the stability of the entire system.

In distributed systems, services usually depend on each other. So when one service becomes slow or unhealthy, the services waiting on it also start slowing down or become unhealthy. Over time, this creates a chain reaction where failure propagates across multiple parts of the system. Sometimes it can even crash the entire system.

Consider the following scenario: Your Order Service needs to call the Payment Service before confirming an order. Normally, the Payment Service responds within 200ms, so everything works smoothly. But suddenly, the Payment Service becomes very slow and starts taking 20–30 seconds to respond. Now what happens? The Order Service is still waiting for responses. Meanwhile, more incoming requests continue arriving. Gradually:

  • threads remain occupied

  • request queues start growing

  • database connections remain busy

  • memory usage increases

Soon, the Order Service itself starts becoming slow. Now imagine multiple services depending on the Order Service. Those services also begin waiting and slowing down. At this point, failure is no longer limited to just the Payment Service. The slowdown starts spreading across the system. This is a chain reaction.

When failure or slowdown in one service gradually causes failures in other dependent services it is called Cascading failure.


Preventing Cascading Failures:

Now the obvious question is: If one unhealthy service can slow down the entire system, how do we prevent that?

One possible approach is retries. But retries alone are not enough, because if the service is already slow or failing and you keep retrying requests, even with exponential backoff, there will eventually be additional load on both the caller and the called service due to repeated requests that are likely to fail or timeout anyway.

Now here, you can clearly see that retries seem like a good approach, but the problem is that because of the unhealthy service, requests will continue piling up and cause additional load on both services.

So what if we could completely avoid sending requests to an unhealthy service once we already know it is struggling?

The best way to do that is to have a mechanism that tells us to temporarily stop requesting a service when certain conditions are met.

Now here, we have two important questions:

  1. How do we know a service we are calling is unhealthy?

  2. What should those conditions be?

We can know that service we are calling is unhealthy based on the following things:

  • Timeouts: When a request does not receive a response within an acceptable amount of time.

  • Repeated Failures: When requests continuously fail during a particular time window.

  • High Latency: Requests are getting processed, but very slowly. What is considered slow depends on business requirements.

  • Consecutive Failures: Continuous 5xx server-error responses received one after another.

And, what would those conditions be? it would be based on the above four parameters and it is as follows:

  • Thresholds: Predefined limits which determine when the system should stop sending requests to a service.

  • Failure Percentage: How many requests failed out of the total requests sent during a particular time window.

  • Timeout Count: How many requests timed out during a particular time window.

  • Consecutive Failures: How many continuous failures occurred without a successful response in between.

A mechanism that continuously monitors these conditions and temporarily stops requests when thresholds are exceeded is called a Circuit Breaker.


The Circuit Breaker:

The Circuit Breaker Pattern is a resilience mechanism used in distributed systems to prevent cascading failures.

Instead of continuously sending requests to a service that is already unhealthy, the circuit breaker temporarily blocks requests once certain failure conditions are met.

This helps the system:

  • fail fast

  • reduce unnecessary load

  • protect resources

  • prevent failures from spreading across other services

In simple words, a circuit breaker acts like a protection layer between services.

The Circuit Breaker has 3 States:

  1. Closed State: Everything normal, requests allowed.

  2. Open State: Threshold exceeded, stop requests immediately.

  3. Half-Open State: Allow limited test requests to check recovery.

So behind the scenes, what happens is that you attach a circuit breaker to a service call. It continuously monitors different parameters based on requests and responses. Initially, it remains in the Closed State and allows requests normally. But if certain thresholds are exceeded, it moves to the Open State, where further requests to the unhealthy service are blocked in order to reduce unnecessary load and prevent cascading failures. After a certain cooldown period, the circuit breaker moves to the Half-Open State, where it sends a few test requests to check whether the unhealthy service has recovered or not. Based on the response of these test requests: it moves back to the Closed State if the service is healthy again or returns to the Open State if the service is still unhealthy.

Another important thing is that choosing correct thresholds is very critical.

  • Thresholds are too sensitive: healthy services may get blocked unnecessarily

  • Thresholds are too relaxed: failures may continue spreading before the circuit breaker reacts

Because of this, threshold values usually depend on:

  • business requirements

  • traffic patterns

  • acceptable latency

  • system capacity


When Should We Use Circuit Breakers?

Circuit breakers are especially useful when:

  • services depend heavily on external APIs

  • requests can timeout frequently

  • failures can spread across multiple services

  • systems are distributed or microservice-based

  • resource exhaustion is a concern

They are commonly used in:

  • payment systems

  • notification systems

  • cloud services

  • microservices architectures

  • third-party API integrations


Final Thought:

In distributed systems, failures are not exceptional events.

They are expected.

The real challenge is not:

“How do we completely prevent failures?”

The real challenge is:

“How do we stop one failure from bringing down the entire system?”

That is exactly what the Circuit Breaker Pattern is designed to do.

Note: In real-world systems, most companies use existing circuit breaker libraries/frameworks instead of implementing one from scratch. Building a production-ready circuit breaker involves handling things like concurrency, rolling metrics, state management, thresholds, recovery logic, and race conditions, which is beyond the scope of this article.