Home Posts Post Search Tag Search

Thinking Distributed Systems 03 - Fault Tolerance (40)
Published on: 2026-01-27 Tags: elixir, Side Project, Testing, Think Distributed Systems

3 Failure tolerance (40)

We will take some time now to talk about the idea of fault tolerance which refers to the guarantee that a distributed system functions in a well-defined manner even when a failure occurs.


3.1 In theory (41)

A failure is an unwanted but possible state transition of a system. On a failure a system transitions from a good state to a bad state. Failure tolerance is the ability of a system to behave in a well-defined manner when the system is in a bad state.


A formal approach to defining a failure is based of the three states that a system can be in:

_illegal state_ - Good state.
_legal state_ - Bad state.
_intolerable state_ - everything is lost.

We can also look at the 2 types of transitions that matter in this case: normal and failure


Just to add some other vocab here we can talk about all the state to state transitions:

Legal state (normal transition) => Legal state
Legal state (failure transition) => Illegal state
Illegal state (_repeated_ normal transition) => Legal state
Illegal state (failure transition) => Illegal state

NOTE Very funny note we don’t include intolerant states as we don’t tolerate them…


3.2 Types of failure tolerance (42)

Types of failure tolerance:

__ Safe Not Safe
Live Masking Nonmasking
Not live Fail-safe :(

3.2.1 Masking failure tolerance (42)

Masking failure tolerance is the most desirable form of failure tolerance. However this might be very costly.


3.2.2 Nonmasking failure tolerance (42)

If a system guarantees liveness but not safety is provides nonmasking failure tolerance. This is the idea that it will make progress but will make mistakes, messages will still be sent but might be out of order.


3.2.3 Fail-safe failure tolerance (43)

If a system guarantees safety but not liveness we are in a fail-safe failure tolerant system. It will not make mistakes but will not make any progress. This can better be thought of as it will stop sending messages to not make mistakes.


3.2.4 None of the above (43)

If it can’t guarantee either safety or liveness we are not in a fault-tolerant system.


Let’s take a look at some code and use some vocab to explain the different transitions:

process P
  var x ∈ {0, 1, 2, 3} init 1
  transitions
    // normal transition
    ❶ x = 1 → x := 2
    ❷ x = 2 → x := 1
    // normal transition (repair)
    ❸ x = 0 → x := 1
    // failure transition
    ❹ x ≠ 0 → x := 0
  end
end

As the creators of the system, we specify its correctness. safety - x = 1 or 2 liveness - There will be a point where x will be 1 or 2


The system is safe and live if x = 1 or x = 2. Safe but not live if x = 0. Note that x = 3 is intolerable.


3.3 In practice (44)

Now that the foundations we can outline failure-handling strategies.


3.3.1 System model (44)

When a process is started two scenarios are possible:


Total application - No failures the process executes each step and transitions. Correct -> correct.


Partial application - With failures it can only execute certain steps. Correct -> incorrect.


So we can now say that a process is a sequence of steps in which partial execution is undesirable. So in the event of a failure we want the process to execute in one of two ways:


correct and desirable - total application (forward recovery)


correct but less desirable - no application (backward recovery)


NOTE The old credit card example. Must be charged only once the full event has taken place. Depending on which part of the credit_card_charge application we are working on. We will need to go back or retry.


3.3.2 Failure handling (46)

Failure handling has two main steps: failure detection and failure mitigation


3.3.3 Failure classification (47)

We first need to classify a failure. The book wants to use two orthogonal dimensions.


Spatial Dimension <br> This is the idea of where a failure occurs. To think about this we will need to use a model that cant think in layers.


In general a higher level component will make calls to a lower level, but very rarely will a lower level make a call to a higher level component. The top layer is the application layer and lower levels are the platform layer. There might even be a third layer infrastructure.


Application-Level and Platform-Level Failures <br> Always classify the level of the failure by the lowest level that the system could deal with that layer. Something like InsufficientFunds would be an application-level failure, but CouldNotConnect would be a platform-level failure.


Temporal Dimension


When a failure occurs is in reference to the Temporal Dimension. They can be: transient, intermittent, permanent. This can be thought of as an auto-fix will work to deal with the issue. Autorepeat can help with transient or intermittent but a permanent needs to be dealt with a manual repair. These can also be thought of as the probability of the events happening again.


Transient Failures (Failures that come and Go)


The probability of the event happening again is the same as the event happening on its own.
These events are autorepairing.

Intermittent Failures (Failures that Linger)


The probability that a failure will occur after the first is larger than the single event.
These events are autorepairing.

Permanent Failures (Failures that are Here to Stay)

The probability that a second will occur is 100%.
These events are not autorepairing.

3.3.4 Failure detection (50)

We need a witness to even understand when a failure occurs. Depending on the level of the failure you could just look at the return value of the level that has the error. Or for the higher level the return code of the application. But all of this would be manual so a listener will help to automate it. All failure detectors function by exchanging messages at set intervals between the observer and the component:


pull - observer sends a request to the observed and expects a response.


push - observed sends a heartbeat to the observer.


For these a timeout is considered a failure but it might not always be the case. Consider these scenarios:


synchronous - You can always assume that a timeout is in fact a failure.


asynchronous - The failure might just be the message is delayed while speaking to an other component.


We can say that a failure detection is complete if it can identify every actual failure, and it’s accurate if every identified failure is an actual failure.


3.3.5 Failure mitigation (52)

Now we get to talk about the second component of failure handling, failure mitigation.


Backward Recovery


This will take an illegal state and transition it back to a previous (legal) initial state. This is a common application-level failure mitigation strategy.


Forward Recovery


We need to push past the current illegal state to a legal final state. This will involve some level of fix as if we just try and push past we will just compound the issues. This is a common platform-level mitigation strategy.


3.3.6 Putting everything together (52)

Let’s put it all together.


Application Level Failure


This is when we can try and go backwards to fix the problem. Thinking about the InsufficientFunds we would like to go back to before we tried to charge the card in order to move around or stop the process from happening.


Platform-Level Failure Mitigation


This is the layer that might have a few strategies to deal with the issues. Keep in mind that you might have an issue with timeouts, network issues, a full queue, etc. Let’s look at a few of the strategies.


With; platform-level, transient, autorepair. We issue a retry.


With; platform-level, intermittent, autorepair. We schedule multiple retries.


With; platform-level, permanent, manual-repair failure. Suspend, repair then resume and retry.


Be Vigilant


As with everything that we have said before nothing will be simple to diagnose so stay ready to look at logs and add more robust fault tolerance.