Home Posts Post Search Tag Search

Thinking Distributed Systems 08 - Replication
Published on: 2026-02-13 Tags: Blog, Side Project, Think Distributed Systems, Models, Systems, Asynchronous, synchronous, Messaging, Processing

8 Replication (106)

While thinking about the ACID that keeps distributed systems alive we are now going to talk about Durability. One way of making sure that a system is durable is to make it have replication. This can be data replication as well as logical replication (places to run the code). You might want to be sure that a component has a back-up to be sure that if one fails you have an other node ready to take on the task.

8.1 Redundancy (107)

Redundancy refers to the duplication and coordination of subsystems so that an increase in the duplication factor results in increased reliability and/or scalability. Let’s look at 2 types of redundancy.

¡ Static redundancy —The set of components and the set of interactions between the
  components do not change during the lifetime of a system. This type of redundancy
  is present predominantly in hardware systems.
¡ Dynamic redundancy —The set of components and the set of interactions between
  the components change during the lifetime of a system. This type of redundancy
  is present predominantly in software systems.

Keep in mind that although you can have redundancy in a system it is not a coherent systems without come way of composition. You need to be able to make the nodes work in tandem. One of the easiest ways of doing this is with a round-robin distribution.


The example here is a “Hello World” logic gate. Instead of having 1 logic gate we will implement 3 logic gates and 3 Majority Gates. The Majority Gate is there to pick the logic gates output that is most agreed upon. The express the idea that you will pick the set of output that at least 2 of the three agree on is written as:

    R = (a ∧ b) ∨ (b ∧ c) ∨ (a ∧ c)

8.2 Thinking about replication and consistency (109)

What is “one thing?” Think about a book that has multiple editions. If you were to have copies of more than one addition and someone asked; how many of that book do you have? Would each copy increase the number of books that you have?


Now think about a distributed system we could be talking about a database entry or we could be talking about an update to code. We need every node to be up-to-date.

8.3 Replication (110)

We need to have replication to avoid the reliability limits of a single object. Core to this principle is the idea of replication transparency, which refers to the system’s ability to hide the existence of multiple objects, thus the system can expect each object to be seen as one.

8.4 The mechanics of replication (111)

Logic gates can be thought of as state-less components as there output only relies on their inputs. But for this chapter we will focus on stateful components as that shows us some of the most important aspects of redundancy. Thinking about a state for a counter. We need to know if it is above 10, if we have increment and reset and we only check at certain points we might miss the increment above 10 if there was a reset.

8.4.1 System model (111)

We can now switch to a system model that is a “link between nodes” as this will help us understand the issues with replication. We need to make sure that we can have nodes communicate between each other but a network partition will form if there is: Crash-Stop failure, Omission failure, Crash-Recovery failure, or communication over an unreliable network.

8.4.2 Replication lag (112)

Replication lag refers to the inherent delay in propagating changes across replicas. The changes must be applied one after the other till all are set to the right state. There are also 2 types of replication lag: inherent replication lag is the nature of a distributed system, or imposed replication lag which is the result of network partitions and component failures.

8.4.3 Synchronous vs. asynchronous replication (113)

synchronous replication is when we make the changes to the system have been made one node at a time while wait for each response.


asynchronous replication is when we only wait for the initial node to send a response then the update is happening in the background. We might just have the first node be responsible for dealing with the responses.

8.4.4 State-based vs. log-based replication (114)

There are two ways of replicating change: In this example we are trying to update a counter from 0 -> 2 -> 5


state this will propagate the changes to every node so that we are all on the same place. If we are talking about a counter we will send out the state of 2 and then 5 to the other nodes.


operations this will propagate changes to every node but it will be the operations. If we are talking about the counter will send out the operations increment(2) and increment(3)

8.4.5 Single-leader, multileader, and leaderless systems (114)

Let’s talk about ways to orchestrate the system. Most strategies can be thought of by the number of leaders in the system.


single-leader systems, one dedicated leader node to coordinate operations.


multileader systems, more than one node to deal with operations, this is very helpful for dealing with scale but without some way of splitting the work there can be conflicts.


leaderless no leader and all nodes can accept operations. Conflicts arise from concurrent operations, and the state not being kept up-to-date.