1 Thinking in distributed systems: Models, mindsets, and mechanics (1)

A distributed system is a collection of collaborating, concurrent components that communicate with one another by sending and receiving messages over a network:
• The behavior of a distributed system emerges from the behavior of its components and their interactions.
• The complexity of a distributed systems emerges from the complexity of its components and the intricacy of their interactions

1.1 Software engineering and mental models (2)

There might not be a ready set of ways to talk about the coding world. Even diagrams seem to be built as we need them and not a standard way to show what we mean. Every time that you look at a system you might find that the previous thought model might not hold true anymore. This is where the book will come in, it should help you develop ways of looking and showing models.

1.1.1 Mental models: The foundation of reasoning (3)

Mental models will be very important as we progress though the book. So let’s take a minute to discuss what a mental modal even is.
• A model is a representation of a target system.
• A mental model is the internal representation of the target system, and the basis of comprehension and communication.

Looking at the image on the page the model is the way in which an engineer might show how it works, but the mental model is on the right and shows the way in which you might communicate that model.

A tangible way to think of mental models and systems is a set of facts:
• System - A system can be understood as a set of facts that constitutes the ground truth.
• Mental Model - a mental model can be understood as set of facts that constitutes our perceived truth.

A good mental model is a mental model that is correct and complete:
• Correctness - In a correct mental model, every fact of the model is a fact of the system.
• Completeness - In a complete mental model, every relevant fact of the system id a fact of the model.

1.1.2 Correct mental models (4)

Looking back at the image 1.1 we would need to have everything that is in the model be correct an incorrect ohms would not be a correct model.

1.1.3 Complete mental models (4)

There might be much more that could be put within the model, such as length of wire or location, but for completeness we only require the relevant data.

1.2 Mental model of software systems (4)

Software will behaves as if it were physical constructs but there will never be made of anything physical, that is fine. Different models are valid and we will need to use many different types to full understand the model of what we are trying to convey.

1.3 Different types of models (5)

As just stated there are many ways to describe a system. Try not to get upset that the way in which something is described is not always consistent. IT might be trying to show a different element of the system.

1.3.1 Different models describing the same aspects (5)

A distributed system is a set of concurrent components that communicate with one another by sending and receiving message over a network. In this way we can thing of it at least 2 ways:
• the network as only a bus, with components having a buffer to track in-flight messages. (figure 1.4)
• the network as a communication bus with a buffer to track in-flight messages. (figure 1.3)

1.3.2 Different models describing different aspects of a system (6)

You can see from both models they are describing mostly the same thing, both might be valid but are not describing the same thing.

When you try to read a Blog or a post try to understand what they are trying to impart and what they are leaving out. This is very relevant within the computing world as you can always go deeper all the way down the the 1’s and 0’s. This will get you lost in the weeds.

1.4 Thinking about distributed systems (8)

When talking about a distributed system we need to understand a few basics:
• Each component has exclusive access to its own local state, which other components cannot access.
• The Network has access to it’s own local state including messages that are in flight.

With this said it might be a good way to think of this as as state machine, each state will have the system transition from onw state to the next. With the system taking discrete steps, taken by the component of the network:
• External steps - Receiving messages or sending messages
• Internal steps - Performing local computations or accessing the local state.

We will then think about the system as exactly one component or the network will complete exactly one step.

components are often referred to as actors, agents, processes, or nodes. steps are referred to as action or events.

1.4.1 Correctness (9)

correctness can be defined in terms or safety and liveness:
• safety - Something bad will never happen.
• liveness - Something good will eventually happen.

Think of it this way if you are taking care of every possible state and it eventually lead to the right outcome it is correct. No two participants will arrive at conflicting results.

Any given participant will be able to update to three possible states: working, committed, aborted

So we can say that if it is safe P1 nor P2 will commit while the other aborts. while the liveliness will ensure that eventually P1 and P2 will commit or abort. Together we can say that we will always have them reach the same decision, commit or abort.

1.4.2 Scalability and reliability (12)

It’s hard to always think about safety and liveness in terms of Scalability and Reliability. It might be best to think about these in terms of responsiveness:
• Scalability - responsiveness in the presence of load.
• Reliability - responsiveness in the presence of failure.

1.4.3 Responsiveness (12)

You can describe Responsiveness by four related concepts:
• Service-Level indicator - Quantities observation about the behavior of a system.
• Service-Level objective - A true or false statement about a service-level indicator that determines whether it meets an objective.
• Error rate - observations that do not meet the objective over the total of all observations (for a time period).
• Error budget - Acceptable ratio (to us)

Therefore the responsiveness of a system is teh ability to keep the error rate below a defined value. Also keep in mind that correctness will be an emergent property they will not trace back to a single component, you must look at the whole to see. (Feels very similar to the idea about what you are trying to covey vs trying to describe the entire system.)

1.5 Two big ideas (13)

Global vs Local viewpoints. (Services and Microservices)

1.5.1 Systems of systems (13)

Components can be thought of as atomic - indivisible entities. However you might find that components are subsystems. A holon is an entity that is both whole in itself but as part of a larger whole. So a holon can be atomic or a collection of holons (holarchy).

Thinking about a multitenant (single instance that serves many tenants) we can view it in many ways:
• One atomic entity (database cluster)
• higher-order entity with 3 interacting database nodes.
• two atomic entities: 2 databases one per tenant
• each database as a higher-order entity composed of interacting database nodes.

Each version here is describing the same thing but the viewpoint might help us look deeper into one aspect of the system.

1.5.2 Global view vs. local view (15)

If we were to look at the entire system with an all knowing eye we would be withing the global view. However a component does not have that luxury as it can only know it’s own state (local view).

Also keep in mind that even if you wanted to have a more wholistic view of the system within a local view you would need to communicate with other components and that would require messaging. That must be done withing the system so that would again require a global view.

1.6 Distributed Systems Incorporated (16)

One might want to look at a distributed system as an office building that has defined offices with pneumatic tubes for messaging. Although you might want to message a specific office you still need to go through the main bus to route your message. Here is a way to model it in your head:
• components as rooms - Each room is a concurrent component.
• network as pneumatic tubes - Ways in which we will be able to send a messages (internal)
• external interface as a mailbox - This is the message hub that will be able to route all the messages (external).

With this being a more general version of the office model we might need to add in more components:
crash semantics - Breaks or vacations will result in a different model.
• short absences - Short breaks are a transient failure.
• extended absences - Vacations are a intermittent failure.
• permanent departures - Leaving are a permanent failure.

Message-delivery semantics - Lost, duplicate, or rearranged messages.
• lost messages - Not delivered.
• duplicate messages - Delivered more than once.
• reordered messages - Delivered out of order.

With these out of the way you can start to talk about how you would deal with a situation that might come up. In an office how would you deal with knowing that a message was not delivered? What about how to deal with a message being sent twice?

1.7 Navigating complexity (18)

There is so much that might be firing in your head right now about how to deal with situations and how you might model them.

1.7.1 Simple yet complex (18)

Although you might be able to describe what a single node is doing at any given time you might not be able to see the entire system with such clarity.

1.7.2 Emergent behavior (19)

emergent behavior is something that the entire system will need to keep track of.

1.7.3 Changing perspective (19)

Do we have a black-box (just an outward view of the system) or a white-box (inward view of the system). You don’t always want to model both at the same time. Also keep in mind that you can have a white-box view of one of the nodes within a white-box. Or a black-box view of a node within a given white-box component.

1.7.4 Think globally; act locally (20)

We need to give each component the authority to act as a lead while also having the system be the orchestrator of the entire system.

1.8 Thinking above the code (21)

There will be few an fare between code examples for this book. We could talk about race conditions as they are frequently talked about within distributed systems. But this will make it feel that is is a problem for a single entry or node. We might be able to do better for the scope of this book and topic

A process is a sequence of atomic actions. Then process P is a sequence of atomic actions a, b, c and process Q is a sequence of atomic x, y, z. So if we wanted to talk about the events that could take place for these 2 processes we might have something like

        a b c x y z  
        a x b c y z  
  
(P | Q) ...  
  
        x y a z b c  
        x y z a b c

This can then be used to not only think about this example but any combination of events throughout the system that might not always lead to the same results.

So the concurrent composition of P and Q (P|Q) is correct if for every possible interleaving the result is equivalent to one of the following:
• (PQ) - Sequential composition of P and Q (P before Q)
• (QP) - Sequential compositions of Q and P (Q before P)