Home Posts Post Search Tag Search

Thinking Distributed Systems 06 - Distributed Transastions
Published on: 2026-02-10 Tags: Blog, Side Project, Think Distributed Systems, Asynchronous, synchronous, Messaging, Processing

6 Distributed transactions (79)

Distributed transactions span changes across multiple systems. The participants are often referred to as resource managers (RMs). The term database systems can be used interchangeably with resources managers, so I might use one or the other going forward.

6.1 Atomic commitment: From a single RM to multiple RMs (79)

Since we want to start to think in the terms of a distributed systems we will start to have entries belong to different RMs within the entire system the first code we saw, the one were we debit one account and then increase an other.

BEGIN
  UPDATE accounts SET balance = balance - $1 WHERE id = $2;
  UPDATE accounts SET balance = balance - $1 WHERE id = $3;
COMMIT

This could happen one two RMs so let’s look at how that might work

RM1:
BEGIN
  UPDATE accounts SET balance = balance - $1 WHERE id = $2;
COMMIT

RM2:
BEGIN
  UPDATE accounts SET balance = balance - $1 WHERE id = $3;
COMMIT

We need to ensure that both fail or abort or both commit. It needs to be atomic (commit at most once, or abort and nothing happens).

6.1.1 Transaction on a single RM (81)

We can assume that on a single RM transactions will be atomic, (this doesn’t feel like it can always be the case but we can assume) as it will always abort of commit.

6.1.2 Transaction on multiple RMs (81)

We can start to have some faith in the system if we can have some safety and liveness guarantees:

¡ The _safety_ guarantee asserts that no two participants in a transaction arrive at a
conflicting decision:
S: ∄ rm₁, rm₂: state[rm₁] = commit & state[rm₂] = abort

¡ The _liveness_ guarantee asserts that every participant will eventually arrive at a
decision:
L: ∀ rm: <>[] state[rm] = commit | <>[] state[rm] = abort

6.1.3 Blocking and nonblocking (82)

There are two types of atomic commit protocols. Blocking protocols guarantee safety but not liveness in the presence of participant failure. Non-blocking protocols guarantee both safety and liveness in the presence of a single participant failure.

6.2 The essence of distributed transactions (82)

Within the entire systems there are global transactions (distributed transactions), that consist of two or more local transactions. Looking at just the local transactions we can think of them in fours states as they are working:

¡ _Working_ —The transaction is executing operations. From here, it may decide to
transition to aborted 
¡ _Prepared_ —The transaction is not executing operations but has not transitioned
to committed or aborted yet. From here, the transaction may transition to committed 
¡ _Committed or aborted_ —The transaction has reached its final, irreversible state by
being committed or aborted.

6.3 Two-Phase Commit protocol (83)

Okay so let’s talk about ways that we can work withing these guidelines to get to a system that will ensure some safety and liveliness for distributed RMs. THe Two-Phase Commit (2PC) is the protocol that we will go over.

6.3.1 In the absence of failure (83)

The 2PC uses exactly one Transaction coordinator (TC) and 2 or more RMs, and then executes in two phases: Prepare and Commit.


The client initiates a transaction, but instead of asking the RMs to do the work it asked the TC to commit or abort the transactions.


This uses a voting system to commit or abort the process. The TC will need to get all the votes for commit (to commit) or a single abort (to abort). The entire process will be done in 2 phases that I will go into next.

Phase 1: Prepare
1 The TC persistently records a ⟨Prepare, Timeout⟩ entry to its log and sends a
  ⟨Prepare⟩ request to all participating resource managers.
2 If RMi decides to commit, it persistently records a ⟨VotetoCommit ⟩ entry to its
  log and sends ⟨VotetoCommit ⟩ to the TC.
  or
  If RMi decides to abort, it persistently records an ⟨Abort⟩ entry to its log, sends
  ⟨Abort⟩ to the TC, and aborts the transaction t1.

phase 2: commit
1 If the TC receives a ⟨VotetoCommit ⟩ response from all managers RMis, it 
  persistently records a ⟨Commit⟩ entry to its log and sends a ⟨Commit⟩ request to all
  participating RMs.
2 If the TC receives an ⟨Abort⟩ response from at least one participating RMi or a
  timeout, it persistently records an ⟨Abort⟩ entry to its log and sends an ⟨Abort⟩
  request to all participating RMs.
3 Each participating RM receives the ⟨Commit⟩ or ⟨Abort⟩ request, persistently
  records a ⟨Commit⟩ or ⟨Abort⟩ entry to its log, and commits or aborts its local
  transaction ti.

This should guarantee safety and liveliness.

6.3.2 In the presence of failure (85)

Here is where the rubber hits the road. Let’s talk about the ways in which we can have some failures.


RM Failure


When a RM fails to record a <vote-to-commit> or an <abort>, upon recovery it will record an <abort> and then send an <abort> to the TC.


When a RM fails after recording a <Vote-to-commit>, upon recovery it will need to ask the TC whether to commit or abort.


When a RM fails after recording a <commit> to its log, on recovery it must perform a REDO.


When a RM fails after recording <abort> to its log, on recovery it must perform an UNDO.


TC Failure


There are many cases in which this 2PC system will guarantee safety but not liveliness. In any of the cases that we send out all but 1 of the the <commit> all the RMs will try and commit but there will be nodes that will not be in a commit. The TC upon recovery will need to send out messages to <abort> to the other nodes, or try and ask for an update on the other nodes. The book does not go over this case and how it is fixed.

6.3.3 Improvement (86)

An other improvement might be to allow the nodes to speak to other RMs, (Which I think breaks the entire system). They say that this will help it maintain safety but I don’t see it.