Design & Architecture

Distributed Systems & Consistency

Advanced

As soon as your system spans more than one process — services, a database and a cache, a queue, a third party — you are building a distributed system, and the easy assumptions of single-process code stop holding. Networks fail and lag, messages duplicate and reorder, and two places can disagree about the truth. Design for this reality instead of being surprised by it.

Distributed systems have well-known, unavoidable trade-offs. You cannot have perfect consistency, availability, and partition-tolerance all at once (CAP). Most real systems accept eventual consistency: different parts agree over time rather than being identical at every moment. The common mistakes are assuming a call always succeeds instantly, that everyone sees the same data at the same moment, or that you can update two systems atomically without a plan.

This pulls together topics from across the handbook — idempotency and transactions (Data Integrity), messaging (Asynchronous Messaging), failure handling (Designing for Failure), and caching — into the mindset you need when state lives in more than one place. For money and AML decisions, deciding where you need strong consistency versus where eventual is fine is a critical design call.

Assume the network and accept the trade-offs

DoAssume every remote call can be slow, fail, time out, or be retried. Design timeouts, retries with backoff, and fallbacks for this (see Designing for Failure, Third-Party Integrations).
DoDecide on purpose where you need strong consistency (money, balances, screening decisions — keep these within one transactional boundary) and where eventual consistency is acceptable (read models, search, analytics).
DoMake operations idempotent so retries and duplicate messages do not double-apply. The network means you will sometimes see both (see Data Integrity & Transactions, Asynchronous Messaging).
DoKeep related data that must change together within one service or transaction. Across services, use sagas or outbox with compensation instead of pretending you have a distributed transaction.
NeverUpdate two systems (for example, the database and a published event or another service) as separate steps for a money- or state-critical change without a pattern (outbox or saga) to keep them consistent.

Dual write, assume success

db.Save(payment);             // succeeds
bus.Publish(paymentEvent);    // network blip -> never sent
// downstream never learns; systems now permanently disagree

Two separate writes with no consistency plan. A failure between them leaves the database and the rest of the system disagreeing forever. Across processes, you cannot assume both happen.

Outbox with idempotent consumers

using var tx = db.Begin();
db.Save(payment); db.Save(outboxEvent);   // one atomic transaction
tx.Commit();
// a relay publishes the outbox event; consumers are idempotent

The state change and the intent to publish commit together. Delivery happens reliably afterwards, and duplicates are safe. The systems converge correctly.

Design for partial truth

DoExpect stale reads where you have chosen eventual consistency. Make the UX and logic tolerate it (for example, "processing" states) instead of assuming instant global agreement.
DoHandle concurrency and ordering explicitly. Use optimistic concurrency for races, and do not assume messages or events arrive in order (see Concurrency).
ConsiderReconciliation for critical state (for example, payments): periodically compare with the source of truth to catch and fix divergence (see Third-Party Integrations).
ConsiderCorrelation ids and distributed tracing, because debugging spans many services (see Observability & Logging Hygiene).
AvoidDistributed monoliths: many services so chatty and tightly coupled they must all be deployed together. You get the cost of distribution with none of the independence (see Domain Modelling & Boundaries).

Self-review checklist

AskDoes this assume a remote call is instant and always succeeds?
AskDoes this operation need strong consistency, or is eventual fine — and have I chosen on purpose?
AskIf a message/call is duplicated or reordered, is the result still correct?
AskAm I updating two systems with no plan to keep them consistent?

Why it matters: Most real outages and data-integrity bugs in modern systems come from distributed-systems realities that single-process thinking ignores: dual writes that diverge, retries that double-charge, stale reads treated as truth. Designing for the network, choosing consistency on purpose, and making operations idempotent is what keeps a multi-part, money-handling system correct under the failures that will happen.