Distributed systems, the hard parts
Originally this was a twitter thread with 2k likes in my now-deleted account. It was pretty condense, and so are paragraphs in this article.
Distributed systems are interesting because you don’t have some of the guarantees that you normally have, and are used to, in a single process. An anti-pattern is to assume you have them. Here are a few assumptions I’ve observed in my experience.
RPCs
RPCs have higher overhead than function calls. Put multiple items in a single request if you expect many items, but don’t put all of them into a single batch because RPCs define limits, and because you want consistent latency. Make RPCs in parallel, but don’t forget to limit number of concurrent outgoing requests, e.g. to avoid running out of file descriptors.
Unlike a function call, when you make an RPC, the handler may receive your request multiple times. Implement idempotency. And when you respond to the client, it doesn’t necessarily receive your response. They might retry, so idempotency helps again. Same with queues. A message may be delivered multiple times. The message handler gotta be idempotent. It is generally difficult/rare to implement exactly-once semantics. Usually it is at-least-once.
When calling a stateful dependency, such as a distributed database, different items in a batch might be processed by different nodes. Minimize the number of nodes involved per batch by sorting items by the sharding key before forming batches.
Races conditions
The order of requests sent isn’t the same as the order of requests received. Assuming the otherwise might lead to races. Enqueued messages may be delivered out of order. Consider putting data-that-might-change into a db, and put only an immutable ref to that data into an SQS message. Then when processing the however-stale message, the handler fetches the latest state.
A piece of state you read from another machine/service/database isn’t necessarily the same when writing it back a subsecond later. This also leads to race conditions. When writing it back, atomically verify that state is still the same, via ConditionChecks / transactions.
Clocks on different machines may show different times. Don’t rely on timestamp comparison if must get it 100% correct. This is another source of race conditions.
Most race conditions can be addressed by a synchronization mechanism, and in most distributed systems it is your database that supports atomic condition checks / transactions. It is the arbiter of who wins.
Atomic mutations
Unlike state in a single process, in distributed systems you rarely can make mutations in two systems atomically, e.g. marking a purchase order as placed in the database and sending a confirmation email is not atomic. You might successfully mutate first system, but fail on the second one.
If your storage system supports Change Data Capture, like DynamoDB, only place the purchase order, and in the background listen to committed changes and send the email there. This assumes that sending the email is idempotent.
Without CDC, update the order row and atomically insert another row representing an intention to send the email. In the background poll “intention rows” and send the email. Usually this is done in conjunction with a queuing system, so typically the intention is to enqueue a msg.
Availability
A single process can rarely overwhelm itself. In a distributed system, one service can take another down by sending too many requests. A service must protect itself by throttling incoming requests, and you sometimes should protect others by throttling outgoing requests, see adaptive client-side throttling by @MarcJBrooker.
Unlike functions in the same process, different services have different availability SLOs. In the code path of your API, don’t call another API that didn’t promise availability as high as you promised in your API. Same with latency. Try to move such calls out, to background jobs.
With a single process, if it crashes there isn’t much you can do, so why bother. In a distributed system your service should survive when a single process/machine fails, and try to fail gracefully if your dependency service is hard-down.
When one process fails, the system should somehow notice that (failed health checks, expired leases, explicit acknowledgements of completed work) and redistribute the work among survivors. In particular, you don’t have to maintain multiple exact copies of a state in a single process. In a stateful service, high availability can be achieved by maintaining multiple replicas of the same state.
State replicas should live in different availability zones which fail independently. Some keep copies even in geographically distant locations to survive natural disasters, for business continuity. State replication, especially synchronous / strongly consistent, is hard to get right. The good news is you rarely have to implement this yourself because your distributed database takes care of that.
Unlike a single process, deployments in a distributed system aren’t atomic. Backward incompatible changes require a multiple deployments, e.g. 1) update service1 to accept both old & new protocols 2) update service2 to use the new protocol 3) remove old code from service1.
Security
Different parts of the same process trust each other. A distributed system should minimize the impact when one node is compromised: each node may have to verify each incoming call/connection via authentication, authorization, transport encryption and/or input validation.
You might want to extract security sensitive parts of your application, e.g. payments, into a separate highly secure environment, e.g. Nitro Enclaves. When a monolith is compromised, ALL of its parts are compromised at once.
Testing
Testing a distributed system is harder. It may require multiple machines and/or more time. If your dependency service in a cloud doesn’t provide local emulators, you may have to create resources in your cloud provider, and forget to cleanup.
It might be infeasible to run integration tests locally or before merging a change. This increases the feedback loop interval, which makes engineers less happy.
Conclusion
While distribute systems have higher scale, availability, latency/throughput, and security, there are MUCH more complex. Making a system more distributed is a large investment and you must have a crisp understanding of what you are trying to achieve, i.e. the ROI.