Designing Data Intensive Applications: Part 3

14 views

Skip to first unread message

JnBrymn

unread,

Apr 8, 2020, 5:36:03 PM4/8/20

to Penny University

Edward Ribiero lead our penultimate installment of the Designing Data Intensive Applications book review (thank you!). Here were some things that stood out to me:

Chapter 8: The Trouble with Distributed Systems (Edward)

The biggest takeaway is the fact that a remote system can fail, and you may never know how it failed. And you can't really be sure it failed at all! So it's obviously very difficult to build systems around such poor guarantees.
A secondary point is how undependable clocks are. There are several types of clocks that computers use but they all have problems.

wall clock time can not only be wrong, but can do silly things like move in the opposite direction of normal time
monotonic clocks are useful on a single machine for timing events... but the timing is not terribly accurate, and the monotonic clock time doesn't have any meaning outside of the computer it lives on
logical clocks (like Lamport timestamps) are identifiers that increment monotonically so they can help establish ordering, but they only seem to be useful after the fact, and can't establish ordering of concurrent events as they happen.

Paxos is a famous distributed consensus algorithm... but it's really a family of algorithms. What people (Google) tends to actually implement is "multipaxos".

Chapter 9: Consistency and Consensus (John)

Linearizability is not serializability. Linearizability is the guarantee that you can make data in your system act as if there is only one copy (even if it's replicated for durability).
Linearizability doesn't seem so profoundly useful on the surface, but, consider these two cases where it's important for data to seem as if there is only one copy:

If a leader node of your database cluster falls over - it's important that all nodes agree about who the new leader is.
If you are signing up users and each user must have a unique id, then all nodes have to agree, when a new user id is created, that it is unique.

Chapter 10: Batch Processing (Eric Goddard)

I found the author's metaphor relating UNIX process to batch processing interesting. At first I didn't get it, because MapReduce feels almost nothing like piping bash commands together. But once we started talking about Spark processing it made more sense. Both UNIX and Spark work because they have a standardized interface between processing units (in UNIX the standard are bytestreams; in Spark it's filestreams). Also both UNIX and Spark avoid materializing intermediate results (e.g. avoid writing them to disk ... expensive) and instead pass data from one processor to another.

Outside of the book conversation, I continue to be impress with Edwards knowledge of research. He's always throwing links to papers into the conversation. This time he threw in a link for a paper entitled Calvin: Fast Distributed Transactions for Partitioned Database Systems which seems interesting because in my chapter "distributed transactions" is exactly what seemed so very hard to do with any speed at all!