Designing Data Intensive Applications: Part 2

24 views

Skip to first unread message

Nick Chouard

unread,

Mar 7, 2020, 12:14:17 PM3/7/20

to Penny University

This Friday, John Berryman, Edward Ribeiro, Ben Miner and I met to discuss chapters 5-7 of Designing Data Intensive Applications. I lead the discussion of chapter 5, while Edward lead on chapter 6 and Ben on chapter 7. Here where my take aways from chapter 5:

Replication can serve the purpose of reducing latency by placing data geographically close to users, increasing availability by not relying on a single node, and increasing read throughput by having multiple nodes with the same data.
The most common form of replication is single leader replication, where one node accepts all the writes and passes them on to “follower” nodes. Reads can then be read by the leader or by any followers.
Requiring a leader to pass all of it’s writes on to follower nodes can lead to replication lag. There are several problems associated with this, and they also have solutions:

A user could write to the leader and then read from a follower before the leader has written to that follower, making it appear as if they haven’t written anything. This is solved by read-after-write consistency.
If a user first reads data from a fresh replica, then from a stale replica, it may appear that data is going back in time. Monotonic reads fix this issue.
If two separate clients are writing to different partitions in the database database (imagine two people having a conversation and one observing it), it is possible for the observer to read from the partitions out of order and experience the conversation out of sync. Fixing this requires consistent prefix reads guarantee that if the database always applies writes in the same order, any reads will occur in the same order.

Certain use cases require multi-leader replication:

Multi data center operation
Clients with offline operation (example: calendars on devices, local db with leader and other db with leader)
Real-time collaboration.

This brings a challenge of dealing with write conflicts, which requires the system to have some form of conflict resolution.

The third strategy for replication is leaderless replication (also called Dynamo-style data stores), in which clients can write to several nodes, and also read from several nodes in parallel in order to detect and correct nodes with stale data. These types of data stores use “quorums” to decide whether data is up to date. If W + R > N, where W is the number of nodes required for the write to be successful, R is the number of nodes we query from, and N is the total number of replicas, we expect to get an up to date values form the data store. You can configure these values in Dynamo-style systems to lead to higher availability and lower latency, but it may come at the cost of more stale values. The biggest issue with leaderless replication is dealing with concurrent writes.

Thanks for another great discussion everyone! I'm looking forward to the next one.

Edward Ribeiro

unread,

Mar 11, 2020, 2:48:41 PM3/11/20

to Penny University

Hi,

I would like to thank you for allowing me to participate in this discussion. Here are my takeaways from Chapter 6 - Partition.

* Partition (sharding) splits data and distribute them among a set of nodes in cluster. It is usually required for scalability and load balancing, that is, when the data cannot fit on a single machine or to distribute the query/write requests load among a set of machines. It is usually coupled with Replication (Chapter 5);

* Partition can be by row or by range of rows. The granularity of the data partitioning is usually a big design concern for the app. The unit of data partition that is spread among nodes may be as small a single row (as in Cassandra) or a range of rows (as in Bigtable/HBase). Partition by key distributes the rows more evenly on the cluster, but you looses the range scan efficiency of range partitioning (as defined in Bigtable/HBase);

* Each partition is a mini-database on its own, a concept later explored by Google Megastore, a distributed transactional ACID database built on top an eventually consistent NoSQL database (Bigtable).

* Partitions can be subject to two phenomena: skew (one partition receives more data or queries than others) and hot spot (one partition has high write/read load than others). There's no definite solution to both phenomena, and a trade off must be accomplished like, for example, carefully choosing the keys to distribute the load evenly to that specific app;

* Partition makes secondary indexing hard. The secondary indexing can be document based, when a local index is stored alongside the partition data on each node. This requires sending the queries to all partitions and gather the results (scatter-gather approach). This approach is expensive, even if parallel queries are used as tail latency (i.e., waiting for the slowest node to answer) kicks in. OTOH, there's term based indexing, that is a global index that covers all partitions. The index itself needs to be partitioned to avoid single point of failures and hot spots. Its lookup is more efficient than document based indexing, but writes are slower and more complex;

* Partition can be fixed or dynamic. In a fixed partition scheme, the number of partitions doesn't change nor does the keys to partition mapping. The alternative to fixed partitioning is dynamic partitioning as used by Bigtable/HBase where the data is automatically split once it reaches a certain threshold and moved to other node. Either in fixed or dynamic partitioning, this operation, known as partition rebalancing, is complex because the nodes need to keep serving requests while the data is being moved among the nodes and its associated metadata is updated;

*A common approach used in fixed partitioning is to make the number of partitions larger than the number of nodes. This is usually preferable as new nodes can "steal" partitions from loaded nodes, but requires the cluster to maintain a mapping between partitions and the nodes where these partitions are stored. Choosing a good initial value for the number of partitions and their size is not trivial, as it depends on the application characteristics and anticipated volume of users/data.

* Finally, there's the routing of requests to the right partition. Routing requires a routing tier that will either dynamically calculate what partition should hold the data (as in Cassandra or Dynamo) or use a discovery service like ZooKeeper (as HBase does).

Thanks again for taking part in the discussions! :)

Edward

Reply all

Reply to author

Forward

0 new messages