Section 1: Designing Data Intensive Applications

33 views

Skip to first unread message

Nick Chouard

unread,

Feb 14, 2020, 5:29:13 PM2/14/20

to Penny University

This week we had an information-packed discussion of the first section of Martin Kleppmann's "Designing Data Intensive Applications." Each different chapter from the section was presented by Anthony Fox, Ben Miner, myself, and John Berryman. Getting through four chapters of such an informative book was challenging, but taking the time to go back through the chapters we read definitely helped solidify the knowledge I gained on my initial read. Here are some of the big takeaways and questions for me.

Chapter 1:

Many of today's applications are data-intensive, rather than computer-intensive. Software engineers have to consider how this data impacts the reliability, scalability, and maintainability of their applications. While there are many tools available, it is important that as engineers we understand how each of these tools functions and how they can server our applications.
Being able to describe the load of your system will help you come up with the best solution for your scalability problem. Remember that what might work for one level of load will often not work for the next level.
Be wary of "accidental complexity" in your application. It can be easy to introduce future pain points while trying to quickly solve an issue or bug.

Chapter 2:

The three main data models used today are relational models, document models, and graph models. Each have specific use cases for which they excel. In the future, it will probably be useful to move towards a hybrid approach of these in order to provide maximum flexibility (and this is already being done in many systems).
Relational data models provide better support for relationships, while document models can provide better flexibility due to their schemaless design and better performance due to data locality.
I have yet to experience a graph data base in a real-world use case? Have any of you used them in production? How did they solve your particular issue?

Chapter 3:

The two most popular structures for storing data are LSM-Trees and B-Trees. LSM-Trees provide faster write speeds while B-Trees provide faster read speeds.
Because of the way B-Trees are structured, you can store an enormous amount of data without increasing your write times significantly. B-Trees with n keys have a depth of O(logn).
With the rise of data analytics, the need for data warehouses arose to accommodate a different form of data processing. Data analytics require aggregations of larger amounts of data, which has lead to OLAP (online analytic processing) systems, in contrast to the OLTP (online transaction processing) systems powers most user applications.
OLAP systems lend themselves to using column oriented storage, which has various benefits for large amounts of data, including being able to compress your data and complete faster aggregations (since these are usually column based).

Chapter 4:

The way your data is encoded affect not only the efficiency of your application, but also the architecture of it. Evolvability (forwards and backwards compatibility) refers to the ease of making changes to your application, and needs to be front of mind as encodings are chosen.
Text formats like XML and JSON are useful for their flexibility and readability. Binary formats greatly reduce the amount of space needed to store your data.
Special care and consideration of compatibility needs to be taken when examining different data flows, such as data through databases and data over networks.

Four chapters was quite a lot of information for just an hour discussion. Everyone did a great job presenting their chapter, but I wish we had a little more time for discussion. In the future it may be useful to take fewer chapters to allow for more open discussion. It was also really helpful that John and Anthony summarized their chapters in a Google doc that we could follow along with. I will definitely do this in the future to make for a more effective chat.

Thank you to everyone who participated, and especially to Anthony for organizing! Hopefully we will meet again soon to dive into more of this book.

Edward Ribeiro

unread,

Feb 23, 2020, 11:54:08 AM2/23/20

to Penny University

Congratulations Nick, very cool write up! I am very excited about the upcoming meetings. :)

I have some points I would like to discuss/clarify regarding the discussions so far. See below,

> B-Trees with n keys have a depth of O(logn).

Nope. The depth/height of a B-Tree is roughly *O(log_bN)* (base b is not 2) where b is the fanout (number of keys per node/page).

> "popular structures for storing data"

popular data structures for storing and retrieving data on disk.
--------------------------

> "which has lead to OLAP (online analytic processing) systems, in contrast to the OLTP (online transaction processing) systems"

The OLAP/OLTP categories are very old and used in DB community. What is different now is the current scale of data/processing that forced the creation of new engines specialized on either OLTP or OLAP (e.g., VoltDB, Vertica, Hive, Presto, Cassandra, etc). See the paper "One size does not fit all" by Michael Stonebraker for the context.

Before that era, DBs like Oracle/MS Server where used for both OLTP/OLAP (curiously, the newest trend in DB research is the creation of HTAP ( Hybrid Transaction/Analytics Processing Architectures ) systems, that is, history repeats itself. :simple_smile:

> "better performance due to data locality"

Data locality has been a feature explored on modern distributed DB systems (relational, nosql, newsql). Not an exclusive feature of document DBs

> "Relational data models provide better support for relationships"

Relational data models:
* Have strong and consistent data modeling features (constraints, normal forms, ACID transactions, etc);

* Are backed by sound math theory (relational algebra, set theory, etc);

* Have a declarative, intuitive, and expressive query language that abstracts data layout on disk and access patterns;

> "sorted string tables (USED IN SEARCH ENGINES (ish))"

SST tables are used by many NoSQL systems (Cassandra, RocksDB, LevelDB, HBase, etc). Idea came from Google's Bigtable.

> "keep a memtable in memory (skip list) and write it to disk when it gets too big"*

Memtable flushes its data to disk (SST) when it hits a threshold size or at regular time intervals, whichever comes first.

> "does a merge sort and expunge dead data"

LSMT are append only Data Structures and SST tables are immutable once written on disk. Even deletes and updates preserve old entries (deletes insert a special Mark referencing the delete row in the newer SST table). Compaction reclaims space by merging files and discarding old/dead data.

> "reading?"
For each reading, LSMT read request needs to merge data from SST tables on disk (discarding dead/out of date rows) and memtable in RAM. Bloom filters let skip SST table files.

> "b-trees have to be defragmented"

Because b-trees do in-place insertion/updates/deletions. LSMT are append only.

> "Search Posting lists are effectively mini-LSMTs"

Hum... sort of. Lucene's segments are similar to SST tables. Think one posting list on RAM and one posting list on disk for every segment. Each query requires looking on both RAM and disk and merging the results. Lucene's segment merging is like LSTM compaction.

> "XML - bloated, complicated, yet lives on"

XML is verbose, simple, self documented, has comments, schemas, but the ecosystem around XML is huge and very complex. JSON is more compact, but lack schema and comments, for example.