Designing Data Intensive Applications: Part 4

19 views

Skip to first unread message

JnBrymn

unread,

May 7, 2020, 1:03:46 PM5/7/20

to Penny University

Edward Ribeiro organized our last discussion for the Designing Data Intensive Applications and it was a great discussion!

Here are a couple of my takeaways.

Chapter 11: Stream Processing

This was a good review of the history of streaming. The author first went over the (now somewhat dated) publish-subscriber model of messaging systems, and then moved into discussing the new revolutionary move we've seen towards log-style message systems. I like the simplicity of the new systems over the complexity of the old. The new style of message processing implies fewer guarantees and fewer ways of working with messages, but there is strength in the simplicity because the simpler patters are sufficient for most things that you would want to do and much easier to reason about.

One thing I was looking for in the chapter was how people deal with stream processing when you need to join the stream back to another data source, either another stream (to find related messages) or to a database (to enrich messages). It turns out, there's no special sauce here and no free lunch. If you want to join to another stream then you need to keep track of a window of time for both streams and then emit events whenever you find related messages. Similary, when enriching stream messages based on data in a database, the best way to do this is to copy the entire database into the process that is handling messages so that you can make quick joins against the data. (Ouch...)

Chapter 12: The Future of Data Systems

I have not read this chapter yet, but from Edward's description it is going to serve a really neat purpose in framing the rest of the book. Throughout the book, the author discussed big chunks of infrastructure: databases, streams, batch processors; but in chapter 12, the author reveals that he actually thinks of an entire infrastructure an an analogy to one giant, unified database. For example, a log-based message system is really just the write-ahead log like MySQL uses to ensure durability. And a search engine kinda serves as a sort of caching component for the "infrastructure database". So I'm looking forward to reading this chapter and then reconsidering how all they other chapters are really talking about pieces of the "infrastructure as a database". ... Maybe I wish I'd read this chapter first!

Thanks for leading the discussion Edward! Thanks for your input Chang!

Chang Lee

unread,

Jun 2, 2020, 12:52:34 PM6/2/20

to Penny University

Finished the book and contributing with my own notes

Things I learned

* Isolation levels in databases: I had no idea what they were but I know better how they are related to transactions.

* Appending to a log: the idea of Event Sourcing, or tools like Kafka, essentially comes down to treating data or events by appending them to a log for processing. This solves lots of issues on reasoning about concurrency of data.

* The idea of schema-on-read vs schema-on-write for storing data and designing applications.

* Data encodings like protobuf, Thrift, Avro and why we need this to solve forward and backward compatibility.

* Strategies databases use to solve the consistency problem in distributed systems: single-leader, multi-leader, leaderless, and the tradeoff among them.

* Database indexes = derived data. I didn't really know what indexes are, but now I know they are denormalized data for specific purposes that needs to get updated each time a new row is inserted.

* Clocks are dangerous in distributed systems!

This book really makes me want to build better programs that works like a stream processor - I guess that's why web developers keep talking about "stateless." I can see how this idea applies everywhere and can make writing and debugging software simpler.

Reply all

Reply to author

Forward

0 new messages