Re: [penny-university] Digest for penny-university@googlegroups.com - 1 update in 1 topic

17 views

Skip to first unread message

Edward Ribeiro

unread,

May 9, 2020, 11:48:06 AM5/9/20

to penny-un...@googlegroups.com

Thanks John and Chang for participating in the last round of discussions about DDIA book!

Your input was really helpful, insightful and appreciated. :) And thanks the community at Penny Univ. for allowing me to organize and lead the discussions. I hope to improve quite a bit in the future chats. :)

Chapter 12: Future of Data Systems

Complementing John's excellent review of chapter 12, I would add that the chapter argument is that the most scalable, performant, fault tolerant and flexible architecture to deal with massive data sets and distributed software components is the one based on asynchronous processing of a data in a durable log queue using either batch processing or stream processing. The book goes on to compare and contrast this with distributed transactions and microservices architectures . Data is stored in the log and derived data is generated by either batch or stream processing reading from (and writing to) the log queues. Stream processing operates on unbounded datasets and have low latency while batch processing have large latency and operates on large inputs of known finite size. While stream processing allows an incremental evolution of the derived data with addition of new fields, for example, the batch processing allows a complete derivation of new views because they have access to all the historical data input. Ordering of events written in the queue and fault tolerance are quite important properties of this new architecture.

The application code runs as stream operator. Each operator takes a stream of state changes as input and produces other streams of state changes as output. The author argues how this distributed setting is similar to the inside parts of a database, as John mentioned. This database "inside out" (also known as unbundling databases) are the key idea of the chapter. The book proceeds to show how we can achieve properties like uniqueness, integrity, idempotence and timeliness (ensuring that users observe the system in an up to date state) in this new architecture. Naturally, some tradeoffs must be made, and sometimes constraints like timeliness should be relaxed in favour of more critical properties like integrity.

The second half of the book talks about the ethics of data acquisition, manipulation and retention. There are many open issues regarding who owns the users' data and how this data can be protect from future misuses by governments or companies. The whole data acquisition and processing and retention is very opaque and treated as business by the companies that provide online services. There are many crucial privacy concerns still to be deal with and the blind believe in the supremacy of data for making decisions is not only delusional, but it’s very dangerous. Predictive analysis can amplify prejudice and injustice because they take biased data as input and then produce even more biased data as output. Automated decisions can jeopardise people without anywhere (courts, legislation, civil society) to reach out for help. Data oriented computing can create feedback loops showing people only opinions that they already agree with (i.e., echo chamber), perpetuating stereotypes, misinformation, and polarisation (Brazil's is suffering the consequences of this phenomena in national politics since 2018 with catastrophic results so far). Simple solutions to this problem as not joining a given online service are not always possible. During the talk I gave the example of WhatsApp, that is so massively popular and used in Brazil that not using it is really not an option, because I would be out of social/profissional interaction with family, friends, co-workers and service providers. Coincidently, WhatsApp has been a powerful source of untraceable fake news to create and control political echo chambers too.

Having privacy doesn’t mean keeping everything secret. It means having the freedom to choose which things to reveal to whom, what to make public, and what to keep secret. Privacy settings that allow a user of an online service to control which aspect of their data other users can see are a starting point to hand back some control to users. Data is a valuable asset that can live well beyond the company that acquired it. Whenever we collect data, we need to balance the benefits with the risk of it falling in the wrong hands (criminals, foreign intelligence services, unscrupulous management, totalitarian regime, etc). When collecting data we should not only consider the current political scenario, but all future governments. The author believes in self regulation and legislation to protect users from data abuse, but legislation is not enough or effective. It’s an open question how to protect users’ right to privacy while still deriving interesting applications from data.

Thanks again, John and Chang!

On Fri, May 8, 2020 at 7:58 AM <penny-un...@googlegroups.com> wrote:
>
> penny-un...@googlegroups.com Google Groups
> Topic digest
> View all topics
>
> Designing Data Intensive Applications: Part 4 - 1 Update
>
> Designing Data Intensive Applications: Part 4
> JnBrymn <jfber...@gmail.com>: May 07 10:03AM -0700
>
> Edward Ribeiro organized our last discussion for the Designing Data
> Intensive Applications and it was a great discussion!
>
> Here are a couple of my takeaways.
>
> *Chapter 11: Stream Processing*
>
> This was a good review of the history of streaming. The author first went
> over the (now somewhat dated) publish-subscriber model of messaging
> systems, and then moved into discussing the new revolutionary move we've
> seen towards log-style message systems. I like the simplicity of the new
> systems over the complexity of the old. The new style of message processing
> implies fewer guarantees and fewer ways of working with messages, but there
> is strength in the simplicity because the simpler patters are sufficient
> for most things that you would want to do and much easier to reason about.
>
> One thing I was looking for in the chapter was how people deal with stream
> processing when you need to join the stream back to another data source,
> either another stream (to find related messages) or to a database (to
> enrich messages). It turns out, there's no special sauce here and no free
> lunch. If you want to join to another stream then you need to keep track of
> a window of time for both streams and then emit events whenever you find
> related messages. Similary, when enriching stream messages based on data in
> a database, the best way to do this is to copy the entire database into the
> process that is handling messages so that you can make quick joins against
> the data. (Ouch...)
>
> *Chapter 12: The Future of Data Systems*
>
> I have not read this chapter yet, but from Edward's description it is going
> to serve a really neat purpose in framing the rest of the book. Throughout
> the book, the author discussed big chunks of infrastructure: databases,
> streams, batch processors; but in chapter 12, the author reveals that he
> actually thinks of an entire infrastructure an an analogy to one giant,
> unified database. For example, a log-based message system is really just
> the write-ahead log like MySQL uses to ensure durability. And a search
> engine kinda serves as a sort of caching component for the "infrastructure
> database". So I'm looking forward to reading this chapter and then
> reconsidering how all they other chapters are really talking about pieces
> of the "infrastructure as a database". ... Maybe I wish I'd read this
> chapter first!
>
> Thanks for leading the discussion Edward! Thanks for your input Chang!
> Back to top
> You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
> To unsubscribe from this group and stop receiving emails from it send an email to penny-universi...@googlegroups.com.

Reply all

Reply to author

Forward

0 new messages