Improving multi-table workflows: interesting paper

Antonin Delpeuch (lists)

unread,

Sep 11, 2020, 4:37:48 PM9/11/20

to openref...@googlegroups.com

Hi all,

This paper appeared on my radar:

"Table Scraps: An Actionable Framework for Multi-Table Data Wrangling
From An Artifact Study of Computational Journalism"
S Kasica, C Berret, T Munzner
https://arxiv.org/abs/2009.02373

I only had a brief skim but it seems a very interesting read to identify
gaps in our collection of operations, particularly when it comes to
features involving multiple tables.

Best,
Antonin

Steve Kasica

unread,

Nov 9, 2020, 5:04:02 PM11/9/20

to OpenRefine Development

Thanks for sharing,

Here's a pre-recorded talk I gave at the IEEE Information Visualization (InfoVis) conference in October on this paper.

https://www.youtube.com/watch?v=woyuvqUu52I&t=1s&ab_channel=TamaraMunzner

Thad Guidry

unread,

Nov 26, 2020, 4:05:38 PM11/26/20

to openref...@googlegroups.com

For me, multi-table workflows typically come with the needs of schema evolution as well as preserving history/changesets for FAIR workflows.

There are lots of ways to deal with schema evolution, such as no schema (schema-less) systems, as well as schema-base systems like RDBMS & OLAP systems where some now support schema evolution.

If the need is to keep snapshots, history, ability to undo or rollback history, then additional complexity comes into play as well at scale with multi-table workflows.

2 recent technologies that build upon lessons learned from the past 6 years that I have looked at over the COVID crisis and really like are Apache Iceberg's Table Evolution supporting schema evolution that even includes nested structures. And since history and undo capabilities are needed even through table merges and change isolation, then Project Nessie and it's Git-like branching, tagging etc. operations could be incorporated.

If you step back and look at OpenRefine's modeling, it sort of mimics both of those together. But where both of those technologies allow for massive scaling.

Many of our users won't need such scale, however, the features offered are closely aligned with OpenRefine's mission of transforming data, schema, etc. through exploratory analysis and series of operations that often need to be preserved and shared to support FAIR workflows.

Thad

https://www.linkedin.com/in/thadguidry/

Steve Kasica

unread,

Nov 27, 2020, 2:22:53 PM11/27/20

to OpenRefine Development

These comments are really insightful, and I wasn't aware of Iceberg or Nessie. Thanks for sharing!

I agree that schema evolution (aka schema drift) is an important challenge for data wrangling with multiple data sources / tables. I think journalists probably run into this issue frequently, like when some government agencies releases new monthly or yearly data. I also notice journalists talk a lot about finding stories at the intersection of two seemingly different datasets, which I bet involves inner, outer, and maybe even anti-joins.

Another project in my thesis is an interview study where I hope to learn more about the contexts, processes, and pain points of journalists doing any kind of multi-table wrangling, which I sure generalizes at some level to other data wranglers.

Reply all

Reply to author

Forward