Adam Brazier

Sep 23, 2022, 11:48:12 AMSep 23
to scimma
Hi all

I hope you all enjoyed your Summer break! This month in our SCiMMA public talk, we'll be hearing from Rich Wolski and Kerem Celik of UCSB on community data lakes, a topic of great importance for those of us dealing with large and heterogeneous datasets. Please do forward this to other that may be interested (and apologies to those who receive this more than once)

Title: Some Experimental Thinking about Community Data Lakes

Presenters: Rich Wolski and Kerem Celik, UCSB


Increasingly, communities focused on climate change are adopting data

contribution models designed to facilitate shared, collaborative research.

Unlike "open data" where data is simply published on line with a permissive

license, community science is turning towards a contribution model similar

to the open repository model for code implemented by github or bitbucket.


To facilitate community data contribution, curation, sharing, and access

control we discuss Depot: Dependency-Eager Platform of Transformations -- an

open source "data lake" for facilitating the development and sharing of

community-contributed data. Depot is an experimental study of data lake design

that supports full data provenance and versioning as well as data management and

owner-defined access control policies.  In particular, it includes policy

mechanisms for data retention and data quotas that span access groups to

facilitate efficient cloud-based implementation. It also uses dependency tracking

and lazy "data materialization" to optimize storage footprint with the goal of

enabling community sustainability as a long-running service.


We describe the Depot abstractions and provide a short demonstration of

the current Depot prototype service using EIA and other publicly available

data sets.

Best regards

Adam Brazier (for SCiMMA)

Adam Brazier

Sep 27, 2022, 8:01:56 AMSep 27
to scimma
Today at 3pm Eastern, 2pm Central, noon Pacific! Talk details are below, hope to see you there
