Hello Jacques,First of all thanks for immediate response.At a high level our system manages a large amount of data(ingestion, enrichment, tiered storage, indexing, aggregation) and their "table like" schema is managed via Drill's Iceberg. Now there are "long running" data processing pipelines/jobs running on these which require some kind of "transnational" view data until the WHOLE pipeline/job is done as otherwise we are required to "stage the data for a pipeline" and "commit" once we are fully completed.(these "commits" were done using "HDFS directory movements" )
Now as you can understand as these processing gets complex with a lot of reference data/tables to lookup and longer pipelines while business wants "near real-time" visibility. The natural way to go about it is to have "versioned tables where a commit" represents point in time data/reference-data and let the "processing job" refer to everything by "commit-id". We also want user level control on versioned snapshots of data. That's where Nessie's branching model seems natural fit to us.
At a high level
- "main'' branch is "system" branch where a business user's visible global data set is being committed as and when ready.
- "human users" get a branch forked when they start their "analytics" (which can run for hours to month) for consistent ML without "staging/copying in personal space"
We need guidance which can align better our efforts with Nessie immediate API/SPI changes:
- Reliable versioned storage for Nessie (other than dynamodb,we can use postgres here): We are okay to depend on PG specific features to start with, we can contribute but need some guidance on final SPI we should implement. We have done some analysis on how data is stored in dynamodb using Nessie. Found that there are five main structures used to create git kind of branching structure.
- Ref (where branch,hash is stored) : This structure contains commit history reference from L1 and current hash which refers to key of L2.
- L1 : Here all commit operations are stored like (adding,removing or mutation of key). In the L1,corresponding L2 key as well, the parent L1 pointer list is stored.
- L2 : L2 will point to corresponding L3
- L3 : In which current state of key and their respective value pointer is stored
- Value : In which value information is stored
- We need fast versioned storage ( a commit under 100ms for <512Kb meta per commit, on 100 branches concurrently ) : We believe Nessie can do this at core, this is just our internal SLA.
- Keep iceberg interface compliance : Frankly we are still in-progress to finalise how meta-data plugin in DRILL can be made version aware, commitID in jdbc-session by driver? . (Yes this is more of a DRILL question but I am fully aware whom I am talking to here.)
Thanks and regards,Jenil Shah
On Fri, Oct 30, 2020 at 11:37 PM Jacques Nadeau <jac...@dremio.com> wrote:I’ve definitely been thinking a lot about it. One of the challenges is the right way to minimize roundtrips to achieve similar commit rates to dynamo. It’s easier with rdbms like Postgres that have first class support for arrays than systems like MySQL which don’t.Can you share more about your preferences and environment to understand different options that might work?ThanksJacques
----
You received this message because you are subscribed to the Google Groups "projectnessie" group.
To unsubscribe from this group and stop receiving emails from it, send an email to projectnessi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/projectnessie/301d82cb-8386-4ba5-88fc-86b474e9d813n%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--Jacques NadeauCTO and Co-Founder, Dremio
--JEnil Shah