How to handle data source refresh

86 views
Skip to first unread message

Pala Muthiah

unread,
Jan 11, 2018, 2:33:47 PM1/11/18
to Druid User
Hi,

We have a use case where dimensions of a data source is changed/modified and the entire data source needs to be reingested, to account for the new dimensions. When re-ingestion is executed for a data source, the data source itself is offline until the reingestion is complete.

What's the best way to handle this? I'd like to keep the data source online while the re-ingestion is happening. Are there features in ingestion that i can use to keep the data source online while reindexing is happening?

I've worked a bit on view support in Druid SQL - in the SQL case, we can solve this by registering a view that references different versions of underlying data source (e.g: create view data_src as select * from data_src_v1). Can we similarly look at implementing 'alias' support for data source name, so that we have a similar solution in the case of JSON queries as well? What do folks here think?



Thanks,
pala

Nishant Bangarwa

unread,
Jan 13, 2018, 5:24:01 AM1/13/18
to druid...@googlegroups.com
Not sure what do you mean by datasource going offline. 
Fwiw, You can reIndex data using batch ingestion and also run queries on existing older segments at the same time. Once the batch ingestion completes, new segments are created and loaded on the historical nodes, Older segments are dropped only after newer version segments are loaded. So at any point of time during ingestion you can query your old data.    

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/eb2c47c5-a173-441b-b6b2-a8c6261b80c9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Pala Muthiah

unread,
Jan 18, 2018, 8:12:08 PM1/18/18
to druid...@googlegroups.com
Thanks Nishant.

I spoke to our internal user - here is a better explanation of the scenario:

- We are changing dimension sets, value bucketing or other aggregate features and so need to re-ingest the entire data again.

- The input data for this data source is very large (~17 billion rows of data). A single hadoop batch ingestion for entire dataset does not successfully complete, with job failing after running for more than 24 hours. Therefore we break it up into a few chunks by time window and submit one hadoop ingestion job for each window.
 
- Prior to re-ingestion, we have to delete the datasource first (by issuing HTTP delete on the datasource REST object, which i believe removes the segment metadata). Reason is if we don't and only some of the hadoop ingestion job above succeed and some fail, we will be in weird intermediate state with both old and new data mixed.


Are there other ways i am not aware of, to address this scenario, while still keeping the datasource online? 



On Sat, Jan 13, 2018 at 2:23 AM, Nishant Bangarwa <nishant...@gmail.com> wrote:
Not sure what do you mean by datasource going offline. 
Fwiw, You can reIndex data using batch ingestion and also run queries on existing older segments at the same time. Once the batch ingestion completes, new segments are created and loaded on the historical nodes, Older segments are dropped only after newer version segments are loaded. So at any point of time during ingestion you can query your old data.    

On Fri, 12 Jan 2018 at 01:03 'Pala Muthiah' via Druid User <druid...@googlegroups.com> wrote:
Hi,

We have a use case where dimensions of a data source is changed/modified and the entire data source needs to be reingested, to account for the new dimensions. When re-ingestion is executed for a data source, the data source itself is offline until the reingestion is complete.

What's the best way to handle this? I'd like to keep the data source online while the re-ingestion is happening. Are there features in ingestion that i can use to keep the data source online while reindexing is happening?

I've worked a bit on view support in Druid SQL - in the SQL case, we can solve this by registering a view that references different versions of underlying data source (e.g: create view data_src as select * from data_src_v1). Can we similarly look at implementing 'alias' support for data source name, so that we have a similar solution in the case of JSON queries as well? What do folks here think?



Thanks,
pala

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+unsubscribe@googlegroups.com.

To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/eb2c47c5-a173-441b-b6b2-a8c6261b80c9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Druid User" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-user/_rKQofO4QK8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to druid-user+unsubscribe@googlegroups.com.

To post to this group, send email to druid...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages