Duplicate data handling during batch ingestion

223 views
Skip to first unread message

Manish Deora

unread,
Jan 14, 2016, 7:18:43 AM1/14/16
to Druid User
Are there any ways to avoid duplicate data ingestion in druid ? I could't find any from the documentation.
Also, what metric spec should be mentioned to count the unique values of a particular dimension.?

Fangjin Yang

unread,
Jan 16, 2016, 2:16:10 PM1/16/16
to Druid User
Hi Manish, if you use batch ingestion, it should be 100% accurate in terms of the data you put in. If you are looking for exactly once streaming ingestion, we are working towards this for Kafka->Druid, and you should follow this PR:

Manish Deora

unread,
Jan 16, 2016, 2:26:28 PM1/16/16
to druid...@googlegroups.com
Hi Fangjin,

Kafka -> Druid once ingestion is good, but I am talking about duplicate handling during batch ingestion. Say We ingest 100k data points every 30 mins, want to ensure the same data point was not ingested in previous 30 mins ingestion job. 

Don't want to do any checks before ingesting. 

Are there any possible ways to handle it in Druid. 

Also Even if I ingest duplicate. How do I get the unique count considering all dimensions and metric values. 

Thanks
Manish
--
You received this message because you are subscribed to a topic in the Google Groups "Druid User" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-user/2PAS12I0ViM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/f9d08693-7997-4218-9483-8b1f8975bcec%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Fangjin Yang

unread,
Jan 19, 2016, 8:38:46 PM1/19/16
to Druid User
Hi Manish, can you do the de-duplication as part of your ETL layer? Druid doesn't have anything native built in to handle de-duplication. When you reindex data in Druid, it creates new versions of segments that obsolete older versions of segments for the same interval of time.


On Saturday, January 16, 2016 at 11:26:28 AM UTC-8, Manish Deora wrote:
Hi Fangjin,

Kafka -> Druid once ingestion is good, but I am talking about duplicate handling during batch ingestion. Say We ingest 100k data points every 30 mins, want to ensure the same data point was not ingested in previous 30 mins ingestion job. 

Don't want to do any checks before ingesting. 

Are there any possible ways to handle it in Druid. 

Also Even if I ingest duplicate. How do I get the unique count considering all dimensions and metric values. 

Thanks
Manish


On Jan 17, 2016, at 12:46 AM, Fangjin Yang <fan...@imply.io> wrote:

Hi Manish, if you use batch ingestion, it should be 100% accurate in terms of the data you put in. If you are looking for exactly once streaming ingestion, we are working towards this for Kafka->Druid, and you should follow this PR:

https://github.com/druid-io/druid/pull/2220

On Thursday, January 14, 2016 at 6:18:43 AM UTC-6, Manish Deora wrote:
Are there any ways to avoid duplicate data ingestion in druid ? I could't find any from the documentation.
Also, what metric spec should be mentioned to count the unique values of a particular dimension.?

--
You received this message because you are subscribed to a topic in the Google Groups "Druid User" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-user/2PAS12I0ViM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to druid-user+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages