how to ensure no duplicates?

1,172 views
Skip to first unread message

Prashant Deva

unread,
Aug 18, 2014, 3:17:37 AM8/18/14
to druid-de...@googlegroups.com
Is there any way to ensure that if for some reason a single row is inserted twice accidentally no change is made to the data source.


Prashant Deva

unread,
Aug 18, 2014, 3:26:48 AM8/18/14
to druid-de...@googlegroups.com
to clarify my question, it is possible for a buggy client to accidentally send the same data twice.
In this case, we want to essentially discard the duplicated data that was sent more than once.

how can we do this practically (without say querying druid before every single insert)?

Fangjin Yang

unread,
Aug 19, 2014, 12:40:56 PM8/19/14
to druid-de...@googlegroups.com
Hi Prashant, as explained over IRC, trying to guarantee exactly once semantics is extremely difficult given the current state of open source and enterprise solutions in the space right now. We run batch fixup to clean imperfections.

Prashant Deva

unread,
Aug 19, 2014, 12:44:01 PM8/19/14
to druid-de...@googlegroups.com
regarding the irc chat, I had this question:


I watched the whole strata conf video and have a question about your ETL process.
Basically the way the video describes it seems you:

1. Ingest the same data in both hadoop and the realtime nodes
2. The realtime nodes take care of realtime queries on what is slightly inaccurate data
3.  After a while hadoop has done cleaning on the data and you can give precise results

however..

the way druid works is the realtime nodes pass their data onto historical nodes, which persist that data so if your hadoop cluster also passes the data to historical nodes, after cleaning, in druid you still have duplicated data.

am i correct?

in fact, this way you always end up with 2 copies of the same data in druid.
how and when is the data from the realtime nodes discarded?

Prashant


--
You received this message because you are subscribed to a topic in the Google Groups "Druid Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-development/JerbxiB67MI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/c08a317b-86b0-4b07-a1f1-43246a4ae92a%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Xavier Léauté

unread,
Aug 19, 2014, 2:04:23 PM8/19/14
to druid-de...@googlegroups.com
Prashant, Druid doesn't concern itself with duplicate data per se. If you send it the data twice, it will ingest it twice and aggregate it. This is usually the desired behavior for most applications, since Druid was built to aggregate large amounts of data in real-time.

For our purposes, de-duplication happens on a set of dimensions which is larger than the fields we put into Druid. We run a separate map-reduce job to remove duplicates prior to submitting the data through the Hadoop indexing task. This is not something Druid provides directly, since data cleaning is very much application specific and you would probably want to do other cleanup as part of the de-duplication step.

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.

To post to this group, send email to druid-de...@googlegroups.com.

Prashant Deva

unread,
Aug 19, 2014, 2:10:09 PM8/19/14
to druid-de...@googlegroups.com
i understand the part about druid not concerning itself with duplicates.

what i am curious about is your de-duplication process.

i undertand that you use hadoop to clean up the duplicates.
however, in your talk you said, you still feed the realtime data through the realtime nodes.

if that is indeed the case, then arent you feeding druid the same data twice, once through realtime nodes and another through your hadoop job?

Xavier Léauté

unread,
Aug 19, 2014, 2:36:44 PM8/19/14
to druid-de...@googlegroups.com
Yes, we feed the data twice. The way Druid works is that any new segments created using batch ingestion will replace the segments created via realtime ingestion.


Prashant Deva

unread,
Aug 19, 2014, 2:38:53 PM8/19/14
to druid-de...@googlegroups.com
ah i see. is that defaul druid behavior (replacing realtime segments with batch segments) or does it need to be configured?
is there any documentation that explains this more?

Prashant


Fangjin Yang

unread,
Aug 19, 2014, 6:49:27 PM8/19/14
to druid-de...@googlegroups.com
No nothing should need to be configured.

Prashant


To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.

To post to this group, send email to druid-development@googlegroups.com.

--
You received this message because you are subscribed to a topic in the Google Groups "Druid Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-development/JerbxiB67MI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.

Gian Merlino

unread,
Aug 20, 2014, 4:45:33 PM8/20/14
to druid-de...@googlegroups.com
The default behavior for batch ingestions with given interval is to replace all prior data for that interval.

Deepak Jain

unread,
Aug 20, 2014, 9:14:34 PM8/20/14
to druid-de...@googlegroups.com
I think its all about interval.  You ingest data via realtime/batch for interval T1 and then you later index data , most likely through batch as your realtime will be ingesting current hour data, for time interval T1, druid will mark the later with higher version (MVCC). When you query, historicals will pick data from the latest version of segment for time T1. 

I am not sure when does the older version of segment gets dropped for T1, may it eventually does.

Xavier Léauté

unread,
Aug 20, 2014, 9:24:04 PM8/20/14
to druid-de...@googlegroups.com
A segment gets dropped when its entire interval has been overshadowed by newer segments. 

Segments can be of different lengths and span various ranges, Druid knows how to pick different subsets from different segments as long as each segment has all the data for the interval it advertises. 

Prashant Deva

unread,
Aug 20, 2014, 9:26:30 PM8/20/14
to druid-de...@googlegroups.com
when the segments are dropped, are they dropped from deep storage and mysql too or just from historical nodes?

Prashant


Gian Merlino

unread,
Aug 20, 2014, 9:28:50 PM8/20/14
to druid-de...@googlegroups.com
Only from the historical nodes. You can delete them from deep storage manually, if you want, or run a "kill" task through the indexing service.

Prashant Deva

unread,
Aug 20, 2014, 9:31:39 PM8/20/14
to druid-de...@googlegroups.com
just curious as to why this is the case. Also in the case of the ETL layer, where the data is constantly being cleaned up by hadoop and ingested into indexing service, wouldnt deleting from deep storage have to be done anyway (manually in this case since you describe that is not automated).

Prashant


Fangjin Yang

unread,
Aug 21, 2014, 12:56:13 PM8/21/14
to druid-de...@googlegroups.com
Segments are dropped from Druid and stored in deep storage in case you ever want to reload them. Also, if your entire historical cluster goes down, you still don't lose data as all segments can be reloaded from deep storage. Deep storage provides a potential permanent storage for data such that data is never lost no matter what occurs in your Druid cluster.

Prashant


To unsubscribe from this group and all its topics, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.

Xavier Léauté

unread,
Aug 21, 2014, 1:39:04 PM8/21/14
to druid-de...@googlegroups.com
The way Druid works mainly stems from the fact that we refrain from deleting any data unless it becomes a necessity (e.g. storage cost). Over-shadowed segments do not get automatically deleted. You can think of it as an automatic backup of segments.

This allows for recovery in case batch ingestion replaces good data with corrupted data, because of some upstream system failures. This way you don't have to wait to fix your data and re-ingest, you can manually remove the corrupt segment and re-instate the shadowed segment.

Since different organization usually have very different approaches to data retention, Druid leaves it up to the user to define how and when to delete data. The kill and archive tasks are there to pick out the over-shadowed segments and delete them outright or move them to a different storage location. Typically you would run those tasks on a schedule to cleanup specified data interval.


Reply all
Reply to author
Forward
0 new messages