Is tuple de-duplication usefull?

肖康(Kang Xiao)

unread,

Dec 14, 2011, 4:38:50 AM12/14/11

to storm...@googlegroups.com

The message guranting api will call the spout's fail() interface to resend tuple if a tuple is timeout to avoid message lost. Resending a tuple in fail() may cause some bolt receive duplicated tuple. The duplicated tuple maybe harmfull in some case such as counting will get more than count.

We have tried a dedup code to avoid the harmness of duplicated tuple. Is it usefull for others? Or some other suggesstion to deal with duplicated tuple?

--
Best Regards!

肖康(Kang Xiao,<kxiao...@gmail.com>)
Distributed Software Engineer @ Baidu Inc.

qing

unread,

Dec 15, 2011, 2:40:01 AM12/15/11

to storm-user

Hi Kang,
How do you implement dedupe? The systematic way is doing checkpoint/
persistency thus
avoid global resending.

On Dec 14, 5:38 pm, 肖康(Kang Xiao) <kxiao.ti...@gmail.com> wrote:
> The message guranting api will call the spout's fail() interface to resend
> tuple if a tuple is timeout to avoid message lost. Resending a tuple in
> fail() may cause some bolt receive duplicated tuple. The duplicated tuple
> maybe harmfull in some case such as counting will get more than count.
>
> We have tried a dedup code to avoid the harmness of duplicated tuple. Is it
> usefull for others? Or some other suggesstion to deal with duplicated tuple?
>
> --
> Best Regards!
>

> 肖康(Kang Xiao,<kxiao.ti...@gmail.com>)
> Distributed Software Engineer @ Baidu <http://www.baidu.com> Inc.

nathanmarz

unread,

Dec 15, 2011, 4:32:48 AM12/15/11

to storm-user

There's no universal way to "de-duplicate". What you're asking for are
very complex transaction semantics across multiple distributed
systems. It's just not practical in anyway. You also need to realize
that such a scheme would need to work under extreme fault-tolerance
conditions: any node, process, or combination can go down at any
point. There are some cases where you can "de-dup", but there's no way
of doing so for all cases. So you should design your applications
appropriately for the at-least-once guarantee that Storm provides. I
recommend using a combination of batch and realtime processing as I
discussed in this post: http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html

-Nathan

Joy Banerjee

unread,

Dec 15, 2011, 8:59:57 AM12/15/11

to storm-user

How did you achieve de-duplications in your case ?

On Dec 14, 2:38 pm, 肖康(Kang Xiao) <kxiao.ti...@gmail.com> wrote:
> The message guranting api will call the spout's fail() interface to resend
> tuple if a tuple is timeout to avoid message lost. Resending a tuple in
> fail() may cause some bolt receive duplicated tuple. The duplicated tuple
> maybe harmfull in some case such as counting will get more than count.
>
> We have tried a dedup code to avoid the harmness of duplicated tuple. Is it
> usefull for others? Or some other suggesstion to deal with duplicated tuple?
>
> --
> Best Regards!
>

> 肖康(Kang Xiao,<kxiao.ti...@gmail.com>)
> Distributed Software Engineer @ Baidu <http://www.baidu.com> Inc.

Nathan Marz

unread,

Dec 15, 2011, 4:52:53 PM12/15/11

to storm...@googlegroups.com

As described in that post, our systems are built with a combination of batch processing and realtime processing. So if anything ever goes wrong in our realtime layer, it gets auto-corrected within a couple hours.

--
Twitter: @nathanmarz
http://nathanmarz.com

Richards Peter

unread,

Dec 16, 2011, 12:51:22 AM12/16/11

to storm...@googlegroups.com

Hi Nathan,

We are interested to know the logic that can be used for auto correction

2011/12/16 Nathan Marz <natha...@gmail.com>

Richards Peter

unread,

Dec 16, 2011, 12:56:19 AM12/16/11

to storm...@googlegroups.com

Hi Nathan,

Can you just tell us the logic that can capture real time errors during batch processing? I know it would be specific to the application. But can you provide us an example? Also can you specify the kind of auto correction that is employed in such cases?

Richards.

Nathan Marz

unread,

Dec 16, 2011, 3:47:30 AM12/16/11

to storm...@googlegroups.com

It's really simple. All data is written to both batch and realtime. Since you're looking at all the data at once in batch, it's really easy to ignore duplicate data when counting.

Continuing with the counting example, we get the portion of the count older than a few hours ago from the batch store, and then the rest from our realtime indexes. The realtime count is just the sum from those two datastores.

I go into more detail on this thread: http://groups.google.com/group/storm-user/browse_thread/thread/b26279a6f4f551f9/aeb35e83f08c1681?#aeb35e83f08c1681

-Nathan

肖康(Kang Xiao)

unread,

Jan 16, 2012, 11:56:01 AM1/16/12

to storm...@googlegroups.com

I am very sorry for late response.

There's no universal way to "de-duplicate".

There are some cases where you can "de-dup", but there's no way of doing so for all cases.

Thanks Nathan's advice and I really agree with you. And on the other hand I have implemented a framework to deal with the cases where "de-dedup" may be useful and applicable. I'd like to get more advice for the code https://github.com/xiaokang/storm-dedup.

--
Best Regards!

肖康(Kang Xiao,<kxiao...@gmail.com>)
Distributed Software Engineer @ Baidu Inc.

Nathan Marz

unread,

Jan 17, 2012, 3:51:52 AM1/17/12

to storm...@googlegroups.com

I think the upcoming TransactionalSpout stuff in 0.7.0 will subsume the need for something like this. It enables idempotence for algorithms that are vulnerable to message replay, like counting.

肖康(Kang Xiao)

unread,

Jan 17, 2012, 4:24:10 AM1/17/12

to storm...@googlegroups.com

Thanks nathan! I will try to read the code about TransactionalSpout.

2012/1/17 Nathan Marz <natha...@gmail.com>

Reply all

Reply to author

Forward