De-duplicating Site Streams

28 views
Skip to first unread message

Marc Mims

unread,
Oct 26, 2010, 11:55:10 PM10/26/10
to Twitter Development Talk
De-duplicating statuses in the Streaming API is fairly straightforward.
But with Site Streams, where a single status might be received multiple
times for multiple mentioned users, and/or as favorites, it is a bit
more difficult.

I'm wondering if anyone can offer advice on an efficient method for
de-duplicating Site Streams.

-Marc

M. Edward (Ed) Borasky

unread,
Oct 27, 2010, 2:05:05 PM10/27/10
to twitter-deve...@googlegroups.com, Marc Mims, Twitter Development Talk
Quoting Marc Mims <marc...@gmail.com>:

If you're talking about building something "massively scalable" for
some value of "massive", you're getting into the realm of "NoSQL"
databases. I *think* Cassandra has a Perl interface but I haven't
looked at it recently. I'm by no means an expert on NoSQL databases -
I just picked Cassandra because Twitter uses it for some things.

--
M. Edward (Ed) Borasky
http://borasky-research.net http://twitter.com/znmeb

"A mathematician is a device for turning coffee into theorems." - Paul Erdos
>
> --
> Twitter developer documentation and resources: http://dev.twitter.com/doc
> API updates via Twitter: http://twitter.com/twitterapi
> Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list
> Change your membership to this group:
> http://groups.google.com/group/twitter-development-talk
>

Scott Wilcox

unread,
Oct 27, 2010, 2:09:40 PM10/27/10
to twitter-deve...@googlegroups.com
Hi Marc,

I'd throw the hat in for MongoDB, its retardedly fast and I now adore it. Pop me a message on Twitter if you'd like to discuss it more.

Scott.

John Kalucki

unread,
Oct 31, 2010, 11:30:07 PM10/31/10
to twitter-deve...@googlegroups.com
Create two in-memory hash sets of seen ids. Write ids to both. If the id is found on write, discard. Alternatively expire them every few tens of  minutes to bound growth, but provide continuous coverage.

-John



Marc Mims

unread,
Nov 1, 2010, 3:18:11 PM11/1/10
to twitter-deve...@googlegroups.com
* John Kalucki <jo...@twitter.com> [101031 20:30]:

> Create two in-memory hash sets of seen ids. Write ids to both. If the id is
> found on write, discard. Alternatively expire them every few tens of
> minutes to bound growth, but provide continuous coverage.

That's what I'm doing now for the Streaming API and it works very well.
But in the Site Streams API, I might receive the same ID several times
in context of different users (for_user).

E.g., status N mentions users A, B, and C. In addition it is favorited
by user D. If I'm following all 4 users is the in with Site Streams,
I'll see N 4 times in 4 different messages. However, if any of those
messages is repeated, I need to discard the repeats.

So, I can't simply track status IDs like I do in the Streaming API. I
need to track for_user/type/status_id.

Or am I missing somethings, here?

-Marc

Mark McBride

unread,
Nov 1, 2010, 3:25:59 PM11/1/10
to twitter-deve...@googlegroups.com
Isn't this a matter of just changing the keys?  status_id becomes user_id":"status_id?

       -Marc

Marc Mims

unread,
Nov 1, 2010, 3:32:14 PM11/1/10
to twitter-deve...@googlegroups.com
* Mark McBride <mmcb...@twitter.com> [101101 12:26]:

> Isn't this a matter of just changing the keys? status_id becomes
> user_id":"status_id?

Yes. Probably needs to be user_id/type/status_id to accommodate the
case where a user favorites a status she was mentioned in. We'd get
that one, twice---once for the mention and again for the favorite.

-Marc

Reply all
Reply to author
Forward
0 new messages