De-duplicating statuses in the Streaming API is fairly straightforward. But with Site Streams, where a single status might be received multiple times for multiple mentioned users, and/or as favorites, it is a bit more difficult.
I'm wondering if anyone can offer advice on an efficient method for de-duplicating Site Streams.
> De-duplicating statuses in the Streaming API is fairly straightforward. > But with Site Streams, where a single status might be received multiple > times for multiple mentioned users, and/or as favorites, it is a bit > more difficult.
> I'm wondering if anyone can offer advice on an efficient method for > de-duplicating Site Streams.
> -Marc
If you're talking about building something "massively scalable" for some value of "massive", you're getting into the realm of "NoSQL" databases. I *think* Cassandra has a Perl interface but I haven't looked at it recently. I'm by no means an expert on NoSQL databases - I just picked Cassandra because Twitter uses it for some things.
>> De-duplicating statuses in the Streaming API is fairly straightforward. >> But with Site Streams, where a single status might be received multiple >> times for multiple mentioned users, and/or as favorites, it is a bit >> more difficult.
>> I'm wondering if anyone can offer advice on an efficient method for >> de-duplicating Site Streams.
>> -Marc
> If you're talking about building something "massively scalable" for some value of "massive", you're getting into the realm of "NoSQL" databases. I *think* Cassandra has a Perl interface but I haven't looked at it recently. I'm by no means an expert on NoSQL databases - I just picked Cassandra because Twitter uses it for some things.
Create two in-memory hash sets of seen ids. Write ids to both. If the id is found on write, discard. Alternatively expire them every few tens of minutes to bound growth, but provide continuous coverage.
On Tue, Oct 26, 2010 at 8:55 PM, Marc Mims <marc.m...@gmail.com> wrote: > De-duplicating statuses in the Streaming API is fairly straightforward. > But with Site Streams, where a single status might be received multiple > times for multiple mentioned users, and/or as favorites, it is a bit > more difficult.
> I'm wondering if anyone can offer advice on an efficient method for > de-duplicating Site Streams.
> Create two in-memory hash sets of seen ids. Write ids to both. If the id is > found on write, discard. Alternatively expire them every few tens of > minutes to bound growth, but provide continuous coverage.
That's what I'm doing now for the Streaming API and it works very well. But in the Site Streams API, I might receive the same ID several times in context of different users (for_user).
E.g., status N mentions users A, B, and C. In addition it is favorited by user D. If I'm following all 4 users is the in with Site Streams, I'll see N 4 times in 4 different messages. However, if any of those messages is repeated, I need to discard the repeats.
So, I can't simply track status IDs like I do in the Streaming API. I need to track for_user/type/status_id.
On Mon, Nov 1, 2010 at 12:18 PM, Marc Mims <marc.m...@gmail.com> wrote: > * John Kalucki <j...@twitter.com> [101031 20:30]: > > Create two in-memory hash sets of seen ids. Write ids to both. If the id > is > > found on write, discard. Alternatively expire them every few tens of > > minutes to bound growth, but provide continuous coverage.
> That's what I'm doing now for the Streaming API and it works very well. > But in the Site Streams API, I might receive the same ID several times > in context of different users (for_user).
> E.g., status N mentions users A, B, and C. In addition it is favorited > by user D. If I'm following all 4 users is the in with Site Streams, > I'll see N 4 times in 4 different messages. However, if any of those > messages is repeated, I need to discard the repeats.
> So, I can't simply track status IDs like I do in the Streaming API. I > need to track for_user/type/status_id.
* Mark McBride <mmcbr...@twitter.com> [101101 12:26]:
> Isn't this a matter of just changing the keys? status_id becomes > user_id":"status_id?
Yes. Probably needs to be user_id/type/status_id to accommodate the case where a user favorites a status she was mentioned in. We'd get that one, twice---once for the mention and again for the favorite.