Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
De-duplicating Site Streams
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  7 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Marc Mims  
View profile  
 More options Oct 26 2010, 11:55 pm
From: Marc Mims <marc.m...@gmail.com>
Date: Tue, 26 Oct 2010 20:55:10 -0700
Local: Tues, Oct 26 2010 11:55 pm
Subject: De-duplicating Site Streams
De-duplicating statuses in the Streaming API is fairly straightforward.
But with Site Streams, where a single status might be received multiple
times for multiple mentioned users, and/or as favorites, it is a bit
more difficult.

I'm wondering if anyone can offer advice on an efficient method for
de-duplicating Site Streams.

        -Marc


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
M. Edward (Ed) Borasky  
View profile  
 More options Oct 27 2010, 2:05 pm
From: "M. Edward (Ed) Borasky" <zn...@borasky-research.net>
Date: Wed, 27 Oct 2010 11:05:05 -0700
Local: Wed, Oct 27 2010 2:05 pm
Subject: Re: [twitter-dev] De-duplicating Site Streams
Quoting Marc Mims <marc.m...@gmail.com>:

> De-duplicating statuses in the Streaming API is fairly straightforward.
> But with Site Streams, where a single status might be received multiple
> times for multiple mentioned users, and/or as favorites, it is a bit
> more difficult.

> I'm wondering if anyone can offer advice on an efficient method for
> de-duplicating Site Streams.

>    -Marc

If you're talking about building something "massively scalable" for  
some value of "massive", you're getting into the realm of "NoSQL"  
databases. I *think* Cassandra has a Perl interface but I haven't  
looked at it recently. I'm by no means an expert on NoSQL databases -  
I just picked Cassandra because Twitter uses it for some things.

--
M. Edward (Ed) Borasky
http://borasky-research.net http://twitter.com/znmeb

"A mathematician is a device for turning coffee into theorems." - Paul Erdos


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Scott Wilcox  
View profile  
 More options Oct 27 2010, 2:09 pm
From: Scott Wilcox <sc...@dor.ky>
Date: Wed, 27 Oct 2010 19:09:40 +0100
Local: Wed, Oct 27 2010 2:09 pm
Subject: Re: [twitter-dev] De-duplicating Site Streams
Hi Marc,

I'd throw the hat in for MongoDB, its retardedly fast and I now adore it. Pop me a message on Twitter if you'd like to discuss it more.

Scott.

On 27 Oct 2010, at 19:05, M. Edward (Ed) Borasky wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
John Kalucki  
View profile  
 More options Oct 31 2010, 11:30 pm
From: John Kalucki <j...@twitter.com>
Date: Sun, 31 Oct 2010 20:30:07 -0700
Local: Sun, Oct 31 2010 11:30 pm
Subject: Re: [twitter-dev] De-duplicating Site Streams

Create two in-memory hash sets of seen ids. Write ids to both. If the id is
found on write, discard. Alternatively expire them every few tens of
 minutes to bound growth, but provide continuous coverage.

-John


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Marc Mims  
View profile  
 More options Nov 1 2010, 3:18 pm
From: Marc Mims <marc.m...@gmail.com>
Date: Mon, 1 Nov 2010 12:18:11 -0700
Local: Mon, Nov 1 2010 3:18 pm
Subject: Re: [twitter-dev] De-duplicating Site Streams
* John Kalucki <j...@twitter.com> [101031 20:30]:

> Create two in-memory hash sets of seen ids. Write ids to both. If the id is
> found on write, discard. Alternatively expire them every few tens of
>  minutes to bound growth, but provide continuous coverage.

That's what I'm doing now for the Streaming API and it works very well.
But in the Site Streams API, I might receive the same ID several times
in context of different users (for_user).

E.g., status N mentions users A, B, and C.  In addition it is favorited
by user D.  If I'm following all 4 users is the in with Site Streams,
I'll see N 4 times in 4 different messages.  However, if any of those
messages is repeated, I need to discard the repeats.

So, I can't simply track status IDs like I do in the Streaming API.  I
need to track for_user/type/status_id.

Or am I missing somethings, here?

        -Marc


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Mark McBride  
View profile  
 More options Nov 1 2010, 3:25 pm
From: Mark McBride <mmcbr...@twitter.com>
Date: Mon, 1 Nov 2010 12:25:59 -0700
Local: Mon, Nov 1 2010 3:25 pm
Subject: Re: [twitter-dev] De-duplicating Site Streams

Isn't this a matter of just changing the keys?  status_id becomes
user_id":"status_id?

   ---Mark

http://twitter.com/mccv


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Marc Mims  
View profile  
 More options Nov 1 2010, 3:32 pm
From: Marc Mims <marc.m...@gmail.com>
Date: Mon, 1 Nov 2010 12:32:14 -0700
Local: Mon, Nov 1 2010 3:32 pm
Subject: Re: [twitter-dev] De-duplicating Site Streams
* Mark McBride <mmcbr...@twitter.com> [101101 12:26]:

> Isn't this a matter of just changing the keys?  status_id becomes
> user_id":"status_id?

Yes.  Probably needs to be user_id/type/status_id to accommodate the
case where a user favorites a status she was mentioned in.  We'd get
that one, twice---once for the mention and again for the favorite.

        -Marc


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »