I'm wondering if anyone can offer advice on an efficient method for
de-duplicating Site Streams.
-Marc
If you're talking about building something "massively scalable" for
some value of "massive", you're getting into the realm of "NoSQL"
databases. I *think* Cassandra has a Perl interface but I haven't
looked at it recently. I'm by no means an expert on NoSQL databases -
I just picked Cassandra because Twitter uses it for some things.
--
M. Edward (Ed) Borasky
http://borasky-research.net http://twitter.com/znmeb
"A mathematician is a device for turning coffee into theorems." - Paul Erdos
>
> --
> Twitter developer documentation and resources: http://dev.twitter.com/doc
> API updates via Twitter: http://twitter.com/twitterapi
> Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list
> Change your membership to this group:
> http://groups.google.com/group/twitter-development-talk
>
I'd throw the hat in for MongoDB, its retardedly fast and I now adore it. Pop me a message on Twitter if you'd like to discuss it more.
Scott.
That's what I'm doing now for the Streaming API and it works very well.
But in the Site Streams API, I might receive the same ID several times
in context of different users (for_user).
E.g., status N mentions users A, B, and C. In addition it is favorited
by user D. If I'm following all 4 users is the in with Site Streams,
I'll see N 4 times in 4 different messages. However, if any of those
messages is repeated, I need to discard the repeats.
So, I can't simply track status IDs like I do in the Streaming API. I
need to track for_user/type/status_id.
Or am I missing somethings, here?
-Marc
-Marc
Yes. Probably needs to be user_id/type/status_id to accommodate the
case where a user favorites a status she was mentioned in. We'd get
that one, twice---once for the mention and again for the favorite.
-Marc