Streaming API: Spritzer-stream coverage

245 views
Skip to first unread message

Sven Svensson

unread,
May 24, 2009, 1:51:17 PM5/24/09
to twitter-deve...@googlegroups.com
Thanks for an excellent API.

I have two questions in relation to the streaming API:

* Assume that two users are both reading the spritzer stream at the same time - will they get the same spritzer streams covering the same subset of all tweets, or will they get two separate spritzer streams covering different tweets?

* Roughly what percentage of all tweets are distributed in the spritzer stream? Is it in the region of four percent of all tweets (my guesstimate)?

Thanks!

John Kalucki

unread,
May 24, 2009, 11:23:09 PM5/24/09
to Twitter Development Talk
Sven,

Excluding connection ramp-up and ramp-down skew, each spritzer feed
delivers the same statuses as all other spritzer feeds. Likewise, each
gardenhose feed delivers the same statuses as all other gardenhose
feeds. Also, spritzer feeds are a strict subset of gardenhose feeds.
There's no point in consuming multiple sampled feeds (spritzer/
spritzer, gardenhose/spritzer, gardenhose/gardenhose), as you'll just
receive duplicate data.

Multiple sessions on sampled feeds just waste scarce resources and you
also may find your access automatically limited for a period of time.
Reduce, reuse, recycle!

-John Kalucki
Services, Twitter Inc.

elversatile

unread,
May 25, 2009, 3:17:36 AM5/25/09
to Twitter Development Talk
How are spritzer statuses sampled? Are they picked uniformly at
random? Or is there some logic behind it?

Also, what makes it "statistically insignificant"? Is it its
percentage in relation to the entire stream or the way it is sampled?

Thanks,
-Eldar

stephane

unread,
May 25, 2009, 5:39:44 PM5/25/09
to Twitter Development Talk
looking at the tweet ids it looks like the spitzer stream delivers 5
tweets every hundreds
this would make it a 5% of the firehose

am i correct?

Stephane
http://www.twazzup.com

Sven Svensson

unread,
May 26, 2009, 6:24:58 AM5/26/09
to twitter-deve...@googlegroups.com
Hi Stephane,

I used the following calculation to obtain a four percent estimate for
the spritzer stream:
tweets_seen_in_stream / (max_tweet_id_seen_in_stream -
min_tweet_id_seen_in_stream)

Did you use the same methodology?

The four percent is probably a bit too low as I assume private tweets
get tweet_id:s too, which makes the denominator a bit too large due to
private tweets being included.

stephane

unread,
May 26, 2009, 9:01:06 PM5/26/09
to Twitter Development Talk
Hi Sven,

well I merely assumed that the easiest way for twitter to send a
subset of tweets on spitzer was to send them based on their ids
(autoincrement integer)...
watching at the stream, I noticed that "all" the ids where ending with
000,001,002,003,004, 100,102, ... 900,901,... 904

I did not push the analysis further though

On May 26, 3:24 am, Sven Svensson <twitterf...@gmail.com> wrote:
> Hi Stephane,
>
> I used the following calculation to obtain a four percent estimate for
> the spritzer stream:
>   tweets_seen_in_stream / (max_tweet_id_seen_in_stream -
> min_tweet_id_seen_in_stream)
>
> Did you use the same methodology?
>
> The four percent is probably a bit too low as I assume private tweets
> get tweet_id:s too, which makes the denominator a bit too large due to
> private tweets being included.
>
> On Mon, May 25, 2009 at 11:39 PM, stephane
>

elversatile

unread,
May 27, 2009, 1:07:06 AM5/27/09
to Twitter Development Talk
Makes sense. I was assuming the same. Thanks people! John from Twitter
said that spritzer is 1/3 of the gardenhose, which makes it 15%. So I
guess statistical insignificance of spritzer is due to its low
percentage. Any explanation directly from Twitter?

Brendan O'Connor

unread,
May 27, 2009, 2:55:17 AM5/27/09
to twitter-deve...@googlegroups.com
On Tue, May 26, 2009 at 10:07 PM, elversatile <elver...@gmail.com> wrote:
>
> Makes sense. I was assuming the same. Thanks people! John from Twitter
> said that spritzer is 1/3 of the gardenhose, which makes it 15%. So I
> guess statistical insignificance of spritzer is due to its low
> percentage.

I'm also curious what "statistical insignificance" means in this
context, since in the Streaming API docs they're pretty assiduous
saying which are "significant" vs. "insignificant". Sample sizes far
lower than 4% are of course fine for certain purposes as long as
they're drawn uniformly. And even if not all that uniform, they might
still be good enough :)

There are so many different things to do with *hose/spritzer I'm not
sure what statistical significance means in the abstract. I'm seeing
hundreds of thousands of messages per day on /spritzer. If you're
interested in computing a statistic that holds across all tweets --
say, average tweet length -- that's *plenty*. (Now, if you wanted to
compute the statistic per 1 minute time window and cared about
minute-per-minute differences, the story might be different...)

I'm curious to know what the docs author meant by "statistically
(in)significant" here.

Brendan
[ http://anyall.org ]

John Kalucki

unread,
May 27, 2009, 10:10:57 AM5/27/09
to Twitter Development Talk
Folks,

The significant/insignificant language currently isn't that important
or clear, as we're preparing for future changes. The spritzer will
likely remain a small public sample, the gardenhose will likely remain
a larger sample that requires an EULA. The proportions, however, are
subject to continuous change -- we want to provide a useful flow, but,
at the same time, we don't want to incur excessive cost or overwhelm
clients. Given our traffic growth, we will probably have to trim rates
down -- few clients want a 5 mbit/sec spritzer feed.

We haven't, yet, worked out a model for adjusting the sampling
proportions. The sampling may be based on some public model of
statistical significance, it may be driven by practical matters, by
client requirements, some unknown factor, or some combination of them
all. We're still measuring, analyzing, and reasoning about the
Streaming API, and there's plenty we don't know just yet.

-John Kalucki
Services, Twitter Inc.




On May 26, 11:55 pm, "Brendan O'Connor" <breno...@gmail.com> wrote:
Reply all
Reply to author
Forward
0 new messages