I used the following calculation to obtain a four percent estimate for
the spritzer stream:
tweets_seen_in_stream / (max_tweet_id_seen_in_stream -
min_tweet_id_seen_in_stream)
Did you use the same methodology?
The four percent is probably a bit too low as I assume private tweets
get tweet_id:s too, which makes the denominator a bit too large due to
private tweets being included.
I'm also curious what "statistical insignificance" means in this
context, since in the Streaming API docs they're pretty assiduous
saying which are "significant" vs. "insignificant". Sample sizes far
lower than 4% are of course fine for certain purposes as long as
they're drawn uniformly. And even if not all that uniform, they might
still be good enough :)
There are so many different things to do with *hose/spritzer I'm not
sure what statistical significance means in the abstract. I'm seeing
hundreds of thousands of messages per day on /spritzer. If you're
interested in computing a statistic that holds across all tweets --
say, average tweet length -- that's *plenty*. (Now, if you wanted to
compute the statistic per 1 minute time window and cared about
minute-per-minute differences, the story might be different...)
I'm curious to know what the docs author meant by "statistically
(in)significant" here.
Brendan
[ http://anyall.org ]