First your test set is a bit small. Did you take into account the
extra data you will get in your first search api poll? Typically your
first poll will return 100 items then subsequent polls will return
only "new" data if using since_id and/or dedupping.
Make sure both your poller and stream reader start at the same item. A
trick, if you want to grab as much similar results at possible from
the start is to request only a single item on the first poll (using
rpp=1) (or use only the most recent item of your result) then use this
item to seed your since_id on the following polls. Another idea might
be to start your stream reader first and use the first item returned
by your reader to again seed your since_id in your poller. Also you
can simply ignore this during collection but cleanup your data once
your done collecting and make sure both data sets start and end with
the same item ID.
In any case, if your difference is in fact related to the handling of
the first poll, it will become marginal as your data grow.
I also ran some tests to compare results between both methods using a
single keyword. With result sets of about 15000 ids, both sets are
identical at 98.3%. For testing purposes both my poller and stream
reader only output IDs so I can use cat, sort, uniq, wc and diff to
compare results.
Colin
> > >
http://jetwick.comTwitterSearch without Noise