Pablo,
The duplicate tweet most likely came from the fact that the "crawler"
ran prior to the "stream_process" picking the tweet out of the
rawstream table (which captures the tweets from the Twitter Streaming
API).
Everything that comes from the Streaming API gets dumped in that table
and gets moved to the right archive [z_table] (by the stream_process
script) when it is run. If I recall correctly, that script doesn't
check for duplicates and just does straight inserts into the archive
tables.
Therefore, if the stream_processing has been slowed down a little, the
crawler may get ahead of it. The main reason the crawler exists is
for for 1) backfill and 2) redundancy checks - and there is somewhat
an assumption that the streaming api / stream process will have always
picked tweets up before the crawler sees it.
Let me know how things go... or if I didn't make any sense...
John
On Jan 18, 2:02 pm, Pablo Lemos <
beterra...@gmail.com> wrote:
> Thanks a lot John.
> Just did that.
>
> I've restarted apache and tested. The first tweet I posted was recorded
> twice, weird. But just the first.
> I did other tweets and it is working fine, only no more "realtime
> recording", but I dont need that precision (even tweeter sometimes takes a
> while to show hashtags at search results :)
>
> Lets see how the server behaves now.
>
> thanks again,
>
> Pablo Cabana
>
> 2012/1/18 John O'Brien III (ob3solutions) <
jobr...@ob3solutions.com>