Out of memory

Pablo Cabana

unread,

Jan 18, 2012, 1:12:33 PM1/18/12

to yourTwapperKeeper

Hi, I am using yourTwapperKeeper in a Rackspace CloudServer and all
work fine, but one thing is weird.
I am only tracking one hashtag, just one simple archive, with no more
than 5 tweets/day.
And, for my surprise, the server is restarted time to time because it
is going out of memory.
Its is a small server, but I use it ONLY for the twapperkeeper.

Any clue why twapperkeeper is so CPU intense?

Thanks in advance,

Pablo Cabana

John O'Brien III (ob3solutions)

unread,

Jan 18, 2012, 1:20:49 PM1/18/12

to yourTwapperKeeper

Pablo,
The stream_processing routine is unfort extremely aggressive to keep
up with a potential high velocity of tweets coming in from the stream
(yes, probably a little too aggressive).

You could probably cut some of the intensity out by simply putting a
little sleep() in the "yourtwapperkeeper_stream_process.php" script.

Just edit the file and add the sleep function right after the WHILE
loop begins...

<?php
// load important files
require_once('config.php');
require_once('function.php');
require_once('twitteroauth_search.php');

// setup values
$pid = getmypid();
$script_key = uniqid();

// process loop
while (TRUE) {

// ADD THIS **************************************
echo "sleeping...\n";
sleep(60);

// lock up some tweets
$q = "update rawstream set flag = '$script_key' where flag =
'-1' limit $stream_process_stack_size";
echo $q."\n";

Hope this helps...

John
http://twitter.com/jobrieniii
http://www.linkedin.com/in/jobrieniii

Pablo Lemos

unread,

Jan 18, 2012, 2:02:35 PM1/18/12

to yourtwap...@googlegroups.com

Thanks a lot John.

Just did that.

I've restarted apache and tested. The first tweet I posted was recorded twice, weird. But just the first.

I did other tweets and it is working fine, only no more "realtime recording", but I dont need that precision (even tweeter sometimes takes a while to show hashtags at search results :)

Lets see how the server behaves now.

thanks again,

Pablo Cabana

2012/1/18 John O'Brien III (ob3solutions) <job...@ob3solutions.com>

Pablo Lemos

unread,

Jan 18, 2012, 2:06:20 PM1/18/12

to yourtwap...@googlegroups.com

wow, my CPU usage drops from 196% to 56%.

thats great! ;)

2012/1/18 Pablo Lemos <beter...@gmail.com>

Pablo Lemos

unread,

Jan 18, 2012, 2:08:30 PM1/18/12

to yourtwap...@googlegroups.com

humm, but sometimes the tweets are being recorded twice.

like randomly.

any clue why it is happening?

2012/1/18 Pablo Lemos <beter...@gmail.com>

John O'Brien III (ob3solutions)

unread,

Jan 18, 2012, 2:11:37 PM1/18/12

to yourTwapperKeeper

Pablo,

The duplicate tweet most likely came from the fact that the "crawler"
ran prior to the "stream_process" picking the tweet out of the
rawstream table (which captures the tweets from the Twitter Streaming
API).

Everything that comes from the Streaming API gets dumped in that table
and gets moved to the right archive [z_table] (by the stream_process
script) when it is run. If I recall correctly, that script doesn't
check for duplicates and just does straight inserts into the archive
tables.

Therefore, if the stream_processing has been slowed down a little, the
crawler may get ahead of it. The main reason the crawler exists is
for for 1) backfill and 2) redundancy checks - and there is somewhat
an assumption that the streaming api / stream process will have always
picked tweets up before the crawler sees it.

Let me know how things go... or if I didn't make any sense...
John

On Jan 18, 2:02 pm, Pablo Lemos <beterra...@gmail.com> wrote:
> Thanks a lot John.
> Just did that.
>
> I've restarted apache and tested. The first tweet I posted was recorded
> twice, weird. But just the first.
> I did other tweets and it is working fine, only no more "realtime
> recording", but I dont need that precision (even tweeter sometimes takes a
> while to show hashtags at search results :)
>
> Lets see how the server behaves now.
>
> thanks again,
>
> Pablo Cabana
>

> 2012/1/18 John O'Brien III (ob3solutions) <jobr...@ob3solutions.com>

Pablo Lemos

unread,

Jan 18, 2012, 2:40:36 PM1/18/12

to yourtwap...@googlegroups.com

yeah, it makes sense :)

but since I am not a php expert I could not solve it alone.

the crawl seems to be checking:

// duplicate record check and insert into proper cache table if not a duplicate

$q_check = "select id from z_".$row_archives['id']." where id = '".$value['id']."'";

$result_check = mysql_query($q_check, $db->connection);

if (mysql_numrows($result_check)==0) {

$q = "insert into z_".$row_archives['id']." values ('twitter-search','".mysql_real_escape_string($temp_text)."','$temp_to_user_id','$temp_from_user','$temp_id','$temp_from_user_id','$temp_iso_language_code','$temp_source','$temp_profile_image_url','$geo_type','$geo_coordinates_0','$geo_coordinates_1','$temp_created_at','".strtotime($temp_created_at)."')";

mysql_query($q, $db->connection);

echo "[".$row['id']."-".$row['keyword']."] $page_counter - $temp_id - insert\n";

} else {echo "$page_counter - $temp_id - duplicate\n";}

}

/////////////////

do you think if change the "sleep" time for the crawl to 60 too, it would sinc both processes?

// sleep for rate limiting

echo "sleep = $sleep\n";

sleep($sleep);

change to

// sleep for rate limiting

echo "sleep =60\n";

sleep(60);

thanks a lot por the quick responses.

Pablo Cabana

2012/1/18 John O'Brien III (ob3solutions) <job...@ob3solutions.com>

John O'Brien III (ob3solutions)

unread,

Jan 18, 2012, 2:48:00 PM1/18/12

to yourTwapperKeeper

No - unfort they take totally different amounts of time to do their
"run" - so synching the wait times wouldn't help.

You may try "speeding" up the stream_process a little (even sleeping a
second or two may help).

John

On Jan 18, 2:40 pm, Pablo Lemos <beterra...@gmail.com> wrote:
> yeah, it makes sense :)
> but since I am not a php expert I could not solve it alone.
>
> the crawl seems to be checking:
>

> *// duplicate record check and insert into proper cache table if not a
> duplicate*

> $q_check = "select id from z_".$row_archives['id']." where id =
> '".$value['id']."'";
> $result_check = mysql_query($q_check, $db->connection);
>
> if (mysql_numrows($result_check)==0) {
> $q = "insert into z_".$row_archives['id']." values

> ('twitter-search','".mysql_real_escape_string($temp_text)."','$temp_to_user _id','$temp_from_user','$temp_id','$temp_from_user_id','$temp_iso_language_ code','$temp_source','$temp_profile_image_url','$geo_type','$geo_coordinate s_0','$geo_coordinates_1','$temp_created_at','".strtotime($temp_created_at) ."')";

> mysql_query($q, $db->connection);
> echo "[".$row['id']."-".$row['keyword']."] $page_counter - $temp_id
> - insert\n";
> } else {echo "$page_counter - $temp_id - duplicate\n";}
> }
>
> /////////////////
>
> do you think if change the "sleep" time for the crawl to 60 too, it would
> sinc both processes?
>
> // sleep for rate limiting
> echo "sleep = $sleep\n";
> sleep($sleep);
>
> change to
>
> // sleep for rate limiting
> echo "sleep =60\n";
> sleep(60);
>
> thanks a lot por the quick responses.
>
> Pablo Cabana
>

> 2012/1/18 John O'Brien III (ob3solutions) <jobr...@ob3solutions.com>

Pablo Lemos

unread,

Jan 18, 2012, 2:58:13 PM1/18/12

to yourtwap...@googlegroups.com

cool.

I just need to not crash the server. :)

I changed to 2 seconds and at my first test it didnt duplicate the tweet.

lets see.

2012/1/18 John O'Brien III (ob3solutions) <job...@ob3solutions.com>

Reply all

Reply to author

Forward