disk usage related to firefox profiles

264 views
Skip to first unread message

António Fonseca

unread,
Feb 11, 2016, 5:10:17 AM2/11/16
to Selenium Users
Hi all,

I have a disk usage issue that is related to topic https://groups.google.com/forum/#!msg/selenium-users/XgoQFfZQHKI/WgJOSv1JNNMJ 
My disk (400G) gets completely full  (after 15k urls fetched) due to the firefox profiles on the /tmp/annonimous*web-driver.

also it is likely impacting 

I am running Nutch 1.11 with Selenium 2.48.2 using selenium.driver remote that goes to a phantomjs service I have running and it connects to firefox 41.0.2.
Is there any way I could go around this?

Thank you
Best regards

Krishnan Mahadevan

unread,
Feb 11, 2016, 5:51:02 AM2/11/16
to Selenium Users
Am a bit confused here.

I didnt quite understand this part >>>> "selenium.driver remote that goes to a phantomjs service I have running and it connects to firefox 41.0.2."

Can you please help elaborate a bit ?

Also it would be good if you can show us the code that you are using to open and close a browser.



Thanks & Regards
Krishnan Mahadevan

"All the desirable things in life are either illegal, expensive, fattening or in love with someone else!"
My Scribblings @ http://wakened-cognition.blogspot.com/
My Technical Scribbings @ http://rationaleemotions.wordpress.com/

--
You received this message because you are subscribed to the Google Groups "Selenium Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to selenium-user...@googlegroups.com.
To post to this group, send email to seleniu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/selenium-users/accde598-6768-47f7-ad10-6e2b74751f98%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

António Fonseca

unread,
Feb 11, 2016, 6:28:44 AM2/11/16
to seleniu...@googlegroups.com
Hi,

The code to open/close browser comes from Selenium, I didn't write any code (just using out of the box from http://mirrors.fe.up.pt/pub/apache/nutch/1.11/ )

About the phantomjs, I had to set up the phantomjs on port 4444 so that selenium proxies through it to firefox (I  tried a lot and wasn't able to make selenium communicate directly with firefox and googling didn't help solving it). i set the nutch selenium proprieties to use "remote" as driver on port 4444.

I can send you my "installation procedure" and nutch config files so you can see exactly what is happening here if needed.

Thank you a lot for the reply, i really need this working. If there is any other info you need let me know.

Best regards


--
You received this message because you are subscribed to a topic in the Google Groups "Selenium Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/selenium-users/KNg3wj7UfR0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to selenium-user...@googlegroups.com.

To post to this group, send email to seleniu...@googlegroups.com.

Krishnan mahadevan

unread,
Feb 11, 2016, 7:31:37 AM2/11/16
to seleniu...@googlegroups.com
Antonio

Unfortunately I don't know anything about nutch. But I think I will try it out.
Either ways it would be good if u can share all data so that atleast some else can help you

Also I can also use that to simulate your problem.

-Krishnan Mahadevan


"All the desirable things in life are either illegal, expensive, fattening or in love with someone else"

From: António Fonseca
Sent: ‎11-‎02-‎2016 16:58
To: seleniu...@googlegroups.com
Subject: Re: [selenium-users] disk usage related to firefox profiles

PeterJeffreyGale

unread,
Feb 11, 2016, 8:20:08 AM2/11/16
to Selenium Users
I don't know what the underlying issue is for this, and the temp files could get left after a script program crashes and fails to complete anyway.

When I first encountered it, though, I just added a call at the end of every test to a routine that removes any of the temp folders that it can.

I've heard of others having seen the temp folders left in place, but if you run tests through Jenkins, for example, it seems to do the clearing up for you without you necssarily realising it.

António Fonseca

unread,
Feb 11, 2016, 8:35:11 AM2/11/16
to seleniu...@googlegroups.com
Hi Krishnan,

The instalation / running manual is as follows (assuming Java and firefox 41.0.2 are installed)

sudo apt-get install ant

sudo apt-get install Xvfb

sudo apt-get install phantomjs


wget http://mirrors.fe.up.pt/pub/apache/nutch/1.11/apache-nutch-1.11-src.tar.gz

tar -xvzf apache-nutch-1.11-src.tar.gz

cd apache-nutch-1.11


change conf/nutch-site.xml and conf/nutch-default.xml with the attached files
put a seed file in conf/urls/seed.csv (a example file is also attached)

ant clean

ant runtime


phantomjs --webdriver=4444 > phantomjs.log & 

/*if it fails see “lsof -i :4444” and kill the process and try again */


/usr/bin/Xvfb :11 -screen 0 1024x768x24 > xvfb.log &

export DISPLAY=:11


runtime/local/bin/crawl conf/urls/ testCrawl1/ 1 > crawlLog1.txt &


after this you will see over time the /tmp folder grow  with anonymousXXXXweb-driver (in which XXXX is a random number).


Thank you

Best regards




sample_seed.csv
nutch-default.xml
nutch-site.xml

António Fonseca

unread,
Feb 11, 2016, 8:40:42 AM2/11/16
to seleniu...@googlegroups.com
Hi Peter,

The problem is at here I'm crawling/scrapping 29k with multiple threads in parallel so it's not linear to know which ones I can already delete or not, but since it seams each firefox uses it's own new temp profile I will leave a script running that every 2 minutes deletes everything in the /tmp folder and see if it doesn't break the testing. If it doesn't break anything that solution is perfectly acceptable (not meaning if the issue is on the Selenium plugin for nutch that it should not be fixed).

I am will run some tests and let everyone know the result (they take some hours, so probably only tomorrow I will say something).

Thank you
Best regards

--
You received this message because you are subscribed to a topic in the Google Groups "Selenium Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/selenium-users/KNg3wj7UfR0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to selenium-user...@googlegroups.com.
To post to this group, send email to seleniu...@googlegroups.com.

António Fonseca

unread,
Feb 11, 2016, 8:56:10 AM2/11/16
to seleniu...@googlegroups.com
Hi,

small correction the temp profile folders are name anonymousXXXwebdriver-profile.

Best regards

PeterJeffreyGale

unread,
Feb 11, 2016, 8:56:37 AM2/11/16
to Selenium Users
The ones that are still in use should cause an ecception, so just catch that and move on

Krishnan Mahadevan

unread,
Feb 11, 2016, 9:34:56 AM2/11/16
to Selenium Users
@Peter,
I don't think there is any coding involved when using nutch, so that cleanup that you called out would need to be done separately outside of nutch.

@Antonio,

I think I know what is causing the problem. But what I don't know is how do you go about fixing it [ I am not able to conclude clearly ]


There were some interesting things that i noticed.

The version that you are using i.e., 1.11 calls out a fix that was specifically targeted around the clean-up of the temporary file system [ http://archive.apache.org/dist/nutch/1.11/CHANGES.txt ].

* NUTCH-2111 Delete temporary files location for selenium tmp files after driver quits (Kim Whitehall via lewismc)

But what I did notice was, that the HttpWebClient.java specifically disables the behaviour of cleaning up profiles after webdriver has quit, when you run on "remote" mode [ I believe this is the mode in which you are running ]


wherein the code is disabling the JVM argument webdriver.reap_profile 

So even though the method cleanupDriver() [ line 132 in the above java class ] makes a call to 

TemporaryFilesystem.getDefaultTmpFS().deleteTemporaryFiles();

[ this is a selenium provided API which helps in cleaning up the temporary folder ], the cleanup doesn't happen because the method TemporaryFilesystem.getDefaultTmpFS().deleteTemporaryFiles(); first checks what was the JVM argument value for webdriver.reap_profile. ONLY if its been set to true, does it go about cleaning the temp file system [ In your case, the remote mode specifically disables this ]


So long story short, to me it looks like you are using nutch  in an incorrect manner.

* The current implementation in nutch is designed to cleanup the temp directory, ONLY if you are NOT running on the "remote" mode.
* If you would like to run in the "remote" mode, then what you have to do is, either spin off a Selenium Grid and provide its IP and PORT in your nutch-site.xml [ You can refer to my blog https://rationaleemotions.wordpress.com/2012/01/23/setting-up-grid2-and-working-with-it/ to learn how to setup your own grid ] (or) 
* Resort to following the instructions that are documented here https://github.com/momer/nutch-selenium-grid-plugin and leverage docker.

(or) switch to using firefox/chrome browser by changing
 <name>selenium.driver</name>
 <value>remote</value>
to 
 <name>selenium.driver</name>
 <value>firefox</value>

I hope that helps you with some information to get started.

PS : I haven't tried running nutch locally, and all of this information is based on what I have learnt looking at the codebase.


Thanks & Regards
Krishnan Mahadevan

"All the desirable things in life are either illegal, expensive, fattening or in love with someone else!"
My Scribblings @ http://wakened-cognition.blogspot.com/
My Technical Scribbings @ http://rationaleemotions.wordpress.com/

You received this message because you are subscribed to the Google Groups "Selenium Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to selenium-user...@googlegroups.com.

To post to this group, send email to seleniu...@googlegroups.com.

PeterJeffreyGale

unread,
Feb 11, 2016, 9:45:20 AM2/11/16
to Selenium Users
I don't use nutch, but I've seen issues with the temp files being left.

António Fonseca

unread,
Feb 11, 2016, 9:54:42 AM2/11/16
to seleniu...@googlegroups.com
Hi Krishnan,

First of all thank you a lot for the effort and tips.

Indeed i have the remote option. (which might be a wrong configuration) so the rest of this message goes a bit off-topic.

The reason I use remote to go to Firefox is that directly with firefox I keep having the "could not connect to Firefox on port 7055 after 45000 ms" and all the answers I found for this were related to a wrong firefox version which I triple checked and had the correct one.
With chrome the issue is another one related to a "fixed issue" in 2011 which is every time chrome opens it listens to a new port instead of reusing the old one. after a while I get no more ports).

If I can solve any of those I can indeed not use the "remote" option. 
I am still running the test with a external script cleaning the /tmp folder every 2 minutes to see if it's a possible solution for me.

Thank you
Best regards


António Fonseca

unread,
Feb 11, 2016, 10:40:05 AM2/11/16
to seleniu...@googlegroups.com
Hi,

Meanwhile the crome issue is the one i reported in: https://github.com/SeleniumHQ/selenium/issues/1585 the original issue page i could not find but it was fixed on:

(should I close this topic and open a new one for the issue I'm having on chrome? )

Thank you
Best regards

Krishnan Mahadevan

unread,
Feb 11, 2016, 12:25:13 PM2/11/16
to Selenium Users
Antonio,
So I guess 1585 resolves your chrome issue [ ephemeral port issue ]

For Firefox, I think you should seriously consider trying to figure out what is causing firefox to not spawn locally.

If that doesn't work, you may want to try out the Grid alternative.

On a side note, how many threads are you spawning ? I think if you try to spawn more than 20 odd firefox instances concurrently you may see that issue which you called out earlier.

Alternatively you can also consider adding PhantomJS support locally itself to your code by editing 

./src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java



Thanks & Regards
Krishnan Mahadevan

"All the desirable things in life are either illegal, expensive, fattening or in love with someone else!"
My Scribblings @ http://wakened-cognition.blogspot.com/
My Technical Scribbings @ http://rationaleemotions.wordpress.com/

António Fonseca

unread,
Feb 11, 2016, 12:42:48 PM2/11/16
to seleniu...@googlegroups.com
Hi Krishnan,

The 1585 was created by me, waiting for answer since I still see that happening 
(although mentioned fixed in 2011)

The firefox I am indeed using more than 20 threads, will try that, thank you :)

about the grid, since Nutch supports clustering my intention was to have a cluster running Nutch and each with it's own single selenium, 
but if that makes it work it's a possibility.

Meanwhile deleting the temp files destroyed the tests due to racing issues between the delete and the firefox threads.

Thank you
Best regards

Reply all
Reply to author
Forward
0 new messages