Using Selenium for Harvesting / Web Scraping / Data extraction

Harald Koschinski

unread,

Aug 31, 2016, 2:10:56 PM8/31/16

to Selenium Users

Hi,
we think about using Selenium not only for testing and load generation but also for harvesting / Web Scraping / Extracting Data from online Databases.
Is this a good idea or are there better tools for this ? Limitations? Advantages?
Somebody here using Selenium for such work?
Any hints a apreaceated.

Harald Koschinski

unread,

Sep 16, 2016, 1:10:40 PM9/16/16

to Selenium Users

I am sad that noboy answers :-(
Does this mean that nobody is using it for harvesting (...) ??

⇜Krishnan Mahadevan⇝

unread,

Sep 16, 2016, 1:22:44 PM9/16/16

to Selenium Users

Harald,

There is no reason to be sad. You perhaps aren't getting any relevant replies because mostly people use Selenium for automation testing of their web applications.

But that doesn't mean you can't get adventurous with using Selenium.

Like for e.g., in the recently concluded Selenium conference here in Bangalore, these two guys came up to talk about how they were using Selenium for their actual business concept viz., submit job applications on behalf of the users.

For this they had upto some extent re-engineered the grid itself and adopted it in an innovative manner.

Slides : https://www.slideshare.net/slideshow/embed_code/key/nWBsQ2mMY2I3M1

Video : https://www.youtube.com/watch?v=sJDmZ1JG4yo&feature=youtu.be

So I would say, give it a whirl and see where you go with it.

When it comes to web scraping am guessing that you would be required to load up the web app inside a browser. Many websites have stringent usage policies which kind of prevent any automation being run against their website. So you may have to keep an eye on them.

Limitations : Well as long as the web site doesnt use any flash content, you will not have any technical challenge automating it.

Thanks & Regards
Krishnan Mahadevan

"All the desirable things in life are either illegal, expensive, fattening or in love with someone else!"
My Scribblings @ http://wakened-cognition.blogspot.com/

My Technical Scribbings @ http://rationaleemotions.wordpress.com/

--
You received this message because you are subscribed to the Google Groups "Selenium Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to selenium-users+unsubscribe@googlegroups.com.
To post to this group, send email to selenium-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/selenium-users/f7c154c8-c390-4a9f-a3b6-fc5d921e6857%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Harald Koschinski

unread,

Sep 17, 2016, 2:34:35 PM9/17/16

to Selenium Users

Hi Krishnan,

thank you very much for your post - very interesting and now I am no more sad :-)

Best regards
Harald

Am Mittwoch, 31. August 2016 20:10:56 UTC+2 schrieb Harald Koschinski:

David

unread,

Sep 17, 2016, 9:21:53 PM9/17/16

to Selenium Users

Depending on what you're harvesting & scraping, it's typically better to do it as optimally as possible. As such that usually means using a non-GUI HTTP/REST client to fetch web pages and parse the data out of it, whether the data is HTML/XML/JSON/PDF/text/binary. And for some cases, directly hitting the REST APIs (if accessible) made by the website (via AJAX calls) rather than scrape off the site. Think code/scripts with HTTP/REST client libraries, and/or tools like curl with scripting.

But there are times where you need the stuff to be rendered in the browser then scrape that off, or if you need to scrape for visuals/graphics rendered in the browser, not basic text or images you can fetch by URL alone. In that case, Selenium comes in handy. And where Selenium fails there (say Flash or Java plugins), then you go to some other tool like Sikuli and image recognition tools.

But when using Selenium for scraping, you'd probably want to try headless options and only go to GUI mode when debugging, since headless executes faster and won't hog up your display/screen. E.g. PhantomJS/GhostDriver, headless Firefox mode, headless Chrome mode

Harald Koschinski

unread,

Sep 18, 2016, 2:32:38 PM9/18/16

to Selenium Users

I need the rendered page when harvesting. That was the reason I came to selenium. So non-Gui HTTP/REST Client is no option for me. Flash is out - that's past - no more important. With Selenium I can fetch all HTML5 stuff unsing the newest browser. That's what I need.

Headless: I run selenium/FF with a virtual X (Xvfb) and x11vnc when debugging. With this config I can have >100 FF running on one server without problems.
Is there a better way?

David

unread,

Sep 20, 2016, 3:14:54 AM9/20/16

to Selenium Users

Sounds like you got your solution then. How's it working out so far?

Noilson Caio

unread,

Sep 20, 2016, 11:14:45 PM9/20/16

to seleniu...@googlegroups.com

basically do you have a generic harvest loop like:

example - sniffer email address

//

// use a spider procedure to match all pages and urls

//

alvo="url"

res:=get(alvo)

adress[ ]:=regex.func.match.for.email.regex(res)

"Is this a good idea or are there better tools for this ? Limitations? Advantages?"

1 - yes, is a good idea. maybe not the best idea =];

2 - the simpler, the better. sometimes a simple script maybe solve.

3 - nmap has https://nmap.org/nsedoc/scripts/http-grep.html and others

4 - other way is public sources. (boring way) https://github.com/laramies/theHarvester

i never see a harvesting processes with seleniunhq before. can you test and share with us ?

ps: english is not my principal language. sorry for anything =]

--

You received this message because you are subscribed to the Google Groups "Selenium Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to selenium-users+unsubscribe@googlegroups.com.
To post to this group, send email to selenium-users@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/selenium-users/82ab4bc1-d2b3-4b87-a109-0a1963dc1f19%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Noilson Caio Teixeira de Araújo
https://ncaio.wordpress.com
https://br.linkedin.com/in/ncaio
https://twitter.com/noilsoncaio