Re: [hackerspaces] Academic scraping

105 views
Skip to first unread message

Bryan Bishop

unread,
Jan 14, 2013, 12:59:38 PM1/14/13
to Lokkju Brennr, Bryan Bishop, science-libe...@googlegroups.com, Hackerspaces General Discussion List
On Mon, Jan 14, 2013 at 11:51 AM, Lokkju Brennr wrote:
> see:
> http://scraperwiki.org
> http://scrapy.org/
>
> Once you have the raw data in a central location, it becomes much easier for
> someone specialized in data processing to convert it to usable form - even
> if it is hard to parse. It does help to keep the metadata though...

One of my favorite scraping methods at the moment is phantomjs, a
headless wrapper around webkit.

http://phantomjs.org/
https://github.com/ariya/phantomjs
https://github.com/kanzure/pyphantomjs

But for academic projects, I highly recommend zotero's translators.

https://github.com/zotero/translators

Here's why. There's already a huge userbase of zotero users actively
updating these scrapers. When they break, they fix them immediately.
They are all written in javascript and they extract not only the link
to the pdf but also the maximum amount of metadata. With the help of
the zotero/translation-server project, they can be used headlessly.

https://github.com/zotero/translation-server

I have a demo of this working in irc.freenode.net ##hplusroadmap
(paperbot), he just grabs links from our conversation and posts the
pdfs so that we don't have to ask each other for copies.

- Bryan
http://heybryan.org/
1 512 203 0507

Piotr Migdal

unread,
Jan 30, 2013, 3:23:07 PM1/30/13
to science-libe...@googlegroups.com, Lokkju Brennr, Bryan Bishop, Hackerspaces General Discussion List
I typically use Requests (for downloading pages) + BeautifulSoup (for extracting data from HTML files).

Links:
http://docs.python-requests.org/en/latest/
http://www.crummy.com/software/BeautifulSoup/

Regards,
Piotr

Bryan Bishop

unread,
Jan 30, 2013, 4:09:25 PM1/30/13
to Piotr Migdal, Bryan Bishop, science-libe...@googlegroups.com, Lokkju Brennr, Hackerspaces General Discussion List
On Wed, Jan 30, 2013 at 2:23 PM, Piotr Migdal <pmi...@gmail.com> wrote:
> I typically use Requests (for downloading pages) + BeautifulSoup (for
> extracting data from HTML files).
>
> Links:
> http://docs.python-requests.org/en/latest/
> http://www.crummy.com/software/BeautifulSoup/

Many years ago, someone did a comparison of lxml versus BeautifulSoup
and found that while BeautifulSoup has a non-sucky API, that it tends
to be slower than lxml. I am not sure if this is still the case,
because even 2 years ago is ancient legend by now.

I enjoy python-requests as much as everyone else. However, I find that
sometimes servers implement non-standard HTTP. Sometimes this is
caused by the server rejecting otherwise standard headers... so my
solution was to write this to patch requests:

https://github.com/kanzure/careful-requests

(because kennethreitz rejected related changes). So, this might be
helpful for scraping delicate servers. For unit testing a scraper, I
like to use:

https://github.com/gabrielfalcao/HTTPretty

Piotr Migdal

unread,
Jan 31, 2013, 7:07:38 AM1/31/13
to science-libe...@googlegroups.com, Lokkju Brennr, Bryan Bishop, Hackerspaces General Discussion List
I switched from lxml/etree to BeautifulSoup, the second in much cleaner.
For me parsing time is not an issue, since getting it takes longer anyway.

But... you can use BS on top of lxml:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
(getting lxml's speed AND BS's clean API)


Regards,
Piotr

On Monday, January 14, 2013 6:59:38 PM UTC+1, Bryan Bishop wrote:
Reply all
Reply to author
Forward
0 new messages