HTML parsing made easy in Refine

Thad Guidry

unread,

Nov 20, 2010, 11:29:02 PM11/20/10

to google...@googlegroups.com

After spending most of the day BANGING my head on using Regex and GREL
to handle HTML parsing.... I thought, there MUST be a better way to
parse HTML !!!

I know several of you have thought the same thing. So, I took the
time today to find out where and how this could be improved directly
in Refine or with an extension of our own.
It just so happens Refine already has a wonderful extension. Another
language itself: Jython

Enter BeautifulSoup http://www.crummy.com/software/BeautifulSoup/
(love the name?) a Jython library for powerful HTML parsing and entity
extraction.
(it's not the fastest out there, because of what it can simplify for
you, you'll understand more if you read the docs)

Here's more on how to use it easily within Google Refine:
http://code.google.com/p/google-refine/wiki/StrippingHTML

Enjoy!
-Thad
http://savetheworldnotyourmoney.blogspot.com/

Randall Amiel

unread,

Nov 21, 2010, 2:00:11 AM11/21/10

to google...@googlegroups.com

Xpath xquery too

Thad Guidry

unread,

Nov 21, 2010, 11:46:58 AM11/21/10

to google...@googlegroups.com

Randall,

Yes, you could just 'import Xpath' using Jython since it is Jython is
Java, however,

I've found that ElementTree is a bit more productive for parsing XML.
See here: http://code.google.com/p/google-refine/wiki/Jython?

-Thad

Randall Amiel

unread,

Nov 21, 2010, 5:01:34 PM11/21/10

to google...@googlegroups.com

Universally it seems the industry is moving towards xpath and xquery to traverse html>xml

On Nov 21, 2010 11:47 AM, "Thad Guidry" <thadg...@gmail.com> wrote:

Randall,

Yes, you could just 'import Xpath' using Jython since it is Jython is
Java, however,

I've found that ElementTree is a bit more productive for parsing XML.
See here: http://code.google.com/p/google-refine/wiki/Jython?

-Thad

On Sun, Nov 21, 2010 at 1:00 AM, Randall Amiel <randy1...@gmail.com> wrote:
> Xpath xquery too

...

Tim McNamara

unread,

Nov 21, 2010, 5:22:10 PM11/21/10

to google...@googlegroups.com

On Mon, Nov 22, 2010 at 11:01 AM, Randall Amiel <randy1...@gmail.com> wrote:

Universally it seems the industry is moving towards xpath and xquery to traverse html>xml

Please don't be offended, but I find comments like this to be unproductive. Prefixing a statement with "Universally" is something that sounds like 'my way is right', 'your way is wrong'. I can't think of any instances in computer science where a single approach is universally appropriate. Let's not get into debates about the right way to do something. David and co seemed to have taken pains to make sure that Refine doesn't impose a particular workflow.

I think Thad has provided the community a very valuable service. He's demonstrated that it's possible to extend Google Refine using any Python library that runs on the Jython implementation. This is new to me.

I feel that evangelising a single tool for HTML parsing could lead to bitterness in the community. Deciding on the right tool is like choosing between blue and green sometimes, some things just feel different to different people. I prefer to use Parsley[1] when I am collaborating with others, CSS selectors with lxml[2] or use directed machine learning[3]. While those approaches may use use XPath selectors at a low level, none of them provide me as the XPath selectors as their interface. I find BeautifulSoup very slow, but I think that it provides an excellent balance of performance & features when processing a single page- which is what the use case was discussing.

Tim

@timClicks

[1] https://github.com/fizx/parsley

[2] http://codespeak.net/lxml/cssselect.html#css-selectors

[3] http://code.google.com/p/sitescraper/

Randall Amiel

unread,

Nov 21, 2010, 5:27:43 PM11/21/10

to google...@googlegroups.com

No worries just stating my *opinion*. I always like taking different approaches to solving problems otherwise there would be no innovation.

Im sorry for my word choice, I apologize.

On Nov 21, 2010 5:22 PM, "Tim McNamara" <mcnama...@gmail.com> wrote:

On Mon, Nov 22, 2010 at 11:01 AM, Randall Amiel <randy1...@gmail.com> wrote:
>

> Universally it ...

Thad Guidry

unread,

Nov 21, 2010, 9:00:18 PM11/21/10

to google...@googlegroups.com

Thanks Tim and Randall for your insight. Always good to hear how
others define their workflows.

I feel Google Refine doesn't try to be too restricting currently, and
as Stefano stated in another thread, it's about openness and seeing
how Refine begins to be utilized.

By the way, before both of you replied, I had already updated the wiki
here that "It very much depends on the input which parser works
better."
http://code.google.com/p/google-refine/wiki/Jython

Yeah, BeautifulSoup is slower than other implementations but as noted
by the author and my conversation with him, it does indeed try to
strike a balance and is not a good fit for everything. He mentioned
in his docs that he prefers other XML parsers since BeautifulSoup
wasn't engineered to handle XML well but instead HTML.

I think everyone agrees that adaptability is what usually wins hearts
and minds, and I think Google Refine has already proved that for many
of us.

-Thad

Stefano Mazzocchi

unread,

Nov 22, 2010, 11:38:00 AM11/22/10

to google...@googlegroups.com

Just FYI (I don't have the cycles to do this right now), but we have a fully functional (and production grade!) HTML parser in Acre (our application hosting environment that runs *.freebase.com and *.freebaseapps.com and we open sourced at http://code.google.com/p/acre/) and I don't think it would be too much work to embed it in Refine.

Would be pretty nice to be able to do something like this

value.parseHTML().find("#body p.address")[2]

Randall: xquery is massively overdesigned for scraping and xpath has the problem that many people now know how to use CSS selectors and/or jquery these days while hardly anybody has real-life xpath experience (or knowledge) [as the original designer of Apache Cocoon which was heavily XSLT/XPath based, I have deep scars that tell me just that]

--
Stefano Mazzocchi <stef...@google.com>
Software Engineer, Google Inc.

Thad Guidry

unread,

Nov 22, 2010, 12:13:23 PM11/22/10

to google...@googlegroups.com

On Mon, Nov 22, 2010 at 10:38 AM, Stefano Mazzocchi <stef...@google.com> wrote:
> Just FYI (I don't have the cycles to do this right now), but we have a fully
> functional (and production grade!) HTML parser in Acre (our application
> hosting environment that runs *.freebase.com and *.freebaseapps.com and we
> open sourced at http://code.google.com/p/acre/) and I don't think it would
> be too much work to embed it in Refine.
> Would be pretty nice to be able to do something like this
> value.parseHTML().find("#body p.address")[2]

> Stefano Mazzocchi <stef...@google.com>
> Software Engineer, Google Inc.
>

Stefano,

Cool!

Is this the right source area for the parsing utilities in Acre ?
http://code.google.com/p/acre/source/browse/#svn/trunk/webapp/WEB-INF/src/com/google/util/DOM

-Thad

Stefano Mazzocchi

unread,

Nov 22, 2010, 12:23:52 PM11/22/10

to google-refine

Yes, that's right.

Note that in order to implement something like .find(selector) in acre I used Sizzle (the selector engine that powers JQuery) [see http://sizzle.freebaseapps.com/] but that requires javascript interpretation on top of the jsDOM that Acre provides.

GREL is not implemented in javascript, but refine does ship with a javascript engine built in (Rhino, which it's used by the extension framework and it's the same one that Acre uses) so it shouldn't be too hard to hook the two together (although David is the one to talk to for that since he's the one that wrote the GREL interpreter).

--

David Huynh

unread,

Nov 22, 2010, 1:10:02 PM11/22/10

to google...@googlegroups.com

How many new functions do we need? I can think of 3 to start with

- find(xmlElement, selector)

- getAttribute(xmlElement, attributeName)

- getText(xmlNode)

David

Stefano Mazzocchi

unread,

Nov 22, 2010, 1:23:13 PM11/22/10

to google-refine

better to call it domElement as this doesn't need to come from XML, it could come from HTML too.

We also need a way to get the domElement from a string, so something like parseHTML(string) and parseXML(string);

also I would use "attr" instead of getAttribute (to keep it short and go with jquery terminology), also I think that getText() could be implicit, so something like this that could extract "bar" from

<html>

<body>

</body>

</html>

with

value.parseHTML().find(".whatever")[0].split("|")[1]

note that find(selector) must returns an array of nodes so it would be cool if

value.parseHTML().find(".whatever").split("|")[1]

should work too if the array returned by find(selector) only contains a single node.

Thad Guidry

unread,

Nov 22, 2010, 1:39:40 PM11/22/10

to google...@googlegroups.com

Yeap. All that makes sense. Yeap, knew about Rhino.

Although, don't you agree it's not like we have to reinvent EVERYTHING
with GREL syntax, but where it makes sense to do so for common folk
needs (which apparently are ever expanding! with our Swiss Army tool
called Google Refine)

Mostly, it would be nice to have easier screen scraping tools/commands
in GREL to leverage with Add column by fetching URLs. They don't
exist with GREL, so I just leveraged on some Jython tools. I agree
that having similar sets of tools/commands in GREL would make things
easier.

I also think a GUI interface to the idea of screen scraping is
ENTIRELY optional within Refine and I really don't see that working
itself out in Refine, since webpage metadata is on a case by case,
domain by domain basis. Do you ?

-Thad

Stefano Mazzocchi

unread,

Nov 22, 2010, 1:52:29 PM11/22/10

to google-refine

<shrug>

I don't have a strong opinion either way. I used every piece of scraping technology other there and David and I wrote several of our own (Solvent, Crowbar)

On Mon, Nov 22, 2010 at 10:39 AM, Thad Guidry <thadg...@gmail.com> wrote:

Yeap. All that makes sense. Yeap, knew about Rhino.

Although, don't you agree it's not like we have to reinvent EVERYTHING
with GREL syntax, but where it makes sense to do so for common folk
needs (which apparently are ever expanding! with our Swiss Army tool
called Google Refine)

Mostly, it would be nice to have easier screen scraping tools/commands
in GREL to leverage with Add column by fetching URLs. They don't
exist with GREL, so I just leveraged on some Jython tools. I agree
that having similar sets of tools/commands in GREL would make things
easier.

I think that your experiments with Jython are very useful.

My opinion is that GREL should be a streamlined condensation of the tools and operations that people find they need the most when working with data in their real-life workflows. The benefit of that is that we can make something very compact like

value.parseHTML().find(".whatever").split("|")[1]

once we know what we need.

By no means people should stop experimenting with Jython and Clojure support in Refine, that's why we have support for additional languages that already come with extensive APIs.

My suggestion for Acre's parsing machinery integration is merely that we already have it and it's working well for us in many apps we use for Freebase so it might be useful here too.

I also think a GUI interface to the idea of screen scraping is
ENTIRELY optional within Refine and I really don't see that working
itself out in Refine, since webpage metadata is on a case by case,
domain by domain basis. Do you ?

Agreed.

Note that David and I, in a past life, wrote a tool called "solvent", it's a firefox extension that does provide some very interesting capabilities (see a screencast of it here)

http://simile.mit.edu/solvent/screencasts/solvent_screencast.swf

the code is open source here

http://simile.mit.edu/repository/solvent/trunk/

although I'm not sure it still works, it hasn't been touched in years, but some of the concepts there could still be very useful.

Thad Guidry

unread,

Nov 22, 2010, 2:21:01 PM11/22/10

to google...@googlegroups.com

> I think that your experiments with Jython are very useful.

Glad you think so. (and found a few issues along the way that David
patched up quick)

> My opinion is that GREL should be a streamlined condensation of the tools
> and operations that people find they need the most when working with data in
> their real-life workflows. The benefit of that is that we can make something
> very compact like
> value.parseHTML().find(".whatever").split("|")[1]
> once we know what we need.

Yeap, agreed. Wire it up in GREL once we discover what is useful
based on community feedback.

> By no means people should stop experimenting with Jython and Clojure support
> in Refine, that's why we have support for additional languages that already
> come with extensive APIs.

Great. Then I'll try to find a few more diamonds in Refine through
experimentation ;)

> My suggestion for Acre's parsing machinery integration is merely that we
> already have it and it's working well for us in many apps we use for
> Freebase so it might be useful here too.
>>

Yeah, I figured that.

> Agreed.
> Note that David and I, in a past life, wrote a tool called "solvent", it's a
> firefox extension that does provide some very interesting capabilities (see
> a screencast of it here)
> http://simile.mit.edu/solvent/screencasts/solvent_screencast.swf
> the code is open source here
> http://simile.mit.edu/repository/solvent/trunk/
> although I'm not sure it still works, it hasn't been touched in years, but
> some of the concepts there could still be very useful.
>

> Stefano Mazzocchi <stef...@google.com>
> Software Engineer, Google Inc.

Oh S&*# , this team has a terrific background on scrapping! Nice,
little extension. Flows much better than iMacros extension.

OK, you have some work to do and bugs to fix in Refine. Thanks
Stefano for the highlights and possible roadmaps ahead.

-Thad

Tim McNamara

unread,

Nov 22, 2010, 4:39:56 PM11/22/10

to google...@googlegroups.com

I would like a getAllText function - which would recursively get the text from descendants as well.

Tim

Reply all

Reply to author

Forward