I know several of you have thought the same thing. So, I took the
time today to find out where and how this could be improved directly
in Refine or with an extension of our own.
It just so happens Refine already has a wonderful extension. Another
language itself: Jython
Enter BeautifulSoup http://www.crummy.com/software/BeautifulSoup/
(love the name?) a Jython library for powerful HTML parsing and entity
extraction.
(it's not the fastest out there, because of what it can simplify for
you, you'll understand more if you read the docs)
Here's more on how to use it easily within Google Refine:
http://code.google.com/p/google-refine/wiki/StrippingHTML
Xpath xquery too
Yes, you could just 'import Xpath' using Jython since it is Jython is
Java, however,
I've found that ElementTree is a bit more productive for parsing XML.
See here: http://code.google.com/p/google-refine/wiki/Jython?
-Thad
Universally it seems the industry is moving towards xpath and xquery to traverse html>xml
On Nov 21, 2010 11:47 AM, "Thad Guidry" <thadg...@gmail.com> wrote:
Randall,
Yes, you could just 'import Xpath' using Jython since it is Jython is
Java, however,
I've found that ElementTree is a bit more productive for parsing XML.
See here: http://code.google.com/p/google-refine/wiki/Jython?
-Thad
On Sun, Nov 21, 2010 at 1:00 AM, Randall Amiel <randy1...@gmail.com> wrote:
> Xpath xquery too
...
Universally it seems the industry is moving towards xpath and xquery to traverse html>xml
No worries just stating my *opinion*. I always like taking different approaches to solving problems otherwise there would be no innovation.
Im sorry for my word choice, I apologize.
On Nov 21, 2010 5:22 PM, "Tim McNamara" <mcnama...@gmail.com> wrote:
On Mon, Nov 22, 2010 at 11:01 AM, Randall Amiel <randy1...@gmail.com> wrote:
>
> Universally it ...
I feel Google Refine doesn't try to be too restricting currently, and
as Stefano stated in another thread, it's about openness and seeing
how Refine begins to be utilized.
By the way, before both of you replied, I had already updated the wiki
here that "It very much depends on the input which parser works
better."
http://code.google.com/p/google-refine/wiki/Jython
Yeah, BeautifulSoup is slower than other implementations but as noted
by the author and my conversation with him, it does indeed try to
strike a balance and is not a good fit for everything. He mentioned
in his docs that he prefers other XML parsers since BeautifulSoup
wasn't engineered to handle XML well but instead HTML.
I think everyone agrees that adaptability is what usually wins hearts
and minds, and I think Google Refine has already proved that for many
of us.
-Thad
> Stefano Mazzocchi <stef...@google.com>
> Software Engineer, Google Inc.
>
Stefano,
Cool!
Is this the right source area for the parsing utilities in Acre ?
http://code.google.com/p/acre/source/browse/#svn/trunk/webapp/WEB-INF/src/com/google/util/DOM
-Thad
Although, don't you agree it's not like we have to reinvent EVERYTHING
with GREL syntax, but where it makes sense to do so for common folk
needs (which apparently are ever expanding! with our Swiss Army tool
called Google Refine)
Mostly, it would be nice to have easier screen scraping tools/commands
in GREL to leverage with Add column by fetching URLs. They don't
exist with GREL, so I just leveraged on some Jython tools. I agree
that having similar sets of tools/commands in GREL would make things
easier.
I also think a GUI interface to the idea of screen scraping is
ENTIRELY optional within Refine and I really don't see that working
itself out in Refine, since webpage metadata is on a case by case,
domain by domain basis. Do you ?
-Thad
Yeap. All that makes sense. Yeap, knew about Rhino.
Although, don't you agree it's not like we have to reinvent EVERYTHING
with GREL syntax, but where it makes sense to do so for common folk
needs (which apparently are ever expanding! with our Swiss Army tool
called Google Refine)
Mostly, it would be nice to have easier screen scraping tools/commands
in GREL to leverage with Add column by fetching URLs. They don't
exist with GREL, so I just leveraged on some Jython tools. I agree
that having similar sets of tools/commands in GREL would make things
easier.
I also think a GUI interface to the idea of screen scraping is
ENTIRELY optional within Refine and I really don't see that working
itself out in Refine, since webpage metadata is on a case by case,
domain by domain basis. Do you ?
Glad you think so. (and found a few issues along the way that David
patched up quick)
> My opinion is that GREL should be a streamlined condensation of the tools
> and operations that people find they need the most when working with data in
> their real-life workflows. The benefit of that is that we can make something
> very compact like
> value.parseHTML().find(".whatever").split("|")[1]
> once we know what we need.
Yeap, agreed. Wire it up in GREL once we discover what is useful
based on community feedback.
> By no means people should stop experimenting with Jython and Clojure support
> in Refine, that's why we have support for additional languages that already
> come with extensive APIs.
Great. Then I'll try to find a few more diamonds in Refine through
experimentation ;)
> My suggestion for Acre's parsing machinery integration is merely that we
> already have it and it's working well for us in many apps we use for
> Freebase so it might be useful here too.
>>
Yeah, I figured that.
> Agreed.
> Note that David and I, in a past life, wrote a tool called "solvent", it's a
> firefox extension that does provide some very interesting capabilities (see
> a screencast of it here)
> http://simile.mit.edu/solvent/screencasts/solvent_screencast.swf
> the code is open source here
> http://simile.mit.edu/repository/solvent/trunk/
> although I'm not sure it still works, it hasn't been touched in years, but
> some of the concepts there could still be very useful.
>
> Stefano Mazzocchi <stef...@google.com>
> Software Engineer, Google Inc.
Oh S&*# , this team has a terrific background on scrapping! Nice,
little extension. Flows much better than iMacros extension.
OK, you have some work to do and bugs to fix in Refine. Thanks
Stefano for the highlights and possible roadmaps ahead.
-Thad