Re: [Freebase-discuss] ScraperWiki.com and Google Refine

1,054 views
Skip to first unread message

Max Ogden

unread,
Jan 2, 2011, 3:26:55 PM1/2/11
to scrap...@googlegroups.com, Freebase.com discussion list, google...@googlegroups.com
I asked on the Google Refine group if there is a way to run Refine headlessly:

Basically, we'd need a Java dev to hack on the code. I have been learning Java to write a Refine extension for the last week (https://github.com/maxogden/refine-uploader) but it would be nice to get some help build a command line interface to Google Refine that allows you to load in data, run extracted operation histories against that data, and then export the data.

I just made a screencast that demonstrates how to generate such an operation history: http://vimeo.com/18351837

It's the same result as writing a ScaperWiki parser, but instead of having to use a scripting language you can use a easy to understand web app UI. Imagine being able to train anyone who can use GMail how to write scrapers visually, and then have them hit a button in Google Refine and have their Refine operation histories become ScraperWiki entries.

Another neat thing would be to make ScaperWiki a reconciliation server.... see http://code.google.com/p/google-refine/wiki/ReconciliationServiceApi?redir=1 and https://github.com/ldodds/pho-reconcile for more info on that. 

Shawn Simister

unread,
Jan 4, 2011, 5:15:23 PM1/4/11
to google...@googlegroups.com
Aren't there usually a lot of actions in the operation history that are specific to the current data set? Like, "Change the value at (5,7) to 'Verizon Center'" or "delete rows 13,14 & 15". It sounds like all you really need is a tool that can apply GREL expressions to spreadsheets.

From browsing through the source code, the GREL interpreter looks like its self-contained enough that it could easily be extracted from the Refine project and used in any number of stand-alone command-line tools.

Shawn

Tim McNamara

unread,
Jan 4, 2011, 5:33:55 PM1/4/11
to google...@googlegroups.com, scrap...@googlegroups.com
Apologies for cross-posting.

On Mon, Jan 3, 2011 at 9:26 AM, Max Ogden <maxo...@gmail.com> wrote:
I asked on the Google Refine group if there is a way to run Refine headlessly:

Basically, we'd need a Java dev to hack on the code. I have been learning Java to write a Refine extension for the last week (https://github.com/maxogden/refine-uploader) but it would be nice to get some help build a command line interface to Google Refine that allows you to load in data, run extracted operation histories against that data, and then export the data.

I don't think that this is necessary*. Google Refine is a webapp. Therefore, it's conceivable that anything that understands HTTP could be used to clean dirty data.

Tim
@timClicks

*probably advisable, but not necessary

Thad Guidry

unread,
Jan 4, 2011, 5:35:04 PM1/4/11
to scrap...@googlegroups.com, google...@googlegroups.com
Max,

Your idea of a more visual scrapping tool is what I'm hoping gets
written or conceived, maybe perhaps as a Google Refine extension, who
knows. The committers on Google Refine (I think Iain Sprout) have
mentioned about somehow, somewhere, a host can be used as the
intermediary repository for Refine cleanup processes or "scraping".

A stronger connection via extensions in Refine would allow perhaps
ScraperWiki to be that host. And as Julian Todd mentioned, it looks
like your guys' basic Views can be extended with Python to actually
function as a Reconciliation service. Coolness indeed. (All the
while, there is Google Fusion tables as very neat layer that folks can
add to the mix that affords even richer Views and animation into the
data we scrape and refine.)

Let's all work together to see how far the 2 communities can leverage
upon each others skills. More collaboration is needed and idea
sharing.

Let's keep up the cross-posting on this thread to both community
mailing lists to continue the discussions.

2cents,

-Thad
http://www.freebase.com/view/en/thad_guidry

Thad Guidry

unread,
Jan 4, 2011, 5:43:27 PM1/4/11
to scrap...@googlegroups.com, google...@googlegroups.com
But Tim....there are 2 camps here, I think. Those with a bit of
coding skill, and those without or slim to none. I think Max was
thinking about how to enhance or extend Google Refine to allow a more
"visual" scraper ability ?

David and Stefano had built a cool Firefox browser extension back in
their MIT days and I'm seriously wondering if that couldn't somehow be
hacked up (around) into a Google Refine extension ? Where, if you
want your data or Refine processes public and maintainable, you can
leverage upon ScraperWiki and it's community for hosting and
maintenance of those ?

Quoting Stefano:
" Note that David and I, in a past life, wrote a tool called
"solvent", it's a firefox extension that does provide some very
interesting capabilities (see a screencast of it here)

http://simile.mit.edu/solvent/screencasts/solvent_screencast.swf

the code is open source here

http://simile.mit.edu/repository/solvent/trunk/

although I'm not sure it still works, it hasn't been touched in years,
but some of the concepts there could still be very useful. "

-Thad
http://www.freebase.com/view/en/thad_guidry

Max Ogden

unread,
Jan 5, 2011, 2:43:46 AM1/5/11
to scrap...@googlegroups.com, google...@googlegroups.com
Re: "Having a read of the architecture here... Sounds like we want
something to use instead of the client, but that would talk to the
same server backend?"


I think that's the best option. We essentially need to create a client library on top of the Java Refine server. I can almost envision it manifesting itself as a Ruby gem or something similar. Basic pseudo-code functionality:

refine = Refine.new('localhost:3333')
project = refine.newProject({'data' => 'movies.csv', 'projectName' => "Favorite Movies"})
project.applyOperationHistory('my_extracted_operation_history.json')
csv = project.export("csv")

That way we can write small recipes for loading data from, say, ScraperWiki and applying operation histories to that data.

Re: A stronger connection via extensions in Refine would allow perhaps
ScraperWiki to be that host.

I want to increase the participation/adoption of open data by making it as easy/visual as possible. I've been working on a screencast series that starts in ScraperWiki and then imports data into Google Refine. I'd recommend watching them if you are familiar with the utility value of either Refine or ScraperWiki but not yet both. The following two screencasts are contiguous and in them I talk about the strengths of each tool:

Re: Just to be clear here, this is saying you can use the Views feature to
make a Google Refine backend, based off some data of ScraperWiki.


I'm very interesting in implementing reconciliation and going from Google Refine back into ScraperWiki. ScraperWiki isn't a semantic knowledge database in the same way that Freebase is so I'm not sure how useful it would be to think of reconciliation beyond the point of duplicate detection (e.g. semantic classification would be a lot of work to implement on ScraperWiki). I need to learn more about the inner workings of what a reconciliation service should be. The Refine extension (https://github.com/maxogden/refine-uploader) that I've been working on lets you post an entire project as JSON to a HTTP API of your choosing. This is a very 'dumb' way of uploading data, as you aren't taking advantage of any of the nice semantic linking and reconciliation capabilities of Refine, but I think that it's a necessary low level tool and that you don't necessarily always need to link data before it's usable.

Sorry for the huge post, but this subject matter is very exciting to me. :)

Max

Christopher Groskopf

unread,
Jan 5, 2011, 10:48:34 AM1/5/11
to google...@googlegroups.com, scrap...@googlegroups.com
Without too many specifics, I want to add my enthusiasm for this idea to the thread.  A way of interacting with refine programmatically has been on my wishlist ever since I first started using it.  Just being able to automate applying operation histories would be a huge step, although if I could also do things like project.columns[4].cluster(method="key collision", keying_function="fingerprint", accept_all=True) that would be even better.  I considered starting to hack on something like this, but the got stymied by the sheer complexity of Refine's codebase and my relatively limited time to work on it.  

If someone does start working on an implementation I'd be happy to test/report/contribute/port-to-python!

C

David Huynh

unread,
Jan 5, 2011, 8:07:08 PM1/5/11
to google...@googlegroups.com
A python library for this doesn't seem too hard. I'll try to hack together something soon as a proof of concept.

David

David Huynh

unread,
Jan 6, 2011, 2:04:25 PM1/6/11
to google...@googlegroups.com, scrap...@googlegroups.com, freebase...@freebase.com
I've hacked up a little bit of python to drive Refine. Please give it a try

First, install urllib2_file from https://github.com/seisen/urllib2_file/#readme according to its instructions.

Second, download the attached files.

Third, obviously, make sure Refine is running :-)

Then go python and try something like this with the attached sample files

import sys
sys.path.append("/directory/where/you/put refine.py")
import refine

r = refine.Refine()
p = r.new_project("/file/path/to/sample/dates.txt")
p.apply_operations("/file/path/to/sample/operations.json")
print p.export_rows()
p.delete_project()

I've tested this very briefly on python 2.6. It's only intended as a starting point.

David
dates.txt
refine.py
operations.json

David Huynh

unread,
Jan 6, 2011, 3:41:27 PM1/6/11
to google...@googlegroups.com, scrap...@googlegroups.com, freebase...@freebase.com
Oops, fixed small bug that showed up on python 2.7.
refine.py

Randall Amiel

unread,
Jan 6, 2011, 3:45:29 PM1/6/11
to google-refine
David:

Say I wanted to scrape my gmail, and give refine all the html, it
would be nice to automatically find keyed datasets. I guess you would
need learning algorithms to detect various types of datasets and where
they are located within the DOM, and then do operations based upon
what the dataset might be (this is under the premise that we do not
currently know what type of dataset we are dealing with).

It would also be nice if Google Refine could automatically detect
primary keys such as the hash key listed below and extract all of the
metadata associated with that key.

["12d577d4b25060fd","12d577d4b25060fd","12d577d4b25060fd",1,0,
["^all","^i","^smartlabel_group","^unsub","Google Refine"]
,[]
,"\u003cspan class\u003d\"yP\" email\u003d\"google-refine
+nor...@googlegroups.com\"\u003egoogle-refine+noreply\u003c/span
\u003e","\u0026nbsp;","Digest for google...@googlegroups.com - 14
Messages in 3 Topics","Today\u0026#39;s Topic Summary Group:
http://groups.google.com/group/google-refine/topics Screen
\u0026hellip;",0,"","","Jan 5","Wed, Jan 5, 2011 at 1:43 PM",
1294259681262595,,[]
,,0,["google-refine.googlegroups.com"]
,,[]
,,,[0]
]

It would be nice to have Google Refine help process and define this
kind of structured data.

Thanks
Randall




On Jan 6, 2:04 pm, David Huynh <dfhu...@gmail.com> wrote:
> I've hacked up a little bit of python to drive Refine. Please give it a try
>
> First, install urllib2_file fromhttps://github.com/seisen/urllib2_file/#readmeaccording to its
>  dates.txt
> < 1KViewDownload
>
>  refine.py
> 3KViewDownload
>
>  operations.json
> < 1KViewDownload

David Huynh

unread,
Jan 6, 2011, 4:35:32 PM1/6/11
to google...@googlegroups.com
Randall,

I've done something like that in my past life


and have found it to be quite tricky. Since the record and field detection heuristics are not always correct, there always has to be a lever to disengage the automation and do things manually, and programming such a hybrid UI is quite painful ... :-)

I'm also thinking that scraping should be left to ScraperWiki, and Google Refine should just start when there is already structured data. We can make sure that the path between the two is smooth and reversible.

David

Max Ogden

unread,
Jan 7, 2011, 3:08:51 AM1/7/11
to scrap...@googlegroups.com, google...@googlegroups.com, freebase...@freebase.com
Hey David,

I'm not seeing refine.py in the attachments on either of your emails, just dates.txt and operations.json.

Could you maybe put it in a github gist and paste a link? Maybe the attachment is getting lost in the ether

Max

David Huynh

unread,
Jan 7, 2011, 3:15:39 AM1/7/11
to google...@googlegroups.com
Strange... anyway, let me just paste the code directly here. It's experimental anyway:

import urllib2_file
import urllib2, urlparse, os.path, time, json

class Refine:
  def __init__(self, server='http://127.0.0.1:3333'):
    self.server = server[0,-1] if server.endswith('/') else server
  
  def new_project(self, file_path, options=None):
    file_name = os.path.split(file_path)[-1]
    project_name = options['project_name'] if options != None and 'project_name' in options else file_name
    data = {
      'project-file' : {
        'fd' : open(file_path),
        'filename' : file_name
      },
      'project-name' : project_name
    }
    response = urllib2.urlopen(self.server + '/command/core/create-project-from-upload', data)
    response.read()
    url_params = urlparse.parse_qs(urlparse.urlparse(response.geturl()).query)
    if 'project' in url_params:
      id = url_params['project'][0]
      return RefineProject(self.server, id, project_name)
    
    # TODO: better error reporting
    return None

class RefineProject:
  def __init__(self, server, id, project_name):
    self.server = server
    self.id = id
    self.project_name = project_name
  
  def wait_until_idle(self, polling_delay=0.5):
    while True:
      response = urllib2.urlopen(self.server + '/command/core/get-processes?project=' + self.id)
      response_json = json.loads(response.read())
      if 'processes' in response_json and len(response_json['processes']) > 0:
        time.sleep(polling_delay)
      else:
        return
  
  def apply_operations(self, file_path, wait=True):
    fd = open(file_path)
    operations_json = fd.read()
    
    data = {
      'operations' : operations_json
    }
    response = urllib2.urlopen(self.server + '/command/core/apply-operations?project=' + self.id, data)
    response_json = json.loads(response.read())
    if response_json['code'] == 'error':
      raise Exception(response_json['message'])
    elif response_json['code'] == 'pending':
      if wait:
        self.wait_until_idle()
        return 'ok'
    
    return response_json['code'] # can be 'ok' or 'pending'
  
  def export_rows(self, format='tsv'):
    data = {
      'engine' : '{"facets":[],"mode":"row-based"}',
      'project' : self.id,
      'format' : format
    }
    response = urllib2.urlopen(self.server + '/command/core/export-rows/' + self.project_name + '.' + format, data)
    return response.read()
    
  def delete_project(self):
    data = {
      'project' : self.id
    }
    response = urllib2.urlopen(self.server + '/command/core/delete-project', data)
    response_json = json.loads(response.read())
    return 'code' in response_json and response_json['code'] == 'ok'


Randall Amiel

unread,
Jan 7, 2011, 7:17:45 PM1/7/11
to google-refine
David/Stefano:

Wonderful video and demonstration. This was the track that I was on
but I guess most scrapers rely upon firefrox and developing on top of
firefox as an extension or plugin. Also, I really like the idea of
Crowbar and the fact that it acts as a headless Gecko/XULRunner
server-side engine. However, a key problem I've encountered using
these kind of engines seems to be that it is not running in a full
browsing environment and the agent cannot access the DOM after the
onload JavaScript hooks are executed. Do these engines allow us to
scrape content that was not in the HTML page served initially but are
client-side included via AJAX or programmatically computed after the
page was loaded?




On Jan 6, 4:35 pm, David Huynh <dfhu...@gmail.com> wrote:
> Randall,
>
> I've done something like that in my past life
>
> http://vimeo.com/808235
>
> <http://vimeo.com/808235>and have found it to be quite tricky. Since the
> record and field detection heuristics are not always correct, there always
> has to be a lever to disengage the automation and do things manually, and
> programming such a hybrid UI is quite painful ... :-)
>
> I'm also thinking that scraping should be left to ScraperWiki, and Google
> Refine should just start when there is already structured data. We can make
> sure that the path between the two is smooth and reversible.
>
> David
>

David Huynh

unread,
Jan 7, 2011, 7:52:53 PM1/7/11
to scrap...@googlegroups.com, freebase...@freebase.com, google...@googlegroups.com
Oops, I only replied to one mailing list previously. Apologies for the duplicate.

Max, any luck with the script?

David

Stefano Mazzocchi

unread,
Jan 7, 2011, 8:09:37 PM1/7/11
to google...@googlegroups.com
On Fri, Jan 7, 2011 at 4:17 PM, Randall Amiel <randy1...@gmail.com> wrote:
David/Stefano:

Wonderful video and demonstration. This was the track that I was on
but I guess most scrapers rely upon firefrox and developing on top of
firefox as an extension or plugin. Also, I really like the idea of
Crowbar and the fact that it acts as a headless  Gecko/XULRunner
server-side engine. However, a key problem I've encountered using
these kind of engines seems to be that it is not running in a full
browsing environment and the agent cannot access the DOM after the
onload JavaScript hooks are executed. Do these engines allow us to
scrape content that was not in the HTML page served initially but are
client-side included via AJAX or programmatically computed after the
page was loaded?

yes, crowbar was designed precisely for that usecase and it does (did?) the job fine.



--
Stefano Mazzocchi  <stef...@google.com>
Software Engineer, Google Inc.

Max Ogden

unread,
Jan 7, 2011, 10:42:18 PM1/7/11
to google...@googlegroups.com, scrap...@googlegroups.com, freebase...@freebase.com
Hey David, thanks so much for doing this so quickly! It works as advertised. Do you mind if I start a github repository for it?

Sorry for the delay in reviewing, I just started a fellowship at Code for America in San Francisco this week so it's been a bit hectic.

Max

David Huynh

unread,
Jan 7, 2011, 11:53:06 PM1/7/11
to google...@googlegroups.com, scrap...@googlegroups.com, freebase...@freebase.com
Max, please feel free to github it :)

David

David Huynh

unread,
Jan 9, 2011, 4:39:06 AM1/9/11
to google...@googlegroups.com, scrap...@googlegroups.com
So I wanted to try out ScraperWiki myself for real and got this going ... The story goes:

- I started hunting down a data set for a workshop on Refine and thought I should tackle the same data sets ProPublica used for Dollars for Docs
My gosh those data sets are in PDFs! The horror!

- ProPublica's tutorial on turning PDFs into text
    http://www.propublica.org/nerds/item/turning-pdfs-to-text-doc-dollars-guide
is quite extensive, but I figure maybe I can give ScraperWiki + Google Refine dynamic duo a shot.

- So, the scraper was gotten from the standard template, with just a few tweaks:
It saves left and top coordinates of each text span so we can use those numbers to get the text to the right column later on.

- Import that data straight into Refine without any special configuration.

- And the refine operations are

From 70 pages of PDF to one fine table, all interactions inside the browser: awesomeness!

David

Max Ogden

unread,
Jan 9, 2011, 5:34:29 AM1/9/11
to google...@googlegroups.com, scrap...@googlegroups.com
That is great! 

I am working on porting your python utility to ruby as well as creating a generic API document for other developers.

I had another idea... to include a 'Export to ScraperWiki' option inside my https://github.com/maxogden/refine-uploader project. It would basically have a modal dialog in which you enter the URL of the page you wish to fetch and feed into Refine and then your ScraperWiki scraper ID, username and pw. It would authenticate you with ScraperWiki and upload your operation history and the data URL.

On the ScraperWiki end there would have to be some Refine integration, but hopefully the python and ruby libraries will lead to that.

One weird interaction in Refine is creating a new project from a URL and an entire webpage worth of raw HTML. It would be better if on ScraperWiki there was some sort of visual javascript selection aide along the lines of what Chris from MIT was linking to that lets the user select a portion of the page that they want to grab the raw markup from. At this point a ScraperWiki API URL that points to this markup could be presented to the user along with instructions on how to download Refine and create a new project from the URL.

Max

David Huynh

unread,
Jan 9, 2011, 4:31:50 PM1/9/11
to google...@googlegroups.com
What if ScraperWiki hosts one (or more?) instance of Refine? (Of course we need some sort of access control which Refine doesn't have right now.)  Then all the plumbing for data to go between ScraperWiki and Refine can be totally hidden from the user.

David

Max Ogden

unread,
Jan 9, 2011, 6:21:02 PM1/9/11
to google...@googlegroups.com, google...@googlegroups.com, scrap...@googlegroups.com
I uploaded David's Python Refine library and a Ruby port that I created to Github:


There is also a generic API reference wiki here:

Re: ScraperWiki hosting Refine - 

I don't think it's as scalable to host all the Refine instances on ScraperWiki. It is nice to have zero latency when dealing with your local copy of Refine, and then you automatically get offline access that way as well. The easiest way to get started with Refine/ScraperWiki integration that I can think of is for ScraperWiki to have anonymous instances that aren't publicly accessible that run little scripts to import, process and export data into the ScraperWiki API. Just my two cents.

Max

p.s. sorry if this is a double post, I keep having issues with my posts having to get approved by moderators of the freebase-discuss list

Francis Irving

unread,
Jan 9, 2011, 6:23:41 PM1/9/11
to google...@googlegroups.com
On Sun, Jan 09, 2011 at 01:31:50PM -0800, David Huynh wrote:
> On Sun, Jan 9, 2011 at 2:34 AM, Max Ogden <maxo...@gmail.com> wrote:
>
> > That is great!
> >
> > I am working on porting your python utility to ruby as well as creating a
> > generic API document for other developers.
> >
> > I had another idea... to include a 'Export to ScraperWiki' option inside
> > my https://github.com/maxogden/refine-uploader project. It would basically
> > have a modal dialog in which you enter the URL of the page you wish to fetch
> > and feed into Refine and then your ScraperWiki scraper ID, username and pw.
> > It would authenticate you with ScraperWiki and upload your operation history
> > and the data URL.
> >
> > On the ScraperWiki end there would have to be some Refine integration, but
> > hopefully the python and ruby libraries will lead to that.
> >
> > One weird interaction in Refine is creating a new project from a URL and an
> > entire webpage worth of raw HTML. It would be better if on ScraperWiki there
> > was some sort of visual javascript selection aide along the lines of what
> > Chris from MIT was linking to that lets the user select a portion of the
> > page that they want to grab the raw markup from. At this point a ScraperWiki
> > API URL that points to this markup could be presented to the user along with
> > instructions on how to download Refine and create a new project from the
> > URL.
> >

Max, what's the use case where you'd upload to ScraperWiki from refine
uploader? Is it to get the data easily back into ScraperWiki so you
can use views and things?



> What if ScraperWiki hosts one (or more?) instance of Refine?

> (Of course we
> need some sort of access control which Refine doesn't have right now.) Then
> all the plumbing for data to go between ScraperWiki and Refine can be
> totally hidden from the user.

I'd like to be able to make ScraperWiki able to execute refine scripts like your one,
David:
http://scraperwiki.com/scrapers/eli-lilly-dollars-for-docs-refine-operations/edit/
Either as a short Python/Ruby script with the JSON included as a
string. Or even as a new language type called Refine which would just
have the refine code.

So then yes, a hosted version to be able to easily create and update
refine scripts would be ideal.

Architecturally, what is the best way to patch Refine to be hosted and
accessible remotely by multiple users?

(Which is an even larger job, I imagine, than making running Refine
JSON scripts a headless operation you can do from the command line, or
Python/Ruby).

Francis

Francis Irving

unread,
Jan 9, 2011, 6:24:49 PM1/9/11
to scrap...@googlegroups.com, google...@googlegroups.com
Aha! You've answered half my questions. Having a look at those Python
bindings...

Max Ogden

unread,
Jan 9, 2011, 7:15:58 PM1/9/11
to google...@googlegroups.com, google...@googlegroups.com, scrap...@googlegroups.com
"Max, what's the use case where you'd upload to ScraperWiki from refine
uploader? Is it to get the data easily back into ScraperWiki so you
can use views and things?"


I was just figuring that since I'm making an extension that deals with uploading JSON to a server already that I could just expand the scope of the existing extension. In addition to the existing functionality there would be a new 'Upload to ScraperWIki' button that just sends the operation history, not any of the actual row data in Refine.

"I'd like to be able to make ScraperWiki able to execute refine scripts. Either as a short Python/Ruby script with the JSON included as a
string. Or even as a new language type called Refine which would just have the refine code."

I like this idea a lot, specifically the bit about a new language type for Refine that just houses the operation history. Here is what i'm brainstorming for the workflow:

1. User clicks 'New Scraper', selects Refine.
2. User is prompted to enter in a URL to grab data from.
3. User is presented with a link to copy paste into their Refine instance that will load the raw webpage data into Refine.
4. Once they have completed cleaning up the raw data in Refine, they can either copy paste the operation history JSON into ScraperWiki or they can install a Refine extension that does that for them.

The one thing that might not be immediately feasible happens between step 2 and 3. Importing raw HTML into refine is a little daunting, so it would be nice to come up with a way to limit the scope of the raw data before dumping it into Refine. In light of that, here is another possible workflow:

1. User clicks 'New Scraper', selects their language of choice (how it is now)
2. User writes a portion of the scraper in a scripting language that grabs very raw text out of the raw HTML (such as in this video http://vimeo.com/18351837)
3. User then has the opportunity to 'chain' together Google Refine by applying a Refine operation history to the results of their basic scraper data

So in this case it wouldn't necessarily be a new language type, but rather a means of post production

Max

"Architecturally, what is the best way to patch Refine to be hosted and
accessible remotely by multiple users?

(Which is an even larger job, I imagine, than making running Refine
JSON scripts a headless operation you can do from the command line, or
Python/Ruby)."

To start you'd have to implement user accounts and implement a security model in Java. I think for a quick win you can run Refine privately on ScraperWiki and then just build a queue on top of it that simply does data transformations for any pending scrapers that 'Refine enabled' and then deletes the data out of Refine once it's been exported and saved in the ScraperWiki API.

Max


Francis Irving

unread,
Jan 9, 2011, 7:20:23 PM1/9/11
to scrap...@googlegroups.com, google...@googlegroups.com
I've got the test.py example to work locally... Really good.

Starting to play with running it on the Eli Lilly example. I've forked
and pushed a few minor changes to the Python bindings - the main one
code to allow specification of a URL as input format.

https://github.com/frabcus/refine-python

I'm running this script, trying to get it to grab the import file
directly from the CSV download from the Ely Lilly scraped data on
ScraperWiki.

http://seagrass.goatchurch.org.uk/~francis/tmp/eli_lilly.py

Some questions.

1) I've altered new_project to take either a project_file or
project_url as a parameter, judging by the specification
in ./main/webapp/modules/core/index.vt in the Refine code.

Is that the right thing to do?

2) Running eli_lilly.py I then get an error. First of all, there's an
error reporting problem in the bindings.

If after response.read() in new_project I add the following lines to
help with debugging:
print response_body
print response.info()
print response.code

I see that an HTTP status code 200 is still being returned. What's the
best way of checking for errors back from Refine? So it can print
the (Java!) stack trace when it fails, but only when it does.

3) The stack trace I'm getting is this one. But I really have to go to bed now,
so I can't debug it. It is probably something obvious...

<h2>Failed to import file:</h2>
<pre class="errorstack">org.apache.commons.fileupload.FileUploadBase$InvalidContentTypeException: the request doesn't contain a multipart/form-data or multipart/mixed stream, content type header is application/x-www-form-urlencoded
at org.apache.commons.fileupload.FileUploadBase$FileItemIteratorImpl.<init>(FileUploadBase.java:885)
at org.apache.commons.fileupload.FileUploadBase.getItemIterator(FileUploadBase.java:331)
at org.apache.commons.fileupload.servlet.ServletFileUpload.getItemIterator(ServletFileUpload.java:148)
at com.google.refine.commands.project.CreateProjectCommand.internalImport(CreateProjectCommand.java:146)
at com.google.refine.commands.project.CreateProjectCommand.doPost(CreateProjectCommand.java:112)
at com.google.refine.RefineServlet.service(RefineServlet.java:171)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1166)
at org.mortbay.servlet.UserAgentFilter.doFilter(UserAgentFilter.java:81)
at org.mortbay.servlet.GzipFilter.doFilter(GzipFilter.java:155)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:938)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:755)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

4) What is the security of the refine server like? This is at two levels.

Firstly right now, anyone who can access the refine server web interface can
currently browse and alter any project, I think? (Makes sense for a local tool,
but just checking).

Secondly though, how robust should it be as a web application? It looks like
you can make it read arbitary files off the filesystem (which in lots of
controlled circumstances, I'm not worried about).

Is it likely that someone can run arbitary code, so we have to sandbox it as
well as a browser-edited Python/Ruby script? Or could it (apart from the first
security problem) sit outside that?

Enough for now, quite exciting to see this working as well as it is!

Francis

David Huynh

unread,
Jan 9, 2011, 7:46:16 PM1/9/11
to google...@googlegroups.com, scrap...@googlegroups.com
On Sun, Jan 9, 2011 at 4:20 PM, Francis Irving <fra...@scraperwiki.com> wrote:
I've got the test.py example to work locally... Really good.

Starting to play with running it on the Eli Lilly example. I've forked
and pushed a few minor changes to the Python bindings - the main one
code to allow specification of a URL as input format.

https://github.com/frabcus/refine-python

I'm running this script, trying to get it to grab the import file
directly from the CSV download from the Ely Lilly scraped data on
ScraperWiki.

http://seagrass.goatchurch.org.uk/~francis/tmp/eli_lilly.py

Some questions.

1) I've altered new_project to take either a project_file or
project_url as a parameter, judging by the specification
in ./main/webapp/modules/core/index.vt in the Refine code.

Is that the right thing to do?

Looks good to me.
 
2) Running eli_lilly.py I then get an error. First of all, there's an
error reporting problem in the bindings.

If after response.read() in new_project I add the following lines to
help with debugging:
   print response_body
   print response.info()
   print response.code

I see that an HTTP status code 200 is still being returned. What's the
best way of checking for errors back from Refine? So it can print
the (Java!) stack trace when it fails, but only when it does.

response.geturl() would show something like "/error" which means you're seeing the error page.
Try to add a URL parameter called "url" (in the POST url) rather than using a POST body.


4) What is the security of the refine server like? This is at two levels.

Firstly right now, anyone who can access the refine server web interface can
currently browse and alter any project, I think? (Makes sense for a local tool,
but just checking).

Correct.

 
Secondly though, how robust should it be as a web application? It looks like
you can make it read arbitary files off the filesystem (which in lots of
controlled circumstances, I'm not worried about).

I don't think it can read arbitrary files off the server machine, except perhaps through the use of Python.


Is it likely that someone can run arbitary code, so we have to sandbox it as
well as a browser-edited Python/Ruby script? Or could it (apart from the first
security problem) sit outside that?

The native language GREL can't do much to the server machine, but other scripting languages might be able to. I suppose ScraperWiki must have solved the problem of sandboxing pythong scripts already, and we can just borrow that technique. Frankly, I think GREL is powerful enough, and other languages are only a matter of preference and familiarity.

David

David Huynh

unread,
Jul 24, 2012, 3:43:09 PM7/24/12
to google...@googlegroups.com
It seems like the problem is that the data file (which starts with "Date") got imported by default as line-based text files with no header row, so that the column is named "Column 1" rather than "Date". Which is why you saw the error message "No column named Date". I think those python scripts are out of date. Could you try Paul Makepeace's latest library?


David

On Tue, Jul 24, 2012 at 11:56 AM, Daniel Wu <bow...@gmail.com> wrote:
Hello David,

I would like to run these python scripts you post here.But I found the dates are not changed. I use python 1.7, java 1.7. I also build the grefine source code on my machine using ant 1.8.4 and run it using "./refine". Could you please take a look at this issue?
I tried the example in your previous post. The returned result is 
====
Column 1
Date
7 December 2001
July 1 2002
10/20/10
======
some exceptions are thrown on the server side
java.lang.Exception: No column named Date
at com.google.refine.operations.EngineDependentMassCellOperation.createHistoryEntry(EngineDependentMassCellOperation.java:68)
at com.google.refine.model.AbstractOperation$1.createHistoryEntry(AbstractOperation.java:52)
at com.google.refine.process.QuickHistoryEntryProcess.performImmediate(QuickHistoryEntryProcess.java:73)
at com.google.refine.process.ProcessManager.queueProcess(ProcessManager.java:82)
at com.google.refine.commands.history.ApplyOperationsCommand.reconstructOperation(ApplyOperationsCommand.java:88)
at com.google.refine.commands.history.ApplyOperationsCommand.doPost(ApplyOperationsCommand.java:69)
at com.google.refine.RefineServlet.service(RefineServlet.java:177)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1166)
at org.mortbay.servlet.UserAgentFilter.doFilter(UserAgentFilter.java:81)
at org.mortbay.servlet.GzipFilter.doFilter(GzipFilter.java:155)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:938)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:755)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
11:50:17.333 [                   refine] POST /command/core/export-rows/dates.txt.tsv (7ms)
11:50:17.337 [                   refine] POST /command/core/delete-project (4ms)

Daniel Wu

unread,
Jul 24, 2012, 6:59:52 PM7/24/12
to google...@googlegroups.com
Thank you David. I tried using Column 1 as the Column name. It works
Are there some documents on the how to write operation json file? or some more complex example of the operation json file? 
I would like to write some operation json file to drive the server side grefine.
Reply all
Reply to author
Forward
0 new messages