I asked on the Google Refine group if there is a way to run Refine headlessly:Basically, we'd need a Java dev to hack on the code. I have been learning Java to write a Refine extension for the last week (https://github.com/maxogden/refine-uploader) but it would be nice to get some help build a command line interface to Google Refine that allows you to load in data, run extracted operation histories against that data, and then export the data.
Your idea of a more visual scrapping tool is what I'm hoping gets
written or conceived, maybe perhaps as a Google Refine extension, who
knows. The committers on Google Refine (I think Iain Sprout) have
mentioned about somehow, somewhere, a host can be used as the
intermediary repository for Refine cleanup processes or "scraping".
A stronger connection via extensions in Refine would allow perhaps
ScraperWiki to be that host. And as Julian Todd mentioned, it looks
like your guys' basic Views can be extended with Python to actually
function as a Reconciliation service. Coolness indeed. (All the
while, there is Google Fusion tables as very neat layer that folks can
add to the mix that affords even richer Views and animation into the
data we scrape and refine.)
Let's all work together to see how far the 2 communities can leverage
upon each others skills. More collaboration is needed and idea
sharing.
Let's keep up the cross-posting on this thread to both community
mailing lists to continue the discussions.
2cents,
-Thad
http://www.freebase.com/view/en/thad_guidry
David and Stefano had built a cool Firefox browser extension back in
their MIT days and I'm seriously wondering if that couldn't somehow be
hacked up (around) into a Google Refine extension ? Where, if you
want your data or Refine processes public and maintainable, you can
leverage upon ScraperWiki and it's community for hosting and
maintenance of those ?
Quoting Stefano:
" Note that David and I, in a past life, wrote a tool called
"solvent", it's a firefox extension that does provide some very
interesting capabilities (see a screencast of it here)
http://simile.mit.edu/solvent/screencasts/solvent_screencast.swf
the code is open source here
http://simile.mit.edu/repository/solvent/trunk/
although I'm not sure it still works, it hasn't been touched in years,
but some of the concepts there could still be very useful. "
-Thad
http://www.freebase.com/view/en/thad_guidry
import syssys.path.append("/directory/where/you/put refine.py")import refiner = refine.Refine()p = r.new_project("/file/path/to/sample/dates.txt")p.apply_operations("/file/path/to/sample/operations.json")print p.export_rows()p.delete_project()
import urllib2_fileimport urllib2, urlparse, os.path, time, jsonclass Refine:def __init__(self, server='http://127.0.0.1:3333'):self.server = server[0,-1] if server.endswith('/') else serverdef new_project(self, file_path, options=None):file_name = os.path.split(file_path)[-1]project_name = options['project_name'] if options != None and 'project_name' in options else file_namedata = {'project-file' : {'fd' : open(file_path),'filename' : file_name},'project-name' : project_name}response = urllib2.urlopen(self.server + '/command/core/create-project-from-upload', data)response.read()url_params = urlparse.parse_qs(urlparse.urlparse(response.geturl()).query)if 'project' in url_params:id = url_params['project'][0]return RefineProject(self.server, id, project_name)# TODO: better error reportingreturn Noneclass RefineProject:def __init__(self, server, id, project_name):self.server = serverself.id = idself.project_name = project_namedef wait_until_idle(self, polling_delay=0.5):while True:response = urllib2.urlopen(self.server + '/command/core/get-processes?project=' + self.id)response_json = json.loads(response.read())if 'processes' in response_json and len(response_json['processes']) > 0:time.sleep(polling_delay)else:returndef apply_operations(self, file_path, wait=True):fd = open(file_path)operations_json = fd.read()data = {'operations' : operations_json}response = urllib2.urlopen(self.server + '/command/core/apply-operations?project=' + self.id, data)response_json = json.loads(response.read())if response_json['code'] == 'error':raise Exception(response_json['message'])elif response_json['code'] == 'pending':if wait:self.wait_until_idle()return 'ok'return response_json['code'] # can be 'ok' or 'pending'def export_rows(self, format='tsv'):data = {'engine' : '{"facets":[],"mode":"row-based"}','project' : self.id,'format' : format}response = urllib2.urlopen(self.server + '/command/core/export-rows/' + self.project_name + '.' + format, data)return response.read()def delete_project(self):data = {'project' : self.id}response = urllib2.urlopen(self.server + '/command/core/delete-project', data)response_json = json.loads(response.read())return 'code' in response_json and response_json['code'] == 'ok'
David/Stefano:
Wonderful video and demonstration. This was the track that I was on
but I guess most scrapers rely upon firefrox and developing on top of
firefox as an extension or plugin. Also, I really like the idea of
Crowbar and the fact that it acts as a headless Gecko/XULRunner
server-side engine. However, a key problem I've encountered using
these kind of engines seems to be that it is not running in a full
browsing environment and the agent cannot access the DOM after the
onload JavaScript hooks are executed. Do these engines allow us to
scrape content that was not in the HTML page served initially but are
client-side included via AJAX or programmatically computed after the
page was loaded?
Max, what's the use case where you'd upload to ScraperWiki from refine
uploader? Is it to get the data easily back into ScraperWiki so you
can use views and things?
> What if ScraperWiki hosts one (or more?) instance of Refine?
> (Of course we
> need some sort of access control which Refine doesn't have right now.) Then
> all the plumbing for data to go between ScraperWiki and Refine can be
> totally hidden from the user.
I'd like to be able to make ScraperWiki able to execute refine scripts like your one,
David:
http://scraperwiki.com/scrapers/eli-lilly-dollars-for-docs-refine-operations/edit/
Either as a short Python/Ruby script with the JSON included as a
string. Or even as a new language type called Refine which would just
have the refine code.
So then yes, a hosted version to be able to easily create and update
refine scripts would be ideal.
Architecturally, what is the best way to patch Refine to be hosted and
accessible remotely by multiple users?
(Which is an even larger job, I imagine, than making running Refine
JSON scripts a headless operation you can do from the command line, or
Python/Ruby).
Francis
"Architecturally, what is the best way to patch Refine to be hosted and
accessible remotely by multiple users?
(Which is an even larger job, I imagine, than making running Refine
JSON scripts a headless operation you can do from the command line, or
Python/Ruby)."
To start you'd have to implement user accounts and implement a security model in Java. I think for a quick win you can run Refine privately on ScraperWiki and then just build a queue on top of it that simply does data transformations for any pending scrapers that 'Refine enabled' and then deletes the data out of Refine once it's been exported and saved in the ScraperWiki API.
Max
Starting to play with running it on the Eli Lilly example. I've forked
and pushed a few minor changes to the Python bindings - the main one
code to allow specification of a URL as input format.
https://github.com/frabcus/refine-python
I'm running this script, trying to get it to grab the import file
directly from the CSV download from the Ely Lilly scraped data on
ScraperWiki.
http://seagrass.goatchurch.org.uk/~francis/tmp/eli_lilly.py
Some questions.
1) I've altered new_project to take either a project_file or
project_url as a parameter, judging by the specification
in ./main/webapp/modules/core/index.vt in the Refine code.
Is that the right thing to do?
2) Running eli_lilly.py I then get an error. First of all, there's an
error reporting problem in the bindings.
If after response.read() in new_project I add the following lines to
help with debugging:
print response_body
print response.info()
print response.code
I see that an HTTP status code 200 is still being returned. What's the
best way of checking for errors back from Refine? So it can print
the (Java!) stack trace when it fails, but only when it does.
3) The stack trace I'm getting is this one. But I really have to go to bed now,
so I can't debug it. It is probably something obvious...
<h2>Failed to import file:</h2>
<pre class="errorstack">org.apache.commons.fileupload.FileUploadBase$InvalidContentTypeException: the request doesn't contain a multipart/form-data or multipart/mixed stream, content type header is application/x-www-form-urlencoded
at org.apache.commons.fileupload.FileUploadBase$FileItemIteratorImpl.<init>(FileUploadBase.java:885)
at org.apache.commons.fileupload.FileUploadBase.getItemIterator(FileUploadBase.java:331)
at org.apache.commons.fileupload.servlet.ServletFileUpload.getItemIterator(ServletFileUpload.java:148)
at com.google.refine.commands.project.CreateProjectCommand.internalImport(CreateProjectCommand.java:146)
at com.google.refine.commands.project.CreateProjectCommand.doPost(CreateProjectCommand.java:112)
at com.google.refine.RefineServlet.service(RefineServlet.java:171)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1166)
at org.mortbay.servlet.UserAgentFilter.doFilter(UserAgentFilter.java:81)
at org.mortbay.servlet.GzipFilter.doFilter(GzipFilter.java:155)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:938)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:755)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
4) What is the security of the refine server like? This is at two levels.
Firstly right now, anyone who can access the refine server web interface can
currently browse and alter any project, I think? (Makes sense for a local tool,
but just checking).
Secondly though, how robust should it be as a web application? It looks like
you can make it read arbitary files off the filesystem (which in lots of
controlled circumstances, I'm not worried about).
Is it likely that someone can run arbitary code, so we have to sandbox it as
well as a browser-edited Python/Ruby script? Or could it (apart from the first
security problem) sit outside that?
Enough for now, quite exciting to see this working as well as it is!
Francis
I've got the test.py example to work locally... Really good.
Starting to play with running it on the Eli Lilly example. I've forked
and pushed a few minor changes to the Python bindings - the main one
code to allow specification of a URL as input format.
https://github.com/frabcus/refine-python
I'm running this script, trying to get it to grab the import file
directly from the CSV download from the Ely Lilly scraped data on
ScraperWiki.
http://seagrass.goatchurch.org.uk/~francis/tmp/eli_lilly.py
Some questions.
1) I've altered new_project to take either a project_file or
project_url as a parameter, judging by the specification
in ./main/webapp/modules/core/index.vt in the Refine code.
Is that the right thing to do?
2) Running eli_lilly.py I then get an error. First of all, there's an
error reporting problem in the bindings.
If after response.read() in new_project I add the following lines to
help with debugging:
print response_body
print response.info()
print response.code
I see that an HTTP status code 200 is still being returned. What's the
best way of checking for errors back from Refine? So it can print
the (Java!) stack trace when it fails, but only when it does.
4) What is the security of the refine server like? This is at two levels.
Firstly right now, anyone who can access the refine server web interface can
currently browse and alter any project, I think? (Makes sense for a local tool,
but just checking).
Secondly though, how robust should it be as a web application? It looks like
you can make it read arbitary files off the filesystem (which in lots of
controlled circumstances, I'm not worried about).
Is it likely that someone can run arbitary code, so we have to sandbox it as
well as a browser-edited Python/Ruby script? Or could it (apart from the first
security problem) sit outside that?
Hello David,I would like to run these python scripts you post here.But I found the dates are not changed. I use python 1.7, java 1.7. I also build the grefine source code on my machine using ant 1.8.4 and run it using "./refine". Could you please take a look at this issue?I tried the example in your previous post. The returned result is====Column 1Date7 December 2001July 1 200210/20/10======some exceptions are thrown on the server sidejava.lang.Exception: No column named Dateat com.google.refine.operations.EngineDependentMassCellOperation.createHistoryEntry(EngineDependentMassCellOperation.java:68)at com.google.refine.model.AbstractOperation$1.createHistoryEntry(AbstractOperation.java:52)at com.google.refine.process.QuickHistoryEntryProcess.performImmediate(QuickHistoryEntryProcess.java:73)at com.google.refine.process.ProcessManager.queueProcess(ProcessManager.java:82)at com.google.refine.commands.history.ApplyOperationsCommand.reconstructOperation(ApplyOperationsCommand.java:88)at com.google.refine.commands.history.ApplyOperationsCommand.doPost(ApplyOperationsCommand.java:69)at com.google.refine.RefineServlet.service(RefineServlet.java:177)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1166)at org.mortbay.servlet.UserAgentFilter.doFilter(UserAgentFilter.java:81)at org.mortbay.servlet.GzipFilter.doFilter(GzipFilter.java:155)at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)at org.mortbay.jetty.Server.handle(Server.java:326)at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:938)at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:755)at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)at java.lang.Thread.run(Thread.java:722)11:50:17.333 [ refine] POST /command/core/export-rows/dates.txt.tsv (7ms)11:50:17.337 [ refine] POST /command/core/delete-project (4ms)