Programmatic project creation + import

52 views
Skip to first unread message

Raff

unread,
Jun 18, 2014, 3:51:52 PM6/18/14
to openr...@googlegroups.com
Hello,

I'm trying to interface with OpenRefine 2.6 (as it comes from GitHub at the moment) from a Scala webapp. 

I need to create a project and import json files through one ore more HTTP requests. I've been digging around the OpenRefine Java and JavaScript code to determine how to do this.

To simplify the problem, I am:
  • using the command: /command/core/create-project-from-upload
  • sending only one JSON file as binary body of an HTTP entity
  • setting properties as text parameters on the HTTP entity
For example, using the Java Apache HTTP library:

val file = new File("myFile.json")

val httpEntity = MultipartEntityBuilder.create()
  .addBinaryBody("project-file", file, ContentType APPLICATION_JSON, file.getName)
  .addTextBody("project-name", "testing")
  .addTextBody("format", "text/json")
  .build

val post = new HttpPost(url)
post.setEntity(httpEntity)
val client = HttpClientBuilder.create().build
val response = client.execute(post)

EntityUtils.toString(response.getEntity())

After this, OpenRefine returns a 500 error: "Failed to import file: java.lang.ArrayIndexOutOfBoundsException: 0"

After some digging, I find out that when options are not manually set in the HTTP request, they are built by com.google.refine.importers.JsonImporter.createParserUIInitializationData 

The method parses the json file, including its content (for preview). However, it does not specify a "recordPath" field, which is required when parsing the content into the Open Refine tree. The importer looks for the field and the following call to XmlImportUtilities.importTreeData fails because the recordPath array is empty.

If I supply recordPath as an option as an HTTP request parameter, then CreateProjectCommand does not parse the content of my json file and only continue with the provided fields. So the project gets created, but it's obviously empty.

Is there anyone with more experience of the code that can help me figure out what's missing? My hypothesis at this point is that the JavaScript client manages this in two steps because the user has to provide a recordPath through the UI (i.e. by clicking on the JSON object property that they want). Would it be possible to emulate this programmatically?

Finally, this problem does not exist with CSV files; they are imported correctly. I assume this is why the python client only supports CSV input?

Thanks for any help you can provide!
Raff

Tom Morris

unread,
Jun 18, 2014, 11:26:46 PM6/18/14
to openref...@googlegroups.com
+openrefine-dev
bcc: openrefine

This sounds much more appropriate for the audience of the developers list.  Are you building a Scala client library for OpenRefine?

First, just as a reminder, all the third party client libraries are unsupported efforts based on reverse engineering the internal Refine client/server protocol which is a) not documented and b) not change controlled except as needed for our internal use.

Having said that, the selection of the object that you want to import is key to the JSON (and XML) import process.  Importing the root is almost certainly not what you want.  Importing the top level objects MAY be what you want, but it varies a lot based on the JSON that you are trying to import.

If you've got a specific case that you're trying to support, your best bet would probably be to either use a browser Javascript debugger or a protocol analyzer like Wireshark to look at what the OpenRefine web client is sending and emulate that.

Do you have a standalone reproducer which demonstrates the problem that you are trying to solve?  That would make it a lot easier for folks to help you.  I don't have time to look at the code tonight, but I'll try to have a quick look over the weekend.

Tom


--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Raff

unread,
Jun 19, 2014, 12:09:47 PM6/19/14
to openr...@googlegroups.com, openref...@googlegroups.com
Hi Tom, thanks for your reply!

On Wednesday, June 18, 2014 11:26:46 PM UTC-4, Tom Morris wrote:
+openrefine-dev
bcc: openrefine

This sounds much more appropriate for the audience of the developers list.  

Yes, my bad, I realized there was a -dev list after sending the message.
 
Are you building a Scala client library for OpenRefine?

No, I'm dealing with a specific case and I'm bound to Scala because of existing data pre-processing tools. I wouldn't exclude putting together a client later on, but project time constraints won't allow me to do that at this stage.
 

First, just as a reminder, all the third party client libraries are unsupported efforts based on reverse engineering the internal Refine client/server protocol which is a) not documented and b) not change controlled except as needed for our internal use.

Having said that, the selection of the object that you want to import is key to the JSON (and XML) import process.  Importing the root is almost certainly not what you want.  Importing the top level objects MAY be what you want, but it varies a lot based on the JSON that you are trying to import.

Makes sense; I guess that a proper client would would require a path as a parameter to a json impoting method. 


If you've got a specific case that you're trying to support, your best bet would probably be to either use a browser Javascript debugger or a protocol analyzer like Wireshark to look at what the OpenRefine web client is sending and emulate that.

Thanks, sounds like Wireshark might help, will give that a try.
Reply all
Reply to author
Forward
0 new messages