openrefine 3.0beta API and python 3

81 views
Skip to first unread message

David PACHE

unread,
Oct 26, 2018, 9:33:50 PM10/26/18
to OpenRefine
For three days. I worked on a copy of the refine-client-py to get a python 3 version working for at least the project creation, the apply operations and the delete functions. Newbie in python, I had issues but the things seems to working smoothly for the delete operation. The creation for the moment is my biggest concern : 
- I can create a project with a CSV file but the separator does not seem to be taken; 
- With a excel file. I have a corruption of the file, like if the type was not recognize. 

I do not know if it is due to the python serialization or the server itself throw the API, but it is clear to me it is here. I need help ! And I have a lot of question. The good, news is my company authorize me to provide you with the code and to give some times. So I can work on it full time the next days. But it must be quick. Thank you for your help. I catch the TCP  that I sent to openrefine (CSV file) with Python. 


Here the HTTP packet. If you see something let me know :
--e848888cc6eb3c4ba87ab35ec5ce6c9f
Content-Disposition: form-data; name="format"


xslx
--e848888cc6eb3c4ba87ab35ec5ce6c9f
Content-Disposition: form-data; name="separator"


;
--e848888cc6eb3c4ba87ab35ec5ce6c9f
Content-Disposition: form-data; name="ignore-lines"


-1
--e848888cc6eb3c4ba87ab35ec5ce6c9f
Content-Disposition: form-data; name="header-lines"


1
--e848888cc6eb3c4ba87ab35ec5ce6c9f
Content-Disposition: form-data; name="skip-data-lines"


0
--e848888cc6eb3c4ba87ab35ec5ce6c9f
Content-Disposition: form-data; name="limit"


-1
--e848888cc6eb3c4ba87ab35ec5ce6c9f
Content-Disposition: form-data; name="guess-value-type"


true
--e848888cc6eb3c4ba87ab35ec5ce6c9f
Content-Disposition: form-data; name="process-quotes"


true
--e848888cc6eb3c4ba87ab35ec5ce6c9f
Content-Disposition: form-data; name="store-blank-rows"


true
--e848888cc6eb3c4ba87ab35ec5ce6c9f
Content-Disposition: form-data; name="store-blank-cells-as-nulls"


true
--e848888cc6eb3c4ba87ab35ec5ce6c9f
Content-Disposition: form-data; name="include-file-sources"


false
--e848888cc6eb3c4ba87ab35ec5ce6c9f
Content-Disposition: form-data; name="project-name"


test
9
--e848888cc6eb3c4ba87ab35ec5ce6c9f
Content-Disposition: form-data; name="fd"; filename="fd"


policyID
;statecode;county;eq_site_limit;hu_site_limit;fl_site_limit;fr_site_limit;tiv_2011;tiv_2012;eq_site_deductible;hu_site_deductible;fl_site_deductible;fr_site_deductible;point_latitude;point_longitude;line;construction;point_granularity
119736;FL;CLAY COUNTY;498960;498960;498960;498960;498960;792148.9;0;9979.2;0;0;30.102261;-81.711777;Residential;Masonry;1
448094;FL;CLAY COUNTY;1322376.3;1322376.3;1322376.3;1322376.3;1322376.3;1438163.57;0;0;0;0;30.063936;-81.707664;Residential;Masonry;3
206893;FL;CLAY COUNTY;190724.4;190724.4;190724.4;190724.4;190724.4;192476.78;0;0;0;0;30.089579;-81.700455;Residential;Wood;1
333743;FL;CLAY COUNTY;0;79520.76;0;0;79520.76;86854.48;0;0;0;0;30.063236;-81.707703;Residential;Wood;3
172534;FL;CLAY COUNTY;0;254281.5;0;254281.5;254281.5;246144.49;0;0;0;0;30.060614;-81.702675;Residential;Wood;1
785275;FL;CLAY COUNTY;0;515035.62;0;0;515035.62;884419.17;0;0;0;0;30.063236;-81.707703;Residential;Masonry;3
995932;FL;CLAY COUNTY;0;19260000;0;0;19260000;20610000;0;0;0;0;30.102226;-81.713882;Commercial;Reinforced Concrete;1
223488;FL;CLAY COUNTY;328500;328500;328500;328500;328500;348374.25;0;16425;0;0;30.102217;-81.707146;Residential;Wood;1
433512;FL;CLAY COUNTY;315000;315000;315000;315000;315000;265821.57;0;15750;0;0;30.118774;-81.704613;Residential;Wood;1
142071;FL;CLAY COUNTY;705600;705600;705600;705600;705600;1010842.56;14112;35280;0;0;30.100628;-81.703751;Residential;Masonry;1
253816;FL;CLAY COUNTY;831498.3;831498.3;831498.3;831498.3;831498.3;1117791.48;0;0;0;0;30.10216;-81.719444;Residential;Masonry;1
--e848888cc6eb3c4ba87ab35ec5ce6c9f
Content-Disposition: form-data; name="filename"; filename="filename"


C
:\Users\dpache\Desktop\test_project_reconc\fl_insurance_sample.csv
--e848888cc6eb3c4ba87ab35ec5ce6c9f--



Here the result :

2018-10-26 09_43_37-test 9 - OpenRefine.png

As you can see the strange column with my directory. Probably due to a wrong serialization. 

Do you have any test or python sample or powershell to test the API ? It will be very helpful to debug. 

Owen Stephens

unread,
Oct 27, 2018, 12:40:05 AM10/27/18
to OpenRefine
Hi David,

The first strange thing I notice in the information you've shared is that you have
Content-Disposition: form-data; name="format"


xslx

Which doesn't seem right as you are trying to import a csv - is setting the format working correctly?

Owen

Ettore Rizza

unread,
Oct 28, 2018, 10:46:30 AM10/28/18
to OpenRefine
Hello David,

It looks like you're not the only one working on a refine-client translation to py3. Maybe you should talk to Lan Li.

Best regards,

Ettore
Message has been deleted

David PACHE

unread,
Oct 29, 2018, 8:50:18 AM10/29/18
to OpenRefine
Hello Ettore,

do you know a way to contact him ?

David PACHE

unread,
Oct 29, 2018, 8:56:22 AM10/29/18
to OpenRefine
Hello Owen, 

In fact no options (Separator, format, etc) seems to work as expected. Please see my coming update. 

David PACHE

unread,
Oct 29, 2018, 10:00:47 AM10/29/18
to OpenRefine
Hello,
 
I have some advancement. 

Now, the standard CSV seems to work flawless but the options like separator, format and others are not taken into account. Excel is not working at all : I have a clear corrupted file on the server side. I am not so sure there is again a problem with the serialization knowing that the CSV is OK. I precise I use a binary serialization so I do not understand why CSV could work but not Excel. Could it be the server interpretor ?  Here the new details : 

--d89f46505d20b1c7d8aa2e2999b033bc
Content-Disposition: form-data; name="format"

csv
--d89f46505d20b1c7d8aa2e2999b033bc
Content-Disposition: form-data; name="separator"

;
--d89f46505d20b1c7d8aa2e2999b033bc
Content-Disposition: form-data; name="ignore-lines"

-1
--d89f46505d20b1c7d8aa2e2999b033bc
Content-Disposition: form-data; name="header-lines"

1
--d89f46505d20b1c7d8aa2e2999b033bc
Content-Disposition: form-data; name="skip-data-lines"

0
--d89f46505d20b1c7d8aa2e2999b033bc
Content-Disposition: form-data; name="limit"

-1
--d89f46505d20b1c7d8aa2e2999b033bc
Content-Disposition: form-data; name="guess-value-type"

true
--d89f46505d20b1c7d8aa2e2999b033bc
Content-Disposition: form-data; name="process-quotes"

true
--d89f46505d20b1c7d8aa2e2999b033bc
Content-Disposition: form-data; name="store-blank-rows"

true
--d89f46505d20b1c7d8aa2e2999b033bc
Content-Disposition: form-data; name="store-blank-cells-as-nulls"

true
--d89f46505d20b1c7d8aa2e2999b033bc
Content-Disposition: form-data; name="include-file-sources"

false
--d89f46505d20b1c7d8aa2e2999b033bc
Content-Disposition: form-data; name="project-name"

test Yassine 2
--d89f46505d20b1c7d8aa2e2999b033bc
Content-Disposition: form-data; name="project-file"; filename="project-file"

policyID;statecode;county;eq_site_limit;hu_site_limit;fl_site_limit;fr_site_limit;tiv_2011;tiv_2012;eq_site_deductible;hu_site_deductible;fl_site_deductible;fr_site_deductible;point_latitude;point_longitude;line;construction;point_granularity
119736;FL;CLAY COUNTY;498960;498960;498960;498960;498960;792148.9;0;9979.2;0;0;30.102261;-81.711777;Residential;Masonry;1
448094;FL;CLAY COUNTY;1322376.3;1322376.3;1322376.3;1322376.3;1322376.3;1438163.57;0;0;0;0;30.063936;-81.707664;Residential;Masonry;3
206893;FL;CLAY COUNTY;190724.4;190724.4;190724.4;190724.4;190724.4;192476.78;0;0;0;0;30.089579;-81.700455;Residential;Wood;1
333743;FL;CLAY COUNTY;0;79520.76;0;0;79520.76;86854.48;0;0;0;0;30.063236;-81.707703;Residential;Wood;3
172534;FL;CLAY COUNTY;0;254281.5;0;254281.5;254281.5;246144.49;0;0;0;0;30.060614;-81.702675;Residential;Wood;1
785275;FL;CLAY COUNTY;0;515035.62;0;0;515035.62;884419.17;0;0;0;0;30.063236;-81.707703;Residential;Masonry;3
995932;FL;CLAY COUNTY;0;19260000;0;0;19260000;20610000;0;0;0;0;30.102226;-81.713882;Commercial;Reinforced Concrete;1
223488;FL;CLAY COUNTY;328500;328500;328500;328500;328500;348374.25;0;16425;0;0;30.102217;-81.707146;Residential;Wood;1
433512;FL;CLAY COUNTY;315000;315000;315000;315000;315000;265821.57;0;15750;0;0;30.118774;-81.704613;Residential;Wood;1
142071;FL;CLAY COUNTY;705600;705600;705600;705600;705600;1010842.56;14112;35280;0;0;30.100628;-81.703751;Residential;Masonry;1
253816;FL;CLAY COUNTY;831498.3;831498.3;831498.3;831498.3;831498.3;1117791.48;0;0;0;0;30.10216;-81.719444;Residential;Masonry;1
--d89f46505d20b1c7d8aa2e2999b033bc--



As you will see, the separator option is not taken into account. Only the comma is interpreted. A good news too, the text delimiter double quote is well interpreted.
And the result :

2018-10-29 09_57_45-test Yassine 2 - OpenRefine.png




Message has been deleted

Lan Li

unread,
Oct 29, 2018, 11:31:32 AM10/29/18
to openr...@googlegroups.com
Hi David,
I am Lan Li, I'm so grateful that Ettore helps me to find you who is also focused on python3 with OpenRefine.  I'm working on a research project with OpenRefine, and if the OpenRefine-Client library can be updated to python3, I think it will help my project greatly.

Thanks very much!
Regards,
Lan

On Mon, Oct 29, 2018 at 7:50 AM David PACHE <david...@gmail.com> wrote:
Hello Ettore,

do you know a way to contact him ?

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Lan Li

unread,
Oct 29, 2018, 11:32:03 AM10/29/18
to openr...@googlegroups.com
Thank you very much, Ettore!
Have a good day:)

Best,
Lan

--

David PACHE

unread,
Oct 29, 2018, 1:26:43 PM10/29/18
to OpenRefine
Lan Li, your help will be greatly appreciated. I am on the gitter chat room, if you want to contact me. 

la_poisse (David)

unread,
Oct 29, 2018, 1:45:55 PM10/29/18
to OpenRefine
Here the link of the chat room ... :)

Ettore Rizza

unread,
Oct 29, 2018, 1:46:49 PM10/29/18
to OpenRefine
Hello Lan/David,

As I understand it, the main problem comes from the fact that refine-cli uses the deprecated module urllib2_file, whose sole purpose is to extend urllib2 to support HTTP POST file upload. But requests, if I'm not mistaken, can do the same thing. So the best thing would be to rewrite refine.py using requests instead of urllib.
Message has been deleted

la_poisse (David)

unread,
Oct 29, 2018, 2:51:49 PM10/29/18
to OpenRefine
Hello Ettore,


Exactly and it has already been done. But it does not explain the other problems. So just to be sure that we are clear I recall what I have working on my computer :
- creation of a project
- data import for the project IF it is a pure CSV using comma as a separator and double quote (maybe simple quote but I doubt) text delimiter
- deletion of a project
- applying of operations


For my needs I would prefer to import an Excel or change the separators on the CSV.

Owen Stephens

unread,
Oct 29, 2018, 6:29:36 PM10/29/18
to OpenRefine
I suspect the delimiter problem is down to the problem addressed by this PR https://github.com/OpenRefine/OpenRefine/pull/1764

Ideally that PR would have some tests associated with it before we accept it into the code base, but you could try building OpenRefine with that change and see if that solves your problem

Owen

Thad Guidry

unread,
Oct 29, 2018, 8:13:18 PM10/29/18
to openr...@googlegroups.com
Just advice: We don't alter data on import but highlight to the user when/where data is broken.

I would advise a cautious approach to manipulating original separators and in fact would not do that but tell the user something screwed up while parsing during import.

Still on vacation in China 1 more week.
-Thad

--

la_poisse (David)

unread,
Oct 30, 2018, 8:59:47 AM10/30/18
to OpenRefine
Hello Thad, 

Sorry to disagree. Double quote and comma are often used into raw data by users : after all they have a meaning into text fields. I do not think it is a optional design to be able to import CSV with other types of separators and qualifiers. But relief, I have the chance to be able to do that with my data for the moment. 

I wish you to enjoy your trip.

la_poisse (David)

unread,
Oct 30, 2018, 9:21:14 AM10/30/18
to OpenRefine
Hello Owen,

Hum. It seems to be clearly related. I am not closed to the idea but I am not feeling comfortable to recompile Openrefine with this pulling request. First, it would be time consuming for me to learn Git, install a java develoment environment to hope for a result. But I can overestimate the time. Maybe if you have a good starting point, it could be acceptable. Second, it would mean to install a forked Openrefine into my prod environment... That, I am clearly against this idea. I am not sure my boss would like this idea too :) . I can deal with the standard CSV for the moment so I will probably stay like this for the moment. But if this pulling solve the excel issue, it could be a game changer.

As I saw test are near to be be completed with this pulling request. Do you planify to accept it for the next version ? When do you think about release this version ? 

Owen Stephens

unread,
Oct 30, 2018, 9:31:03 AM10/30/18
to OpenRefine
Hi David,

Thanks - I didn't mean to suggest running this in production - just to see if it fixed your problem. Building OpenRefine is relatively straightforward - but I understand it is all extra time you need to spend.

We have just started to discuss releasing a 3.1 beta version in the OpenRefine developers group. Still under discussion, but it seems likely we'll do this soon - and potentially we might include this change if we can get some tests for this.

However - I'm not sure this will resolve the Excel problem - which seems likely different to the CSV separator problem (which is just about specifying the right options and them being read by OpenRefine).

Can you share the Python code you are using to do the Project creation? If so I can do some testing and see if I can diagnose the problem

Owen

la_poisse (David)

unread,
Oct 30, 2018, 10:50:17 AM10/30/18
to OpenRefine
With pleasure. Please forgive me if the code is not pythonic oriented but criticisms will be well received. I join the refine.py forked from the refine-client-py. The new_project function is what you need to test. As example, I join the parker_cleansing I used to test the refine-client. If you need anything else, feel free to contact me. 
refine.py
parker_cleansing.py

la_poisse (David)

unread,
Oct 30, 2018, 10:54:38 AM10/30/18
to OpenRefine
By the way, as the options are not taken into the process of import, the file type cannot be forced. That is why I thought maybe it could be related : the importer interpreting the excel as CSV ... But you are probably knowing better :) . 

Thad Guidry

unread,
Oct 30, 2018, 9:27:02 PM10/30/18
to openr...@googlegroups.com
OpenRefine labels with the importer term "CSV" but it's actually a delimited file importer.  We use "CSV" term because it's what folks understand, but it's really delimited files of any kind that can be parsed with the right logic.  We use file extensions as a clue for parsing but OpenRefine can be used to import any file in actuality, we just add new importers that have the parsing logic when necessary and further give users power to control the parsing of an importer. That's the design we follow.

- Thad


--

Owen Stephens

unread,
Oct 31, 2018, 7:36:26 AM10/31/18
to OpenRefine
Hi David,

Looking at ```def new_project``` in refine.py we can see that currently this sets the project_format to /text/line-based/*sv - this is hard coded and you won't be able to import any other type of file without extending this code to support different file types.

It may make sense to look at this code https://github.com/opencultureconsulting/openrefine-client by Felix Lohmeier (based on the original library by Paul Makepeace). It includes the necessary changes to support xls import (and other files). While this code also needs updating to run in a Python 3 environment, I suspect that part of the process will be the same as you've already had to deal with to get your version of refine.py working (note I've not done any extensive checking, but I suspect it is just the urllib2_file dependency again)

Owen
Reply all
Reply to author
Forward
0 new messages