reconcile-csv issues

592 views
Skip to first unread message

taxidriver

unread,
Apr 25, 2017, 2:46:15 PM4/25/17
to OpenRefine
Hi all,

I am using OpenRefine 2.7-rc.2 (1) on a Macbook running OSX Sierra.

I have 2 datasets 
  1. A master CSV file containing of over 7,000 schools, addresses and unique IDs. 
  2. A file containing schools, addresses and related data, but no IDs.
I want to add the unique IDs to dataset #2. There are 2 main issues with the data:
  1. The school names are not always consistently entered (e.g. JFK Middle School vs JF Kennedy MS). 
  2. School names are not unique because the same school name can appear in different cities.
I used Reconcile-CSV to host dataset #1 through the command line, loaded dataset #2 into OR, and started the reconcile by adding http://localhost:8000/reconcile as a standard service. The reconcile processed for a while but the judgment came back all none despite seeing strings that are 100% matches in the school_name field in the two files. I tried changing the different settings in the reconcile pop-up (type on/off; match checked/uncheck), but the results are still null. There is a message in the reconcile pop-up window that reads, "Sorry, we can't suggest any type for your data. Please specify a type yourself below," but I don't think that means anything to local reconciling.

I'm stuck. Is there a way to troubleshoot this issue? An error log? Is this a new incompatibility with the older reconcile-csv. 

Is there another way to do this? Can I combine the two files, cluster and edit on multiple columns (eg school_name and city) then fill down for the missing IDs? If so, what's the command for cluster/editing on multiple columns?

Thank you for your help.
Drew

Ettore Rizza

unread,
Apr 25, 2017, 3:59:51 PM4/25/17
to OpenRefine
Hi Drew,

I understand your frustration, but it's hard to help you remotely. The problem can have many causes: no unique ID in your main file, a bad column name... I don't see anything in Github about a specific Mac issue. 

To keep it short and simple, you can use as an alternative the free Excel plugin Fuzzy Lookup Add-In for Excel. It's easy to use, as you can see in this screencast. The main constraint: the two tables to be reconciled must be in Excel table format (in the screencast, I use CTRL + L to do the operation).


Drew Harris

unread,
Apr 25, 2017, 6:07:11 PM4/25/17
to openr...@googlegroups.com
Ettore,

Thank you! Your post got me thinking that the reference csv was improperly formatted. So, I cleaned out some forward slashes in a few of the school names and round trip the file through OpenRefine. The exported version worked!!

I knew right away when the data type window in the reconcile popup had a button with /csv-recon next to it. Before, it just had a message that said no suggested data type. I wish the error message was clearer, but now it’s fixed.

Thanks, again.

Drew


--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Drew Harris

unread,
Apr 26, 2017, 8:39:35 AM4/26/17
to openr...@googlegroups.com
Ok. My reconciliation is now running properly, generating appropriate matches. Unfortunately, cell.recon.match.id
only generates nulls. I get a 404 not found error, when I click on one of the matches or click on “choose new match.” The reconcile-csv java script appears to be running properly in the terminal window. It did generate the following when it started, but I don’t know if this is routine.
2017-04-26 08:15:08.499:INFO:oejs.Server:jetty-7.x.y-SNAPSHOT
2017-04-26 08:15:08.540:INFO:oejs.AbstractConnector:Started SelectChann...@0.0.0.0:8000

If case it’s important, this is the command I used to start the java program: 
java -Xmx2g -jar reconcile-csv-0.1.2.jar <CSV-File> <Search Column> <ID Column>

Any thoughts??

Thanks,

Drew

Ettore RIZZA

unread,
Apr 26, 2017, 9:04:56 AM4/26/17
to openr...@googlegroups.com
Maybe a firewall problem. I've juste tried to use reconcile-csv and, as soon as I typed the command, Windows asked me if I allow the Internet connection. I do not know how this happens on Mac.



To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+unsubscribe@googlegroups.com.

Drew Harris

unread,
Apr 26, 2017, 10:57:54 AM4/26/17
to openr...@googlegroups.com
I’m running Little Snitch and it doesn’t flag it. Even if I turn the filter off, it doesn’t work. I don’t see how it could work for reconcile, but be blocked by a firewall for the lookup. Shouldn’t it be all or none?


On Apr 26, 2017, at 9:04 AM, Ettore RIZZA <ettor...@gmail.com> wrote:

Maybe a firewall problem. I've juste tried to use reconcile-csv and, as soon as I typed the command, Windows asked me if I allow the Internet connection. I do not know how this happens on Mac.
<Capture.JPG>

To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.

Ettore RIZZA

unread,
Apr 26, 2017, 11:05:22 AM4/26/17
to openr...@googlegroups.com
Looks like it's not "all or nothing". If I disable all "Java (TM) Platform SE Binary" in my Firewall, reconciliation works, but return to a "404 not found" page when I click a blue link.

Drew Harris

unread,
May 7, 2017, 7:51:52 PM5/7/17
to openr...@googlegroups.com
I’ve done some more testing and don’t believe this is a firewall issue. After reconciling the data file against my reference file, every candidate match in the field links to localhost:8000/view/null. Clicking them yields the 404 error. Similarly, using the cell.recon.match.id in "add column based on this column" yields this Error: Cannot retrieve field from null in the result column.

Even when I create a test file and use it as both master and lookup file, I still get null returns with cell.recon.match.id. The matches are perfect but it won’t pull in the ID from reference file.

Very frustrating. Does anyone have any ideas of what I should be looking at.


To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.

Drew Harris

unread,
May 8, 2017, 2:04:56 AM5/8/17
to openr...@googlegroups.com
All,

Problem solved. Apparently, Excel on Mac generates legacy carriage returns (CR) resulting in an improperly formatted csv file. I tried 2 ways to clean the files. 1) Import/export from Open Refine as a csv. 2) Open the file in TextWrangler and tell it to change the CRs to Unix style instead of Mac. I did this to the reference file opened by Reconcile-CSV in the terminal and the file I imported into OpenRefine.

So far, it looks good. Thanks.

Thad Guidry

unread,
May 8, 2017, 6:07:25 AM5/8/17
to openr...@googlegroups.com

Drew,

 

This might be a pretty important bug that needs addressed upstream from us as well with the Reconcile-csv team at OKFN, where Clojure-CSV is the library that they use to do the CSV parsing. 


What do you mean by improperly formatted ?

 

Mydata1,mydata2<CR>

Mydata1,mydata2<CR>

 

Or

 

“Mydata1,mydata2<CR>”<CR>

“Mydata1,mydata2<CR>”<CR>


Looks like Reconcile-csv just expects the default of Clojure-csv library which is:


        :end-of-line - A string containing the end-of-line character

                       for reading CSV files. If this setting is nil then

                       \\n and \\r\\n are both accepted.  Default value: nil


So if you had something other than a CR or CRLF , then yeah, Reconcile-csv probably didn't know what to do, and should be guided.  That change could be made as an input option for Reconcile-csv, as we do with OpenRefine with similar options and auto-detection of line-endings and parsing using the Apache POI Java library.  Besides, clojure-csv is now up to version 2.0.2  https://clojars.org/clojure-csv  which the csv-map library in Reconcile-csv depends on to actually do the parsing before it builds up the mapping for the recon process.


So I'd encourage you to open and issue and let Reconcile-csv know about your issue here:  https://github.com/okfn/reconcile-csv/issues

 

-Thad



To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Drew Harris

unread,
May 8, 2017, 1:08:00 PM5/8/17
to openr...@googlegroups.com
Caveat: I’m not a programmer. I’ve done some more testing and think I narrowed down the cause. 

It looks like the file Reconcile-CSV reads is the issue. This problem occurs when the file has the "Unicode (UTF-8), with BOM” encoding. When I use TextWrangler to change the encoding to “Unicode (UTF-8)” the matches are no longer null. My guess is that the default Excel CSV export adds BOM, which Reconcile-CSV can’t handle.

With other test files, the issue appeared to occur when the file had “Legacy Mac OS (CR)” according to TextWrangler. I found a similar issue with CSV files described on GitHub here. It is possible that when I changed the end-of-line code, I also fixed the encoding.

Reconcile-CSV hasn’t been updated in two years and appears to be inactive. It would be nice if it was able to pre-process or at least check the csv file to confirm it’s properly formatted. At least now I know how to fix the problem when it occurs.

Drew

Thad Guidry

unread,
May 8, 2017, 1:14:18 PM5/8/17
to openr...@googlegroups.com
" It is possible that when I changed the end-of-line code, I also fixed the encoding."

That's probably the likely scenario.

OK, thanks Drew.  We'll leave you alone now to do your polishing and refining.

Cheers,
-Thad

Reply all
Reply to author
Forward
0 new messages