How do I merge two excel spreadsheets into one using google refine?

2,208 views
Skip to first unread message

Miha Tomsic

unread,
Jan 15, 2014, 10:25:44 AM1/15/14
to openr...@googlegroups.com
Hello,

I am trying to figure out how to merge two sets of data into one and then refine it. I have created two simple excel documents and have been trying to merge them into one, but I can't do it. Several columns in both documents are very similar and in the first spreadsheet I have material no. 1, material no. 2 and material no. 3 plus some technical properties for each one of them (the numbers are totally made up) and in second spreadsheet I have material no. 1 and material no. 3 again, but there is no material no. 2, instead there is material no. 5, which is a different material. Also, only two of the technical properties are the same, while the third one is added. I would like to get one table containing all four materials and all 4 technical properties for each one (with values 0 where there is no value given). How can I do this?? I have been trying with this example http://blog.ouseful.info/2011/05/06/merging-datesets-with-common-columns-in-google-refine/ but probably I am messing something up, because I can't seem to get it right. 
Could anyone please tell me what to do, I would be very thankful.

Mihael Tomsic, faculty of civil engineering, university of Ljubljana, Slovenia 
test.xlsx
test2.xlsx

jonathan...@gmail.com

unread,
Jan 17, 2014, 4:18:33 AM1/17/14
to openr...@googlegroups.com
Hi Mihael,

I too had this problem for a while, and I didn't end up finding an easy, bulk way in Refine. The best way to merge two datasets/tables where they share a value in at least one column that I found is actually Google Fusion tables.

First you need to load both of them into Fusion tables and make sure you have the "Allow export" checkbox checked. Then from either of the new tables, select File -> Merge and then choose the other table from the list.

You will be shown the candidate columns that the app thinks match (in your case it looks like material no.1 or material no 3.—getting them standardized in Refine beforehand prevents errors here). The next option will be to select the columns you want to keep in the newly merged table, in your case, the material column and then the four technical properties.

As long as each row for a given material always has the same technical properties you should be fine to just keep one copy of each material's technical properties.

Export this newly merged table to CSV or your favorite format and bring it back into Refine. From there you should be able to create a facet to look for null values in each column and change them to 0.

Cheers,
Jonathan Schatz
BC Libraries Cooperative

Thad Guidry

unread,
Jan 17, 2014, 11:59:00 AM1/17/14
to openr...@googlegroups.com
Miha,

If you were using Refine 2.5 version, then the cross() function might not work properly, there were bugs that have been fixed.
Using OpenRefine 2.6 beta version has the fixes for the cross() function.  You might try downloading the newest beta version.


--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--

Miha Tomsic

unread,
Feb 3, 2014, 5:06:42 AM2/3/14
to openr...@googlegroups.com
Thank you!

Dne petek, 17. januar 2014 10:18:33 UTC+1 je oseba jonathan...@gmail.com napisala:

amga...@gmail.com

unread,
Feb 3, 2014, 2:51:18 PM2/3/14
to openr...@googlegroups.com
Hi Thad, 

I am running OpenRefine 2.6 beta and still running into errors in the cross() function.  I am using the expression:

cell.cross("UseThis_BARCODES_thru20140116PM csv", "Barcode").cells["DateIN"].value[0]

And all I get for a result in each cell is  "Error: Cannot retrieve field from null"

I have imported new projects and tried with those and I have closed and reopened Refine.

I would appreciate any help!

Best, 
Anne

Thad Guidry

unread,
Feb 3, 2014, 2:56:42 PM2/3/14
to openr...@googlegroups.com
Anne,

Are you sure you have a column called "Barcode" in one project....and in the other project, a column called "DateIN" ?

Tom Morris

unread,
Feb 3, 2014, 2:58:20 PM2/3/14
to openr...@googlegroups.com
On Mon, Feb 3, 2014 at 2:51 PM, <amga...@gmail.com> wrote:

I am running OpenRefine 2.6 beta and still running into errors in the cross() function.  I am using the expression:

cell.cross("UseThis_BARCODES_thru20140116PM csv", "Barcode").cells["DateIN"].value[0]

And all I get for a result in each cell is  "Error: Cannot retrieve field from null"

That expression doesn't look correct to me.  I would start smaller and build up the expression step-by-step.  That way you can check to see if you're getting the expected results at each stage before adding the next piece.

In particular, this piece:

    cell.cross("UseThis_BARCODES_thru20140116PM csv", "Barcode")

should return an array of matches, so you probably want your array index there i.e.

    cell.cross("UseThis_BARCODES_thru20140116PM csv", "Barcode")[0].cells["DateIN"].value

Tom

Thad Guidry

unread,
Feb 3, 2014, 2:59:59 PM2/3/14
to openr...@googlegroups.com
Anne,

Oh, I see your possible problem... your missing the index for "Barcode", which probably needs to be [0] , the first array, as Tom suggested...perhaps try this:

cell.cross("UseThis_BARCODES_thru20140116PM csv", "Barcode")[0].cells["DateIN"].value[0]

Anne Gaynor

unread,
Feb 3, 2014, 3:10:05 PM2/3/14
to openr...@googlegroups.com

Hi Thad,

Yes, I have the column Barcode in both projects and I am trying to fetch data from the column DateIN in the "UseThis_BARCODES..."

-Anne

--
You received this message because you are subscribed to a topic in the Google Groups "OpenRefine" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/openrefine/JNgJ2wXEsY4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to openrefine+...@googlegroups.com.

amga...@gmail.com

unread,
Feb 3, 2014, 4:31:58 PM2/3/14
to openr...@googlegroups.com
Hi Tom, 

OK, then this might be where the problem begins.  When I enter: 

cell.cross("UseThis_BARCODES_thru20140116PM csv", "Barcode")

The preview shows return values of "[]"

I've attached screenshots of the two projects.  In earlier versions of UseThis_BARCODES_thru20140116PM csv, I created one of the Date and Time OUT columns using a cell.cross function: 

cell.cross("Cataloging Statistics Collection Responses thru Jan21noon xlsx", "Barcode").cells["Timestamp"].value[0]

Is something short circuiting because I'm trying to match again on a column I used to match on before?
OpenRefine.pdf

Tom Morris

unread,
Feb 3, 2014, 5:09:32 PM2/3/14
to openr...@googlegroups.com
Hi Anne.  Sounds like you're making progress in narrowing it down.

On Mon, Feb 3, 2014 at 4:31 PM, <amga...@gmail.com> wrote:

OK, then this might be where the problem begins.  When I enter: 

cell.cross("UseThis_BARCODES_thru20140116PM csv", "Barcode")

The preview shows return values of "[]"

That indicates that there aren't any matches.  Are you sure that the cell value that you are starting with has at least one match in the "Barcode" column of the target spreadsheet?  Have you accounted for whitespace and other potential problems?  Perhaps you could try cutting and pasting the values to make sure they're truly identical and not different in some subtle way.
 
I've attached screenshots of the two projects.  In earlier versions of UseThis_BARCODES_thru20140116PM csv, I created one of the Date and Time OUT columns using a cell.cross function: 

cell.cross("Cataloging Statistics Collection Responses thru Jan21noon xlsx", "Barcode").cells["Timestamp"].value[0]

Is something short circuiting because I'm trying to match again on a column I used to match on before?

No, that shouldn't cause any issues. 

Tom

amga...@gmail.com

unread,
Feb 3, 2014, 5:44:43 PM2/3/14
to openr...@googlegroups.com
Oh jeez, I feel like such an idiot. There were no matches within the first several rows that were showing up in the preview, but when I moved to a page where there was a match, the DateIN showed up. Thank you so much for your help! I probably wouldn't have discovered this error without your suggestions! 

Thad Guidry

unread,
Feb 3, 2014, 6:01:02 PM2/3/14
to openr...@googlegroups.com
Anne,

Yeah this has bitten many many many folks, including myself...

Best Practice In OpenRefine:

Always apply your GREL expressions, no matter the preview that your get, and then inspect the results to confirm your expectations of the expression.
You can always UNDO easily in OpenRefine and try again !

Christina Moraal

unread,
Aug 1, 2017, 8:25:02 AM8/1/17
to OpenRefine
Hi,

I have two different data sets, with the same type of info. I want to create a single project in OpenRefine and this one will be huge (1,7 Mil entries). I have tried to consolidate the data in excel first, but I have found it is limited to 1 mil rows only and I am hoping to consolidate the two data sets to have a single data set with all 1,7 mil rows.

Ettore Rizza

unread,
Aug 1, 2017, 8:37:57 AM8/1/17
to OpenRefine
@Christina If you have two (or more) datasets with exactly the same columns, you can create a project with both. Simply select them holding the CTRL key. Open Refine will recognize that it is identical data and will paste one below the other in the new project.

Reply all
Reply to author
Forward
0 new messages