Create complexe XML from OpenRefine in record mode

1,115 views
Skip to first unread message

Martin Magdinier

unread,
Jan 9, 2013, 10:54:37 PM1/9/13
to openrefine

Hey,

Currently OpenRefine thanks to the record mode support complexe XML file import. Each child of the main element is stored in a new row and the record mode help to manipulate the data. The templating export is great to convert basic csv to XML.

However it will be even better if it support complexe XML export based on records. In this scenario the templating export will iterate for every non blank row within a record and create a child element.

This process will be the reverse of the current import function and will bridge the gap to have a complete tool for XML import, transformation / cleaning and export.

I wanted to open the discussion here and see if it worth creating an issue / request.

Thanks

Martin

Owen Stephens

unread,
Jan 10, 2013, 4:25:41 AM1/10/13
to openr...@googlegroups.com
I'd support this not just for XML but also for JSON. This was also mentioned in David Huynh's previous list of feature requests https://groups.google.com/d/topic/openrefine/Qs7vcbdz2zs/discussion

I wonder if this is also linked to another item in that list of feature requests which was "Internal JSON-like data model. Improve the record model and make it behave coherently in expressions, facets, and operations."

Owen

Marsha Maguire

unread,
Apr 5, 2013, 3:56:19 PM4/5/13
to openr...@googlegroups.com
Thank so much, Martin.

I would say, "Yes!" It would very much be worth creating an issue/request to add XML export in the way you describe it. If XML is still the lingua franca of the Web, then we definitely need a complete import/transform/clean/export XML tool in Refine.

Thanks again. I really appreciate your help.

Marsha


On Mon, Apr 1, 2013 at 3:00 PM, Marsha <mmagu...@gmail.com> wrote:
I hope I'm understanding this correctly (I'm not a programmer AT ALL), but are you saying that if data in OpenRefine takes advantage of the Record mode (is that the same as Column Groups?), that it cannot be converted to XML (without what for me would be impossible scripting)? For example, I opened an Excel spreadsheet in OpenRefine to clean up some data. The spreadsheet describes some old radio programs, and each row represents an episode of the program series. The thing is, in a given episode, there may be 2-3 actors, and there may be alternate titles for some episodes, so to preserve that hierarchical structure (2 actors under one radio program), I used the Record feature in OpenRefine.  I need to convert this, ultimately, to XML. I was going to use Talend Open Source for that, but I don't know how to represent the OpenRefine concept of Records in Talend. Talend looks at a"spreadsheet" as flat, but in the OpenRefine Record structure, it's really hierarchical.
 
Are you saying I can't export Records from OpenRefine (at present) in a way that preserves that hierarchical structure represented in OpenRefine Records? Can Excel handle this kind of hierarchical structure at all?
 
I work in a library, and I can see that OpenRefine, Talend, and similar tools could be so useful to us (in converting one form of data to another), but I'm just starting to learn all this.
 
Many thanks.
 
Marsha

--
You received this message because you are subscribed to a topic in the Google Groups "Open Refine" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/openrefine/l22PL_IQyTY/unsubscribe?hl=en-US.
To unsubscribe from this group and all its topics, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Tom Morris

unread,
Apr 5, 2013, 5:42:25 PM4/5/13
to openr...@googlegroups.com
I'm happy to have people create feature requests even if they're the only one who wants the feature.  It provides a context for the discussion, refinement, and prioritization of the feature. 

On Mon, Apr 1, 2013 at 3:00 PM, Marsha <mmagu...@gmail.com> wrote:
 
Are you saying I can't export Records from OpenRefine (at present) in a way that preserves that hierarchical structure represented in OpenRefine Records? Can Excel handle this kind of hierarchical structure at all?

That's correct.  All of the exporters are row based, not record based.

Record mode is really only partially implemented at the moment and column groups for XML & JSON are computed the same way that they are for spreadsheets, not using the actual tree structure of the file.  Additionally, column groups get deleted as soon as you move any columns around, even if they wouldn't necessarily be affected.

Excel does appear to have some limited XML functionality, but I've never tried to use it, so don't know if it would be enough for your case:
You would need a XSD Schema file which describes your desired format.

Tom

Marsha Maguire

unread,
Apr 5, 2013, 7:05:55 PM4/5/13
to openr...@googlegroups.com
Thanks for your reply.

Excel, too, only likes flat structures when exporting spreadsheet data as XML. Try getting hierarchical on Excel and it will get very confused.

Not being a programmer (but not having access to one, either, even at a serious research library, because they're all too busy), I'm learning that these wonderful data conversion tools (OpenRefine, Talend, etc.) actually can't handle conversion to XML that has any kind of hierarchy to it (doesn't nearly all XML have hierarchy and repeating elements?). JSON is hierarchical; is this also difficult to map to flat structures? This is all very discouraging. Maybe someday...

Marsha


Tom Morris

unread,
Apr 10, 2013, 6:07:24 PM4/10/13
to openr...@googlegroups.com
Marsha - sorry for the delayed reply.

On Fri, Apr 5, 2013 at 7:05 PM, Marsha Maguire <mmagu...@gmail.com> wrote:
 
 (doesn't nearly all XML have hierarchy and repeating elements?). JSON is hierarchical; is this also difficult to map to flat structures? 

A lot of XML does have hiearchies and repeating elements.  We can read these easily, but not write them and preserve the structure.

You are also correct that JSON is a hierarchical structure.

Unfortunately, the traditional way of processing XML (XSLT style sheets) is pretty unfriendly, even for programmers.  You might want to look into XML editors which have batch capabilities.  If you need Refine's text processing, clustering, etc, it looks like Altova, StylusStudio, Oxygen, etc can probably map from Refine's output CSV to the hierarchical XML (or they may have sufficiently powerful tools built-in to do the whole job).

Of course, if you'd like to loan us one of your programmers (or fund one of ours), we could teach Refine how to do it all :-) 

Tom

SanjayKumar Rajbhar

unread,
Apr 11, 2013, 11:07:43 AM4/11/13
to openr...@googlegroups.com
Its quite easy to map in excel 2007 and later......


--
You received this message because you are subscribed to the Google Groups "Open Refine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
Regards,
SanjayKumar Rajbhar
9930053346

Marsha Maguire

unread,
Apr 11, 2013, 12:06:59 PM4/11/13
to openr...@googlegroups.com
Does that include nested, repeating elements? It looks like XML can be imported into Excel, edited in "XML tables," and exported as XML. But what about opening an Excel spreadsheet in Excel and exporting it as hierarchical XML that can include repeating elements? Of course, we'd need to set up the Excel spreadsheet in a way that shows which cells in which rows should be exported into the same resulting XML element (as in the OpenRefine "record" model).

Oxygen does convert Excel to XML, but again, values that should end up in repeating elements in the XML are stripped out. My spreadsheet describes radio programs, some which feature two or more actors. Actor names beyond the first one are stripped out.

I appreciate your answers and suggestions so much, I can't tell you! Many thanks.

Marsha


--
You received this message because you are subscribed to a topic in the Google Groups "Open Refine" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/openrefine/l22PL_IQyTY/unsubscribe?hl=en-US.
To unsubscribe from this group and all its topics, send an email to openrefine+...@googlegroups.com.

Tom Morris

unread,
Apr 11, 2013, 12:16:05 PM4/11/13
to openr...@googlegroups.com
Excel was my first suggestion.  Did you try the instructions on the link I forwarded?



On Thu, Apr 11, 2013 at 12:06 PM, Marsha Maguire <mmagu...@gmail.com> wrote:
Does that include nested, repeating elements? It looks like XML can be imported into Excel, edited in "XML tables," and exported as XML. But what about opening an Excel spreadsheet in Excel and exporting it as hierarchical XML that can include repeating elements? Of course, we'd need to set up the Excel spreadsheet in a way that shows which cells in which rows should be exported into the same resulting XML element (as in the OpenRefine "record" model)

When you rejected this solution as unworkable, I presumed that you had tested this and it didn't meet your needs. I'd suggest giving it a try directly.  You might find it easily meets your needs out of the box.

Tom 

Marsha Maguire

unread,
Apr 11, 2013, 12:46:29 PM4/11/13
to openr...@googlegroups.com
I did. Well, I read them, but I need to actually try it out. I'm also tinkering with Oxygen because we have it here.

Great! I'm hopeful (for the first time in months) that we can really get this done. And we'll use OpenRefine for cleaning, parsing, reconciling data before we convert it.

Thank you again.


--
Reply all
Reply to author
Forward
0 new messages