Handle nested hierarchical XML data

363 views
Skip to first unread message

Kai Liu

unread,
Aug 17, 2022, 1:50:13 AM8/17/22
to OpenRefine
Hi,

I am about to handle a nested xml file like the following:

<design>
  <name>mydesign</name>
  <port>
    <var>
      <name>var1</name>
    </var>
    <var>
      <name>var2</name>
    </var>
  </port>
  <port>
    <var>
      <name>var3</name>
    </var>
  </port>
</design>

I loaded it to OpenRefine by using the record mode, and got one record with two columns: "design - name" and "design - port - var - name". The relationship between port and var is gone. The relationship between design and port is gone as well. My question is: how to teach OpenRefine to keep the original hierachical information in the record mode. Maybe via column transformation? In this specific case, I'd like to get one record with columns like:

| design | design - name | design - port | design - port - var | design - port - var - name |

Any suggestions are much appreciated. 

Thanks in advance.

Kai

Antoine Beaubien

unread,
Aug 17, 2022, 2:33:53 AM8/17/22
to OpenRefine
I don't have a good solution for you.

The XML importer is too limited.

My best shot would be to edit the XML (add a property to the tag <port>?). Only in OR, I would try this hack:
I would import it with a custom column delimiter set to 2 spaces.
It works with the indentation of your example.

There is still a lot of clean up, but with a few regex, magic can be done.

Regards,
   Antoine
Screenshot 2022-08-17 02.24.03.png

jonathan...@gmail.com

unread,
Aug 17, 2022, 4:11:12 AM8/17/22
to openr...@googlegroups.com

If you’re getting the data from a website, try creating the project with the “web url” option – paste the link into the box. From there OR is able to allow you a one-click conversion to a readable table (select the table within the preview pane)  – I’ve only just started using this solution myself and am amazed how easy it is.

 

Jonathan Stoneman

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/c743accb-195b-4eca-b044-4487b2ef17ffn%40googlegroups.com.

Kai Liu

unread,
Aug 18, 2022, 4:21:31 PM8/18/22
to OpenRefine
Hi Antoine,

nice! I tried your hack. It worked for XML files that have fixed number of spaces between tags. 

I tried also your proposal about adding an element to <port>, 

<design>
  <name>mydesign</name>
  <port>
    <number>1</number>

    <var>
      <name>var1</name>
    </var>
    <var>
      <name>var2</name>
    </var>
  </port>
  <port>
    <number>2</number>

    <var>
      <name>var3</name>
    </var>
  </port>
</design>

and got the following:

1.png

The problem is that it can only recognize the first level of record, but the next level of record is not modelled. Meaning that the relationship between port1 and var1+var2 is gone. 

My real-world XMLs are much more complex than this small example. I believe OR might not be the right tool to process XML, it is really good for tabular files though.

Thanks for your help.

Best regards, Kai

Antoine Beaubien

unread,
Aug 18, 2022, 9:20:34 PM8/18/22
to openr...@googlegroups.com
Hi Kai,

   This hack is not very good as it really depends on the indentation, which is really guaranteed, unless you format it before.

   That being said, the problem of the single-level-grouping can be addressed by 2 actions, filling down the first column, and moving the column new grouping column at the start. Yes, it’s cumbersome, but it gets the job done.

   In your situation, I would first use a tool like xq (https://github.com/kislyuk/yq), to adapt my XML (in this case, a modification and JSON conversion).
This was just a hack, useful in particular situations.

Regards,
   Antoine



--
You received this message because you are subscribed to a topic in the Google Groups "OpenRefine" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/openrefine/cpnKPvjnWmE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/a822337d-996a-4869-ac3a-99bffc255c3an%40googlegroups.com.

Owen Stephens

unread,
Aug 31, 2022, 6:12:20 AM8/31/22
to OpenRefine
As noted, the XML import is a bit limited - and to be honest round-tripping XML in and out of OpenRefine is painful as building an export template to produce the XML is also difficult.

Another option which I've used (which may or may not be appropriate for you) is to put all the XML in a single cell, and then use the GREL parseXML() to do all further manipulation. This can work although requires some care. You can split the XML into a row per top level element with somethiung like:

value.parseXml().select("design").join("|")

and then use "Split multi-valued cells" to get the content of each "design" element on it's own row. From there you can do further work to extract the values to columns / rows exactly as you need. It can be painstaking with complex XML, although if you are repeating the same process many times with the same structure of XML this can then be re-used via the Extracting/Applying the same steps to other projects

Owen
Reply all
Reply to author
Forward
0 new messages