Bug in XML parsing?

9 views
Skip to first unread message

Paul Makepeace

unread,
Oct 25, 2012, 9:51:17 PM10/25/12
to google-r...@googlegroups.com
I'm experimenting with loading XML data into Refine - the attached
file is the concatenation of a couple of Ohloh.net results from their
API (https://www.ohloh.net/p/gcc.xml?api_key=QAt…) wrapped in a <root>
tag.

Importing, I select <project> and it all pretty much works except I
get a third record. My cursory look makes me think: with the GCC
record the languages & factoids (four each) happen to match up. But
with refine-client-py there are three factoids but only two languages
so the third factoid seems to leak into a new record.

I'm new to this feature so maybe it's a known gotcha?

PS what do the red bars and (inactive) triangles in the headers mean?

Paul

(Refine r2407, OS X, Chrome)
ohloh.xml

Tom Morris

unread,
Oct 26, 2012, 10:57:27 AM10/26/12
to google-r...@googlegroups.com
The short answer is that record mode is incomplete/buggy.

On Thu, Oct 25, 2012 at 9:51 PM, Paul Makepeace <pa...@paulm.com> wrote:
>
> I'm experimenting with loading XML data into Refine - the attached
> file is the concatenation of a couple of Ohloh.net results from their
> API (https://www.ohloh.net/p/gcc.xml?api_key=QAt…) wrapped in a <root>
> tag.

As an aside, particularly if it's only a couple, you might look at
pasting the URLs into the create project dialog and see what happens.

> Importing, I select <project> and it all pretty much works except I
> get a third record. My cursory look makes me think: with the GCC
> record the languages & factoids (four each) happen to match up. But
> with refine-client-py there are three factoids but only two languages
> so the third factoid seems to leak into a new record.

That's probably pretty close to what's happening. The missing bit of
information is that the XML (or JSON) structure isn't actually used to
create Refine records. Instead, it uses the same process that it
would use with an indent CSV which is basically to look at how
populated columns are and impute column groups along with a "key
column" which always has a value in the group.

The heuristic can fail in the other direction as well, giving merged XML records
https://github.com/OpenRefine/OpenRefine/issues/137

I've tweaked things a little bit in the past, so it's possible that
I've biased it more towards failing in the other direction now. What
really needs to happen though is to construct the record structure
based on the XML/JSON structure rather than doing it after the fact
after tossing away all the useful information.

You can work around this by selecting a column with a unique value,
e.g. project - name, and moving it to the beginning. This will
discard and recreate the column groups and give you the two records
you desire.

> PS what do the red bars and (inactive) triangles in the headers mean?

The bars spanning the tops of columns signify column groups
(sub-records more or less). There's an implicit column group spanning
all columns which isn't shown since it exists for all projects. The
triangles, I suspect, are a forward looking artefact meant to hold a
column group menu.

Tom
Reply all
Reply to author
Forward
0 new messages