Record/column groups bug

Tom Morris

unread,

Oct 27, 2011, 9:53:17 AM10/27/11

to google-refine-dev

We have a long-standing XML import bug that I've been investigating
with an eye towards getting it fixed for the 2.5 release.
http://code.google.com/p/google-refine/issues/detail?id=137

I think I've figured out what the problem, but could use some feedback
from someone (Iain? David?) who understands what column groups are
intended to be used, either in the context of the XML importer or in
general.

The problem is that the column groups for the following XML fragment
are getting computed incorrectly:

<sponsors>
<lead_sponsor>
<agency>National Center for Research Resources (NCRR)</agency>
<agency_class>NIH</agency_class>
</lead_sponsor>
<collaborator>
<agency>HRSA/Maternal and Child Health Bureau</agency>
<agency_class>U.S. Fed</agency_class>
</collaborator>
</sponsors>

Refine is currently computing three column groups from this:
lead_sponsor (2 columns), collaborator (2 columns), and sponsor (all 4
columns). This last group is triggering incorrect row dependencies
when there is no collaborator element in the source record.

Visually the column groups look like this - note the (re)ordering of
the columns:

2222333344445555 - Col #
SSSSSSSSSSSSSSSS - outer/top Sponsor column group
CCCCccccLLLLllll - Lead/Collaborator column group

[Use a fixed width font, or squint a little, to get this to line up]

The three groups are group S (2-5), group Cc (2-3), and group Ll
(4-5). Note that the columns have been reordered with the Collaborator
columns before the Lead columns. Group S is the problematic one.
Since the Collaborator sponsors are optional, that means that the
"key" column #2 can be blank for a top level record, causing it to get
merged with the previous record (the algorithm searches back for a row
with a non-blank "key" cell if any of the cells in the column group
are non-blank).

Group S is a real column group (consisting solely of other column
groups), but since we don't appear to use column groups for anything,
I'm not sure what value it has. I'm tempted to say that column groups
which consist of nothing but other groups, without any individual
ungrouped columns of their own should be eliminated.

I'm not sure why we can't compute row->record dependencies directly
since we have that information from the parse. We know what the path
to the top level element is and, by definition, all rows that we
create during the parse of its children are part of the same record.

I think there are two options here:

1. Eliminate column groups which consist solely of other column groups
from the dependency analysis

2. Compute row dependencies using a different method than column
groups (eg use the tree structure directly from the parse).

Opinions? Can anyone provide background on what column groups are
used for (or, perhaps, are intended to be used for in the future)?

Tom

David Huynh

unread,

Oct 27, 2011, 10:52:42 AM10/27/11

to google-r...@googlegroups.com

Hi Tom,

I intended column groups to be useful when converting records into little graphs (for loading into Freebase). I also thought they would be useful in some transformations, where you can reference the groups of cells containing a particular cell. But so far there doesn't seem to be a strong demand for column groups, and accessing the whole record seems to be enough.

If the columns are moved around (into, out of column groups), or if rows get deleted, then we need to reconstruct the row dependencies, but we can't do that from the parse anymore. So option #2 is not feasible.

Option #1 is lossy ... Do you think that in TreeImportUtilities.java, line 97 or so, we can fix up any column group that contains only other column groups? We can assign it the same key column index as the key column index of its first sub group. What do you think?

David

Tom Morris

unread,

Oct 28, 2011, 1:48:44 PM10/28/11

to google-r...@googlegroups.com

On Thu, Oct 27, 2011 at 10:52 AM, David Huynh <dfh...@gmail.com> wrote:

> Do you think that in TreeImportUtilities.java, line
> 97 or so, we can fix up any column group that contains only other column
> groups? We can assign it the same key column index as the key column index
> of its first sub group. What do you think?

That's actually what's being done already, I think. The problem is that I don't think we can guarantee that there will be a "key" column (ie one where the cell value is always filled in for the row which is the record root). Say we've got a group G with two subgroups g1 and g2, both of which are optional.

      GGGGGGGGG
      g1g1g1 g2g2g2
x1 a1 b1
x2      c2 d2

I don't think there's a way to compare those four cells for dependencies based on an analysis of the data alone (which is what we're currently doing). The corresponding XML looks like this:

<record>
<x>x1</x>
<G>
    <a>a1</a>
    <b>b1</b>
</G>
</record>
<record>
<x>x2</x>
<G>
    <c>c2</a>
    <d>b2</b>
</G>
</record>

The real dependency is to a data-less node in the graph which doesn't have a representation in the spreadsheet.

I just fixed a bug in the column group sorting which helps Iain's particular example, but I'm still doubtful that we'll be able to get correct behavior in all cases by looking solely at the rectangular data. I think we may need to keep more schema information from the parse.

Tom

Reply all

Reply to author

Forward

Record/column groups bug - issue 137

Tom Morris

David Huynh

Tom Morris