I think I've figured out what the problem, but could use some feedback
from someone (Iain? David?) who understands what column groups are
intended to be used, either in the context of the XML importer or in
general.
The problem is that the column groups for the following XML fragment
are getting computed incorrectly:
<sponsors>
<lead_sponsor>
<agency>National Center for Research Resources (NCRR)</agency>
<agency_class>NIH</agency_class>
</lead_sponsor>
<collaborator>
<agency>HRSA/Maternal and Child Health Bureau</agency>
<agency_class>U.S. Fed</agency_class>
</collaborator>
</sponsors>
Refine is currently computing three column groups from this:
lead_sponsor (2 columns), collaborator (2 columns), and sponsor (all 4
columns). This last group is triggering incorrect row dependencies
when there is no collaborator element in the source record.
Visually the column groups look like this - note the (re)ordering of
the columns:
2222333344445555 - Col #
SSSSSSSSSSSSSSSS - outer/top Sponsor column group
CCCCccccLLLLllll - Lead/Collaborator column group
[Use a fixed width font, or squint a little, to get this to line up]
The three groups are group S (2-5), group Cc (2-3), and group Ll
(4-5). Note that the columns have been reordered with the Collaborator
columns before the Lead columns. Group S is the problematic one.
Since the Collaborator sponsors are optional, that means that the
"key" column #2 can be blank for a top level record, causing it to get
merged with the previous record (the algorithm searches back for a row
with a non-blank "key" cell if any of the cells in the column group
are non-blank).
Group S is a real column group (consisting solely of other column
groups), but since we don't appear to use column groups for anything,
I'm not sure what value it has. I'm tempted to say that column groups
which consist of nothing but other groups, without any individual
ungrouped columns of their own should be eliminated.
I'm not sure why we can't compute row->record dependencies directly
since we have that information from the parse. We know what the path
to the top level element is and, by definition, all rows that we
create during the parse of its children are part of the same record.
I think there are two options here:
1. Eliminate column groups which consist solely of other column groups
from the dependency analysis
2. Compute row dependencies using a different method than column
groups (eg use the tree structure directly from the parse).
Opinions? Can anyone provide background on what column groups are
used for (or, perhaps, are intended to be used for in the future)?
Tom