Comment #5 on issue 137 by tfmorris: XML import merges records together
http://code.google.com/p/google-refine/issues/detail?id=137
I think I've figured out what the problem is here. The column groups for
this XML fragment are getting computed incorrectly:
<sponsors>
<lead_sponsor>
<agency>National Center for Research Resources (NCRR)</agency>
<agency_class>NIH</agency_class>
</lead_sponsor>
<collaborator>
<agency>HRSA/Maternal and Child Health Bureau</agency>
<agency_class>U.S. Fed</agency_class>
</collaborator>
</sponsors>
Refine is currently computing three column groups from this: lead_sponsor
(2 columns), collaborator (2 columns), and sponsor (all 4 columns). This
last group is triggering unnecessary row dependencies when there is no
collaborator element.
I'm tempted to say that column groups which consist of nothing but other
groups, without any individual ungrouped columns of their own should be
eliminated, but it's a fairly critical piece of code, so I want to look at
it a little more closely.
p.s. I suspect the reason for the sluggish browser performance is that it
was trying to deal with the entire file at once since the monster second
record effectively disabled the paging.
Attachments:
xml_small3.xml 13.8 KB
Comment #7 on issue 137 by tfmorris: XML import merges records together
http://code.google.com/p/google-refine/issues/detail?id=137
This is fixed, at least well enough for this example, with r2347. The cell
data counts weren't getting updated before sorting the column groups,
causing them all to be zero which meant data rich columns weren't getting
place where Refine needed them to be for the key column.
I think there's still an underlying issue where the current counting
strategy can get fooled into choosing the wrong column. For example if
Column A is mandatory but never has more than a single value and Column B
values are optional, but high frequency (e.g. 50% of the records, but every
record has 3 values, giving it 1.5x the number of cells as column A), then
column B will get chosen in preference to column A.
There's also the case where no single column in a column group always has a
value (e.g. a column group where either column A OR column B is populated
for any given record).
Comment #8 on issue 137 by tfmorris: XML import merges records together
http://code.google.com/p/google-refine/issues/detail?id=137
Reopening. The theoretical issue that I suspected has been confirmed in
the wild by a new XML file received from a commenter on issue 393, so we're
going to need a different approach to dealing with column groups.
Issue 393 has been merged into this issue.
Hello all,
Outside user, working to clean-up someone else's messy server. After
importing three .csv files that total 5019 records in length, I delete the
the 'file' column as it is not needed. The record count then abates to
1333. I do not have any facets selected. When attempting click around and
discover the source of the problem, Firefox tells me that I have an
unresponsive script:
http://127.0.0.1:3333/project-bundle.js:7973
And asks me if I wish to continue or not. If I continue, it will
eventually ask me again. If I stop the script, it will crash the tab I am
running Refine in.
I converted my original .csv files into .tsv format, and reopened them in
Refine. This solved my issue.
Quickly little bug with a simple workaround for those of you out there in
Userland that are not quite up to speed in programing languages.
Hello all,
Outside user, working to clean-up someone else's messy server. After
importing three .csv files that total 5019 records in length, I delete the
the 'file' column as it is not needed. The record count then abates to
1333. I do not have any facets selected. When attempting click around and
discover the source of the problem, Firefox tells me that I have an
unresponsive script:
http://127.0.0.1:3333/project-bundle.js:7973
And asks me if I wish to continue or not. If I continue, it will
eventually ask me again. If I stop the script, it will crash the tab I am
running Refine in.
I converted my original .csv files into .tsv format, and reopened them in
Refine. This solved my issue.
Quirkly little bug with a simple workaround for those of you out there in
Hello all,
Outside user, working to clean-up someone else's messy server. After
importing three .csv files that total 5019 records in length, I delete the
the 'file' column as it is not needed. The record count then abates to
1333. I do not have any facets selected. When attempting click around and
discover the source of the problem, Firefox tells me that I have an
unresponsive script:
http://127.0.0.1:3333/project-bundle.js:7973
And asks me if I wish to continue or not. If I continue, it will
eventually ask me again. If I stop the script, it will crash the tab I am
running Refine in.
I converted my original .csv files into .tsv format, and reopened them in
Refine. This solved my issue.
Quirky little bug with a simple workaround for those of you out there in
Since you're working with CSV, not XML, it's clearly not related to this
bug. Please feel free to post on Google Refine mailing list if you'd like
assistance. It sounds like you probably had a mostly blank initial column
in one or more of your CSVs. You can either process it in "row" mode
(instead of "record" mode) or shuffle the columns around so it's not an
issue.