Re: Issue 137 in google-refine: XML import merges records together

google...@googlecode.com

non lue,

14 oct. 2011, 14:49:2914/10/2011

à google-r...@googlegroups.com

Updates:
Summary: XML import merges records together
Labels: -Priority-Medium -Component-UI -Performance -Usability
-Milestone-2.x Priority-High Component-Logic Milestone-2.5

Comment #5 on issue 137 by tfmorris: XML import merges records together
http://code.google.com/p/google-refine/issues/detail?id=137

I think I've figured out what the problem is here. The column groups for
this XML fragment are getting computed incorrectly:

<sponsors>
<lead_sponsor>
<agency>National Center for Research Resources (NCRR)</agency>
<agency_class>NIH</agency_class>
</lead_sponsor>
<collaborator>
<agency>HRSA/Maternal and Child Health Bureau</agency>
<agency_class>U.S. Fed</agency_class>
</collaborator>
</sponsors>

Refine is currently computing three column groups from this: lead_sponsor
(2 columns), collaborator (2 columns), and sponsor (all 4 columns). This
last group is triggering unnecessary row dependencies when there is no
collaborator element.

I'm tempted to say that column groups which consist of nothing but other
groups, without any individual ungrouped columns of their own should be
eliminated, but it's a fairly critical piece of code, so I want to look at
it a little more closely.

p.s. I suspect the reason for the sluggish browser performance is that it
was trying to deal with the entire file at once since the monster second
record effectively disabled the paging.

Attachments:
xml_small3.xml 13.8 KB

google...@googlecode.com

non lue,

28 oct. 2011, 15:57:3728/10/2011

à google-r...@googlegroups.com

Updates:
Status: Fixed

Comment #7 on issue 137 by tfmorris: XML import merges records together
http://code.google.com/p/google-refine/issues/detail?id=137

This is fixed, at least well enough for this example, with r2347. The cell
data counts weren't getting updated before sorting the column groups,
causing them all to be zero which meant data rich columns weren't getting
place where Refine needed them to be for the key column.

I think there's still an underlying issue where the current counting
strategy can get fooled into choosing the wrong column. For example if
Column A is mandatory but never has more than a single value and Column B
values are optional, but high frequency (e.g. 50% of the records, but every
record has 3 values, giving it 1.5x the number of cells as column A), then
column B will get chosen in preference to column A.

There's also the case where no single column in a column group always has a
value (e.g. a column group where either column A OR column B is populated
for any given record).

google...@googlecode.com

non lue,

6 nov. 2011, 16:15:4906/11/2011

à google-r...@googlegroups.com

Updates:
Status: Started

Comment #8 on issue 137 by tfmorris: XML import merges records together
http://code.google.com/p/google-refine/issues/detail?id=137

Reopening. The theoretical issue that I suspected has been confirmed in
the wild by a new XML file received from a commenter on issue 393, so we're
going to need a different approach to dealing with column groups.

google...@googlecode.com

non lue,

6 nov. 2011, 16:30:5406/11/2011

à google-r...@googlegroups.com

Comment #9 on issue 137 by tfmorris: XML import merges records together
http://code.google.com/p/google-refine/issues/detail?id=137

Issue 393 has been merged into this issue.

google...@googlecode.com

non lue,

13 avr. 2012, 10:53:4113/04/2012

à google-r...@googlegroups.com

Comment #11 on issue 137 by emilytgr...@gmail.com: XML import merges
records together
http://code.google.com/p/google-refine/issues/detail?id=137

Hello all,

Outside user, working to clean-up someone else's messy server. After
importing three .csv files that total 5019 records in length, I delete the
the 'file' column as it is not needed. The record count then abates to
1333. I do not have any facets selected. When attempting click around and
discover the source of the problem, Firefox tells me that I have an
unresponsive script:

http://127.0.0.1:3333/project-bundle.js:7973

And asks me if I wish to continue or not. If I continue, it will
eventually ask me again. If I stop the script, it will crash the tab I am
running Refine in.

I converted my original .csv files into .tsv format, and reopened them in
Refine. This solved my issue.

Quickly little bug with a simple workaround for those of you out there in
Userland that are not quite up to speed in programing languages.

google...@googlecode.com

non lue,

13 avr. 2012, 10:56:1413/04/2012

à google-r...@googlegroups.com

Comment #12 on issue 137 by emilytgriffiths: XML import merges records
together
http://code.google.com/p/google-refine/issues/detail?id=137

Hello all,

Outside user, working to clean-up someone else's messy server. After
importing three .csv files that total 5019 records in length, I delete the
the 'file' column as it is not needed. The record count then abates to
1333. I do not have any facets selected. When attempting click around and
discover the source of the problem, Firefox tells me that I have an
unresponsive script:

http://127.0.0.1:3333/project-bundle.js:7973

And asks me if I wish to continue or not. If I continue, it will
eventually ask me again. If I stop the script, it will crash the tab I am
running Refine in.

I converted my original .csv files into .tsv format, and reopened them in
Refine. This solved my issue.

Quirkly little bug with a simple workaround for those of you out there in

google...@googlecode.com

non lue,

13 avr. 2012, 10:57:5413/04/2012

à google-r...@googlegroups.com

Comment #13 on issue 137 by emilytgriffiths: XML import merges records
together
http://code.google.com/p/google-refine/issues/detail?id=137

Hello all,

Outside user, working to clean-up someone else's messy server. After
importing three .csv files that total 5019 records in length, I delete the
the 'file' column as it is not needed. The record count then abates to
1333. I do not have any facets selected. When attempting click around and
discover the source of the problem, Firefox tells me that I have an
unresponsive script:

http://127.0.0.1:3333/project-bundle.js:7973

And asks me if I wish to continue or not. If I continue, it will
eventually ask me again. If I stop the script, it will crash the tab I am
running Refine in.

I converted my original .csv files into .tsv format, and reopened them in
Refine. This solved my issue.

Quirky little bug with a simple workaround for those of you out there in

google...@googlecode.com

non lue,

13 avr. 2012, 12:59:0113/04/2012

à google-r...@googlegroups.com

Comment #14 on issue 137 by tfmorris: XML import merges records together
http://code.google.com/p/google-refine/issues/detail?id=137

Since you're working with CSV, not XML, it's clearly not related to this
bug. Please feel free to post on Google Refine mailing list if you'd like
assistance. It sounds like you probably had a mostly blank initial column
in one or more of your CSVs. You can either process it in "row" mode
(instead of "record" mode) or shuffle the columns around so it's not an
issue.

google...@googlecode.com

non lue,

11 juin 2012, 07:25:1111/06/2012

à google-r...@googlegroups.com

Comment #15 on issue 137 by libo...@gmail.com: XML import merges records
together
http://code.google.com/p/google-refine/issues/detail?id=137

Could someone provide a pointer on how does the record detection exactly
work? Is it import-time, or runtime? Where can I find the code responsible
for this?
When I provide a properly structured xml, it merges some of the records
into one, but I can't see any features that distinguish those records from
the others... I'd like to look at the source of the problem, but currently
have no idea where to look.

I'm attaching the offending xml.
Thanks

Attachments:
gen-local-small.zip 35.8 KB

google...@googlecode.com

non lue,

12 juin 2012, 16:15:5812/06/2012

à google-r...@googlegroups.com

Comment #16 on issue 137 by tfmorris: XML import merges records together
http://code.google.com/p/google-refine/issues/detail?id=137

Clicking the rev above that I used to repair part of the problem (r2347)
will get you in the right ballpark. The XML and JSON importers are in
the "tree-shaped" importer family.

ImportColumnGroup is the class which manages the column groups that are
used to determine dependent rows/records. Anything that references it is
probably involved. TreeImportUtilities and XmlImportUtilities have methods
which are used with this.

Hope that helps get you started. Let us know on the dev list if you have
any questions.

Répondre à tous

Répondre à l'auteur

Transférer