CSV import gone awry

140 views
Skip to first unread message

Friesen, Darryl

unread,
Oct 25, 2017, 12:40:39 PM10/25/17
to ica-ato...@googlegroups.com
One of our archives staff imported a CSV file (that she exported from Excel) that she thought had 415 records into our AtoM 2.4 test instance. She was excited to try a web import using a file with more than 100 records (which I gather was a limitation of AtoM pre 2.4??)
 
Excel, in its infinite wisdom, somehow exported not just the 415 lines of her spreadsheet that had data, but an additional 1,048,161 completely empty lines, yielding a CSV file with 1,048,576 records, most of which were simply this:
 
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
 
She was unaware of the extra blank records and uploaded this CSV file for import. Only after she realized this morning that the import was still running (after, like, 16 hours) did we notice the problem.
 
Firstly, I’d like to suggest there might be a bug in AtoM.  Rather than skip these lines which obviously contain no data, AtoM happily turned them into empty records/objects.  L
 
Which leads me into some questions:
 
What’s the proper way to terminate an import?  I can see an active job in the web interface, but there’s no method to terminate it there.  On the server I see a couple of symphony processes running that are obviously responsible for the import – is it “safe” to just kill those?  I needed to restart Percona for a config change to take effect, and the act of doing that actually seemed to terminate the import (but that left the job in the AtoM web interface listed as “Running” when it is not).
 
Is there a way to remove that phantom “Running” job from the list of active jobs?
 
And lastly, what’s my best option for cleaning out the several hundred thousand empty objects that were created?  I’m assuming the quickest method is a restore from backup?  I’m doing nightly backups of the Percona databases (using something I found on GitHub https://github.com/sixninetynine/surrogate) so that shouldn’t be a big deal; just curious if there was some other method of undoing an import (I assume constructing SQL queries to “rollback”/undo an import would be more trouble than it’s worth??)
 
Thankfully this was all on a test server!
 
- Darryl
 
----------------------------------------------------------------------
Darryl Friesen, B.Sc., Programmer/Analyst    Darryl....@usask.ca
Library Systems & Information Technology,    http://library.usask.ca/
University of Saskatchewan Library
----------------------------------------------------------------------
"Go not to the Elves for counsel, for they will say both no and yes"
 
 

David Juhasz

unread,
Oct 25, 2017, 3:53:22 PM10/25/17
to ica-ato...@googlegroups.com
Hi Darryl,

Wow! That's a lot of rows! :-O

To kill the job, you'll have to find the running worker process:
ps aux | grep symfony

Then "kill [pid]" (using the pid from above) any runaway processes using a lot of CPU.

Once the process is killed you can kill the job queue with (from the AtoM directory):
./symfony jobs:clear

Yes, the fastest method to roll back the import is to restore a backup.  You can try and delete the blank descriptions, but it's complicated because data is added to several MySQL tables for each row imported, e.g. object, information_object, information_object_i18n, object_term_relation, slug, status.

I agree that it would better to ignore blank rows in the import, I'll open a Redmine ticket to that effect.   As with any AtoM issues we welcome and encourage community pull requests or funding for a fix.   We try and fix as many bugs as possible on our own time, but we have limited resources and it's a big project.  I'll post back to this thread with the Redmine ticket number.


Best regards,
David



--

David Juhasz
Director, AtoM Technical Services Artefactual Systems Inc. www.artefactual.com

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.
Visit this group at https://groups.google.com/group/ica-atom-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/ica-atom-users/b2c70a4bd4e84eb7a0c6e589164245b6%40Mail04.usask.ca.
For more options, visit https://groups.google.com/d/optout.

David Juhasz

unread,
Oct 25, 2017, 3:57:43 PM10/25/17
to ica-ato...@googlegroups.com

--

David Juhasz
Director, AtoM Technical Services Artefactual Systems Inc. www.artefactual.com

On Wed, Oct 25, 2017 at 12:52 PM, David Juhasz <dju...@artefactual.com> wrote:
Hi Darryl,

Wow! That's a lot of rows! 😮
To post to this group, send email to ica-ato...@googlegroups.com.

Friesen, Darryl

unread,
Oct 26, 2017, 11:50:06 AM10/26/17
to ica-ato...@googlegroups.com

Wow! That's a lot of rows! 😮

 

Yah Microsoft!  I’ve seen Excel add empty lines to an export before, but never a million.

 

Once the process is killed you can kill the job queue with (from the AtoM directory):

./symfony jobs:clear

 

Thanks!  This was the piece I was missing.  I have a Bash script that clears the cache and restarts all the daemons; I noticed that when I did that the import job re-appeared.   This helped!

 

Yes, the fastest method to roll back the import is to restore a backup.  You can try and delete the blank descriptions, but it's complicated because data is added to several MySQL tables for each row imported, e.g. object, information_object, information_object_i18n, object_term_relation, slug, status.

 

Ya, I assumed it would be a lot of work.  I restored from a previous nightly backup and all seems well.  I’m wondering about Elasticsearch though.  Will the indexes have data in them that doesn’t match the restored MySQL database?  Should I run search:populate (or any other commands) after a DB restore?

 

Thanks for help David.

 

-          Darryl

 

Friesen, Darryl

unread,
Oct 26, 2017, 11:55:38 AM10/26/17
to ica-ato...@googlegroups.com

Thanks David.

 

If I were to look at this and submit a PR, would you want it against the qa/2.5.x branch or stable/2.4.x? 

 

-          Darryl

--

To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.

 

 

--

You received this message because you are subscribed to the Google Groups "AtoM Users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.


To post to this group, send email to ica-ato...@googlegroups.com.
Visit this group at https://groups.google.com/group/ica-atom-users.

David Juhasz

unread,
Oct 26, 2017, 12:18:54 PM10/26/17
to ica-ato...@googlegroups.com
Hi Darryl,

Yes, you should run "search:populate" to sync the search index data with the data in the database after the backup restoration.

Please submit any PRs against qa/2.5.x.  We will backport most bug fixes to stable/2.4.x after the PR is merged.  The only case where we would not backport a bug fix to the current stable branch is when we have concerns about it introducing regressions, but I can't see that being a risk in this case.

Best regards,
David

--

David Juhasz
Director, AtoM Technical Services Artefactual Systems Inc. www.artefactual.com

--

To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to
ica-atom-users@googlegroups.com.

 

 

--

You received this message because you are subscribed to the Google Groups "AtoM Users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.

GR Mulcaster

unread,
Nov 1, 2017, 10:18:29 PM11/1/17
to AtoM Users
Hi Darryl

Would be interested to know if your CSV imports have worked subsequently.

Was there an issue with the UTF-8 encoding or an absence of the "/n" characters at the end of lines that may have triggered the extensive upload ?
We are about to start uploading large numbers of bulk Archival Descriptions using CSV import.

Previously, we have used LibreOffice-generated CSVs encoded in UTF-8 or MS Excel for MacOS-created CSVs encoded in UTF-8.
Normally, the uploads take only a few seconds, when using only a few dozen lines of spreadsheets for Archival Descriptions.

I had interpreted the user guidelines that the limits of 50 or 100 lines of a CSV were to prevent timeout errors.

We always use CSVs to import using qubitParentSlug as the first column, so that we are only populating child-level ADs as "levelOfDescription" Items to an already existing Collection level AD.

But now, we are experimenting with using CSVs encoded in UTF-8 and created in MS Excel for Windows, for which most of our users have been given new updated Office suites that will cope with UTF-8.



regards
Glenn Mulcaster


Librarian (Access and Discovery)
University of Tasmania Library

Phone: +61 3 6324 3061

Email: Glenn.M...@utas.edu.au

Mail: Locked Bag 1312, Launceston TAS 7250

Dan Gillean

unread,
Nov 2, 2017, 11:29:59 AM11/2/17
to ICA-AtoM Users
Hi GR, 

You're correct that the recommendation to limit CSV imports to 50-100 rows has previously been about timeouts. In 2.4, imports are handled by the job scheduler, so this is no longer such an issue. However, I would still recommend breaking up CSVs that are more than 5,000 rows into multiple imports - or at the very least, disabling indexing during import, and then manually running the search:populate command-line task after. 

When the records are indexed as they proceed, this tends to eat up the available memory in your system much faster, and with a large enough CSV, you can exhaust all available system resources (since I think the import is trying to hold everything in memory as it progresses) before the import completes. I recently killed my Vagrant test instance trying to import a 13,000 row CSV via the user interface with indexing turned on. The Vagrant box likely has less memory assigned than your production install would, but still - it's a good reminder. 

In Darryl's case, my suspicion is that Excel was the culprit. Microsoft has made it a bit easier to preserve UTF-8 encoding with CSVs in Office 2016, but generally, and definitely in earlier versions, Microsoft is notorious for using its own custom character encoding (WinLatin I believe) and line endings, which can cause all sorts of problems. My hope is that if you are careful about line endings and character encodings, you should be able to avoid running into this issue. 

That said, our import should be smart enough not to create a bunch of blank records - which is why David created issue ticket #11646 - so we can review the current code and hopefully improve handling in a future version. 

In the meantime, I can't stress enough how important backups are when working with data imports, especially into a production system! I'm so glad that Darryl had backups in place, and was trying out the import on a test instalationl first. Imagine what a nightmare this would be if it were in a production system with no backups! 

Cheers, 

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory

On Wed, Nov 1, 2017 at 10:18 PM, GR Mulcaster <thegrillp...@gmail.com> wrote:
Hi Darryl

Would be interested to know if your CSV imports have worked subsequently.

Was there an issue with the UTF-8 encoding or an absence of the "/n" characters at the end of lines that may have triggered the extensive upload ?
We are about to start uploading large numbers of bulk Archival Descriptions using CSV import.

Previously, we have used LibreOffice-generated CSVs encoded in UTF-8 or MS Excel for MacOS-created CSVs encoded in UTF-8.
Normally, the uploads take only a few seconds, when using only a few dozen lines of spreadsheets for Archival Descriptions.

I had interpreted the user guidelines that the limits of 50 or 100 lines of a CSV were to prevent timeout errors.

We always use CSVs to import using qubitParentSlug as the first column, so that we are only populating child-level ADs as "levelOfDescription" Items to an already existing Collection level AD.

But now, we are experimenting with using CSVs encoded in UTF-8 and created in MS Excel for Windows, for which most of our users have been given new updated Office suites that will cope with UTF-8.



regards
Glenn Mulcaster


Librarian (Access and Discovery)
University of Tasmania Library

Phone: +61 3 6324 3061

Mail: Locked Bag 1312, Launceston TAS 7250


 

On Thursday, October 26, 2017 at 3:40:39 AM UTC+11, Darryl Friesen wrote:
One of our archives staff imported a CSV file (that she exported from Excel) that she thought had 415 records into our AtoM 2.4 test instance. She was excited to try a web import using a file with more than 100 records (which I gather was a limitation of AtoM pre 2.4??)
 
Excel, in its infinite wisdom, somehow exported not just the 415 lines of her spreadsheet that had data, but an additional 1,048,161 completely empty lines, yielding a CSV file with 1,048,576 records, most of which were simply this:
 
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
 
She was unaware of the extra blank records and uploaded this CSV file for import. Only after she realized this morning that the import was still running (after, like, 16 hours) did we notice the problem.
 
Firstly, I’d like to suggest there might be a bug in AtoM.  Rather than skip these lines which obviously contain no data, AtoM happily turned them into empty records/objects.  L
 
Which leads me into some questions:
 
What’s the proper way to terminate an import?  I can see an active job in the web interface, but there’s no method to terminate it there.  On the server I see a couple of symphony processes running that are obviously responsible for the import – is it “safe” to just kill those?  I needed to restart Percona for a config change to take effect, and the act of doing that actually seemed to terminate the import (but that left the job in the AtoM web interface listed as “Running” when it is not).
 
Is there a way to remove that phantom “Running” job from the list of active jobs?
 
And lastly, what’s my best option for cleaning out the several hundred thousand empty objects that were created?  I’m assuming the quickest method is a restore from backup?  I’m doing nightly backups of the Percona databases (using something I found on GitHub https://github.com/sixninetynine/surrogate) so that shouldn’t be a big deal; just curious if there was some other method of undoing an import (I assume constructing SQL queries to “rollback”/undo an import would be more trouble than it’s worth??)
 
Thankfully this was all on a test server!
 
- Darryl
 
----------------------------------------------------------------------
Darryl Friesen, B.Sc., Programmer/Analyst    Darryl....@usask.ca
Library Systems & Information Technology,    http://library.usask.ca/
University of Saskatchewan Library
----------------------------------------------------------------------
"Go not to the Elves for counsel, for they will say both no and yes"
 
 

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.
Visit this group at https://groups.google.com/group/ica-atom-users.
Reply all
Reply to author
Forward
0 new messages