untitled records with no other metadata

80 views
Skip to first unread message

sarah.le...@anu.edu.au

unread,
Dec 12, 2018, 2:15:36 AM12/12/18
to AtoM Users
Hello

Recently a lot (thousands) of records with no metadata have appeared in our database. Here is an example http://archivescollection.anu.edu.au/index.php/fx9p-ny7t-2az9 Has anyone else had this problem and can anyone suggest what's causing it? We suspect it's related to uploadin item descriptions using csv. Perhaps there are too may rows in the csv sheet? So my two questions are can anyone suggest how we can stop it recurring and is there a way to do a bulk delete of all the thousands of record with no metadata?

Thanks for your help.

Sarah
Sarah Lethbridge
Senior Archivist
ANU Archives (Noel Butlin Archives Centre, University Archives and Pacific Research Archives)
R.G.Menzies Building
2 Fellows Rd
The Australian National University
Acton ACT 2601

02 6125 5919
0436608587
www.archives.anu.edu.au

pierre

unread,
Dec 12, 2018, 5:13:54 AM12/12/18
to AtoM Users
Hello,

without direct answer to your 2 questions, would suggest an export then import of your database to clear the void metadata.

"So my two questions are can anyone suggest how we can stop it recurring and is there a way to do a bulk delete of all the thousands of record with no metadata?" 

Automating the bulk delete of records without metadata seems easier with a text editor from the exported csv file.

Dan Gillean

unread,
Dec 12, 2018, 10:21:39 AM12/12/18
to ICA-AtoM Users
Hi Sarah, 

My first guess on seeing this is that you've used Microsoft Excel to prepare a CSV for import. Is this correct?

As noted in our documentation (here), AtoM expects a CSV that is UTF-8 encoded with unix-style line endings. Unfortunately, Microsoft has had a tendency to ignore de facto standards and go its own way, and by default Excel sheets are encoded with a custom character encoding (WinLatin), and make use of custom Windows-style line endings. These line ending characters are not typically visible in the spreadsheet (you would need to open the CSV in a text editor and enable the display of special characters). It seems that sometimes empty rows in an Excel spreadsheet are interpreted as rows to imported, causing this issue. 

It is technically possible to use Excel if you dig deep enough into the settings and customize some aspect (they don't make it easy), and I've found that editing a well-formed CSV created outside of Excel won't mess it up. But if you are preparing new records for import, I strongly recommend using something like LibreOffice Calc, which allows you to set the character encoding every time you open the file, and uses the correct line ending glyphs. It's also open source! 

We have seen this happen to users before when importing Excel-prepared CSV files.... which is unfortunate in general, but fortunate for you! Because we've helped users resolve this in the past! 

Please see this forum thread for a script you can use that will delete all empty descriptions: 
Please use it at your own discretion - while we've tested it, it comes with no guarantees - we will not be responsible if something goes wrong, and we STRONGLY recommend that you back up your database first, just in case! 

Regards, 

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory


--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To post to this group, send email to ica-ato...@googlegroups.com.
Visit this group at https://groups.google.com/group/ica-atom-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/ica-atom-users/d736ddc5-ee35-48e5-a9d6-e61e1036b5a8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Karl Goetz

unread,
Dec 12, 2018, 6:00:59 PM12/12/18
to ica-ato...@googlegroups.com
Hi Dan,
Looking at this issue and the associated documentation makes me wonder if there should be some feature requests open in atom to actually handle these problems.

- atom to handle the line endings (the extra line feed could be stripped by unix2dos or a php regex)
- atom to handle the encoding (php appears to have http://php.net/manual/en/function.utf8-encode.php or http://php.net/manual/en/function.mb-convert-encoding.php which may be useful here?)
- csv:check-import to report on these issues (though in fairness it does this indirectly already - if the reported rows isn’t what was expected it implies a problem).

Would you be open to issues in that vein sitting in the backlog?

Karl.


On 13 Dec 2018, at 2:21 am, Dan Gillean <d...@artefactual.com> wrote:

Hi Sarah, 

My first guess on seeing this is that you've used Microsoft Excel to prepare a CSV for import. Is this correct?

As noted in our documentation (here), AtoM expects a CSV that is UTF-8 encoded with unix-style line endings. Unfortunately, Microsoft has had a tendency to ignore de facto standards and go its own way, and by default Excel sheets are encoded with a custom character encoding (WinLatin), and make use of custom Windows-style line endings. These line ending characters are not typically visible in the spreadsheet (you would need to open the CSV in a text editor and enable the display of special characters). It seems that sometimes empty rows in an Excel spreadsheet are interpreted as rows to imported, causing this issue. 

It is technically possible to use Excel if you dig deep enough into the settings and customize some aspect (they don't make it easy), and I've found that editing a well-formed CSV created outside of Excel won't mess it up. But if you are preparing new records for import, I strongly recommend using something like LibreOffice Calc, which allows you to set the character encoding every time you open the file, and uses the correct line ending glyphs. It's also open source! 

We have seen this happen to users before when importing Excel-prepared CSV files.... which is unfortunate in general, but fortunate for you! Because we've helped users resolve this in the past! 

Please see this forum thread for a script you can use that will delete all empty descriptions: 
Please use it at your own discretion - while we've tested it, it comes with no guarantees - we will not be responsible if something goes wrong, and we STRONGLY recommend that you back up your database first, just in case! 

Regards, 

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory


On Wed, Dec 12, 2018 at 2:15 AM <sarah.le...@anu.edu.au> wrote:
Hello

Recently a lot (thousands) of records with no metadata have appeared in our database. Here is an example http://archivescollection.anu.edu.au/index.php/fx9p-ny7t-2az9 Has anyone else had this problem and can anyone suggest what's causing it? We suspect it's related to uploadin item descriptions using csv. Perhaps there are too may rows in the csv sheet? So my two questions are can anyone suggest how we can stop it recurring and is there a way to do a bulk delete of all the thousands of record with no metadata?

Thanks for your help.

Sarah



-- 
Karl Goetz,  Senior Library Officer (Library Systems)
University of Tasmania, Private Bag 25, Hobart 7001
Available Tuesday, Wednesday, Thursday



University of Tasmania Electronic Communications Policy (December, 2014).
This email is confidential, and is for the intended recipient only. Access, disclosure, copying, distribution, or reliance on any of it by anyone outside the intended recipient organisation is prohibited and may be a criminal offence. Please delete if obtained in error and email confirmation to the sender. The views expressed in this email are not necessarily the views of the University of Tasmania, unless clearly intended otherwise.

Dan Gillean

unread,
Dec 13, 2018, 11:53:41 AM12/13/18
to ICA-AtoM Users
Hi Karl, 

I love this idea! I agree that it would be nice to avoid these kinds of issues for our users - or at the very least, add more checks prior to beginning the import and better reporting when issues are found. 

I think handling the conversion automatically without a) encouraging a user to make backups first, or b) giving the user some control over whether or not the attempted conversion takes place, would potentially be risky. I've taken your suggestions and run with them a bit with these ideas in mind, and captured them on a Wishlist ticket, here: 
In my suggestions building on your proposal, I recommended the following: 

First, that reporting about character encoding, line endings, and separator characters used in the CSV be added to the output of the check-import task.

Second, that the csv:check-import task include a --fix option that attempts to clean up unexpected line endings and character encodings, and possibly convert different separator characters to AToM's expected ones as well (i.e. commas). I would suggest that when run, the task outputs the converted CSV in the same location as it was, with "_fixed" appended to the filename. Users could then choose to re-run the check task against the fixed version to see an updated output and confirm that the conversion was successful, or proceed directly with the import

Third, that checks for character encoding and line endings be incorporated into the CSV import task. If AtoM's current expectations for these are not found, then by default the import is halted and an error message outlining the issue (e.g. "CSV is not UTF-8 encoded" etc) is provided.

Fourth, that a --fix option be included in the csv:import task. When used, the user is first prompted with a warning, and asked if they have made a backup first (y/n must be entered. We could possibly allow this to be skipped if a --force option is included as well, for scripting purposes, but only from the CLI). When yes is selected, AtoM will attempt to fix any encoding/line ending/separator issues prior to proceeding with the import.

In the user interface, this could be a checkbox available to administrators that says "Fix CSV issues on import" that is selected when configuring the CSV import. When checked, it could immediately trigger a warning modal that appears, encouraging users to make sure they have a backup first (note: the idea of adding functionality to allow administrators to automatically generate a SQL database dump backup on import, store it temporarily, and possibly even load the backup automatically if the import fails has also been discussed, and would pair well with this feature).

Further thoughts or alternatives are welcome - I'll gladly add revisions to the wish list ticket!  

Cheers, 

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To post to this group, send email to ica-ato...@googlegroups.com.
Visit this group at https://groups.google.com/group/ica-atom-users.

Karl Goetz

unread,
Dec 13, 2018, 7:46:32 PM12/13/18
to ica-ato...@googlegroups.com
Hi Dan,
Thanks for opening that issue!

I’m looking at our dev installs CSV import page (index.php/object/importSelect?type=csv) and trying to envisage how the description applies.

I am guessing a third section added between the “Import options” and “Select file” areas called something like “data validation” in which these options would sit?

It would contain the "Fix CSV issues on import” you mentioned, and potentially in future a spot for the disposable backup you mentioned too. Doing that allows you to defer the dialog until after all parts of the form are completed and you can see “oh, user wants to fix things but hasn’t added valid backup location, we better prompt” and may help reduce friction.

karl.


For more options, visit https://groups.google.com/d/optout.

Dan Gillean

unread,
Dec 14, 2018, 10:30:34 AM12/14/18
to ICA-AtoM Users
Hi Karl, 

Yes, I like your proposal. Note that it would also be possible to configure the page so if there was no backup location already defined in a config file by a system administrator, then the option to generate a backup prior to import will not appear on the import page. That might also be an elegant way to avoid adding unnecessary/confusing options that are not available to the user. 

Cheers, 

Dan Gillean, MAS, MLIS

GS

unread,
Feb 24, 2019, 10:15:44 PM2/24/19
to AtoM Users
Hi Dan,

to clear out archivescollection.   The script appears to be slow takoing about 4-7 secs per deletion (single process).
Any suggestions of any way to speed this process up?

Regards,

George

Dan Gillean

unread,
Feb 25, 2019, 11:03:40 AM2/25/19
to ICA-AtoM Users
Hi George, 

The only suggestion I can think of without having to do a lot of analysis, testing, and new script development would be to simplify the criteria. Right now, lines 40-45 in the script include a number of criteria, all of which must be met for the information object (AKA archival description) to be deleted: 
  • $criteria->add(QubitObject::ID, 1, Criteria::NOT_EQUAL);
    • I would not remove this one - this is a check to make sure that the information object id is not the root object (to which all descriptions are associated). If you accidentally delete the root information object? Well, let's just hope you've made backups first...
  • $criteria->add(QubitInformationObject::IDENTIFIER, null, Criteria::ISNULL);
    • This checks to see if the identifier field is empty
  • $criteria->add(QubitInformationObjectI18n::TITLE, null, Criteria::ISNULL);
    • This checks to see if the title field is empty
  • $criteria->add(QubitInformationObject::LEVEL_OF_DESCRIPTION_ID, null, Criteria::ISNULL);
    • This checks to see if the level of description field is empty
  • $criteria->add(QubitInformationObject::REPOSITORY_ID, null, Criteria::ISNULL);
    • This checks to see if there is a related repository
  • $criteria->add(QubitInformationObject::PARENT_ID, 1);
    • This ensures that the description is a top-level description - i.e. it's only parent is the root information object
So we have 6 criteria that need to be checked for each description: 
  • Must not be the root information object
  • Identifier must be blank
  • Title must be blank
  • Level of description must be blank
  • There must be no repository linked to the description
  • Must be a top-level description
If you look at your data, you might find that you need less criteria. For example, if you can verify that every description you want to keep meets 2 of these criteria (e.g. has a level of description and a title), then you could consider removing some of the others (such as link to repository, identifier). Hopefully that might reduce the time spent on each information object. 

I would strongly recommend you leave the first criteria  (not be root object), and unless you have a compelling reason to remove it, the last criteria (should be top-level description), in place. These should help avoid accidental deletions that could significantly impact your site. In any case, please make sure you make a backup of your database first! 

The other thing that would speed this up significantly, but which is beyond the level of support we can provide via the forum, would be to use SQL queries to delete the objects, bypassing the Propel ORM which is currently handling the deletion. The advantage of using the ORM is that the deletion will cascade through all the related tables, making sure you aren't leaving partial data in the database. I am not personally able to tell you all the tables involved, but we do have entity relation diagrams available on our wiki for reference. 

Cheers, 

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory

Reply all
Reply to author
Forward
0 new messages