CSV import: problem with titles and descriptions with punctuation

165 views
Skip to first unread message

Eleanor Mckenzie

unread,
Feb 10, 2023, 10:58:18 AM2/10/23
to AtoM Users
Hi everyone, 

When I import a CSV file into AtoM it cuts-off any titles and descriptions after punctuation. This includes umlauts and accents on letters, colons and quotation marks.  I have put an example screenshot below: 

atom question.png

For context, the title should have been "Travel memorabilia from Münster" and you can see the same problem has occurred in the scope and content field as well. 

I have tried formatting the spreadsheet different ways, such as text, custom, but this didn't resolve the issue. AtoM has no problem with punctuation when cataloguing directly into AtoM, but the collection is too large to catalogue that way. Does anyone know how to fix this in the import? 

Any help would be very much appreciated. 

Thanks!
Eleanor

Dan Gillean

unread,
Feb 10, 2023, 11:51:22 AM2/10/23
to ica-ato...@googlegroups.com
Hi Eleanor, 

Are you by chance using Microsoft Excel to prep your spreadsheet? And/or, have you copied the contents of the CSV from somewhere else, like Word doc?

I strongly suspect that the issue is in the character encoding and line endings used in your CSV. AtoM expects CSVs to use UTF-8 character encoding, and unix-style line ending characters. More info on this: 
The other related possibility is that your CSV is not in fact currently a CSV (i.e. values separated by commas), but that some other default separator character is being used. Once again, LibreOffice allows you to configure this and preview the outcome - here's a user forum thread where I show an example of resolving this for a user: 
Basically, if your  CSV is well-formed, is using the expected separators, and has the right character encoding, then this shouldn't happen. I would start by investigating these things. 

Note as well that the 2.7 release now includes a CSV validator, which can help identify problems BEFORE you run an import. See: 
Cheers, 

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory
he / him


--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ica-atom-users/8d27b814-df0b-49c6-8cee-1ac9c3a64ed4n%40googlegroups.com.

Eleanor Mckenzie

unread,
Feb 24, 2023, 9:28:14 AM2/24/23
to AtoM Users

Hi Dan,

Thank you for your reply.

I was using Microsoft Excel to prepare my import spreadsheet. Unfortunately, we are unable to use other software due to IT policies.

My spreadsheet was saved in a CSV format and I had followed the linked guidance in the Artefactual documentation to change the encoding to UTF-8 using the ‘tools’ menu (this is the article linked in the documentation: How to Save CSV in UTF-8 | It Still Works).

I eventually got the import to work by selecting the CSV UTF-8 file format instead of the CSV file format. I still used the tools menu to set web encoding to UTF-8 as per the documentation online.

atom question 2.png

atom question 3.png

Using a CSV UTF-8 format resolved the issue and the spreadsheet imported, punctuation and all, when I used this format. This format isn’t specified in the Artefactual documentation (as far as I could see) and when speaking to other archivist they have mentioned that they use CSV rather than CSV UTF-8 and haven’t encountered issues. Do you know why I have had to use the CSV UTF-8 file format for the import to work?

Thanks for your help!

Best wishes,
Eleanor 

Dan Gillean

unread,
Feb 24, 2023, 10:00:27 AM2/24/23
to ica-ato...@googlegroups.com
Hi Eleanor, 

I'm glad to hear that you were able to sort out the issue. I will try to answer your more general questions about why this was necessary:

UTF-8 is the default encoding used for web content and mail applications and has been for at least a decade. If you would like to learn more about what UTF-8 is and why it's important, here's a pretty good 101 primer:
While the RFC defining the CSV format does not specify a required character set, it does state that " Common usage of CSV is US-ASCII" - UTF-8 is the most widely supported extension of ASCII, as it was built to be backwards compatible with ASCII where the first 127 characters overlap. 

AtoM is a web-based application that follows existing standards wherever possible. To ensure that multilingual content can be displayed properly in AtoM, we require UTF-8 encoding - i.e. we use a well-established open de facto standard. 

These days most other spreadsheet applications ALSO use UTF-8 as the default encoding for saving CSVs, including both LibreOffice Calc and Google Sheets. It is Microsoft that is the outlier here - for many years they used their own custom character set (Windows-1252 for a typical EN-US installation), and I believe that by default they still do. 

In previous years, this "CSV UTF-8" save option wasn't even present in Excel - unless you went very deep into Excel's settings, it was very difficult to save a CSV with UTF-8 character encoding. I believe that this newer Save option wasn't added until Excel 2016, but I could be wrong about this. In any case, the fact that Excel has a Save option for CSV (which would use Windows-1252 encoding) and a separate CSV UTF-8 Save option is entirely a quirk of Excel, and a prime example of Microsoft refusing to update and use existing open standards - and exactly why we do not recommend using Excel with AtoM. 

Regarding your comment: 

This format isn’t specified in the Artefactual documentation (as far as I could see)

I linked to the relevant section last time - here are the links again: 
The fact that it doesn't say explicitly "Save your CSV file as CSV UTF-8, not just CSV" or something similar is because of what I explained above - a CSV not using UTF-8 by default is usually only a Microsoft problem these days. In general, the AtoM documentation does not overly concern itself with what external applications (like spreadsheet applications) you are using - but it does list what AtoM needs. In this case, the documentation states clearly that the CSV should use UTF-8 encoding and unix-style line ending characters. 
 
and when speaking to other archivist they have mentioned that they use CSV rather than CSV UTF-8 and haven’t encountered issues. Do you know why I have had to use the CSV UTF-8 file format for the import to work?

Either these archivists have been lucky, or are saving as UTF-8 by default and not realizing it, or aren't using Excel. 

Long story short: If the CSV is not UTF-8, there will be problems with special characters when it is imported into AtoM. How you get a CSV prepared that uses UTF-8 encoding is up to you! 

In any case, I'm glad you've resolved the import issue!

Cheers, 

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory
he / him

Reply all
Reply to author
Forward
0 new messages