Skip to first unread message

Elizabeth McManus

unread,
Apr 26, 2017, 11:49:20 AM4/26/17
to AtoM Users
Hello,

We have special characters that are turned to questions marks (ex/ ʔ ə θ) when I export my excel sheet to csv with utf-8 encoding.  Apparently uft-16 encoding does support these characters, but excel doesn't have it.  I think perhaps open office does.  Would AtoM support utf-16 encoding?

Thanks
Elizabeth

Dan Gillean

unread,
Apr 26, 2017, 12:27:35 PM4/26/17
to ICA-AtoM Users
Hi Elizabeth,

AtoM uses UTF-8 encoding throughout. This is especially important for CSV imports:
 I admit I'm a bit suspicious of the encoding transformations of Windows products - see for example this excerpt on the Wikipedia page for UTF-8:

Many Windows programs (including Windows Notepad) add the bytes 0xEF, 0xBB, 0xBF at the start of any document saved as UTF-8. This is the UTF-8 encoding of the Unicode byte order mark (BOM), and is commonly referred to as a UTF-8 BOM, even though it is not relevant to byte order. A BOM can also appear if another encoding with a BOM is translated to UTF-8 without stripping it. Software that is not aware of multibyte encodings will display the BOM as three garbage characters at the start of the document, e.g. "" in software interpreting the document as ISO 8859-1 or Windows-1252 or "" if interpreted as code page 437, a default for certain older Windows console applications.

The Unicode Standard neither requires nor recommends the use of the BOM for UTF-8, but does allow the character to be at the start of a file.[34] The presence of the UTF-8 BOM may cause problems with existing software that could otherwise handle UTF-8, for example:

  • Programming language parsers not explicitly designed for UTF-8 can often handle UTF-8 in string constants and comments, but cannot parse the BOM at the start of the file.
  • Programs that identify file types by leading characters may fail to identify the file if a BOM is present even if the user of the file could skip the BOM. An example is the Unix shebang syntax. Another example is Internet Explorer which will render pages in standards mode only when it starts with a document type declaration.
Programs that insert information at the start of a file will break use of the BOM to identify UTF-8 (one example is offline browsers that add the originating URL to the start of the file.

UTF-8 is quite comprehensive, so I wouldn't be surprised if the issue is with Excel's conversion. I would suggest trying LibreOffice Calc with UTF-8 settings to see if your characters are in fact supported and display and import properly when handled there. 

I don't really know what would happen if you tried to import a UTF-16 encoded CSV into AtoM, but I suspect it will misinterpret many characters, as they use different encoding schemes. If you do try it, consider making a backup first so you can roll back your database if needed!

Cheers,

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.
Visit this group at https://groups.google.com/group/ica-atom-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/ica-atom-users/29cfb1d5-f360-4116-b367-f62fadf49f93%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Elizabeth McManus

unread,
Apr 26, 2017, 1:00:51 PM4/26/17
to AtoM Users
Hi,

Thanks for the quick response.  LibreOffice solved my problem.

Thank you
Elizabeth
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To post to this group, send email to ica-ato...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages