Diacriticals in authority and .Arches files problems

47 views
Skip to first unread message

Lucy FJ

unread,
Jan 20, 2016, 3:32:48 AM1/20/16
to Arches Project
Hi all,
We have been loading customised authority files and have noticed that Arches rejects words with diacriticals (accents etc). This is not a problem for us as we were happy to remove them and if we really want them we can enter then through the RDM. But will this problem occur when loading resource data through .arches? We need to input place names as alternative names using diacriticals and it would be much easier if we can do this via .arches files. We know we can input them using the resource data manager but obviously when dealing with about 3000 entries,,this is time consuming.
Any ideas?
Lucy

Alexei Peters

unread,
Jan 20, 2016, 1:24:45 PM1/20/16
to Lucy FJ, Arches Project
Hi Lucy,
The .arches file should support diacritics.  I'm actually surprised that the authority files don't.  I just tested a local file and I was able to add these records:

conceptid,PrefLabel,AltLabels,ParentConceptid,ConceptType,Provider 
20000001-0000-0000-0000-000000000000,Portland,,CITY_AUTHORITY_DOCUMENT.csv,Index,GCI
20000002-0000-0000-0000-000000000000,San Francisco,The Bay Area,CITY_AUTHORITY_DOCUMENT.csv,Index,GCI
20000003-0000-0000-0000-000000000000,San Jose,San José,CITY_AUTHORITY_DOCUMENT.csv,Index,GCI

Notice that the alt label for San Jose, is San José

Can you share the authority file that you're having trouble with?
Cheers,
Alexei


Director of Web Development - Farallon Geographics, Inc. - 971.227.3173

Lucy

--
-- To post, send email to arches...@googlegroups.com. To unsubscribe, send email to archesprojec...@googlegroups.com. For more information, visit https://groups.google.com/d/forum/archesproject?hl=en
---
You received this message because you are subscribed to the Google Groups "Arches Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to archesprojec...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Lucy Fletcher-Jones

unread,
Jan 21, 2016, 8:18:43 AM1/21/16
to Alexei Peters, Arches Project
Hi Alexei,
 
Thank you for looking into this. I am glad to hear that Arches should support diacriticals.
 
Here is the error message on loading the 'Ruler' Authority document:
 
RULER_AUTHORITY_DOCUMENT.csv
 
ERRORS IN FILE: RULER_AUTHORITY_DOCUMENT.values.csv
 
ERRORS IN FILE: RULER_AUTHORITY_DOCUMENT.csv
 
ERROR: Make sure the file is saved with UTF-8 encoding
'utf8' codec can't decode byte 0xea in position 30: invalid continuation byte
Traceback (most recent call last):
  File "/opt/projects/ENV/lib/python2.7/site-packages/arches/management/commands/package_utils/authority_files.py", line 112, in load_authority_file
    for row in rows:
  File "/opt/projects/ENV/lib/python2.7/site-packages/unicodecsv/py2.py", line 217, in next
    row = csv.DictReader.next(self)
  File "/usr/local/lib/python2.7/csv.py", line 104, in next
    row = self.reader.next()
  File "/opt/projects/ENV/lib/python2.7/site-packages/unicodecsv/py2.py", line 128, in next
    for value in row]
  File "/opt/projects/ENV/lib/python2.7/encodings/utf_8_sig.py", line 22, in decode
    (output, consumed) = codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xea in position 30: invalid continuation byte
 
ERROR in row 31 (Legacyoid (RULER_UID:30) not found.  Make sure your ParentConceptid in the  
 
This caused further errors in the Ruler Values files as can be seen from above.
I do not have a copy of the authority file that caused the error asI have since corrected it and changed it in a few places. But the alternative name was
 
Ptolemaîos Philadelphos
 
and I believe it was the circumflex above the 'i' that caused the problem. Certainly when I removed the circumflex, the file loaded OK.
 
Thank you,
Lucy
 
 
----- Original Message -----

Adam Cox

unread,
Jan 21, 2016, 10:36:31 AM1/21/16
to Lucy Fletcher-Jones, Alexei Peters, Arches Project
Hi Lucy, you can check the encoding in Notepad ++.  Open your authority document with that program, and click the Encoding menu.  Your file should be in "UTF-8" or "UTF-8 without BOM" (depends on the version of Notepad ++ you have). The î character should work as far as I know...

Lucy Fletcher-Jones

unread,
Jan 22, 2016, 8:50:34 AM1/22/16
to Alexei Peters, Adam Cox, Arches Project
Hi Adam and Alexei
 
To edit these authority files I am using Excel 2010 on a PC (Vista) and Excel 2013 on a PC (Windows 7) and saving as a comma delimited file. I added in 5 names that use all sorts of diacriticals and see that when I open it in Notepad++, it is actually encoded in Ansi not UTF-8.  Obviously I am choosing the wrong save format from Excel. Which should I be using? 

 I have attached the file  the file here so you can see it and play with it yourselves.  The tilda becomes a '?'. Actually the circumflex works, which is strange because when I got the error before, only the name with the circumflex was left as I had already changed the names with a tilda.
Converting to UTF-8 doesn't help with the tilda problem.
 
Thank you for looking into this.
Lucy
 
----- Original Message -----
To: Lucy FJ
Sent: Wednesday, January 20, 2016 8:24 PM
Subject: Re: [Arches] Diacriticals in authority and .Arches files problems

RULER_AUTHORITY_DOCUMENTERROR.csv

Lucy Fletcher-Jones

unread,
Jan 22, 2016, 9:05:52 AM1/22/16
to Adam Cox, Alexei Peters, arches...@googlegroups.com
Hi Adam and Alexei,
 
I forgot to add that the diacriticals are in the altnames at rows 132 to 136 when editing in Excel.

Koen Van Daele

unread,
Jan 22, 2016, 9:24:42 AM1/22/16
to Arches Project

Hi Lucy,


as far as I know Excel (all versions) are notoriously bad at handling things like character encodings.  This rather old Stackoverflow question seems to confirm that:

http://stackoverflow.com/questions/4221176/excel-to-csv-with-utf8-encoding It does offer some workarounds, but none of them are very nice.


I would suggest writing your CSV files with Libreoffice/Openoffice. You should be able to install it and it's free. While it's not always an exact replacement for Excel, when it comes to character encodings, it just works. By default it will save things as UTF-8 (at least under Linux it does) and it will ask you if you want to save in a different encoding.


Cheers,

Koen



Op vrijdag 22 januari 2016 15:05:52 UTC+1 schreef Lucy FJ:

Lucy FJ

unread,
Jan 24, 2016, 6:28:47 AM1/24/16
to Arches Project
Hi Koen,

Thank you for this information. I did tryout some of the suggestions on Google for using Excel to create UTF-8 files, because I like using Excel and know it well,  but I have tried some and they are over complicated and produce a CVS file in UTF-BOM format which I believe will not work in Arches. It looks like I will need to download the Openoffice version as you suggest. Must all files loading into Arches be UTF-8 only?

Lucy

Van Daele, Koen

unread,
Jan 25, 2016, 4:39:35 AM1/25/16
to Lucy FJ, arches...@googlegroups.com

Hi Lucy,


character encodings are one of those nasty issues in computing that nobody likes tackling. If you want a detailed, yet fairly easy to follow analysis on why that is, see http://www.joelonsoftware.com/articles/Unicode.html (Cthulhu is waiting for you there though...)


Basically, what Arches does is the best thing possible. That way most human languages can be integrated in Arches, and all you need to do is make sure your data is UTF-8. Unfortunately Excel makes that bloody impossible. I think Excel saves that file in the ISO-8859-1  encoding. That encoding just doesn't know the characters you're trying to save (ISO-8859-1 only contains 191 characters). So, it's not just Arches. I can't read them either. Excel should be telling you when saving as CSV that you will lose information), it still wouldn't work since your csv file already contains illegal ISO-8859-1 characters.


And it's not just Excel, the whole Windows ecosystem is fundamentelly flawed in that regard. I myself run Linux where character encoding is handled correctly and UTF-8 is the default. No idea how they do it on a Mac.


So, I think using OpenOffice is your best bet. Or just open the csv file you have in Notepad++ (or similar text editor), save the file as UTF-8 and fix the problems manually. But then you'd have to do that every time you want to change something.


Cheers,

Koen



Van: arches...@googlegroups.com <arches...@googlegroups.com> namens Lucy FJ <luc...@aucegypt.edu>
Verzonden: zondag 24 januari 2016 12:28
Aan: Arches Project
Onderwerp: Re: [Arches] Diacriticals in authority and .Arches files problems
 

Richard Jennings

unread,
Jan 30, 2016, 1:21:35 PM1/30/16
to Arches Project
Hi Lucy,

I just wanted to say that I agree with Koen in that I use OpenOffice to manipulate my authority documents, resource graphs and in preparing my .arches files and find that it works very well in terms of handling issues such as yours. I can't recommend it highly enough after having similar problems with Excel.

Best wishes,

Richard

Lucy Fletcher-Jones

unread,
Jan 30, 2016, 2:44:36 PM1/30/16
to Richard Jennings, Arches Project
Thanks Richard. It looks like this is what I had better do!
Lucy

Sent from my iPad
You received this message because you are subscribed to a topic in the Google Groups "Arches Project" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/archesproject/3l6N7KuEpXY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to archesprojec...@googlegroups.com.

Adam Cox

unread,
Feb 1, 2016, 12:04:49 PM2/1/16
to Arches Project
Hi Koen and Richard, thanks for the OpenOffice thumbs up.  Makes me think we could do with adding a little "recommended software section" to the arches documentation.  At least, I'll start a thread here, so people can make suggestions on the forum for now...  Would one of you mind replying with a sentence or two about OpenOffice?  I haven't used it, so wouldn't be the best to recommend it.
Reply all
Reply to author
Forward
0 new messages