Non-English Characters - UTF-8 vs Latin-1/ISO-8859-1 dictionary/transcripts

Michael McGarrah

unread,

Jan 28, 2017, 3:16:15 PM1/28/17

to FAVE (Force Alignment and Vowel Extraction) Users Group

Hello,

When I've attempted to use a custom non-english characters dictionary on a Ubuntu 16.04 Linux box, I'm getting an error that make it seem like it cannot match the special characters.

The dictionary contains the "PÅ" and the transcription contains the "PÅ". This happens to be the first instance of a special character in my test data.

Below I'm including my command line and some file type information. I've been concerns that inconsistencies in the text file format might impact the results and tried some variations.

To add to the confusion on my part, when using these same files on the http://fave.ling.upenn.edu/FAAValign.html website, it will produce results back to my email.

Any pointers appreciated.

mcgarrah@fave:~/FAVE/FAVE-align$ python FAAValign.py -v -i ~/DATA/CUSTOM_DIC.txt ~/DATA/READING.wav ~/DATA/READING.txt ~/DATA/READING.TextGrid

Read dictionary from file model/dict.

Added all entries in file CUSTOM_DIC.txt to CMU dictionary.

Read dictionary from file added_dict_entries.txt.

Added new entries from file CUSTOM_DIC.txt to file added_dict_entries.txt.

Encoding is UTF-16!

Encoding is UTF-8!

Read transcription file READING.txt.

Checking format of input transcription file...

Checking dictionary entries for all words in the input transcription...

Please enter the Arpabet transcription of word PÅ, or enter [s] to skip.

mcgarrah@fave:~/DATA$ file READING.txt

READING.txt: UTF-8 Unicode text

READING_GMAIL.txt: ISO-8859 text

mcgarrah@fave:~/DATA$ file CUSTOM_DIC*.txt

CUSTOM_DIC.txt: ISO-8859 text

CUSTOM_DIC_DropBox.txt: Non-ISO extended-ASCII text, with CR line terminators

CUSTOM_DIC_GMAIL.txt: ISO-8859 text

CUSTOM_DIC_ORIG.txt: Non-ISO extended-ASCII text

Josef Fruehwald

unread,

Feb 10, 2017, 8:50:19 AM2/10/17

to FAVE (Force Alignment and Vowel Extraction) Users Group

Hi Michael,

What is the arpabet transcription you've given "PÅ"?

It looks like the way that this issue is manifesting is in not recognizing the arpabet transcription for "PÅ", so the command line interface is simply requesting an arpabet transcription like it would for any out of dictionary word. As for why the online interface returns a result, I can't be sure, but I believe it is actually ignoring out of dictionary items, ommitting them from the alignment.

-Joe

Michael McGarrah

unread,

Feb 11, 2017, 12:36:44 PM2/11/17

to FAVE (Force Alignment and Vowel Extraction) Users Group

For the online web interface version those characters are being recognized and processed. They are not dropped like on the command line. I'm doing some digging into the charset being sent from the browser on upload and trying to work thru the options to get consistency.

I'm concerned I have something wrong in my Linux, python, or third party tool (HTK) that are locked to one representation of a character that only included English chars.

I'm continuing to dig and will share as I learn more. Any recommendations are very welcome.