Non-English Characters - UTF-8 vs Latin-1/ISO-8859-1 dictionary/transcripts

46 views
Skip to first unread message

Michael McGarrah

unread,
Jan 28, 2017, 3:16:15 PM1/28/17
to FAVE (Force Alignment and Vowel Extraction) Users Group
Hello,

When I've attempted to use a custom non-english characters dictionary on a Ubuntu 16.04 Linux box, I'm getting an error that make it seem like it cannot match the special characters.
The dictionary contains the "PÅ" and the transcription contains the "PÅ". This happens to be the first instance of a special character in my test data.
Below I'm including my command line and some file type information. I've been concerns that inconsistencies in the text file format might impact the results and tried some variations.

To add to the confusion on my part, when using these same files on the http://fave.ling.upenn.edu/FAAValign.html website, it will produce results back to my email.

Any pointers appreciated.


mcgarrah@fave:~/FAVE/FAVE-align$ python FAAValign.py -v -i ~/DATA/CUSTOM_DIC.txt ~/DATA/READING.wav ~/DATA/READING.txt ~/DATA/READING.TextGrid
Read dictionary from file model/dict.
Added all entries in file CUSTOM_DIC.txt to CMU dictionary.
Read dictionary from file added_dict_entries.txt.
Added new entries from file CUSTOM_DIC.txt to file added_dict_entries.txt.
Encoding is UTF-16!
Encoding is UTF-8!
Read transcription file READING.txt.
Checking format of input transcription file...
Checking dictionary entries for all words in the input transcription...
Please enter the Arpabet transcription of word PÅ, or enter [s] to skip.


mcgarrah@fave:~/DATA$ file READING.txt
READING.txt: UTF-8 Unicode text
READING_GMAIL.txt: ISO-8859 text

mcgarrah@fave:~/DATA$ file CUSTOM_DIC*.txt
CUSTOM_DIC.txt:       ISO-8859 text
CUSTOM_DIC_DropBox.txt: Non-ISO extended-ASCII text, with CR line terminators
CUSTOM_DIC_GMAIL.txt:   ISO-8859 text
CUSTOM_DIC_ORIG.txt:         Non-ISO extended-ASCII text

Josef Fruehwald

unread,
Feb 10, 2017, 8:50:19 AM2/10/17
to FAVE (Force Alignment and Vowel Extraction) Users Group
Hi Michael,

What is the arpabet transcription you've given "PÅ"?

It looks like the way that this issue is manifesting is in not recognizing the arpabet transcription for "PÅ", so the command line interface is simply requesting an arpabet transcription like it would for any out of dictionary word. As for why the online interface returns a result, I can't be sure, but I believe it is actually ignoring out of dictionary items, ommitting them from the alignment.

-Joe

Michael McGarrah

unread,
Feb 11, 2017, 12:36:44 PM2/11/17
to FAVE (Force Alignment and Vowel Extraction) Users Group
For the online web interface version those characters are being recognized and processed. They are not dropped like on the command line. I'm doing some digging into the charset being sent from the browser on upload and trying to work thru the options to get consistency.

I'm concerned I have something wrong in my Linux, python, or third party tool (HTK) that are locked to one representation of a character that only included English chars.

I'm continuing to dig and will share as I learn more. Any recommendations are very welcome.

Thanks for your response.

(Sent from my cell phone)

Reply all
Reply to author
Forward
0 new messages