utf/unicde issues with clean_and_segment_data

737 views
Skip to first unread message

Armin Oliya

unread,
Jul 27, 2017, 12:07:27 PM7/27/17
to kaldi-help
Hi guys, 

I'm experimenting with steps/cleanup/clean_and_segment_data.sh using a Dutch corpus. 

In step 3, there's a call to lattice_oracle_align.sh within which there's another call to align-text in its step 3, specifically:


align-text --special-symbol="$special_symbol" ark:$dir/text ark:$dir/oracle_hyp.txt ark,t:- | \
utils/scoring/wer_per_utt_details.pl --special-symbol "***" > $dir/analysis/per_utt_details.txt

This line breaks due to unicode mismatch:

align-text '--special-symbol=***' ark:exp/tri4_cleaned_input/lattice_oracle/text ark:exp/tri4_cleaned_input/lattice_oracle/oracle_hyp.txt ark,t:-
utf8 "\xE9" does not map to Unicode at utils/scoring/wer_per_utt_details.pl line 75, <STDIN> line 1.
utf8 "\xE9" does not map to Unicode at utils/scoring/wer_per_utt_details.pl line 131, <STDIN> line 12.
utf8 "\xEF" does not map to Unicode at utils/scoring/wer_per_utt_details.pl line 131, <STDIN> line 66.
utf8 "\xF6" does not map to Unicode at utils/scoring/wer_per_utt_details.pl line 131, <STDIN> line 72.
Malformed UTF-8 character (unexpected non-continuation byte 0x72, immediately after start byte 0xf6) in print at utils/scoring/wer_per_utt_details.pl line 128, <STDIN> line 72.
Code point 0x0000 is not Unicode, may not be portable at utils/scoring/wer_per_utt_details.pl line 128, <STDIN> line 72.
... 


Apparently the encoding of text and oracle_hyp.txt  files isn't utf8, and wer_per_utt_details.pl isn't happy about it. 

Could you suggest what's the best way to fix this?
Thanks.

Jan Trmal

unread,
Jul 27, 2017, 12:21:26 PM7/27/17
to kaldi-help
IMO by converting everything to utf-8.
Or hack the script (supply the encoding you use).
y.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Armin Oliya

unread,
Jul 27, 2017, 1:33:52 PM7/27/17
to kaldi-help
Thanks Yenda, 

I've been manually re-encoding all files up until now, but i wonder if i missed a config to force kaldi's scripts to generate everything in utf8. 

It could also be my perl settings .. i tried setting PERL_UNICODE to AS, to force all io to be in utf but no luck. 


To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

Daniel Povey

unread,
Jul 27, 2017, 3:04:58 PM7/27/17
to kaldi-help
It would be great if you could debug this and figure out where the
not-valid-UTF8 text first appears. Our intention is that all scripts
should work properly with UTF8.

Actually in most cases it should not matter whether Perl is
interpreting things as UTF-8 or ASCII because it's doing things
per-word, not per-character. But if it does turn out to be necessary
to set PERL_UNICODE to AS, we should do this in the path.sh somehow.
Note to others: to learn about the PERL_UNICODE variable, type
perldoc perlrun
and search for -C. Default is DSL, and the L makes the stdin/out/err
(S) conditional on the locale variables, which in our case (LC_ALL=C)
would probably turn off unicode. Anyway I believe if any perl scripts
are required to use unicode encoding for their inputs, we somehow set
that mode inside the script.


Dan

Armin Oliya

unread,
Jul 28, 2017, 12:58:35 PM7/28/17
to kaldi-help, dpo...@gmail.com
Thanks Dan for the explanation.

I think i wasn't clear but this seems to be actually the first place that is causing problem. I noticed some of the other files (if i remember correctly, lexiconp.txt and text) weren't utf'd so i manually changed all in my previous experiments. Just recoding all files that i suspected included accented characters.

Anyways, I added the following two lines to lattice_oracle_align.sh just before the call to align-text, and the issue is resolved for now:

  recode -d h..u8 $dir/text
  recode -d h..u8 $dir/oracle_hyp.txt

I'll pay closer attention in subsequent experiments and will let you know if i spot a pattern. 


Thanks!

Daniel Povey

unread,
Jul 28, 2017, 2:57:13 PM7/28/17
to Armin Oliya, kaldi-help
Hm. Well, the assumption is that all of your inputs to Kaldi are
UTF-8 encoded, so this might really be a data preparation issue. I
suspect there was some input, such as the lexicon, which you encoded
other than utf-8. We could fix it by adding some statements to the
data validation to check for this, although I'm not sure what command
can do this.

Daniel Povey

unread,
Aug 6, 2017, 4:32:19 PM8/6/17
to Armin Oliya, kaldi-help
Did you find what the problem was here?

Armin Oliya

unread,
Aug 12, 2017, 4:49:46 AM8/12/17
to dpo...@gmail.com, kaldi-help
So i'm trying to traceback the issue with another experiment but this time i'm stopped at the same point with another error. 


lattice_oracle_align.sh: overall oracle %WER is: 6.05%
steps/cleanup/lattice_oracle_align.sh: oracle ctm is in exp/_66pct/tri4__input/lattice_oracle/ctm

align-text '--special-symbol=***' ark:exp/_66pct/tri4__input/lattice_oracle/text ark:exp/_66pct/tri4__input/lattice_oracle/oracle_hyp.txt ark,t:-

Detected incorrect separator want (expected ;).


i used a larger $nj (100) just for clean_and_segment_data.sh in hope to make it faster; this, and a bigger dataset size are the changes i made since last time; any hints why align-text is failing this time?

I'm still recoding text and oracle_hyp files manually:

recode -d h..u8 $dir/text
recode -d h..u8 $dir/oracle_hyp.txt


Thanks.

Daniel Povey

unread,
Aug 12, 2017, 3:18:39 PM8/12/17
to Armin Oliya, kaldi-help
I previously asked you at what point in the processing the non-UTF
files are appearing. I want to know that.

You'll have to figure out which line of text is causing that problem.
That program will be splitting on space, but if you have a weird
encoding it might have troube doing that.

Armin Oliya

unread,
Aug 15, 2017, 4:15:32 AM8/15/17
to Daniel Povey, kaldi-help
sure, so as far as i can tell the problem is with the encoding of lattice_oracle/text and lattice_oracle/oracle_hyp.txt 
that would make stage 1 of lattice_oracle_align.sh the culprit. 

Daniel Povey

unread,
Aug 15, 2017, 1:16:45 PM8/15/17
to Armin Oliya, kaldi-help
Great, so that's progress.
At the top of that script steps/cleanup/lattice_oracle_align.sh it does
. path.sh
which is a mistake, it should be
. ./path.sh
If there is another path.sh on your path it could pick up that. Check
that path.sh contains "export LC_ALL=C" and try "which path.sh" so see
if another file with the same name is on your path.
And please run with --cleanup false and see if $dir/oracle_hyp.*.txt
also have the problem, that will pin the problem on either int2sym.pl
or on awk.

Armin Oliya

unread,
Aug 18, 2017, 3:38:58 AM8/18/17
to Daniel Povey, kaldi-help
Great tips, thank you. I'll check/try and let you know, pls give me a while. 

Armin Oliya

unread,
Sep 9, 2017, 9:41:38 AM9/9/17
to kaldi-help
In my last experiment i failed again at the following line: 

align-text --special-symbol="$special_symbol" ark:$dir/text ark:$dir/oracle_hyp.txt ark,t:- | \
 utils/scoring/wer_per_utt_details.pl --special-symbol "***" > $dir/analysis/per_utt_details.txt

The issue seems to be with the special "NO-BREAK SPACE" character that appears in my original text and fails the wer_per_utt_details.pl, with a line tha looks like:
epspk1003-epd1cb9aad.1-022500-025500-1 bijna bijna ; veertig veertig ;   <unk> ; nul nul ; patiënten patiënten ;


It shouldn't be a general unicode issue as the perl script successfully compares text with accented e. 
I don't speak perl, so is this something that can be fixed within perl envs/code or should i clean up my corpus?


Thanks!

Jan Trmal

unread,
Sep 9, 2017, 10:09:33 AM9/9/17
to kaldi-help
I didn't analyze this to too much depth, but I believe the issue is "NO-BREAK SPACE" is a white space separator and perls split() function treats it as regular white space character, i.e. throws it away.
My suggestion would probably be: remove these unicode white space characters during initial stages of training (data preparation). I feel it's generally a bad idea to have these special characters in the data (complicates debugging as well, as you cannot see them). Often times, this is also the reason why people do have these characters in the LM or in word lists -- because they didn't see them during data preprocessing/filtering.
y.


To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

Jan Trmal

unread,
Sep 9, 2017, 10:12:22 AM9/9/17
to kaldi-help
And I mean it -- if you absolutely need some class of these characters in the output, then I suggest to map them into ascii characters (or sequence of ascii characters) and only after decoding postprocess the outputs to map them back into the text. It will make your troubleshooting easier.
y

Armin Oliya

unread,
Sep 9, 2017, 10:59:32 AM9/9/17
to kaldi-help
good points, thank you Yenda!

Armin Oliya

unread,
Sep 29, 2017, 7:03:08 AM9/29/17
to kaldi-help
Sorry for dragging this thread.. 

I'm trying to clean a new dataset with clean_and_segment_data.sh. It does finish successfully but the encoding of the cleaned 'text' file is changed to ascii (the original is in utf-8). What's worse i can't seem to recode the new 'text' into utf (iconv stumbling on the first non-ascii char it sees). This causes problem with downstream processes.

The earliest i can trace this back to is where segment_ctm_edits.py is called (stage 7) and an intermediate 'text' file - also with ascii encoding - is generated. 

Appreciate your feedback!
Armin

Daniel Povey

unread,
Sep 29, 2017, 11:48:04 AM9/29/17
to kaldi-help
Before doing anything about this I'd like to understand what the
specific issue is.
Can you verify that you have not changed any of the code in steps/ ?
And can you try to figure out what specifically is being changed in the text?
Don't trust your editor's judgement of what the encoding is-- look at
the bytes (using 'od -c' if needed). Is it removing all the
characters larger than 127? Are some lines valid utf-8, but not all?
And check that python is really python2 (e.g. python --version).
Ascii and utf encodings should be compatible. Even though those
scripts may be interpreting the data as byte strings, all they are
doing is splitting on spaces, which we don't need to understand the
utf-8 encoding to recognize. So they should leave the output still as
valid utf-8.



Dan

Armin Oliya

unread,
Oct 4, 2017, 5:39:26 AM10/4/17
to kaldi-help
Hi Dan,

First things first, I think i was misguided by the output of 'file <filename>' (https://stackoverflow.com/a/36030982/585642), 
Looking at the text files more closely as you suggested, the original text looked like this, with unicode characters looking fine:



ubuntu@ip
-10-0-0-13:raw/train_20h$ grep --color='auto' -P -n "[\x80-\xFF]" text
1387:AT_2013590_0435 ik ben nog maar een uur bezig en volgens het chema mag ik pas over één vijf uur pauze nemen
2680:AT_2021125_0199 volgens mij is ze nieuw en zit ze in één c maar ik weet het niet zeker
2698:AT_2021125_0217 uit één c maar ik weet het niet zeker dat kan niet
...


ubuntu@ip
-10-0-0-13:~/data/raw/train_20h$ tail -n +1387 text | head -n1 | hexdump -C  -s90
0000005a  65 72 20 c3 a9 c3 a9 6e  20 76 69 6a 66 20 75 75  |er ....n vijf uu|
0000006a  72 20 70 61 75 7a 65 20  6e 65 6d 65 6e 0a        |r pauze nemen.|






and after conversion it became like (lines won't match because it's gone through cleaning): the Ã© in the three lines below is meant to be é



ubuntu@ip
-10-0-0-6:~/data/train_20h_cleaned$ grep --color='auto' -P -n "[\x80-\xFF]" text


2155:AT_2021125_0199-1 <unk> en zit ze in één c maar ik weet het
2167:AT_2021125_0217-1 uit één c maar ik weet het <unk>
7341:AT_2035089_0763-1 ook al gebeurt het maar één keer




ubuntu@ip
-10-0-0-6:~/brugklas/data/brugklas_train_20h_cleaned$ tail -n +2155 text | head -n1 | hexdump -C  -s40
00000028  7a 65 20 69 6e 20 c3 83  c2 a9 c3 83 c2 a9 6e 20  |ze in ........n |
00000038  63 20 6d 61 61 72 20 69  6b 20 77 65 65 74 20 68  |c maar ik weet h|
00000048  65 74 0a                                          |et.|




as far as i can tell by looking at the comm of the two text files (before and after cleaning) all unicode chars are messed up. 

Other posts that help me understand the issue:

the changes i can see by looking at "git diff .. " in the kaldi folder are all lines that i had added before and now commented out. Python version is 2.7.

I thought there could be two explanations, either i've messed up somewhere (http://www.weblogism.com/item/270/why-does-e-become-a) or my kaldi's version is out of date. The latter is less likely though since cleaning another set didn't mess up the unicode. 

Anyways, I pulled the latest version of kaldi and redid the process, looking at the intermediate text files, all unicode characters look normal and i'm able to proceed with downstream processes. 



I don't really have an explanation at this point.. but i'm glad I've learned how to double check the encoding details before proceeding with other tasks. 


Thanks 
Armin
Reply all
Reply to author
Forward
0 new messages