Data Preparation Error - CR character and UTF-8 Whitespace Character

881 views
Skip to first unread message

liam.h...@gmail.com

unread,
Apr 7, 2019, 1:03:53 PM4/7/19
to kaldi-help
Hey out there, apologies for the beginner question, I promise I have done a lot of searching already but I can't seem to solve this one on my own.

I think the .sh files are reading my data differently than they should be. I am running all of this through a virtual box version of Ubuntu because I'm on a Windows, don't know if that has anything to do with it...

I'm getting the errors:
steps/make_mfcc.sh --nj 1 --cmd run.pl data/train exp/make_mfcc/train mfccutils/validate_text.pl: The line for utterance 763_cora contains CR (0x0D) character
and
utils/validate_text.pl: ERROR: text file 'data/train/text' contains disallowed UTF-8 whitespace character(s)

It's also telling me:
steps/make_mfcc.sh --nj 1 --cmd run.pl data/test exp/make_mfcc/test mfcc
utils/validate_data_dir.sh: WARNING: you have only one speaker. This is probably a bad idea.
But that's not true.

My text file looks like (though not actually bolded):
763_anton i am finally done with my english homework anton told his mom his mom gave him a look and asked oh really it took me an hour and a half now i better not wait to start my math anton quickly took out his work this should not take too long in fact it should only take about ten minutes
763_cora cora and her baby brother theo are playing in the snow although its chilly the bright sun feels like its already spring as cora pushes theos sled to the top of the hill she smiles he looks exactly like a bear in his furry snowsuit theo laughs and then surprise he rolls off the sled and starts sliding down the hill he quickly gains speed over a steep icy section racing after him cora sees someone appear at the bottom of the hill with care her mother scoops up her speeding son standing up she laughs no broken bones on this little bear now lets go inside and have lunch
773_anton i am finally done with my english homework anton told his mom his mom gave him a look and asked oh really it took me an hour and a half now i better not wait to start my math anton quickly took out his work this should not take too long in fact it should only take about ten minutes
773_cora cora and her baby brother theo are playing in the snow although its chilly the bright sun feels like its already spring as cora pushes theos sled to the top of the hill she smiles he looks exactly like a bear in his furry snowsuit theo laughs and then surprise he rolls off the sled and starts sliding down the hill he quickly gains speed over a steep icy section racing after him cora sees someone appear at the bottom of the hill with care her mother scoops up her speeding son standing up she laughs no broken bones on this little bear now lets go inside and have lunch
...it goes on.

I have two passages (anton and cora) that each student read once. so each speaker ID is attached to two utterances. There are no Carriage Return characters except to start a text ID 

I've tried playing with the encoding but I don't know how to go about altering that nor how I would alter it to satisfy the program. I created these files all myself as instructed in the Kaldi for dummies tutorial, so I don't see how I could have encoded whitespace or CR characters into it that weren't meant to be there.

Any help greatly appreciate, Thank you!

Daniel Povey

unread,
Apr 7, 2019, 1:12:56 PM4/7/19
to kaldi-help
If you created those files on Windows it explains why the CR characters would be there.
You could see if there is a 'dos2unix' command to fix them.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/7432f83b-d39b-48e6-b81e-d46aca4b9fe2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

liam.h...@gmail.com

unread,
Apr 7, 2019, 1:22:05 PM4/7/19
to kaldi-help
Thanks for the reply, no though I made all of the files from within the Ubuntu box in texteditor
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.

Daniel Povey

unread,
Apr 7, 2019, 1:25:23 PM4/7/19
to kaldi-help
It doesn't really matter how they got there, you need to remove the CR characters if they are there.

Windows is weird, it can sometimes insert those characters automatically (if you read in text mode).

I can't offer too much help.  In the past I have found that dealing with Windows issues isn't really worth the time.

Dan

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

liam.h...@gmail.com

unread,
Apr 7, 2019, 1:32:12 PM4/7/19
to kaldi-help
Ok, I'll try uninstalling and reinstalling slowly and carefully. Thanks for the help

Daniel Povey

unread,
Apr 7, 2019, 1:38:04 PM4/7/19
to kaldi-help
It's not an installation issue.
Please read what I have already said more carefully (e.g. the dos2unix stuff) rather than posting follow-ups.


To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

Daniel Povey

unread,
Apr 7, 2019, 3:38:22 PM4/7/19
to kaldi-help

It's possible that if your data is not located inside your virtualbox but is mounted from your host Windows machine, when it opens files in text mode the Windows newline -> CR + newline translation occurs. 
Fixing this would be a lot of work as we'd have to go through all Kaldi scripts and make sure all reads are binary mode.
It might be easier to copy your data to somewhere on the virtualbox's own filesystem.

Dan
Reply all
Reply to author
Forward
0 new messages