Error in data preparation

1,721 views
Skip to first unread message

vinithaba...@gmail.com

unread,
Jul 10, 2018, 12:16:46 AM7/10/18
to kaldi-help
Good morning all,
  While running kaldi I am getting this error in the data preparation step.It seems everything is fine with my text file.Still getting error.

utils/validate_text.pl: ERROR: text file 'data/test_clean/text' contains disallowed UTF-8 whitespace character(s)





Thanks in advance
vinitha

Daniel Povey

unread,
Jul 10, 2018, 12:27:49 AM7/10/18
to kaldi-help
Likely some weird UTF space that is not also an ASCII space.   UTF has quite a few code points that map to space.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/59d9f479-dfd4-4546-b0ef-9eb86f5f0f1a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

vinithaba...@gmail.com

unread,
Jul 10, 2018, 12:37:06 AM7/10/18
to kaldi-help
But the same file its working fine in another machine but not working in my machine.
What does wired white space means and how to find it.

Thanks in advance
Vinitha



Daniel Povey

unread,
Jul 10, 2018, 12:55:31 AM7/10/18
to kaldi-help
In this PR
I have updated validate_text.pl to say more specifically which line has bad UTF whitespace, that should make it easier to debug.
But I'm not going to lead you through how to view text like that.  You need to be prepared to do a little background reading.


--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.

vinithaba...@gmail.com

unread,
Jul 10, 2018, 1:43:09 AM7/10/18
to kaldi-help
Thanks for the reply.I will look into it and update if i have some problem still.

Thanks,
vinitha


vinithaba...@gmail.com

unread,
Jul 10, 2018, 4:35:49 AM7/10/18
to kaldi-help
My error got resolved.Thanks for the help.



Regards
Vinitha

mingi1...@gmail.com

unread,
Mar 25, 2019, 5:43:48 AM3/25/19
to kaldi-help
Dear Vinitha,

I'm a beginner of kaldi and I have identical problem.
How did you resolve the error?

Best,
Ming I

vinithaba...@gmail.com於 2018年7月10日星期二 UTC+8下午4時35分49秒寫道:

Jonathan K

unread,
Mar 25, 2019, 11:20:31 AM3/25/19
to kaldi-help
Did you use the PR Povey suggested? https://github.com/kaldi-asr/kaldi/pull/2541
It should show you in which line the problem is.
Note that the character can be NBSP, it looks like regular space but it is not. Regular space has the character code 32, while the non-breaking space has the character code 160.

Daniel Povey

unread,
Mar 25, 2019, 11:22:43 AM3/25/19
to kaldi-help
I suspect the original person's problem was that the file which was supposed to be called 'text', he named 'text.txt'.  Most likely this other person's issue is different but the error message may have been similar.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

mingi1...@gmail.com

unread,
Mar 25, 2019, 8:43:46 PM3/25/19
to kaldi-help
Sure I did.
The first line was told to be invalid.
I temporarily delete the first line and ran the program again, then the next line was also told with the error.
It seems that disallowed spaces are all over the file.


Jonathan K於 2019年3月25日星期一 UTC+8下午11時20分31秒寫道:

Alex Hung

unread,
Jun 11, 2019, 11:44:31 PM6/11/19
to kaldi-help
I think there is BOM header in your file.
try dox2unix or sed '1s/^\xEF\xBB\xBF//'

Alex

mingi1...@gmail.com於 2019年3月26日星期二 UTC+8上午8時43分46秒寫道:

郝竹林

unread,
Jun 12, 2019, 11:17:11 AM6/12/19
to kaldi-help

This error you need to convert your original text to Linux for a newline.

dos2unix your tran transcript.txt.

This error is because of the diffence of Linux and Window on the newline(Wrap).


在 2018年7月10日星期二 UTC+8下午12:16:46,vinithaba...@gmail.com写道:

Nani

unread,
Mar 2, 2020, 4:56:43 AM3/2/20
to kaldi-help
Hi can u please tell me how you resolved this problem i am unable to resolving it.

Amol Bole

unread,
Mar 4, 2022, 2:13:27 AM3/4/22
to kaldi-help
My problem is resolved by using below command..


tr -d '\r' < text  > text_new


Reply all
Reply to author
Forward
0 new messages