Tesseract 3.02 does not detect inter-word spacing for Bengali language.

111 views
Skip to first unread message

Tawfiq Chowdhury

unread,
May 15, 2015, 4:12:48 PM5/15/15
to tesser...@googlegroups.com

I am developing a traindata for Bengali language.The problem is tesseract does not recognize most spaces  in the input file and keep almost all the characters of the input image together to make one long word instead of several words and sentences.This is for a big traindata where it detects some spaces, for a small traindata, it detects nothing.I made an English traindata with only 26 English alphabets to test whether tesseract detects spacing for it and it can detect for English but not for Bengali.I am using 3.02.02 windows installer.Please tell me where to edit the configuration to make it work.I am giving some characters of Bengali to see.

আ মা দে র দে শে র না ম বা লা দে শ

An input text in an image file can be like this আমাদের দেশের নাম বালাদেশ

However, tesseract generates output like this আমাদেরদেশেরনামবালাদেশ

I am doing my thesis on it and in need to help urgently.Thanks in advance.Is there any version of 3.03 or 3.04 for windows? I heard there is 3.03 beta version.

ShreeDevi Kumar

unread,
May 19, 2015, 11:01:29 AM5/19/15
to tesser...@googlegroups.com
Please try the vietocr gui frontend for tesseract ocr available from http://vietocr.sourceforge.net/
It uses a newer version of tesseract.

you can also try using the bengali traineddata available on tesseract site - 

or


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/520ee839-2152-47be-a9b0-7e651db9a2a0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages