I am developing a traindata for Bengali language.The problem is tesseract does not recognize most spaces in the input file and keep almost all the characters of the input image together to make one long word instead of several words and sentences.This is for a big traindata where it detects some spaces, for a small traindata, it detects nothing.I made an English traindata with only 26 English alphabets to test whether tesseract detects spacing for it and it can detect for English but not for Bengali.I am using 3.02.02 windows installer.Please tell me where to edit the configuration to make it work.I am giving some characters of Bengali to see.
আ মা দে র দে শে র না ম বা লা দে শ
An input text in an image file can be like this আমাদের দেশের নাম বালাদেশ
However, tesseract generates output like this আমাদেরদেশেরনামবালাদেশ
I am doing my thesis on it and in need to help urgently.Thanks in advance.Is there any version of 3.03 or 3.04 for windows? I heard there is 3.03 beta version.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/520ee839-2152-47be-a9b0-7e651db9a2a0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.