Hi,
I just started playing around with tesseract an hour ago - and I tried bengali first too. I do not actually know how to make it work yet.
But I shall tell you what I think I know -
1. The default characters tesseract looks for are english/latin. Use `tesseract --list-langs` for a list of supported languages by default.
I get 3 on a fresh install from apt-get in Ubuntu 14.04
$tesseract --list-langs
List of available languages (3):
eng
osd
equ
This makes sense because the default `tessdata` directory has those traineddata files
$ ls /usr/share/tesseract-ocr/tessdata/ | grep traineddata$
eng.traineddata
equ.traineddata
osd.traineddata
2. clone the tessdata repository from github (
https://github.com/tesseract-ocr/tessdata)
3. run tesseract with "-l ben" from the tessdata directory -
$ tesseract --list-langs --tessdata-dir $NEWTESSDATA
but even this crashes with message
actual_tessdata_num_entries_ <= TESSDATA_NUM_ENTRIES:Error:Assert failed:in file tessdatamanager.cpp, line 53
Segmentation fault (core dumped)
I played around with keeping only one file ben.traineddata in the $NEWTESSDATA folder, but I do not know what the design of the arguments is till now.