Script Detection

rkvsraman

unread,

Nov 8, 2016, 1:38:20 AM11/8/16

to tesseract-ocr

Hello,

I tried to detect the script of the above bengali image with command

tesseract ben.png bensc - -psm 0

and i get following output in bensc.osd which detects the the script as Latin.

Page number: 0

Orientation in degrees: 90

Rotate: 270

Orientation confidence: 1.48

Script: Latin

Script confidence: 2.35

What do i need to do to make it detect it as Bengali.

Thanks.

-Raman

ben.png

Debanjan Basu

unread,

Nov 29, 2016, 9:05:56 AM11/29/16

to tesseract-ocr

Hi,
I just started playing around with tesseract an hour ago - and I tried bengali first too. I do not actually know how to make it work yet.
But I shall tell you what I think I know -
1. The default characters tesseract looks for are english/latin. Use `tesseract --list-langs` for a list of supported languages by default.
I get 3 on a fresh install from apt-get in Ubuntu 14.04
    $tesseract --list-langs
    List of available languages (3):
    eng
    osd
    equ

This makes sense because the default `tessdata` directory has those traineddata files
    $ ls /usr/share/tesseract-ocr/tessdata/ | grep traineddata$
    eng.traineddata
    equ.traineddata
    osd.traineddata
2. clone the tessdata repository from github (https://github.com/tesseract-ocr/tessdata)
3. run tesseract with "-l ben" from the tessdata directory -
    $ tesseract --list-langs --tessdata-dir $NEWTESSDATA

but even this crashes with message
   actual_tessdata_num_entries_ <= TESSDATA_NUM_ENTRIES:Error:Assert failed:in file tessdatamanager.cpp, line 53
   Segmentation fault (core dumped)

I played around with keeping only one file ben.traineddata in the $NEWTESSDATA folder, but I do not know what the design of the arguments is till now.

Zdenko Podobný

unread,

Nov 29, 2016, 9:13:55 AM11/29/16

to tesser...@googlegroups.com

On Tue, Nov 29, 2016 at 3:03 PM, Debanjan Basu <dbas...@gmail.com> wrote:

2. clone the tessdata repository from github (https://github.com/tesseract-ocr/tessdata)

This is totally wrong approach!

First of all - if you installed tesseract with packager (apt-get) - install also languages with packager

Next cloning of all tessdata repository (4213M of binary data) is useful only for those who will create packages for distribution

Next you should know that is repository - using tesseract with wrong data version will cause crash)

Zdenko

Debanjan Basu

unread,

Nov 29, 2016, 9:26:54 AM11/29/16

to tesseract-ocr

@zdenop ah... great! That works!!

@rkvsraman that would be ` sudo apt-get install tesseract-ocr tesseract-ocr-ben`, if that wasn't clear!

Reply all

Reply to author

Forward