Script Detection

117 views
Skip to first unread message

rkvsraman

unread,
Nov 8, 2016, 1:38:20 AM11/8/16
to tesseract-ocr
Hello, 


I tried to detect the script of the above bengali image with command

tesseract ben.png bensc -  -psm 0


and i get following output in bensc.osd  which detects the the script as Latin.


Page number: 0
Orientation in degrees: 90
Rotate: 270
Orientation confidence: 1.48
Script: Latin
Script confidence: 2.35


What do i need to do to make it detect it as Bengali. 

Thanks.

-Raman
ben.png

Debanjan Basu

unread,
Nov 29, 2016, 9:05:56 AM11/29/16
to tesseract-ocr
Hi,
I just started playing around with tesseract an hour ago - and I tried bengali first too. I do not actually know how to make it work yet.
But I shall tell you what I think I know -
1. The default characters tesseract looks for are english/latin. Use `tesseract --list-langs` for a list of supported languages by default.
I get 3 on a fresh install from apt-get in Ubuntu 14.04
    $tesseract --list-langs
    List of available languages (3):
    eng
    osd
    equ

This makes sense because the default `tessdata` directory has those traineddata files
    $ ls /usr/share/tesseract-ocr/tessdata/ | grep traineddata$
    eng.traineddata
    equ.traineddata
    osd.traineddata
2. clone the tessdata repository from github (https://github.com/tesseract-ocr/tessdata)
3. run tesseract with "-l ben" from the tessdata directory -
    $ tesseract --list-langs --tessdata-dir $NEWTESSDATA

but even this crashes with message
   actual_tessdata_num_entries_ <= TESSDATA_NUM_ENTRIES:Error:Assert failed:in file tessdatamanager.cpp, line 53
   Segmentation fault (core dumped)

I played around with keeping only one file ben.traineddata in the $NEWTESSDATA folder, but I do not know what the design of the arguments is till now.

Zdenko Podobný

unread,
Nov 29, 2016, 9:13:55 AM11/29/16
to tesser...@googlegroups.com

On Tue, Nov 29, 2016 at 3:03 PM, Debanjan Basu <dbas...@gmail.com> wrote:
2. clone the tessdata repository from github (https://github.com/tesseract-ocr/tessdata)

This is totally wrong approach!
First of all - if you installed tesseract with packager (apt-get) - install also languages with packager
Next cloning of all tessdata repository (4213M of binary data) is useful only for those who will create packages for distribution
Next you should know that is repository - using tesseract with wrong data version will cause crash)

Zdenko

Debanjan Basu

unread,
Nov 29, 2016, 9:26:54 AM11/29/16
to tesseract-ocr
@zdenop ah... great! That works!!

@rkvsraman that would be ` sudo apt-get install tesseract-ocr  tesseract-ocr-ben`, if that wasn't clear!
Reply all
Reply to author
Forward
0 new messages