add new characters

84 views
Skip to first unread message

Timo Struppi

unread,
Oct 24, 2020, 8:01:18 AM10/24/20
to tesseract-ocr

Hello,

I dont want to invent the wheel new by creating a new language but how do i add the letters  and ī to the OCR??

I tried a lot (vietOCR, Linux inteligent OCR solution, followed the few avaible tutorials etc) for several days but i am still not achieve to add a single letter.


Many thanks in advance

Shree Devi Kumar

unread,
Oct 24, 2020, 9:12:15 AM10/24/20
to tesseract-ocr
Ray has suggested using plus-minus type of training for adding a couple of characters to the traineddata. Did you try that?

Please share the training data you used (box/tiff pairs or lstmf files).

I have done replace a layer training for Sanskrit. It adds the two characters you want (in addition to many other required for Sanskrit transliteration) . See sample image and attached output. The file is available at https://github.com/Shreeshrii/tess5training-sanskrit-iast/tree/main/tessdata/fast

 

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f23a9be3-dea4-46a6-8e21-dbe9c120d993n%40googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
kamakoti4-Sanskrit-1017-fast-tmp.txt
kamakoti4.png

Timo Struppi

unread,
Oct 24, 2020, 11:53:55 AM10/24/20
to tesseract-ocr
perfect! Thank you very much <3 Thats what i was looking for. International Alphabet of Sanskrit Transliteration Characters.

Can tell me in which folder i must place the .traineddata? 

My configuration:
tesseract 4.1.1
 leptonica-1.79.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
 Found AVX
 Found SSE
 Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4

Many thanks again for your fast help

Timo Struppi

unread,
Oct 27, 2020, 5:34:01 PM10/27/20
to tesseract-ocr
Help!  I get following errorcode. What am i doing wrong?

Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/Sanskrit-1017-fast.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'Sanskrit-1017-fast'
Tesseract couldn't load any languages!
Could not initialize tesseract.

shree

unread,
Oct 27, 2020, 10:06:18 PM10/27/20
to tesseract-ocr
Did you copy the traineddata file to /usr/share/tesseract-ocr/4.00/tessdata?
What's the value of TESSDATA_PREFIX  in your 'env' output?

What's the output of?

ls -l /usr/share/tesseract-ocr/4.00/tessdata/Sanskrit-1017-fast.traineddata  

combine_tessdata -d  /usr/share/tesseract-ocr/4.00/tessdata/Sanskrit-1017-fast.traineddata 

tesseract --list-langs --tessdata-dir /usr/share/tesseract-ocr/4.00/tessdata

tesseract --list-langs

tesseract -v

Timo Struppi

unread,
Oct 28, 2020, 4:50:16 PM10/28/20
to tesseract-ocr
Hello, Problem solved.

I just made a Linuxinstallation and the error was gone.

Thanks again for you file and help!

Timo Struppi

unread,
Oct 28, 2020, 4:50:59 PM10/28/20
to tesseract-ocr
New and fresh Linuxinstallation i ment :)
Reply all
Reply to author
Forward
0 new messages