Does unicharset affect recognition quality ?

188 views
Skip to first unread message

Yury

unread,
Aug 24, 2017, 3:06:52 PM8/24/17
to tesseract-ocr
I think No.

I call tesseract 5.03 from Python under Win 8 for recognition text on Kannada. 
The quality of recognition is fine with 80%. However some symbols are divided into 2 halves. One of them is correct, another one is replaced by ಲ.
Example: ಕಾಂ (one char) recognized as ಕಾಲ (two chars), ನಿಂ recognised as ನಿಲ and so on, although separate chars ಕಾ, ನಿ, ... are recognised correctly.
I unpacked the file .unicharset from kan.traineddata and tryed to correct character's parameters.
I summarized width of both chars in pair, added some gap and put it into min/max width (with some deviation). Also I corrected min/max other params from the fine recognition chars.
After that I overwrote unicharset in existing traineddata and saw no difference.
I tried so many values and didn't see any changes for recognition.
In the end I put ten zeros (0,0,0,0,...) in parameters of ಲ char - result is the same (ಲ is recognised as usual).

I think, in the new version of tesseract the quality of recognition doesn't depend on the parameters of unicharset.

So, how can I put some tuning into tesseract ?
Are there any other methods of management to tesseract ?
I don't want to learn tesseract over again because I don't have any big text with all characters (my unicharset have 2851 chars).

On the other hand, I noticed that only chars with 1 or 2 bytes' unicode lenght are correctly recognized.  Characters with 3 or more bytes' lenght are not always recognized.
Are there any additional parameters to remove limitations on the number of bytes per symbol ? 

Yury

unread,
Aug 25, 2017, 12:02:30 AM8/25/17
to tesseract-ocr
Sorry, tesseract version is 3.05.01

пятница, 25 августа 2017 г., 2:06:52 UTC+7 пользователь Yury написал:

Yury

unread,
Aug 25, 2017, 4:18:24 AM8/25/17
to tesseract-ocr
I can add the following. 
When I accidentally made a mistake in the unicharset, and rewrote it in traineddata, the text did recognize the Latin letters and numbers only (I use -l kan+eng). 
Thus, unicharset is correct itself, the mechanism of recognition accesses it as needed.


пятница, 25 августа 2017 г., 2:06:52 UTC+7 пользователь Yury написал:
I think No.

ShreeDevi Kumar

unread,
Aug 25, 2017, 5:07:49 AM8/25/17
to tesser...@googlegroups.com
Have you tried the new tessdata/best/*.traineddata with the latest github sources?

Yury

unread,
Aug 25, 2017, 6:47:56 AM8/25/17
to tesseract-ocr
Hello, shree!

Can you tell me exact path for tessdata/best/*.traineddata ?

пятница, 25 августа 2017 г., 16:07:49 UTC+7 пользователь shree написал:

Yury

unread,
Aug 25, 2017, 7:12:41 AM8/25/17
to tesseract-ocr
Hello again.

I found this: https://github.com/tesseract-ocr/tessdata/blob/master/best/Kannada.traineddata

But after recognition I see only english text symbols and digits, so this did not work.
In log I see:

I have 3.05.


пятница, 25 августа 2017 г., 17:47:56 UTC+7 пользователь Yury написал:

ShreeDevi Kumar

unread,
Aug 25, 2017, 7:52:48 AM8/25/17
to tesser...@googlegroups.com

Latest GitHub source in master branch is for 4.0alpha. you can install via post.

Search for tesseract PPA Alex in Google.

_sent from phone


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b20f906b-db90-43f1-b9c6-b1bb40d21414%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ShreeDevi Kumar

unread,
Aug 25, 2017, 7:56:25 AM8/25/17
to tesser...@googlegroups.com

Yury

unread,
Aug 25, 2017, 9:01:20 AM8/25/17
to tesseract-ocr
Hello shree!

Thanks for your links and taking the time.

I don't found folder /best/ in ~alex-p profile.
But I found kan.traineddata in package tesseract-lang-4.00 (in tesseract-lang-3.05 the language Kannada is absent).
I have to got this file and start recognise - result is the same.
This package is dated at 08.01.17 and have 2851 characters (as I have).
So, I thing I used this package earlier.

пятница, 25 августа 2017 г., 18:56:25 UTC+7 пользователь shree написал:

https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr

For ppa


On 25-Aug-2017 5:22 PM, "ShreeDevi Kumar" <shree...@gmail.com> wrote:

Latest GitHub source in master branch is for 4.0alpha. you can install via post.

Search for tesseract PPA Alex in Google.

_sent from phone

On 25-Aug-2017 4:42 PM, "Yury" <yur...@gmail.com> wrote:
Hello again.

I found this: https://github.com/tesseract-ocr/tessdata/blob/master/best/Kannada.traineddata

But after recognition I see only english text symbols and digits, so this did not work.
In log I see:

I have 3.05.


пятница, 25 августа 2017 г., 17:47:56 UTC+7 пользователь Yury написал:
Hello, shree!

Can you tell me exact path for tessdata/best/*.traineddata ?

пятница, 25 августа 2017 г., 16:07:49 UTC+7 пользователь shree написал:
Have you tried the new tessdata/best/*.traineddata with the latest github sources?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,
Aug 25, 2017, 9:13:22 AM8/25/17
to tesser...@googlegroups.com
If you are using the 4.0alpha - latest version of program you can use kannada traineddata from 

or

I have not tested kannada personally but if it follows the pattern for devanagari, it should be better than the older traineddata.

If you are using 3.05 version of program,
then use traineddata files from 

Please note that the unicharset and langdata files are used while training and just changing the unicharset file is NOT going to improve the recognition.

For that training needs to be done. Please see the wiki for more details.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Yury

unread,
Aug 25, 2017, 12:16:46 PM8/25/17
to tesseract-ocr
ShreeDevi,

Thanks for your answers and taking the time.

I get traineddata file for 3.04 version (file is little less, but number of characters is the same - 2851) and get the same result - some symbols is divided to pair (first is correct and another one is fail).
I think to upgrade to 4.00, so I have a questions: 

Can I install new version nearby with 3.05, without install ?

And another question in the first my post:
Did the tesseract have some limitations for number of bytes per character in unicode ?
Are there any additional parameters to remove limitations on the number of bytes per symbol ?

пятница, 25 августа 2017 г., 20:13:22 UTC+7 пользователь shree написал:

ShreeDevi Kumar

unread,
Aug 25, 2017, 1:23:49 PM8/25/17
to tesser...@googlegroups.com

I do not know about internal working of tesseract.

If you unpack the best/kan.traineddata you may find a smaller unicharset which just the basic characters in it.

Tesseract 4 uses the LSTM neural net engine vs the legacy engine for 3.05. LSTM does line based recognition rather than character base.

Yes, it is possible to have both versions installed, however I do not have exact instructions to make it work. It would also depend on what o/s you are using.

I only have the latest GitHub version installed.


To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Yury

unread,
Sep 11, 2017, 1:08:32 AM9/11/17
to tesseract-ocr
Thanks for your hint. 

I installed CygWin and compiled tesseract 4.0 under CygWin. Quality has improved significantly. 
However, there was another problem. 
In oem mode 1 or 3 everything works fine. When I choose the modes 0 or 2 I get the error: 

Failed loading language 'kan'
Tesseract couldn't load any languages!
Could not initialize tesseract.

I set TESSDATA_PREFIX to "/usr/share/tessdata". There are eng, kan, Kannada and osr traineddata obtained from best catalog. 
What could be the problem ? These modes do not work in version 4 ?

tesseract 4.00.00alpha
 leptonica-1.74.4
  libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.30 : libtiff 4.0.7 : zlib 1.2.11 : libwebp 0.4.4 : libopenjp2 2.1.2

 Found AVX
 Found SSE

суббота, 26 августа 2017 г., 0:23:49 UTC+7 пользователь shree написал:

Yury

unread,
Sep 11, 2017, 1:12:10 AM9/11/17
to tesseract-ocr
Forgot to add. 
At the stage make under CygWin I could not execute a command "sudo ldconfig". 
Although I think that it is not essential - the modes 1 and 3 work fine.

понедельник, 11 сентября 2017 г., 12:08:32 UTC+7 пользователь Yury написал:

ShreeDevi Kumar

unread,
Sep 11, 2017, 1:23:41 AM9/11/17
to tesser...@googlegroups.com
the best traineddata do not have the models for legacy engine, that is why oem_mode 0 and 2 do not work.

you can try the regular 4.0 traineddata, that has both models but maynot be as accurate.

The best way to check is to use combine_tessdata command to unpack a traineddata file and see which components are there in it.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Yury

unread,
Sep 11, 2017, 5:08:58 AM9/11/17
to tesseract-ocr
Thank you for detailed answer.
I think, my customers will satisfy the quality from neural net.

понедельник, 11 сентября 2017 г., 12:23:41 UTC+7 пользователь shree написал:
Reply all
Reply to author
Forward
0 new messages