Re: [tesseract-ocr] Training data gets worse as I add characters

198 views

Skip to first unread message

ShreeDevi Kumar

unread,

Nov 21, 2014, 10:55:19 PM11/21/14

to tesser...@googlegroups.com, tesser...@googlegroups.com

Hi,

Have you added the fonts to font-properties file?

Try removing the 'narrow' font from your training set.

Test with just one or two similar fonts and see if results are better.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Nov 22, 2014 at 7:11 AM, Ryan Dev <software.de...@gmail.com> wrote:

I am trying to cover as much as I can of the latin unicode characters in the BMP.

What I find is that as I add more characters, the ocr results get worse.

For example, instead of getting the correct ö I get Ö and then as I added more characters the latest result is Ṏ.

In otherwords, not only is it getting worse at detecting capitalization correctly, but it is favoring more complex characters over the simpler solutions! This is just one example, another is Ȧ instead of correctly getting A.

When I run a smaller set of training data I get better results (for the trained ones, of course others are missed completely).

Should I be trying to do smaller, multiple, traineddata files? This will reduce performance, but I need accuracy most of all. Plus I've had problems where confidence is reported high on incorrect result, and lower on correct results.

I'm using latest tesseract checkout, on Ubuntu, using the tesstrain.sh script.

Linked are files I'm using, a sample image, and the traineddata. Plus an example image I ocr.

https://drive.google.com/folderview?id=0B5ebDnF6cn8UTVhBc25OOV9JYTg&usp=sharing

The unicode ranges I am trying to train for at the moment are.

0000 - 007f Basic Latin
0080 - 00ff Latin 1 Supplemental
0100 - 017f Latin Ext A
0180 - 024f Latin Ext B
1e00 - 1eff Latin Extended Additional
2500 - 2594 Box Draw and Box Elements
fb00 - fb06 Ligatures

Using the following fonts for training
arial unicode ms
freeserif
liberation mono
liberation sans
liberation sans narrow condensed
liberation serif
segoe ui

I can certainly add more if that helps, but so far adding fonts just means it takes longer to realize how bad the trained data is.

If you are asking why I am doing this, it is because I am trying to create a language agnostic solution. You can see a test image in the link above, and can see I am only looking at font glyphs, not full page ocr.

Any suggestions/advice appreciated!

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b5a502dd-78e8-467a-ad0d-a225bc12715b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ryan Dev

unread,

Nov 24, 2014, 12:49:01 PM11/24/14

to tesser...@googlegroups.com, tesser...@googlegroups.com

Yes, all the fonts are there, but I set all the flags to zero, because I read somewhere that doesn't really do anything anymore, and I was in a rush.

Do those flags in the font properties file make a difference?

I'll try your other suggestions and let you know, thanks.

Reply all

Reply to author

Forward

0 new messages