Training data gets worse as I add characters

46 views
Skip to first unread message

Ryan Dev

unread,
Nov 21, 2014, 8:41:34 PM11/21/14
to tesser...@googlegroups.com
I am trying to cover as much as I can of the latin unicode characters in the BMP.

What I find is that as I add more characters, the ocr results get worse.

For example, instead of getting the correct ö I get Ö and then as I added more characters the latest result is Ṏ.

In otherwords, not only is it getting worse at detecting capitalization correctly, but it is favoring more complex characters over the simpler solutions! This is just one example, another is Ȧ instead of correctly getting A.

When I run a smaller set of training data I get better results (for the trained ones, of course others are missed completely).

Should I be trying to do smaller, multiple, traineddata files? This will reduce performance, but I need accuracy most of all. Plus I've had problems where confidence is reported high on incorrect result, and lower on correct results.

I'm using latest tesseract checkout, on Ubuntu, using the tesstrain.sh script. 

Linked are files I'm using, a sample image, and the traineddata. Plus an example image I ocr.


The unicode ranges I am trying to train for at the moment are.

0000 - 007f Basic Latin
0080 - 00ff Latin 1 Supplemental
0100 - 017f Latin Ext A
0180 - 024f Latin Ext B
1e00 - 1eff Latin Extended Additional
2500 - 2594 Box Draw and Box Elements
fb00 - fb06 Ligatures

Using the following fonts for training
arial unicode ms
freeserif
liberation mono
liberation sans
liberation sans narrow condensed
liberation serif
segoe ui

I can certainly add more if that helps, but so far adding fonts just means it takes longer to realize how bad the trained data is.

If you are asking why I am doing this, it is because I am trying to create a language agnostic solution. You can see a test image in the link above, and can see I am only looking at font glyphs, not full page ocr.

Any suggestions/advice appreciated!







Reply all
Reply to author
Forward
0 new messages