Covering ASCII Extended range.

188 views
Skip to first unread message

Ryan Dev

unread,
Nov 12, 2014, 7:41:58 PM11/12/14
to tesser...@googlegroups.com
The project I am working on I need to do OCR on documents with characters that are covered by the ISO 8859-1 Extended ASCII range (0x20-0xFF)

I was wondering, does anyone have traineddata files for this? 

Or do they know which existing language traineddata files would cover this range (maybe English + Deutsch + ???)?

Thanks!


ShreeDevi Kumar

unread,
Nov 12, 2014, 10:58:39 PM11/12/14
to tesser...@googlegroups.com
You can look at the unicharset of the traineddata to see the coverage. 

try with eng+deu+iast 

iast is a traineddata that I generated for sanskrit transliteration in roman/latin script.




ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b3a8e8ee-00ca-4b3e-bdb7-06ea000e458c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ShreeDevi Kumar

unread,
Nov 13, 2014, 12:48:38 PM11/13/14
to tesser...@googlegroups.com
Try with attached traineddata file. 

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

asc.traineddata
asc.unicharset

Ryan Dev

unread,
Nov 13, 2014, 3:11:34 PM11/13/14
to tesser...@googlegroups.com
Wow! Awesome.

That file definitely helps. It fixed a few issues, but introduced a few of its own, so currently I am running "eng+asc" and that is giving great output, and is running faster then "eng+deu".

Attached is an example image and output using asc. Note that asc is getting the 'ü' as a 'ū', and a few other errors, that "deu" one handles. But still a huge help. 

A BIG improvement is it got '=' correctly, when all other trained data I tried, including math symbols, returns as ':' or worse. Thanks!

A couple questions, to help me learn to fish so to speak...
1. How do I find/get the unicharset file? I checked the english and german tessdata downloads and there is nothing.
2. How did you go about making the asc traineddata? I think I need to dive into this aspect of tesseract. Do I follow these steps? https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3. I am not interested in new languages, just making one that covers extended ascii, and then hopefully one day the Unicode BMP (0x0000 - 0xFFFF). But not sure how to go about that with a huge time sink.

FPDGIP+DekaFrutiger45Light_6_asc_ocr.txt
FPDGIP+DekaFrutiger45Light_6.tiff

ShreeDevi Kumar

unread,
Nov 13, 2014, 10:45:05 PM11/13/14
to tesser...@googlegroups.com
asc traineddata does not have a wordlist or dictionary, so using eng will help with that. Also, I just trained using a few fonts that support the whole range. If you train with the font you are using, you will get better results.

You can use 'combine_tessdata' command with the -u (unpack) option to find the unicharset inside the traineddata. see http://manpages.ubuntu.com/manpages/utopic/man1/combine_tessdata.1.html

If using the latest version from git, you can use the shell script from https://code.google.com/p/tesseract-ocr/source/browse/training/tesstrain.sh

I use jtessbox editor for creating box/tiff pairs as I am not able to run text2image on windows.

I'll upload the files I used for training and let you know. You can change the training text, fonts, dictionary etc to meet your needs.


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

Ryan Dev

unread,
Nov 14, 2014, 2:35:28 PM11/14/14
to tesser...@googlegroups.com
asc traineddata does not have a wordlist or dictionary, so using eng will help with that.

You mean unpack the wordlist from eng and pack it into the asc one? Or run tesseract with "eng+asc"? Currently I run each language in complete isolation from each other, and figure out the results myself.

For example I found, when doing ocr on a greek language file, that "eng+ell" and "ell+eng" results in the same incorrect output. I have to run "ell" on its own to get correct results.
 
If you train with the font you are using, you will get better results.

I don't have 'a font' that I'm using. My client has thousands documents in different languages, that I need 'fix'. Working just on ascii extended range (I know that doesn't mean one encoding) right now, then onto full Unicode BMP range. So I can't train in that sense.
 
A big problem I'm having now, is that I am relying on the per character confidence values from tesseract, and some traineddata, such as the ascii one you provided, have "inflated" confidence scores, so I replace the correct unicode result, from say deu.traineddata, and replace with an incorrect unicode result from asc.traineddata, because the confidence value is higher in the latter. I'm hoping to improve that "somehow"....


I'll upload the files I used for training and let you know. You can change the training text, fonts, dictionary etc to meet your needs.

That would be really appreciated thanks 

ShreeDevi Kumar

unread,
Nov 14, 2014, 9:37:08 PM11/14/14
to tesser...@googlegroups.com
you may get better results using appropriate language data rather than just the ascii range. Are the client documents sorted by language?

I am attaching files used - i had just copied some tables of ascii range - you can delete symbols, add multiple copies of letters that are needed.

Ray is supposed to release new traineddata and language data files soon, that may help you.




ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
asc.zip

Ryan Dev

unread,
Nov 18, 2014, 1:42:07 PM11/18/14
to tesser...@googlegroups.com
Thanks again.

you may get better results using appropriate language data rather than just the ascii range. Are the client documents sorted by language?

I'm not sure how they have them organised, I just know they want an "automatic" solution...
 

I am attaching files used - i had just copied some tables of ascii range - you can delete symbols, add multiple copies of letters that are needed.


I'm still getting up and running with training (I'm doing it on linux as there appear to be more tools available that way). But I saw this comment from zdenop
and it leads me to believe that getting much better trained data using the common fonts (arial, georgia, segoe, garamond) will not be any better then what is available?

I have complete control over the image data I send to tesseract, so I don't care about skewing, exposure, etc, as my glyphs will always be straight, clear, and separated.

For instance, I want to train for the ligatures ff, ffi, and ffl, which are not in the english or ascii ones, and are missing from even the common fonts like arial, but that my client files may contain. 

Should I train new eng or asc traineddata, or just create a new one for a smaller set of glyphs like these?

Thanks again for your help.

shree

unread,
Nov 18, 2014, 9:16:21 PM11/18/14
to tesser...@googlegroups.com
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR.
3 = Fully automatic page segmentation, but no OSD. (Default)

See whether using OSD to detect the script helps you choose the correct traineddata.

Ryan Dev

unread,
Nov 19, 2014, 6:37:05 PM11/19/14
to tesser...@googlegroups.com
I'm dealing with font subsets, and I generate an image per font, so there is no reading order. Though I've seen latin and cjk in the same font subset. If OSD just gives, reading, orientation, and text order, it is not going to give me anything useful. Plus I have the font, so I could get some of that info from the font, just no idea what language (though maybe I should go back and take another look...).

I've got training up and running, on Ubuntu. I modified the text file you gave me, just adding some missing ligatures (ff, ffi, ffl), but my asc.traineddata is way worse then yours.

Do you have a list of fonts you used to create asc.traineddata that I could start with? For example, I think my fonts are missing the old ascii drawing blocks  that you include, and which works great on the fonts that use those (for bullets usually).


ShreeDevi Kumar

unread,
Nov 19, 2014, 10:25:08 PM11/19/14
to tesser...@googlegroups.com
Hi Ryan,

Attached are couple of training logs and their unicharsets, these will have details of fonts used (these are from 2 different trainings). I tried to use fonts that support the full range and created box/tiff using Jtessboxeditor and did rest of training using modified tesstrain.sh.

Most of fonts used are what's available on windows. 

Additionally, I am using the development version of FreeSerif (from GNU freefont project - https://www.gnu.org/software/freefont/).

I also used Siddhanta (which I use mainly for sanskrit but which has support for the accented letters too), you can download that from http://www.svayambhava.org/

I can send you the box/tiff pairs that I used, in case you want them, in addition to your own training images.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
asc.unicharset
tesstrain.log
asc.unicharset
tesstrain.log

ShreeDevi Kumar

unread,
Nov 21, 2014, 10:59:20 PM11/21/14
to Ryan, tesser...@googlegroups.com
Ryan,

I had copied text with the extended range from wikipedia etc to create a quick training set. It is recommended to train with 'actual' text - I think Tesseract relies on language model data.

for more background.


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Nov 22, 2014 at 3:20 AM, Ryan <software.de...@gmail.com> wrote:
Great, thank you for the additional information.

On Wed, Nov 19, 2014 at 7:47 PM, ShreeDevi Kumar <shree...@gmail.com> wrote:
Training 2 files

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Nov 20, 2014 at 9:15 AM, ShreeDevi Kumar <shree...@gmail.com> wrote:
Training 1 files

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Reply all
Reply to author
Forward
0 new messages