Re: Training tesseract 3.01 with new font, for reading non dictionary strings - ideal training text?

469 views
Skip to first unread message

Andres

unread,
Oct 19, 2012, 2:11:57 PM10/19/12
to tesser...@googlegroups.com
I thought that "abcdefghijklmn..." was not a good idea because of the segmentation problem (e.g.: r followed by n interpreted as m ( rn -> m )). So, as in my project I do the character segmentation by myself, I always was using "abcdefghijklmn..." for training. It would be very interesting to know the real reason for this recommendation.

Cheers,

Andres



2012/10/19 Adam Chapam <ste...@googlemail.com>
Just a quick follow up.

I have spent the day running tests. I tried using the above linked data, pages from books, and simple (not recommended) ADBDEFG etc, but found i get the best results randomly generating strings with a simple algorithm that outputs characters in strings ranging from 1 to 12 chars, resulting in images like the one attached:

If anyone knows why this might be a bad idea, please post, but so far it seams the most successful (and simplest) method.

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Nick White

unread,
Oct 21, 2012, 5:55:51 AM10/21/12
to tesser...@googlegroups.com
Hi Adam,

Thanks for writing with so much detail. Was interesting to read.

On Fri, Oct 19, 2012 at 02:22:44AM -0700, Adam Chapam wrote:
> I can follow the training wiki and produce working traineddata files, and have
> written a .net app to automate creating tif/box pairs from a font file, (i know
> there are plenty of other tools out there, but i have no desire to boot into
> linux or learn python just for this)

OK. Someday I'll get my C program cross-compiling with Windows, and
then it will be usable there too.

> The training wiki suggests that abcdefghijklmnopqrstuvwxyz1234567890 would be a
> terrible training text, and i presume this is because it needs to learn
> baseline metrics and other such things

Yes, I think the metrics etc are the main reason for having a
'realistic' training image. In your case going for semi- random
strings of the type you expect to see (as you explained in your
followup email) sounds like a sensible solution, and I can't see any
potential issues with it.

> The other thing that confused me was the need to have x many representations of
> a character in the training text. If using scanned images
> with inevitable small variances between the same characters, that makes sense,
> but using digitally rendered tiffs, they will all be exactly the same, so what
> benefit is there of repeating a character? Is the frequancy used to determine
> between similar characters later on, eg :
> This letter could be an O or a D. The letter D occurred 20 times in training,
> but O only appeared 7 times, so therefore D is the most likely outcome?

As far as I'm aware the character frequency isn't used this way. I
actually think it would be interesting to be able to specify how
common a character is generally, but I don't think frequency in a
training text would be a sensible way to specify it.

As for the need to have multiple representations of a character, you
are right that you gain less from this when using straight digitally
generated characters. There is probably still some benefit to be had
in using several samples, to get more accurate metrics for its
position relative to the baseline and other letters. Less relevant
for a monospace font, though.

Hopefully I've answered all your questions somewhat. Let me know if
I missed anything.

Nick

Gaara Sabaku

unread,
Oct 25, 2012, 2:41:27 PM10/25/12
to tesser...@googlegroups.com
For your purposes a simple approach will yield the best results. The reason it is recommended to repeat letters is because tesseract does not train or read well with small samples due to its approximation/heuristic methods. As tesseract processes the image it improves apon itself and then takes a second pass. These benefits are lost once you are scanning another example. I have gotten the best results by making more than just one scan. How do you do this? By repeating the same image in subsequent pages of the same tiff. Then I only look at the last page data.

On Mon, Oct 22, 2012 at 12:06 AM, Adam Chapam <ste...@googlemail.com> wrote:
@ Andres
I am afraid i do not know the answer to your question, having only looked into the internals of tesseract since last week. My followup email was purely based on an afternoon of unscientific trial and error, but i am interested enough to do further research and will post anything useful that i find.

@ Nick

I am sure more windows based tools can only be a good thing. I wrote mine from scratch as a learning process as much as anything, and also so i can easily compare training results (generate text > render > train > do OCR > compare output.txt to generated). If get the time i will clean it up and comment the source, so it can be released for others.

I imagine the increasing demand for windows based tools is in part due to the success of the various .net wrappers that make integrating tesseract  so trivial.

As a side project i will work on my text generation algorithm to produce more realistic text (capitals at the start of sentences, punctuation etc) 

Your point about monospace font is interesting. In order to avoid bounding box overlaps, i am artificially creating mono spaced output regardless of font. I wonder if relative spacing would be better.
--
Reply all
Reply to author
Forward
0 new messages