Nick White
unread,Oct 21, 2012, 5:55:51 AM10/21/12Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to tesser...@googlegroups.com
Hi Adam,
Thanks for writing with so much detail. Was interesting to read.
On Fri, Oct 19, 2012 at 02:22:44AM -0700, Adam Chapam wrote:
> I can follow the training wiki and produce working traineddata files, and have
> written a .net app to automate creating tif/box pairs from a font file, (i know
> there are plenty of other tools out there, but i have no desire to boot into
> linux or learn python just for this)
OK. Someday I'll get my C program cross-compiling with Windows, and
then it will be usable there too.
> The training wiki suggests that abcdefghijklmnopqrstuvwxyz1234567890 would be a
> terrible training text, and i presume this is because it needs to learn
> baseline metrics and other such things
Yes, I think the metrics etc are the main reason for having a
'realistic' training image. In your case going for semi- random
strings of the type you expect to see (as you explained in your
followup email) sounds like a sensible solution, and I can't see any
potential issues with it.
> The other thing that confused me was the need to have x many representations of
> a character in the training text. If using scanned images
> with inevitable small variances between the same characters, that makes sense,
> but using digitally rendered tiffs, they will all be exactly the same, so what
> benefit is there of repeating a character? Is the frequancy used to determine
> between similar characters later on, eg :
> This letter could be an O or a D. The letter D occurred 20 times in training,
> but O only appeared 7 times, so therefore D is the most likely outcome?
As far as I'm aware the character frequency isn't used this way. I
actually think it would be interesting to be able to specify how
common a character is generally, but I don't think frequency in a
training text would be a sensible way to specify it.
As for the need to have multiple representations of a character, you
are right that you gain less from this when using straight digitally
generated characters. There is probably still some benefit to be had
in using several samples, to get more accurate metrics for its
position relative to the baseline and other letters. Less relevant
for a monospace font, though.
Hopefully I've answered all your questions somewhat. Let me know if
I missed anything.
Nick