traineddata for consolas font

659 views
Skip to first unread message

Marco

unread,
Oct 17, 2013, 5:52:09 AM10/17/13
to tesser...@googlegroups.com
Hi everybody,

I am working on a project where I need to OCR images generated programmatically. IOW, I have one app that dumps (base64) text into images and another one that is supposed to recover the text from the images (long story..). As the default eng.traineddata failed to recognize some characters I decided to train Tesseract with the consolas font. The problem is that, not matter what I try, it keeps making the same errors - so I must be doing something wrong. 

For example: sometimes O is detected in place of 0, sometimes 0 in place of O. It also (sometimes!) reads c and w in place of C and W. 

here is my script:

tesseract.exe foo\foo.consolas.00000.bmp foo\foo.consolas.00000 nobatch box.train 
tesseract.exe foo\foo.consolas.00001.bmp foo\foo.consolas.00001 nobatch box.train 

unicharset_extractor.exe foo\foo.consolas.00000.box foo\foo.consolas.00001.box 

shapeclustering.exe -F font_properties -U unicharset foo\foo.consolas.00000.tr foo\foo.consolas.00001.tr 

mftraining.exe -F font_properties -U unicharset -O foo.unicharset foo\foo.consolas.00000.tr foo\foo.consolas.00001.tr 


copy shapetable foo.shapetable
copy normproto foo.normproto
copy inttemp foo.inttemp
copy pffmtable foo.pffmtable

combine_tessdata foo.

Then I run tesseract using -lang foo.

notes: 

I have checked over and over the box files and they *look* fine to me ( I use JBoxEdit)
all 64 characters where found.
images are 300dpi
font size is 12 (see image)

What am I doing wrong?

Thanks!!

Marco
foo.consolas.00000.bmp

Marco

unread,
Oct 17, 2013, 9:58:59 AM10/17/13
to tesser...@googlegroups.com
for those that may have run into similar problems: increasing font character spacing improved considerably the accuracy.

Marco

Marco

unread,
Oct 17, 2013, 11:25:52 AM10/17/13
to tesser...@googlegroups.com

Il giorno giovedì 17 ottobre 2013 15:58:59 UTC+2, Marco ha scritto:
for those that may have run into similar problems: increasing font character spacing improved considerably the accuracy.


but unfortunately I am not not yet hitting 100% accuracy. 

Suggestions are welcome.

Thanks

Marco

Nick White

unread,
Oct 17, 2013, 12:20:02 PM10/17/13
to tesser...@googlegroups.com
Hi Marco,

On Thu, Oct 17, 2013 at 08:25:52AM -0700, Marco wrote:
>
> Il giorno gioved� 17 ottobre 2013 15:58:59 UTC+2, Marco ha scritto:
>
> for those that may have run into similar problems: increasing font
> character spacing improved considerably the accuracy.
>
>
>
> but unfortunately I am not not yet hitting 100% accuracy.

Have you tried increasing the size of the characters? It may help.

Marco

unread,
Oct 17, 2013, 1:14:41 PM10/17/13
to tesser...@googlegroups.com
ps: not seeing my answer published .. second try.

Hi Nick,

thanks for the replay. I have tried bumping it up to 30 and increase the DPIs to 600 but it does not seem to make a difference. main errors are lower-case letters (w, x, s, c) being recognized instead of their upper-case version as well as 0 and = being mixed up. I have tried using other fonts but got worse results.

Q: is it possible to achieve 100% accuracy with self-generated images? 

thanks,

Marco 

zdenko podobny

unread,
Oct 17, 2013, 1:20:13 PM10/17/13
to tesser...@googlegroups.com
there is no 100% accuracy - in any OCR.

Zdenko


--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
 
---
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Nick White

unread,
Oct 17, 2013, 1:58:35 PM10/17/13
to tesser...@googlegroups.com
On Thu, Oct 17, 2013 at 07:20:13PM +0200, zdenko podobny wrote:
> there is no 100% accuracy - in any OCR.

Zdenko's correct, I'm afraid, even for self-generated images where
you can produce "perfect" specimens. Depending on what you're doing,
it may make more sense to use something like a QR code to do your
stuff, which has error correction and whatnot embedded.

Otherwise, if you know the sorts of data you'll be encountering, you
can add replace / pattern rules, but if it's base64-encoded text
then that isn't really an option.

Marco

unread,
Oct 17, 2013, 3:49:18 PM10/17/13
to tesser...@googlegroups.com
damn, there goes my hacking plan. 

While I could substitute base64 with something else it would still not be possible to apply any lingusitic pattern or error correction. While what tesseract can do is pretty impressive, I am still shocked to learn that it is not possible to achieve 100% reliability having full control of the fonts and all the rest.

thank you for your feedback - much appreciated.

Marco   

Martin Monperrus

unread,
Apr 25, 2020, 10:31:08 AM4/25/20
to tesseract-ocr
Hi Marco, all,

The problem is that base64 contains many pairs of characters that are confusing for OCR (but not for humans), for instance 0 and O. If you replace the default 64 characters by other 64 symbols carefully selected, then you can reach 100% accuracy. As you say, having full-control over the chosen font is also key here.

For instance, I have now achieved 100% accuracy for a document of 15k characters, using a specific alphabet, font Inconsolata and gocr, see https://www.monperrus.net/martin/store-data-paper

With tesseract, I still have errors and my understanding is that tesseract expects words from a given language, and not long sequences of random characters. I've been trying to change the configuration and to fine-tune a tesseract model for this task but so far with no success.

I suspect that tweaking eng.numbers, eng.punc, eng.training_text, eng.wordlist for base64 recognition is doable.

What do you think? What should we put in eng.training_text and eng.wordlist to successfully tune a model to perform base64 recognition?

Thanks!

--Martin

Marco Peretti

unread,
Apr 27, 2020, 3:20:58 AM4/27/20
to tesser...@googlegroups.com
Hello Martin,

I am afraid I won't be of much help as it was a one-off experiment, long forgotten. I was investigating exfiltrating information via the Remote Desktop Protocol (RDP) and I haven't used Tesseract since then. 

My feeling at the time was that Tesseract was better suited for regular OCR, where OCRed text can be improved by using dictionaries, compensating for the lack in accuracy that may occur. In my case, I was after a more generic solution for various kind of data, where 100% accuracy was required. That said, I really only spent a few days experimenting and am not even sure that my conclusions were correct at the time. 

Your research on storing data on paper looks very interesting though!

best,

Marco

--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/JM2kotm7cEo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/32c4b26c-aaf0-4178-9177-cbf41c34f08f%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages