Passing glyph vector data directly to tesseract

Ryan Dev

unread,

Oct 23, 2014, 7:45:08 PM10/23/14

to tesser...@googlegroups.com

Hi, I have what I think is a unique situation, and I was hoping I could get some hints on how to proceed.

I have problem font files, for which I want to fix the unicode mappings for. I also have PDF files with these fonts, so I also have contextual semantics available.

Currently I draw all the glyphs to an image, and run OCR on them. However, there are always issues in just about every test.

The most common problems are

1. lower case and upper case latin o's being mixed up with zero

2. upper case latin i and lower case latin L, and number one being mixed up

3. Characters "randomly" getting broken up. So instead of latin upper case H, I get two vertical bars and a hyphen.

Performance is very important, so I would like to avoid having to do ocr on full page/text (such as paragraphs, words), and instead just work with the font itself.

One approach I was thinking, is skipping the whole image raster steps, since I already have vector data. Would it not be beneficial to simply hook in to tesseract and pass my vector data directly to some later stage (features?) in tesseract.

I am comfortable with C++, etc, so please feel free to point me to source code I should be interested in.

Thanks!

zdenko podobny

unread,

Oct 30, 2014, 9:42:56 AM10/30/14

to tesser...@googlegroups.com

On Fri, Oct 24, 2014 at 1:45 AM, Ryan Dev <software.de...@gmail.com> wrote:

Hi, I have what I think is a unique situation, and I was hoping I could get some hints on how to proceed.

I have problem font files, for which I want to fix the unicode mappings for. I also have PDF files with these fonts, so I also have contextual semantics available.

Currently I draw all the glyphs to an image, and run OCR on them. However, there are always issues in just about every test.

The most common problems are
1. lower case and upper case latin o's being mixed up with zero
2. upper case latin i and lower case latin L, and number one being mixed up

3. Characters "randomly" getting broken up. So instead of latin upper case H, I get two vertical bars and a hyphen.

IMO these (1. and 2.) are general (not only OCR) problems: these letter are difficult to distinguish for some fonts. You can increase chance to identify them correctly by putting them in some context (e.g. words). But as I understand you try to avoid it. Maybe you can post some example image, so

Performance is very important, so I would like to avoid having to do ocr on full page/text (such as paragraphs, words), and instead just work with the font itself.

One approach I was thinking, is skipping the whole image raster steps, since I already have vector data. Would it not be beneficial to simply hook in to tesseract and pass my vector data directly to some later stage (features?) in tesseract.

I am comfortable with C++, etc, so please feel free to point me to source code I should be interested in.

Thanks!

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4540d666-3110-46d5-8f31-208ebc475de0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ryan Dev

unread,

Oct 31, 2014, 1:10:00 PM10/31/14

to tesser...@googlegroups.com

Here is an example of glyphs from one font.

The upper case i is ocr'd as lower case L, and the lower case L was ocr'd as vertical bar '|'

In an earlier post [1] it was recommended to repeat the string, but this rarely, if ever improved the results, and was not worth the added cpu time.

I guess really #3 is my biggest concern. #1 and #2 are not huge, but #3 is very annoying. Here is an image for where the upper case M gets ocr'd as "|\/|".

As for full page OCR, I've been using VietOCR.Net for testing, and confirmed that doing full page ocr does not result in the breaking of the M. But of course process time is orders of magnitude longer.

What I would really like to do is skip the whole image analysis part, since I already have the glyph paths in vector form, so I don't want tesseract to chop.

[1] https://groups.google.com/forum/#!searchin/tesseract-ocr/from$3Ame/tesseract-ocr/K_CHA_DGO-Y/8l7qLOtua7EJ

Reply all

Reply to author

Forward