Article scanning: hocr output wrong after font training?

195 views
Skip to first unread message

Scott Goci

unread,
Jan 1, 2024, 5:15:27 PM1/1/24
to tesseract-ocr
Note: I'm using tesseract v5.3.3

I have a bunch of scanned articles from magazines that I want to convert to some other document format, keeping the article's formatting, including italic / bold / underline, etc. I originally tried --oem 0 with tessedit_debug_fonts=1, but it didn't really seem to do a good job of getting the italicized / bolded / etc words, perhaps because the magazine fonts were different from the ones it was trained on

Since I know the fonts that the magazine uses (ITC New Baskerville), I used text2image along with tesstrain to train tesseract on ITC New Baskerville Standard, ITC New Baskerville Bold, and ITC New Baskerville Italic.

The problem is, even after training a few different ways with tesstrain (e.g. adjusting exposure options, char_spacing options, etc), when I output to hocr (e.g. using the command tesseract sherlock-holmes-example.png output -l ITC-New-Baskerville-Std+ITC-New-Baskerville-Std-Italic -c hocr_font_info=1 hocr) it still seems to get the font info wrong (see attached files for a sample input and output). 

As an example, I was hoping the word "coup-de-maitres" would be recognized with lang='ITC-New-Baskerville-Std-Italic', but it isn't. Conversely, the word "testifying" shows with lang='ITC-New-Baskerville-Std-Italic', but it is not italic.

Any suggestions on what I am potentially doing wrong? Am I training with tesstrain with the wrong parameters, or is there a tesseract option I'm missing to improve quality?

output.hocr
sherlock-holmes-example.png
Message has been deleted

Tom Morris

unread,
Jan 3, 2024, 1:37:17 AM1/3/24
to tesseract-ocr
Font attribute recognition is a legacy engine thing only, ie it doesn't exist in the new LSTM engine for Tess 4/5.

On Monday, January 1, 2024 at 12:15:27 PM UTC-5 sco...@gmail.com wrote:

The problem is, even after training a few different ways with tesstrain (e.g. adjusting exposure options, char_spacing options, etc), when I output to hocr (e.g. using the command tesseract sherlock-holmes-example.png output -l ITC-New-Baskerville-Std+ITC-New-Baskerville-Std-Italic -c hocr_font_info=1 hocr) it still seems to get the font info wrong (see attached files for a sample input and output). 

As an example, I was hoping the word "coup-de-maitres" would be recognized with lang='ITC-New-Baskerville-Std-Italic', but it isn't. Conversely, the word "testifying" shows with lang='ITC-New-Baskerville-Std-Italic', but it is not italic.

You appear to be training the font as a language, which is why it's getting output with the `lang=` tag. That's wrong and it should be `x_font <font>` in the title, if it's actually recognizing it as a font and outputting it as such. The HOCR will also contain <em> tags for italic words if an italic font is recognized. 

I tried using `--oem 0` with the eng model from https://github.com/tesseract-ocr/tessdata and it did output <strong> and <em> tags, but in the wrong places and it's accuracy on the text wasn't as good as the LSTM model. When I used eng+fra, it output language tags, but at the paragraph level, not the word level, and they were mostly wrong. I've attached the output.

You can read more about the state of play of getting font attributes out of the current model here (it's possible, but don't look for it any time soon):

Tom
sherlock-oem0.hocr.html

Scott Goci

unread,
Jan 3, 2024, 3:17:42 PM1/3/24
to tesseract-ocr
You are indeed correct that font attribution recognition is only available via the legacy engine (e.g. --oem 0), but conversely when using the LSTM and tesstrain although it outputs in the end tags that look like lang="...", the tesstrain documentation does not seem to indicate we are training tesseract only for a language, and not just for a font, e.g. I would assume that if it's recognizing the "language", it's doing something similar to when it's recognizing the font, am I wrong? Or does the lang attribute indicate the first result that potentially matched, and not the closest result that matched (e.g. the closest trained font), which would indeed make it different than x_font?

Tom Morris

unread,
Jan 4, 2024, 7:29:22 PM1/4/24
to tesseract-ocr
I believe it's returning what it considers to be the best matching model (ie "lang"), but, if my experiments with eng+fra are any indication, the recognition isn't reliable. If it has trouble distinguishing two Romance languages using the same character set, I doubt it can be counted on to distinguish two closely related fonts from the same family.

Tom

p.s. My earlier comment about the lang= attribute being only at the paragraph level was wrong. It outputs both a paragraph default and per-word overrides for words which don't match the paragraph.

Scott Goci

unread,
Jan 5, 2024, 2:30:05 PM1/5/24
to tesseract-ocr
Hmmm -- makes sense (although unfortunate).

Would you offer any suggestions as to next steps I could take from here? E.g. it seems my options are:
  1. I can go back and train the legacy engine (e.g. --oem 0) on the fonts as well (I've been using this guide: https://michaeljaylissner.com/posts/2012/02/11/adding-new-fonts-to-tesseract-3-ocr-engine/), and hope the results improve enough that I get pretty good results
  2. I can use some sort of post-processing step after tesseract to detect italics / bold / etc (although I'm not sure what tools/software/library I'd use here for, so I'd really need suggestions)
  3. I could wait and hope the roadmap for adding back WordFontAttributes to the non-legacy engine becomes a priority
  4. Something else perhaps?
I don't mind putting in the work of learning / training / etc, the main thing I'd be hesitant is to individually correct and cleanup the ~20,000 articles or more that need to be parsed.

Let me know what you think!

Tom Morris

unread,
Jan 5, 2024, 4:27:10 PM1/5/24
to tesseract-ocr
On Friday, January 5, 2024 at 9:30:05 AM UTC-5 sco...@gmail.com wrote:
Would you offer any suggestions as to next steps I could take from here? E.g. it seems my options are:
  1. I can go back and train the legacy engine (e.g. --oem 0) on the fonts as well (I've been using this guide: https://michaeljaylissner.com/posts/2012/02/11/adding-new-fonts-to-tesseract-3-ocr-engine/), and hope the results improve enough that I get pretty good results
  2. I can use some sort of post-processing step after tesseract to detect italics / bold / etc (although I'm not sure what tools/software/library I'd use here for, so I'd really need suggestions)
  3. I could wait and hope the roadmap for adding back WordFontAttributes to the non-legacy engine becomes a priority
  4. Something else perhaps?
I'm afraid I don't have any magic solutions (or even good suggestions). The only thing I can offer is to perhaps not be so fixated on Tesseract as a solution.

- would a different OCR package (including commercial) give you better results?
- do you really *need* the italics?
- could you implement a crowdsourced annotation facility that let people add the italics later?

Good luck!

Tom

Scott Goci

unread,
Jan 5, 2024, 5:48:26 PM1/5/24
to tesseract-ocr
Hey Tom,

Overall thanks for your guidance here, I appreciate our back and forth!

RE: "[...] do you really *need* the italics?", I think there is actually a lot lost without font attributes (e.g. bold / italic / underline). Consider the following sentences / quotes:
  • "I never said she stole the money"
  • "I never said she stole the money"
  • "I never said she stole the money"
  • "I never said she stole the money"
The context of the above varies drastically depending on which word (if any) were italicized.

For other font attributes (e.g. bold/underline) the case for implementation aren't as strong, but I still believe we miss some things. E.g. consider the following:
  • Not ten eggs, eaten eggs (e.g. here, underlining helps emphasize a specific area of text that changes context of the word at hand)
  • Scott: What is your biggest accomplishment? (e.g. in an interview context, highlighting who is asking the question, especially if there is a different person responding)
----

I can definitely try other OCR packages though, but as this is the biggest non-commercial OCR library I assume other non-commercial OCR libraries might not yield as good results -- I can also try commercial libraries as you suggest as well, although now I am beholden to potentially large pricing schemes.

Let me know if you have any final thoughts, but otherwise I'll take the advise you've given and go from here!

Art Rhyno

unread,
Jan 8, 2024, 3:31:51 PM1/8/24
to tesser...@googlegroups.com

There is a related thread on stack overflow that might be helpful for your processing [1]. The thread is about italics and bolding, but font detection seems a tougher challenge. This repository [2] has links to Adobe work in the area and has an interesting implementation. You would still probably want Tesseract in either case to get the bounding boxes for the characters.

 

Best,

 

art

---

1. https://stackoverflow.com/questions/67577793/detecting-bold-and-italic-text-in-an-image

2. https://github.com/robinreni96/Font_Recognition-DeepFont

 

From: tesser...@googlegroups.com <tesser...@googlegroups.com> On Behalf Of Scott Goci
Sent: Friday, January 5, 2024 12:48 PM
To: tesseract-ocr <tesser...@googlegroups.com>
Subject: [tesseract-ocr] Re: Article scanning: hocr output wrong after font training?

 

You don't often get email from sco...@gmail.com. Learn why this is important

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a279b97d-feca-4650-a22e-c8e8cc4a39c2n%40googlegroups.com.

Reply all
Reply to author
Forward
0 new messages