The problem is, even after training a few different ways with tesstrain (e.g. adjusting exposure options, char_spacing options, etc), when I output to hocr (e.g. using the command tesseract sherlock-holmes-example.png output -l ITC-New-Baskerville-Std+ITC-New-Baskerville-Std-Italic -c hocr_font_info=1 hocr) it still seems to get the font info wrong (see attached files for a sample input and output).As an example, I was hoping the word "coup-de-maitres" would be recognized with lang='ITC-New-Baskerville-Std-Italic', but it isn't. Conversely, the word "testifying" shows with lang='ITC-New-Baskerville-Std-Italic', but it is not italic.
Would you offer any suggestions as to next steps I could take from here? E.g. it seems my options are:
- I can go back and train the legacy engine (e.g. --oem 0) on the fonts as well (I've been using this guide: https://michaeljaylissner.com/posts/2012/02/11/adding-new-fonts-to-tesseract-3-ocr-engine/), and hope the results improve enough that I get pretty good results
- I can use some sort of post-processing step after tesseract to detect italics / bold / etc (although I'm not sure what tools/software/library I'd use here for, so I'd really need suggestions)
- I could wait and hope the roadmap for adding back WordFontAttributes to the non-legacy engine becomes a priority
- Something else perhaps?
There is a related thread on stack overflow that might be helpful for your processing [1]. The thread is about italics and bolding, but font detection seems a tougher challenge. This repository [2] has links to Adobe work in the area and has an interesting implementation. You would still probably want Tesseract in either case to get the bounding boxes for the characters.
Best,
art
---
1. https://stackoverflow.com/questions/67577793/detecting-bold-and-italic-text-in-an-image
2. https://github.com/robinreni96/Font_Recognition-DeepFont
From: tesser...@googlegroups.com <tesser...@googlegroups.com>
On Behalf Of Scott Goci
Sent: Friday, January 5, 2024 12:48 PM
To: tesseract-ocr <tesser...@googlegroups.com>
Subject: [tesseract-ocr] Re: Article scanning: hocr output wrong after font training?
You don't often get email from sco...@gmail.com. Learn why this is important |
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
tesseract-oc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/a279b97d-feca-4650-a22e-c8e8cc4a39c2n%40googlegroups.com.