Post OCR Verification and Editing

539 views
Skip to first unread message

Mark Pellegrino

unread,
Mar 7, 2024, 2:17:28 PM3/7/24
to tesseract-ocr
Hello,
I'm trying to check PDFs made with Tesseract 5.2 for correctness using an OCR editor but am unable to open them in either Abbyy or Acrobat.

If I try to open a Tesseract PDF with Abbyy FineReader/OCR Editor, the software just hangs and crashes. I can open Tesseract PDFs with Acrobat Pro, but when I enable the  'Make OCR text visible' option in Preflight, all of the text layer turns into unreadable black boxes. The font used shows as 'GlyphLessFont' and appears to be embedded in the file.

It doesn't matter what training data I use, or what the source image was, I always get these results. Any other non-Tesseract made PDF works just fine. I'm guessing that the issue is a missing font? I don't have much of an understanding about how embedded PDF fonts work and I haven't found anything about this in the Tesseract docs. Can someone please point me in the right direction? I Thanks.


Mark Pellegrino

unread,
Mar 7, 2024, 2:53:28 PM3/7/24
to tesseract-ocr
I found more info here:

Glyphless appears to be an 'invisible font' and all that Tesseract supports. It seems like the solution it to use Tesseract to generate hOCR, then use another tool to combine the source image with the hOCR? 

Does anyone have a simple workflow for editing/correcting Tesseract OCR documents that they can share?

Thanks again,

Zdenko Podobny

unread,
Mar 8, 2024, 6:14:21 AM3/8/24
to tesser...@googlegroups.com
Hello,


I am not sure if OCRmyPDF(https://ocrmypdf.readthedocs.io/en/latest/) allows redaction.

If you would to implement text layer by yourself with custom font, have a look at PyMuPDF:

Zdenko


št 7. 3. 2024 o 20:53 Mark Pellegrino <mar...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b43c0ea6-fd81-49af-b74f-e93b0a682574n%40googlegroups.com.

Merlijn B.W. Wajer

unread,
Mar 8, 2024, 7:03:16 AM3/8/24
to tesser...@googlegroups.com
Hi Mark,

On 07/03/2024 20:53, Mark Pellegrino wrote:
> I found more info here:
> https://github.com/tesseract-ocr/tesseract/issues/1769#issuecomment-509490277
>
> Glyphless appears to be an 'invisible font' and all that Tesseract
> supports. It seems like the solution it to use Tesseract to generate
> hOCR, then use another tool to combine the source image with the hOCR?
>
> Does anyone have a simple workflow for editing/correcting Tesseract OCR
> documents that they can share?

If you're looking to do OCR and PDF generation separately, you might
want to look into the Internet Archive's PDF generation tooling, which
is designed to do exactly this (plus some aggressive compression):
https://github.com/internetarchive/archive-pdf-tools (disclaimer: I'm
the author of the tooling)

As for viewing and editing hOCR, there's a lot of different tools
around, not all fully functional (I haven't tried most of these):

* https://www.not-implemented.de/hocr-proofreader/
* https://github.com/kba/hocrjs
* https://github.com/GeReV/hocr-editor-ts /
https://github.com/GeReV/HocrEditor

There are also some GUI tools that I recall for editing hOCR, but they
might require you to convert to another format first.

Regards,
Merlijn


>
> Thanks again,
>
> On Thursday 7 March 2024 at 14:17:28 UTC-5 Mark Pellegrino wrote:
>
> Hello,
> I'm trying to check PDFs made with Tesseract 5.2 for correctness
> using an OCR editor but am unable to open them in either Abbyy or
> Acrobat.
>
> If I try to open a Tesseract PDF with Abbyy FineReader/OCR Editor,
> the software just hangs and crashes. I can open Tesseract PDFs with
> Acrobat Pro, but when I enable the  'Make OCR text visible' option
> in Preflight, all of the text layer turns into unreadable black
> boxes. The font used shows as 'GlyphLessFont' and appears to be
> embedded in the file.
>
> It doesn't matter what training data I use, or what the source image
> was, I always get these results. Any other non-Tesseract made PDF
> works just fine. I'm guessing that the issue is a missing font? I
> don't have much of an understanding about how embedded PDF fonts
> work and I haven't found anything about this in the Tesseract docs.
> Can someone please point me in the right direction? I Thanks.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to tesseract-oc...@googlegroups.com
> <mailto:tesseract-oc...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/b43c0ea6-fd81-49af-b74f-e93b0a682574n%40googlegroups.com <https://groups.google.com/d/msgid/tesseract-ocr/b43c0ea6-fd81-49af-b74f-e93b0a682574n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Mark Pellegrino

unread,
Mar 8, 2024, 2:13:43 PM3/8/24
to tesser...@googlegroups.com
Thanks Zedenko, PyMuPDF is an intriguing option. I'll check it out further.

You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/d6ASNhJZUtw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wqM%2BE7KZ6_etfi6y8B_bLeZE4fRYns-TY3Yp%3DGhFjr7g%40mail.gmail.com.

Mark Pellegrino

unread,
Mar 8, 2024, 2:24:34 PM3/8/24
to tesser...@googlegroups.com
Thank you Merlijn, this is very helpful.  I'm very interested in IA's process so I'll have a deep dive through those tools.  This confirms my suspicions that there's no way to use an off-the-shelf text editor with a glyphless font. I'll explore these hOCR editor options. All the best,

You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/d6ASNhJZUtw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.

Merlijn B.W. Wajer

unread,
Mar 8, 2024, 2:38:42 PM3/8/24
to tesser...@googlegroups.com
Hi Mark,

On 08/03/2024 20:24, Mark Pellegrino wrote:
> Thank you Merlijn, this is very helpful. I'm very interested in IA's
> process so I'll have a deep dive through those tools.  This confirms my
> suspicions that there's no way to use an off-the-shelf text editor with
> a glyphless font. I'll explore these hOCR editor options. All the best,

As I understand it the main reason that there is no 'editor' for PDFs
with text is that the text in PDFs in inherently not structured in a
hierarchical manner, so by going from hOCR (or another format) -> PDF
text you lose a lot of structure. Even the PDF text reading order might
differ per PDF renderer - it's just text rendered in a coordinate space,
so it's not a particular good fit for 'editing'.

Regards,
Merlijn

Zdenko Podobny

unread,
Mar 9, 2024, 12:51:59 PM3/9/24
to tesser...@googlegroups.com
" there's no way to use an off-the-shelf text editor with a glyphless font."
tesseract 8087_054.3B.tif 8087_054.3B pdf

I could open 8087_054.3B.pdf without a problem in Acode Acrobat Pro Version 2023.008.20555 64 bit (on Windows 11)
However, it seems that it ignores tesseract text layer and it ran its own text recognition (including font identification).

I tried to open 8087_054.3B.pdf  at https://www.pdffiller.com/jsfiller-desk14/?flat_pdf_quality I can modify the text:

image.png

Also https://tinywow.com/pdf/edit seems to work:

image.png

IMO if pdf tool offers text editing, it should work with tesseract output too.

BR,

Zdenko


pi 8. 3. 2024 o 20:24 Mark Pellegrino <mar...@gmail.com> napísal(a):
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAFPJFhbKP1QW1a80C4fSnXOepYAr54-KnA5YY29WSCML-sSyGg%40mail.gmail.com.

Mark Pellegrino

unread,
Mar 13, 2024, 11:25:06 AM3/13/24
to tesser...@googlegroups.com
Hi Zdenko,

Thank you so much for your continued interest. I'll provide a little more context; I work for a rare book library in Canada and I have around 10,000 pages of digitized, hand-written, latin manuscripts that I'm trying to OCR.

I normally use Abbyy OCR Editor, which has good recognition but struggles with Latin, particularly with ligatures or antiquated characters like a long-s. Tesseract used with the training data available from latirocr.org  has much better recognition, near perfect. However, my issue with Tesseract is that I am unable to define a recognition area in the image, and therefore many unwanted elements on the page like smudges, pen marks, tears, decorative elements, etc, are also recognized with jumbled characters. I understand that I can preprocess the image in Photoshop to remove these unwanted elements, then generate hocr with Tesseract, then merge the hocr with the original unprocessed image, but on my scale that's particularly laborious. I was hoping to OCR all of the images then use an OCR editor like Acrobat or Abbyy to edit out any unwanted characters or inspect the OCR for accuracy, but it appears the Tesseract's usage of a Glyph Less font makes that impossible. 

Here's what happens if I try to open a Tesseract-made PDF in Acrobat. Like you mentioned, it opens just fine, but when the 'Make OCR Visible' option is enabled all of the text turns into black boxes (it's not an issue of redaction). My understanding is that because of the lack of any embedded font information in the file, Acrobat can't make sense of the text layer because there are no associated glyphs to present on screen. Tesseract PDFs won't open in Abbyy OCR Editor or FineReader at all, I'm guessing for the same reason.
tesseract ocr in acrobat.PNG

Thanks for reading. I'll look further into hocr editing tools. I'm hoping other institutions can share their procedures for similar projects.

All the best,


Art Rhyno

unread,
Mar 13, 2024, 3:00:35 PM3/13/24
to tesser...@googlegroups.com

In addition to hocr, Tesseract can produce the alto format, and this allows the use of the Alethia editor [1] from the Prima folks. I haven’t done much correction of hand-written materials but Alethia seems flexible for a windows environment and exports the page format. You also can start with hocr and/or roundtrip between alto, hocr, page, and other xml formats with the ocr-fileformat project [2], which includes some Prima plumbing.  Merlijn and the IA folks have great tools for combing hocr and images to make a lightweight PDF if that’s your end-goal [3].

 

Best,

 

art

---

1. https://www.primaresearch.org/tools/Aletheia

2. https://github.com/UB-Mannheim/ocr-fileformat

3. https://git.archive.org/merlijn/archive-pdf-tools

 

From: tesser...@googlegroups.com <tesser...@googlegroups.com> On Behalf Of Mark Pellegrino
Sent: Wednesday, March 13, 2024 11:25 AM
To: tesser...@googlegroups.com
Subject: Re: [tesseract-ocr] Re: Post OCR Verification and Editing

 

You don't often get email from mar...@gmail.com. Learn why this is important

Hi Zdenko,

 

Thank you so much for your continued interest. I'll provide a little more context; I work for a rare book library in Canada and I have around 10,000 pages of digitized, hand-written, latin manuscripts that I'm trying to OCR.

 

I normally use Abbyy OCR Editor, which has good recognition but struggles with Latin, particularly with ligatures or antiquated characters like a long-s. Tesseract used with the training data available from latirocr.org  has much better recognition, near perfect. However, my issue with Tesseract is that I am unable to define a recognition area in the image, and therefore many unwanted elements on the page like smudges, pen marks, tears, decorative elements, etc, are also recognized with jumbled characters. I understand that I can preprocess the image in Photoshop to remove these unwanted elements, then generate hocr with Tesseract, then merge the hocr with the original unprocessed image, but on my scale that's particularly laborious. I was hoping to OCR all of the images then use an OCR editor like Acrobat or Abbyy to edit out any unwanted characters or inspect the OCR for accuracy, but it appears the Tesseract's usage of a Glyph Less font makes that impossible. 

 

Here's what happens if I try to open a Tesseract-made PDF in Acrobat. Like you mentioned, it opens just fine, but when the 'Make OCR Visible' option is enabled all of the text turns into black boxes (it's not an issue of redaction). My understanding is that because of the lack of any embedded font information in the file, Acrobat can't make sense of the text layer because there are no associated glyphs to present on screen. Tesseract PDFs won't open in Abbyy OCR Editor or FineReader at all, I'm guessing for the same reason.

 

Thanks for reading. I'll look further into hocr editing tools. I'm hoping other institutions can share their procedures for similar projects.

 

All the best,

 

On Sat, Mar 9, 2024 at 12:52PM Zdenko Podobny <zde...@gmail.com> wrote:

" there's no way to use an off-the-shelf text editor with a glyphless font."

tesseract 8087_054.3B.tif 8087_054.3B pdf

 

I could open 8087_054.3B.pdf without a problem in Acode Acrobat Pro Version 2023.008.20555 64 bit (on Windows 11)

However, it seems that it ignores tesseract text layer and it ran its own text recognition (including font identification).

 

I tried to open 8087_054.3B.pdf  at https://www.pdffiller.com/jsfiller-desk14/?flat_pdf_quality I can modify the text:

 

 

Also https://tinywow.com/pdf/edit seems to work:

 

Mark Pellegrino

unread,
Mar 15, 2024, 3:12:39 PM3/15/24
to tesser...@googlegroups.com
Hi Art,

Thanks so much for this. These are very intriguing tools. I'll definitely give Alethia a try. It seems more suited to my needs than Abbyy. I'll report back once I've done some experimentation.

Best,
Mark

Jeremiah

unread,
Mar 30, 2024, 3:41:09 AM3/30/24
to tesseract-ocr
You can proofread and correct .hocr files made by Tesseract using scribeocr.com, which is an open source program I wrote to address difficulties proofreading OCR data.  A video demo can be seen here, and the GitHub repo is here.  The program positions the glyphs precisely over the source image, which (in my experience) reduces the time spent proofreading by 90% versus other methods.  A screenshot is below.

scribe_screenshot.PNG


Proofreading .pdfs created by Tesseract is unfortunately not possible, given that (as you experienced personally), the precise glyph metrics/positioning data is lost when exporting to .pdf.  However, if you upload the source image alongside a .hocr file from Tesseract (with `hocr_char_boxes: '1'` to include glyph-level data), it should have much more information to position glyphs with.  After proofreading is done, a .pdf can be exported using the site.  Alternatively, you can run recognition directly in the browser using a built-in build of Tesseract, which will produce the most accurate overlay due to several changes to Tesseract to improve positioning.   The site is still under active development, so if you try it and experience any issues please let me know via a Git Issue or email to ad...@scribeocr.com

Zdenko Podobny

unread,
Mar 30, 2024, 3:25:34 PM3/30/24
to tesser...@googlegroups.com
Hello Jeremiah,

this looks very interesting and nice app. Any instructions for installation?

I just downloaded code from GH but recognizing text doesn't work for me:

image.png

BR,


Zdenko


so 30. 3. 2024 o 8:41 Jeremiah <jeremia...@gmail.com> napísal(a):

Jeremiah

unread,
Mar 31, 2024, 4:18:09 AM3/31/24
to tesseract-ocr
There currently is no desktop application, so running requires either (1) using the public site on scribeocr.com or (2) serving the files on your local system using an HTTP server.  I added instructions to the README for running locally, which I will also paste below.
git clone --recursive https://github.com/scribeocr/scribeocr.git cd scribeocr npm i npx http-server
The site can then be visited from a browser at the location printed by `npx http-server`.

Mark Pellegrino

unread,
Apr 10, 2024, 3:03:19 PM4/10/24
to tesseract-ocr
Hi Jeremiah,

Thanks so much, this is a fantastic tool. I just tried using the Scribe OCR website to edit an hocr file that was generated with Tesseract against its source image, and it worked perfectly. I was also able to make some edits then successfully generate and download a PDF containing the image and edited text. This is great, just what I needed.

The only issue that I ran into was that it doesn't seem to support the Latin characters and ligatures that I need like, æ œ ſ, etc. That's probably not a complicated fix on my end, I'll just have to dig around in the source code. If you could point me in the right direction it would be greatly appreciated.

Thanks again for your hard work on this, I'll certainly be in touch with more questions about Scribe. 

 Mark

Greg Jay

unread,
Apr 10, 2024, 7:44:04 PM4/10/24
to tesser...@googlegroups.com
Try this: https://ocr.sanskritdictionary.com/#

It worked for the glyphs you mentioned for me.

Greg

Jeremiah

unread,
Apr 11, 2024, 10:38:19 PM4/11/24
to tesseract-ocr
Hi Mark,

Glad you found Scribe OCR useful.  Regarding character support, all characters in the Windows-1252 character set should currently be supported.  This includes æ and œ, so if you encountered issues with those characters that can be replicated, please let me know and I can investigate.  Unfortunately, the ſ character is not included.

Including characters outside of this set is actually fairly involved, as it requires switching to a different encoding and embedded font format when writing the PDF (from simple [type 1] to composite [type 0]).  However, I am already working on implementing this as it is required to support non-Latin languages, so it will probably be possible to add characters outside of the Windows-1252 set at some point in the next month. 

-Jeremiah

Reply all
Reply to author
Forward
0 new messages