--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b43c0ea6-fd81-49af-b74f-e93b0a682574n%40googlegroups.com.
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/d6ASNhJZUtw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wqM%2BE7KZ6_etfi6y8B_bLeZE4fRYns-TY3Yp%3DGhFjr7g%40mail.gmail.com.
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/d6ASNhJZUtw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/57874f8e-be02-4556-b15e-4b2bcb8fb927%40archive.org.
" there's no way to use an off-the-shelf text editor with a glyphless font."
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAFPJFhbKP1QW1a80C4fSnXOepYAr54-KnA5YY29WSCML-sSyGg%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8w9mR%3Dr0eC%3DTO7-bv5PZRZpNHTnN8C2OwkqKRBpipMA%3Dw%40mail.gmail.com.
In addition to hocr, Tesseract can produce the alto format, and this allows the use of the Alethia editor [1] from the Prima folks. I haven’t done much correction of hand-written materials but Alethia seems flexible for a windows environment and exports the page format. You also can start with hocr and/or roundtrip between alto, hocr, page, and other xml formats with the ocr-fileformat project [2], which includes some Prima plumbing. Merlijn and the IA folks have great tools for combing hocr and images to make a lightweight PDF if that’s your end-goal [3].
Best,
art
---
1. https://www.primaresearch.org/tools/Aletheia
2. https://github.com/UB-Mannheim/ocr-fileformat
3. https://git.archive.org/merlijn/archive-pdf-tools
From: tesser...@googlegroups.com <tesser...@googlegroups.com>
On Behalf Of Mark Pellegrino
Sent: Wednesday, March 13, 2024 11:25 AM
To: tesser...@googlegroups.com
Subject: Re: [tesseract-ocr] Re: Post OCR Verification and Editing
You don't often get email from mar...@gmail.com. Learn why this is important |
Hi Zdenko,
Thank you so much for your continued interest. I'll provide a little more context; I work for a rare book library in Canada and I have around 10,000 pages of digitized, hand-written, latin manuscripts that I'm trying to OCR.
I normally use Abbyy OCR Editor, which has good recognition but struggles with Latin, particularly with ligatures or antiquated characters like a long-s. Tesseract used with the training data available from latirocr.org has much better recognition, near perfect. However, my issue with Tesseract is that I am unable to define a recognition area in the image, and therefore many unwanted elements on the page like smudges, pen marks, tears, decorative elements, etc, are also recognized with jumbled characters. I understand that I can preprocess the image in Photoshop to remove these unwanted elements, then generate hocr with Tesseract, then merge the hocr with the original unprocessed image, but on my scale that's particularly laborious. I was hoping to OCR all of the images then use an OCR editor like Acrobat or Abbyy to edit out any unwanted characters or inspect the OCR for accuracy, but it appears the Tesseract's usage of a Glyph Less font makes that impossible.
Here's what happens if I try to open a Tesseract-made PDF in Acrobat. Like you mentioned, it opens just fine, but when the 'Make OCR Visible' option is enabled all of the text turns into black boxes (it's not an issue of redaction). My understanding is that because of the lack of any embedded font information in the file, Acrobat can't make sense of the text layer because there are no associated glyphs to present on screen. Tesseract PDFs won't open in Abbyy OCR Editor or FineReader at all, I'm guessing for the same reason.
Thanks for reading. I'll look further into hocr editing tools. I'm hoping other institutions can share their procedures for similar projects.
All the best,
On Sat, Mar 9, 2024 at 12:52 PM Zdenko Podobny <zde...@gmail.com> wrote:
" there's no way to use an off-the-shelf text editor with a glyphless font."
I converted https://github.com/tesseract-ocr/test/blob/main/testing/8087_054.3B.tif to pdf
tesseract 8087_054.3B.tif 8087_054.3B pdf
I could open 8087_054.3B.pdf without a problem in Acode Acrobat Pro Version 2023.008.20555 64 bit (on Windows 11)
However, it seems that it ignores tesseract text layer and it ran its own text recognition (including font identification).
I tried to open 8087_054.3B.pdf at https://www.pdffiller.com/jsfiller-desk14/?flat_pdf_quality I can modify the text:
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAFPJFhY7Zv8K5H-ofXuxs9R4xpX7aAaSj7GGA8f7hvkKC3Ap%2Bg%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/YT2PR01MB98895E77BA42515B116768B5DC2A2%40YT2PR01MB9889.CANPRD01.PROD.OUTLOOK.COM.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/051f8108-e735-4401-9b0d-32d4cb292ff9n%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d10a895e-f1cb-42cd-8e1a-78cbffe08a2cn%40googlegroups.com.