There are some interesting tools from Ben Lee's Newspaper Navigator project you might want to look at [1], and the OCR-D project includes both Tesseract and layout detection support [2]. A good and fairly recent overview of what's available for historical newspaper digitization projects can be found here [3]. If you want to work from HOCR files directly, you might be able to leverage font metrics for identifying headlines and advertisements, since the text would typically be bigger, but I think most approaches to newspaper segmentation work from images.
Best,
art
---
1. https://github.com/LibraryOfCongress/newspaper-navigator
3. https://drops.dagstuhl.de/entities/document/10.4230/DagRep.12.7.112
From: tesser...@googlegroups.com <tesser...@googlegroups.com>
On Behalf Of shacky
Sent: Saturday, November 25, 2023 10:55 AM
To: tesser...@googlegroups.com
Subject: [tesseract-ocr] Newspaper segmentation techniques
|
You don't often get email from shac...@gmail.com. Learn why this is important |
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
tesseract-oc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/CAPz3gmk6TOOYUeXBMBN33eaC50bOdRY7c98oThXEpkyP8WBtig%40mail.gmail.com.