Newspaper segmentation techniques

118 views

Skip to first unread message

shacky

unread,

Nov 25, 2023, 11:12:04 AM11/25/23

to tesser...@googlegroups.com

Hello everyone,

I’m using tesseract l to ocrize some newspapers and it works very well.

I am making some researches about how I could have some kind of automatic segmentation of singles articles into a newspaper page post processing generated HOCR files and I found some academics papers which speaks about neural networks and machine learning techniques.

I am writing this message because I am wondering if there are some “de facto” working techniques about this or maybe some ready to run programs which make some post processing after Tesseract.

I know that maybe this is not really related to Tesseract, but I cannot find any other better place where I could ask this.

Could you help me please? Do you have any idea or hint about how/where to start to reach the goal?

Thank you very much!

Bye

Art Rhyno

unread,

Nov 28, 2023, 1:24:57 PM11/28/23

to tesser...@googlegroups.com

There are some interesting tools from Ben Lee's Newspaper Navigator project you might want to look at [1], and the OCR-D project includes both Tesseract and layout detection support [2]. A good and fairly recent overview of what's available for historical newspaper digitization projects can be found here [3]. If you want to work from HOCR files directly, you might be able to leverage font metrics for identifying headlines and advertisements, since the text would typically be bigger, but I think most approaches to newspaper segmentation work from images.

Best,

art

---

1. https://github.com/LibraryOfCongress/newspaper-navigator

2. https://ocr-d.github.io

3. https://drops.dagstuhl.de/entities/document/10.4230/DagRep.12.7.112

From: tesser...@googlegroups.com <tesser...@googlegroups.com> On Behalf Of shacky
Sent: Saturday, November 25, 2023 10:55 AM
To: tesser...@googlegroups.com
Subject: [tesseract-ocr] Newspaper segmentation techniques

You don't often get email from shac...@gmail.com. Learn why this is important

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAPz3gmk6TOOYUeXBMBN33eaC50bOdRY7c98oThXEpkyP8WBtig%40mail.gmail.com.

Reply all

Reply to author

Forward

0 new messages