How to extract non-text regions

53 views
Skip to first unread message

Zisha

unread,
Feb 3, 2023, 1:53:24 AM2/3/23
to tesseract-ocr
I want to OCR documents containing images, figures, etc. Is there a way to detect non-text items and extract them to png, and then OCR the rest of the document?

Muneeb Khurram

unread,
Feb 3, 2023, 2:13:48 AM2/3/23
to tesser...@googlegroups.com

You can use Layout Parser in Python. 
On Fri, 3 Feb 2023 at 11:53 AM, 'Zisha' via tesseract-ocr <tesser...@googlegroups.com> wrote:
I want to OCR documents containing images, figures, etc. Is there a way to detect non-text items and extract them to png, and then OCR the rest of the document?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5bd32eef-28ea-4da0-a16f-dd0e1c3a4a70n%40googlegroups.com.

Zdenko Podobny

unread,
Feb 4, 2023, 4:26:45 AM2/4/23
to tesser...@googlegroups.com
The task you mention is called "The document layout segmentation" or "Document layout analysis"(https://en.wikipedia.org/wiki/Document_layout_analysis)

As mentioned Muneeb, you can try https://layout-parser.github.io/ and also https://github.com/qurator-spk/eynollah looks promising.


More code/tools could be find via github topics:

Zdenko


pi 3. 2. 2023 o 7:53 'Zisha' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):
I want to OCR documents containing images, figures, etc. Is there a way to detect non-text items and extract them to png, and then OCR the rest of the document?

--
Reply all
Reply to author
Forward
0 new messages