How to extract non-text regions

Zisha

unread,

Feb 3, 2023, 1:53:24 AM2/3/23

to tesseract-ocr

I want to OCR documents containing images, figures, etc. Is there a way to detect non-text items and extract them to png, and then OCR the rest of the document?

Muneeb Khurram

unread,

Feb 3, 2023, 2:13:48 AM2/3/23

to tesser...@googlegroups.com

You can use Layout Parser in Python.

On Fri, 3 Feb 2023 at 11:53 AM, 'Zisha' via tesseract-ocr <tesser...@googlegroups.com> wrote:

I want to OCR documents containing images, figures, etc. Is there a way to detect non-text items and extract them to png, and then OCR the rest of the document?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5bd32eef-28ea-4da0-a16f-dd0e1c3a4a70n%40googlegroups.com.

Zdenko Podobny

unread,

Feb 4, 2023, 4:26:45 AM2/4/23

to tesser...@googlegroups.com

The task you mention is called "The document layout segmentation" or "Document layout analysis"(https://en.wikipedia.org/wiki/Document_layout_analysis)

As mentioned Muneeb, you can try https://layout-parser.github.io/ and also https://github.com/qurator-spk/eynollah looks promising.

I you would like to do custom training, have a look at https://towardsdatascience.com/object-detection-on-newspaper-images-using-yolov3-85acfa563080

More code/tools could be find via github topics:

https://github.com/topics/document-layout-analysis

Zdenko

pi 3. 2. 2023 o 7:53 'Zisha' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):

I want to OCR documents containing images, figures, etc. Is there a way to detect non-text items and extract them to png, and then OCR the rest of the document?

--

Reply all

Reply to author

Forward