Error in Layout Analysis with Tesseract OCR 4.0.0alpha

Nirajan Pant

unread,

Aug 23, 2017, 12:33:28 AM8/23/17

to tesseract-ocr

I am working on GUI for tesseract OCR 4.0.0 (Nepali Language). When I started analysis of the recognition results I found some missing words or sentences. To find the reason behind this I just draw the boxes detected by tesseract (using hocr) recognition result. The detection was shown here-

This is a part of document with paragraph detection error. Red line is the boundary of detected paragraph (second column of original image given below).

The original image is:

Help me to deal with this issue.

ShreeDevi Kumar

unread,

Aug 23, 2017, 7:45:32 AM8/23/17

to tesser...@googlegroups.com

You could try doing your own layout analysis instead of relying o tesseract's auto mode?

Have you tried gimagereader and vietocr as gui interface for tesseract for Nepali?

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ae0aa097-93ba-4424-baf5-b4ed93ca574a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nirajan Pant

unread,

Aug 23, 2017, 11:46:51 AM8/23/17

to tesseract-ocr

Yeah! I have tried both gimagereader and vietocr as gui interface for tesseract for Nepali. Result from both GUI skips the words.

On Wednesday, 23 August 2017 17:30:32 UTC+5:45, shree wrote:

You could try doing your own layout analysis instead of relying o tesseract's auto mode?

Have you tried gimagereader and vietocr as gui interface for tesseract for Nepali?

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Aug 23, 2017 at 10:03 AM, Nirajan Pant <nira...@gmail.com> wrote:

I am working on GUI for tesseract OCR 4.0.0 (Nepali Language). When I started analysis of the recognition results I found some missing words or sentences. To find the reason behind this I just draw the boxes detected by tesseract (using hocr) recognition result. The detection was shown here-

This is a part of document with paragraph detection error. Red line is the boundary of detected paragraph (second column of original image given below).

The original image is:

Help me to deal with this issue.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,

Aug 23, 2017, 12:05:26 PM8/23/17

to tesser...@googlegroups.com

Skipping words is issue from tesseract. Amit do has a proposed patch for it. Look in tesseract issues.

You can see if it helps in your case.

-- Excuse the brevity, msg sent from phone.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8e726246-a186-47f7-9850-f49441e75191%40googlegroups.com.

Hoang Vu

unread,

Aug 23, 2017, 8:30:40 PM8/23/17

to tesseract-ocr

Are you using c++ Tesseract API ?

In mycase i'm using PSM = 11 ,4,5

api->SetPageSegMode(tesseract::PSM_SINGLE_BLOCK);

api->SetPageSegMode(tesseract::PSM_SPARSE_TEXT);

api->SetPageSegMode(tesseract::PSM_SINGLE_COLUMN);

I think psm =4 have good result words of sentences and psm=11 have good ocr result.

Idk hows it's work? but if you have problem with missing words or sentences you must try change the default psm value,
Vào 13:33:28 UTC+9 Thứ Tư, ngày 23 tháng 8 năm 2017, Nirajan Pant đã viết:

Hoang Vu

unread,

Aug 23, 2017, 8:37:56 PM8/23/17

to tesseract-ocr

Here is my sample base on Tesseract 4.0

Vào 13:33:28 UTC+9 Thứ Tư, ngày 23 tháng 8 năm 2017, Nirajan Pant đã viết:

I am working on GUI for tesseract OCR 4.0.0 (Nepali Language). When I started analysis of the recognition results I found some missing words or sentences. To find the reason behind this I just draw the boxes detected by tesseract (using hocr) recognition result. The detection was shown here-

Reply all

Reply to author

Forward