Error in Layout Analysis with Tesseract OCR 4.0.0alpha

83 views
Skip to first unread message

Nirajan Pant

unread,
Aug 23, 2017, 12:33:28 AM8/23/17
to tesseract-ocr
I am working on GUI for tesseract OCR 4.0.0 (Nepali Language). When I started analysis of the recognition results I found some missing words or sentences. To find the reason behind this I just draw the boxes detected by tesseract (using hocr) recognition result. The detection was shown here-

This is a part of document with paragraph detection error. Red line is the boundary of detected paragraph (second column of original image given below).

The original image is:


Help me to deal with this issue.

ShreeDevi Kumar

unread,
Aug 23, 2017, 7:45:32 AM8/23/17
to tesser...@googlegroups.com
You could try doing your own layout analysis instead of relying o tesseract's auto mode?

Have you tried gimagereader and vietocr as gui interface for tesseract for Nepali?



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ae0aa097-93ba-4424-baf5-b4ed93ca574a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nirajan Pant

unread,
Aug 23, 2017, 11:46:51 AM8/23/17
to tesseract-ocr
Yeah! I have tried both gimagereader and vietocr as gui interface for tesseract for Nepali. Result from both GUI skips the words.  


On Wednesday, 23 August 2017 17:30:32 UTC+5:45, shree wrote:
You could try doing your own layout analysis instead of relying o tesseract's auto mode?

Have you tried gimagereader and vietocr as gui interface for tesseract for Nepali?



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Aug 23, 2017 at 10:03 AM, Nirajan Pant <nira...@gmail.com> wrote:
I am working on GUI for tesseract OCR 4.0.0 (Nepali Language). When I started analysis of the recognition results I found some missing words or sentences. To find the reason behind this I just draw the boxes detected by tesseract (using hocr) recognition result. The detection was shown here-

This is a part of document with paragraph detection error. Red line is the boundary of detected paragraph (second column of original image given below).

The original image is:


Help me to deal with this issue.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,
Aug 23, 2017, 12:05:26 PM8/23/17
to tesser...@googlegroups.com
Skipping words is issue from tesseract. Amit do has a proposed patch for it. Look in tesseract issues.

You can see if it helps in your case.

-- Excuse the brevity, msg sent from phone.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Hoang Vu

unread,
Aug 23, 2017, 8:30:40 PM8/23/17
to tesseract-ocr
Are you using c++ Tesseract API ?
In mycase i'm using PSM = 11 ,4,5
api->SetPageSegMode(tesseract::PSM_SINGLE_BLOCK);
api->SetPageSegMode(tesseract::PSM_SPARSE_TEXT);
api->SetPageSegMode(tesseract::PSM_SINGLE_COLUMN);

I think psm =4 have good result words of sentences and psm=11 have good ocr result.
Idk hows it's work? but if you have problem with  missing words or sentences you must try change the default psm value,
Vào 13:33:28 UTC+9 Thứ Tư, ngày 23 tháng 8 năm 2017, Nirajan Pant đã viết:

Hoang Vu

unread,
Aug 23, 2017, 8:37:56 PM8/23/17
to tesseract-ocr
Here is my sample base on Tesseract 4.0




Vào 13:33:28 UTC+9 Thứ Tư, ngày 23 tháng 8 năm 2017, Nirajan Pant đã viết:
I am working on GUI for tesseract OCR 4.0.0 (Nepali Language). When I started analysis of the recognition results I found some missing words or sentences. To find the reason behind this I just draw the boxes detected by tesseract (using hocr) recognition result. The detection was shown here-
Reply all
Reply to author
Forward
0 new messages