Empty result with images taken as marginally low resolution - Nepali

222 views
Skip to first unread message

Nirajan Pant

unread,
Jan 11, 2018, 11:14:46 AM1/11/18
to tesseract-ocr

Tesseract 4.0 is not working with the image provided here. This is a page from Nepali novel. The resolution is slightly low but not too much. The OCR result only few word or in other pages it returns empty result. 


ShreeDevi Kumar

unread,
Jan 11, 2018, 11:14:32 PM1/11/18
to tesser...@googlegroups.com
Works fine for me. What traineddata and options did you use?

Attaching the output from the following, I did not change dpi of image.

#!/bin/bash
img_files=$(ls ./nepali*.png)
for img_file in ${img_files}; do
  echo "****************************" ${img_file} oem 1"**********************************"
 time tesseract --tessdata-dir /mnt/c/Users/User/shree/tessdata_best/   ${img_file} ${img_file%.*}-Devanagari-best  --oem 1 --psm 6 -l Devanagari 
 time tesseract --tessdata-dir /mnt/c/Users/User/shree/tessdata_fast/   ${img_file} ${img_file%.*}-Devanagari-fast  --oem 1 --psm 6 -l Devanagari 
 time tesseract --tessdata-dir /mnt/c/Users/User/shree/tessdata/   ${img_file} ${img_file%.*}-nep  --oem 1 --psm 6 -l nep 
done



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jan 11, 2018 at 9:44 PM, Nirajan Pant <nira...@gmail.com> wrote:

Tesseract 4.0 is not working with the image provided here. This is a page from Nepali novel. The resolution is slightly low but not too much. The OCR result only few word or in other pages it returns empty result. 


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/534d9a5c-342e-447f-b4cd-7792f7bd7718%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

nepali-Devanagari-fast.txt
nepali-Devanagari-best.txt
nepali-nep.txt

Nirajan Pant

unread,
Jan 12, 2018, 9:12:41 AM1/12/18
to tesseract-ocr
I was using Automatic page segmentation mode. Why automatic mode does not work? Here is sample command

tesseract.exe "E:\Projects\NeOCR_rev1\Text Image Segmenter\bin\Debug\tesseract\tmp_20180111201447661_page-6.png" out --tessdata-dir "E:\Projects\NeOCR_rev1\Text Image Segmenter\bin\Debug\tesseract\tessdata" -l nep --psm 1 --oem 1



On Friday, 12 January 2018 09:59:32 UTC+5:45, shree wrote:
Works fine for me. What traineddata and options did you use?

Attaching the output from the following, I did not change dpi of image.

#!/bin/bash
img_files=$(ls ./nepali*.png)
for img_file in ${img_files}; do
  echo "****************************" ${img_file} oem 1"**********************************"
 time tesseract --tessdata-dir /mnt/c/Users/User/shree/tessdata_best/   ${img_file} ${img_file%.*}-Devanagari-best  --oem 1 --psm 6 -l Devanagari 
 time tesseract --tessdata-dir /mnt/c/Users/User/shree/tessdata_fast/   ${img_file} ${img_file%.*}-Devanagari-fast  --oem 1 --psm 6 -l Devanagari 
 time tesseract --tessdata-dir /mnt/c/Users/User/shree/tessdata/   ${img_file} ${img_file%.*}-nep  --oem 1 --psm 6 -l nep 
done



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jan 11, 2018 at 9:44 PM, Nirajan Pant <nira...@gmail.com> wrote:

Tesseract 4.0 is not working with the image provided here. This is a page from Nepali novel. The resolution is slightly low but not too much. The OCR result only few word or in other pages it returns empty result. 


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,
Jan 12, 2018, 9:28:30 AM1/12/18
to tesser...@googlegroups.com
psm 1 is 1 Automatic page segmentation with OSD.

psm 3 is 3 Fully automatic page segmentation, but no OSD. (Default)


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Nirajan Pant

unread,
Jan 12, 2018, 9:34:01 AM1/12/18
to tesseract-ocr
--psm 3 also not working. 

ShreeDevi Kumar

unread,
Jan 12, 2018, 9:41:12 AM1/12/18
to tesser...@googlegroups.com
Please file an issue with full details.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

ShreeDevi Kumar

unread,
Jan 12, 2018, 10:12:22 AM1/12/18
to tesser...@googlegroups.com
It seems some bug has crept in the processing of diff psm modes. OCR worked only for psm 4 and 6

**************************** ./nepali.png oem 1**********************************
psm 1
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 434
OSD: Weak margin (6.54) for 34 blob text block, but using orientation anyway: 0
Empty page!!
Estimating resolution as 434
OSD: Weak margin (6.54) for 34 blob text block, but using orientation anyway: 0
Empty page!!

real    0m3.201s
user    0m2.250s
sys     0m0.641s

psm 2
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 434
Empty page!!

real    0m1.860s
user    0m1.094s
sys     0m0.516s

psm 3
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 434
Empty page!!
Estimating resolution as 434
Empty page!!

real    0m2.086s
user    0m1.375s
sys     0m0.484s

psm 4
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 434

real    0m7.769s
user    0m7.016s
sys     0m0.453s

psm 5
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.

real    0m4.139s
user    0m3.359s
sys     0m0.484s

psm 6
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.

real    0m7.727s
user    0m6.844s
sys     0m0.516s


ShreeDevi Kumar

unread,
Jan 13, 2018, 12:14:11 AM1/13/18
to tesser...@googlegroups.com
Niranjan,

Please check with 'best' traineddata for nep. That seemed to work.

Nirajan Pant

unread,
Jan 13, 2018, 10:11:49 AM1/13/18
to tesseract-ocr
Thank you Shree.
Reply all
Reply to author
Forward
0 new messages