How to improve ocr reader?

137 views
Skip to first unread message

Teo

unread,
Mar 25, 2020, 1:04:14 AM3/25/20
to tesseract-ocr
The quality is already very good, but is lower than abby finereader. In attachment there is a comparison between abby and gimagereader ocr, and you can see the difference. How we can improve it?



Schermata da 2020-03-24 02-59-00.png

Essam Zaky

unread,
Mar 25, 2020, 2:25:11 AM3/25/20
to tesseract-ocr
You need to know which to improve tesserct  engine or PDF generation

so compare text file from abby and tesserct 
if the result is highly different you need to improve image quality or improve LSTM 

if the result of tesseract is good so you need to enhance the PDF generation module

Teo

unread,
Mar 25, 2020, 5:25:46 AM3/25/20
to tesseract-ocr
Ok I think that it's  a pdf generation module, because the txt is almost the same with the exception of some "the" which tesseract sees as "thè".

Teo

unread,
Mar 25, 2020, 5:39:45 AM3/25/20
to tesseract-ocr
I discovered that the problem is not with reading, but with exporting to pdf. As I have tried to save both readings as txt files and they are almost the same. So how can I make the export more like abby's? With the text precisely on the document, all aligned I mean ..

Essam Zaky

unread,
Mar 25, 2020, 5:41:07 AM3/25/20
to tesseract-ocr
You need now to check the coordinates returned from tesseract ,use hocr output and check if words coordinates are returned correctly if yes so it is a bug in pdf generation

if the coordinates are wrong it's bug in tesseract 

for me i used before library called itextsharp to generate searchable pdf , the library  ported from itext java library , it gives good pdf output

Teo

unread,
Mar 26, 2020, 7:10:22 AM3/26/20
to tesseract-ocr
Thanks for your help. how can i get the coordinates, and how do i check if they are correct?

Essam Zaky

unread,
Mar 26, 2020, 2:13:52 PM3/26/20
to tesseract-ocr
read this document

the following command can return the coordinates
tesseract testing/eurotext.png testing/eurotext-eng -l eng hocr

hocr contain the word as a text and coordinate
you can open the image in any image editor such as MSpaint and check the returned coordinates represent the word in images

Best Regards

Teo

unread,
Mar 26, 2020, 4:54:50 PM3/26/20
to tesseract-ocr
Ok coordinates seem correct.
pho-eng.txt

Essam Zaky

unread,
Mar 26, 2020, 10:13:40 PM3/26/20
to tesseract-ocr
So I guess the error in PDF generation module
you have one of the following option
-try to enhance the bug by your self
-raise an issue in Tesseract issues , but check first that the issue is not exist in list of issues
-Use other extrenal library to create searchable pdf depending on hocr

before tesseract add feature of generating pdf i used library called itextsharp to generate  the pdf and the result was very good for me

Teo

unread,
Mar 28, 2020, 1:24:07 PM3/28/20
to tesseract-ocr
Thanks for the reply. 
I just opened an issue on github/Tesseract. Then I tried to create an pdf only with tesseract and without gimagereader with: 
tesseract pho.png pho-eng -l eng pdf
but this is the result...
Schermata da 2020-03-28 18-23-34.png

Essam Zaky

unread,
Mar 28, 2020, 1:32:26 PM3/28/20
to tesseract-ocr
PLease attach the original image to check on my machine

Teo

unread,
Mar 28, 2020, 1:34:59 PM3/28/20
to tesseract-ocr

Ok
pho.png

Essam Zaky

unread,
Mar 28, 2020, 1:48:17 PM3/28/20
to tesseract-ocr
It works fine in my machine
It seems it's problem in your pdf viewer
i used Adobe PDF reader V9.0

there are some pdf readers fail to read serachable pdf , try to check another reader

Best Regards
Essam

Teo

unread,
Mar 28, 2020, 1:55:05 PM3/28/20
to tesseract-ocr
With the same coomand?
tesseract pho.png pho-eng -l eng pdf



Essam Zaky

unread,
Mar 28, 2020, 2:04:25 PM3/28/20
to tesseract-ocr
Yes with the same command the result attached
pho.pdf

Lorenzo Bolzani

unread,
Mar 28, 2020, 2:24:12 PM3/28/20
to tesser...@googlegroups.com
If you'd like to improve the OCR accuracy too a simple contrast enhancement (with a simple S shaped curve) and a little sharpening helps with the left border. See the attached file.



Lorenzo

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7b8148d7-6075-4bed-9edb-99480001204b%40googlegroups.com.
pho2.png

Teo

unread,
Mar 28, 2020, 2:37:49 PM3/28/20
to tesseract-ocr
Ok thanks a lot.

Teo

unread,
Mar 28, 2020, 2:38:14 PM3/28/20
to tesseract-ocr
Ok thanks, I'll keep this.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages