Underlay text wrong size/position in PDF with Tesseract OCR

330 views
Skip to first unread message

Edson Luis Moretti

unread,
Mar 10, 2016, 8:03:17 AM3/10/16
to tesseract-ocr
Hello everyone,

I'm using Tesseract in VB.Net with to write an underlay text with OCR Data and mount a searchable pdf.

Tesseract is recognizing the text well, My problem is that the underlay text is in the wrong position as you can see in the image attached.

Anyone already had that problem? 

I'm passing the HTML generated by the sub Tesseract.GetHOCRText to the hDocument of HOcr2Pdf.Net but seems like the positions and sizes are wrong.

My code to create the pdf
            With tesseract.Process(currentPageImage)
               
OCRParser.ParseHOCR(hdoc, .GetHOCRText(0, True), True)
                pdfCreator
.AddPage(hdoc.Pages(hdoc.Pages.Count - 1), currentPageImage)
                hdoc
.Pages.RemoveAt(hdoc.Pages.Count - 1)


               
.Dispose()
           
End With
            pdfCreator.SaveAndClose()
this OCRParser class is the same class Parser of hOcr2Pdf.Net but that class is in a private namespace and I can't access. 
I did this because to add a new HTML page to hDocument you need to pass a path of a HTML file and I don't want to save the tesseract output just to pass as an argument.
Doing this way I changed the Parser class to get the HTML object from text and not from a file, now I can pass the HTML text instead of a path of a HTML file.

Can my problem be something related with tesseract training? is it recognizing the wrong font size or something like that?

I'm using the Default english trained data, If I made my own trained data with my samples should the Underlay text be created in the right size/position?

Many thanks!
Edson Luis Moretti.



Capture.JPG

Tom Morris

unread,
Mar 10, 2016, 7:37:24 PM3/10/16
to tesseract-ocr
What version of Tesseract? The issue isn't likely to have anything to do with training.

Tom

Edson Luis Moretti

unread,
Mar 11, 2016, 4:21:56 AM3/11/16
to tesseract-ocr
Hello Tom,

I'm sorry, I forgot to put the verion in my post!
I'm using the version 3.0.2.0
This is the latest version that we have a wrapper for .Net

Tom Morris

unread,
Mar 11, 2016, 10:54:48 AM3/11/16
to tesser...@googlegroups.com
On Fri, Mar 11, 2016 at 4:21 AM, Edson Luis Moretti <edsonlui...@gmail.com> wrote:

I'm sorry, I forgot to put the verion in my post!
I'm using the version 3.0.2.0 
This is the latest version that we have a wrapper for .Net

I'd recommend testing with 3.04 or 3.04.01. Even without the .net wrapper, you should be able to test from the command line to see if it makes a difference.  The release notes mention changes in the PDF support in recent releases: https://github.com/tesseract-ocr/tesseract/blob/master/ReleaseNotes

Tom

Reply all
Reply to author
Forward
0 new messages