Underlay text wrong size/position in PDF with Tesseract OCR

Edson Luis Moretti

unread,

Mar 10, 2016, 8:03:17 AM3/10/16

to tesseract-ocr

Hello everyone,

I'm using Tesseract in VB.Net with

to write an underlay text with OCR Data and mount a searchable pdf.

Tesseract is recognizing the text well, My problem is that the underlay text is in the wrong position as you can see in the image attached.

Anyone already had that problem?

I'm passing the HTML generated by the sub Tesseract.GetHOCRText to the hDocument of HOcr2Pdf.Net but seems like the positions and sizes are wrong.

My code to create the pdf

            With tesseract.Process(currentPageImage)
                OCRParser.ParseHOCR(hdoc, .GetHOCRText(0, True), True)
                pdfCreator.AddPage(hdoc.Pages(hdoc.Pages.Count - 1), currentPageImage)
                hdoc.Pages.RemoveAt(hdoc.Pages.Count - 1)


                .Dispose()
            End With
            pdfCreator.SaveAndClose()

this OCRParser class is the same class Parser of hOcr2Pdf.Net but that class is in a private namespace and I can't access.

I did this because to add a new HTML page to hDocument you need to pass a path of a HTML file and I don't want to save the tesseract output just to pass as an argument.

Doing this way I changed the Parser class to get the HTML object from text and not from a file, now I can pass the HTML text instead of a path of a HTML file.

Can my problem be something related with tesseract training? is it recognizing the wrong font size or something like that?

I'm using the Default english trained data, If I made my own trained data with my samples should the Underlay text be created in the right size/position?

Many thanks!

Edson Luis Moretti.

Capture.JPG

Tom Morris

unread,

Mar 10, 2016, 7:37:24 PM3/10/16

to tesseract-ocr

What version of Tesseract? The issue isn't likely to have anything to do with training.

Tom

Edson Luis Moretti

unread,

Mar 11, 2016, 4:21:56 AM3/11/16

to tesseract-ocr

Hello Tom,

I'm sorry, I forgot to put the verion in my post!

I'm using the version 3.0.2.0

This is the latest version that we have a wrapper for .Net

Tom Morris

unread,

Mar 11, 2016, 10:54:48 AM3/11/16

to tesser...@googlegroups.com

On Fri, Mar 11, 2016 at 4:21 AM, Edson Luis Moretti <edsonlui...@gmail.com> wrote:

I'm sorry, I forgot to put the verion in my post!
I'm using the version 3.0.2.0

This is the latest version that we have a wrapper for .Net

I'd recommend testing with 3.04 or 3.04.01. Even without the .net wrapper, you should be able to test from the command line to see if it makes a difference. The release notes mention changes in the PDF support in recent releases: https://github.com/tesseract-ocr/tesseract/blob/master/ReleaseNotes

Tom

Reply all

Reply to author

Forward