Hello everyone,
I'm using Tesseract in VB.Net with
to write an underlay text with OCR Data and mount a searchable pdf.
Tesseract is recognizing the text well, My problem is that the underlay text is in the wrong position as you can see in the image attached.
Anyone already had that problem?
I'm passing the HTML generated by the sub Tesseract.GetHOCRText to the hDocument of HOcr2Pdf.Net but seems like the positions and sizes are wrong.
My code to create the pdf
With tesseract.Process(currentPageImage)
OCRParser.ParseHOCR(hdoc, .GetHOCRText(0, True), True)
pdfCreator.AddPage(hdoc.Pages(hdoc.Pages.Count - 1), currentPageImage)
hdoc.Pages.RemoveAt(hdoc.Pages.Count - 1)
.Dispose()
End With
pdfCreator.SaveAndClose()
this OCRParser class is the same class Parser of hOcr2Pdf.Net but that class is in a private namespace and I can't access.
I did this because to add a new HTML page to hDocument you need to pass a path of a HTML file and I don't want to save the tesseract output just to pass as an argument.
Doing this way I changed the Parser class to get the HTML object from text and not from a file, now I can pass the HTML text instead of a path of a HTML file.
Can my problem be something related with tesseract training? is it recognizing the wrong font size or something like that?
I'm using the Default english trained data, If I made my own trained data with my samples should the Underlay text be created in the right size/position?
Many thanks!
Edson Luis Moretti.