How to overlay hocr output on original scanned pdf.

900 views
Skip to first unread message

monica kumari

unread,
Sep 17, 2018, 8:12:46 AM9/17/18
to tesseract-ocr
for OCRing a scanned pdf, 
first it is converted to image format then OCRed and gives a temperory file of pdf/text format and overlays on original scanned pdf.
I want the output format to be hocr. for this, I ran the command 
"convert scannedFile.pdf scannedFile.png" and then "tesseract scannedFile.png scanned.pdf -l eng hocr"
I got the hocr fomat as output. 
Now I need a help to overlay it on scannned pdf file.

Anybody have any idea about it ?

Zdenko Podobny

unread,
Sep 17, 2018, 8:14:27 AM9/17/18
to tesser...@googlegroups.com
Something like this?

tesseract scannedFile.png scanned.pdf -l eng hocr pdf

Zdenko


po 17. 9. 2018 o 14:12 monica kumari <monicak...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c5b4f9c7-67e5-41d8-8c24-b4e5e4c39ed3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Monica

unread,
Sep 17, 2018, 8:17:42 AM9/17/18
to tesser...@googlegroups.com
Thanks Zdenko for you response. 
will "tesseract scannedFile.png scanned.pdf -l eng hocr pdf" overlay on pdf file ?

Monica

unread,
Sep 17, 2018, 8:41:35 AM9/17/18
to tesser...@googlegroups.com
I have tried this, but this is showing the default behaviour. I think the default output is overlaying on pdf instead of hocr out.

Shree Devi Kumar

unread,
Sep 17, 2018, 12:17:01 PM9/17/18
to tesser...@googlegroups.com
I think pdf creation adds a text layer only and there isn't an option to add HOCR to it.

@jbreiden can confirm.

On Mon, Sep 17, 2018 at 6:10 PM, Monica <monicak...@gmail.com> wrote:
I have tried this, but this is showing the default behaviour. I think the default output is overlaying on pdf instead of hocr out.

On Mon, Sep 17, 2018 at 5:47 PM Monica <monicak...@gmail.com> wrote:
Thanks Zdenko for you response. 
will "tesseract scannedFile.png scanned.pdf -l eng hocr pdf" overlay on pdf file ?

On Mon, Sep 17, 2018 at 5:44 PM Zdenko Podobny <zde...@gmail.com> wrote:
Something like this?

tesseract scannedFile.png scanned.pdf -l eng hocr pdf

Zdenko


po 17. 9. 2018 o 14:12 monica kumari <monicak...@gmail.com> napísal(a):
for OCRing a scanned pdf, 
first it is converted to image format then OCRed and gives a temperory file of pdf/text format and overlays on original scanned pdf.
I want the output format to be hocr. for this, I ran the command 
"convert scannedFile.pdf scannedFile.png" and then "tesseract scannedFile.png scanned.pdf -l eng hocr"
I got the hocr fomat as output. 
Now I need a help to overlay it on scannned pdf file.

Anybody have any idea about it ?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.



--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Jeff Breidenbach

unread,
Sep 17, 2018, 2:11:42 PM9/17/18
to Shree, tesser...@googlegroups.com
Tesseract produces searchable PDF directly.  If you really want to use HOCR as an 
intermediate format, you can but you will need external software. There are a couple
of  "hocr2pdf" programs floating around and "OCRMyPDF" does an admirable job 
tying things together. That said, going direct should give best results.



On Mon, Sep 17, 2018 at 10:08 AM Shree Devi Kumar <shree...@gmail.com> wrote:
I think pdf creation adds a text layer only and there isn't an option to add HOCR to it.

@jbreiden can confirm.
On Mon, Sep 17, 2018 at 6:10 PM, Monica <monicak...@gmail.com> wrote:
I have tried this, but this is showing the default behaviour. I think the default output is overlaying on pdf instead of hocr out.

On Mon, Sep 17, 2018 at 5:47 PM Monica <monicak...@gmail.com> wrote:
Thanks Zdenko for you response. 
will "tesseract scannedFile.png scanned.pdf -l eng hocr pdf" overlay on pdf file ?

On Mon, Sep 17, 2018 at 5:44 PM Zdenko Podobny <zde...@gmail.com> wrote:
Something like this?

tesseract scannedFile.png scanned.pdf -l eng hocr pdf

Zdenko


po 17. 9. 2018 o 14:12 monica kumari <monicak...@gmail.com> napísal(a):
for OCRing a scanned pdf, 
first it is converted to image format then OCRed and gives a temperory file of pdf/text format and overlays on original scanned pdf.
I want the output format to be hocr. for this, I ran the command 
"convert scannedFile.pdf scannedFile.png" and then "tesseract scannedFile.png scanned.pdf -l eng hocr"
I got the hocr fomat as output. 
Now I need a help to overlay it on scannned pdf file.

Anybody have any idea about it ?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Monica

unread,
Sep 19, 2018, 12:51:59 AM9/19/18
to tesser...@googlegroups.com
Yes, I agree. I have tried that but the quality is not so good. The quality is compromising here. Is there any other way to OCR pdfs without or less compromising with quality ? 

Reply all
Reply to author
Forward
0 new messages