Best export method

79 views
Skip to first unread message

Dayton

unread,
Mar 19, 2020, 3:04:05 AM3/19/20
to tesseract-ocr
Hi All,

I´m using Tesseract for Windows to OCR scanned documents and then format the layout in Word in a later stage.

The text extraction that I get in the .TXT output does not add any hard return or any separation between paragraphs, so I have to spend many time to guess where are the end of each line.

Is there any way to add a parameter in the line code to add separations between paragraphs?

Should I use another output format instead of TXT in order to make easier the formatting in Word?

Thanks!

Zdenko Podobny

unread,
Mar 19, 2020, 6:07:06 PM3/19/20
to tesser...@googlegroups.com
Checkout output to hocr (which is html output), tsv or pdf. See doc.

Zdenko


št 19. 3. 2020 o 8:04 Dayton <montan...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a6e27031-89a2-4800-a574-48f738b439a0%40googlegroups.com.

Dayton

unread,
Mar 20, 2020, 8:25:33 AM3/20/20
to tesseract-ocr
I have output to hocr and tsv but I still get the all text without hard return or any separation between paragraphs.

Is there an HOCR tool which allows to export to Microsoft Word?

The original document is in PDF format. It´s actually an official document. 

First, I run ImageMagick and got a cleaned TIFF file.

After that, I run Tesseract, so I think it does not make sense to back convert the TIFF to PDF again. 

I simply need an export format from Tesseract that allows MS Word to see the text properly, not with lines of code.

Thanks!

El jueves, 19 de marzo de 2020, 23:07:06 (UTC+1), zdenop escribió:
Checkout output to hocr (which is html output), tsv or pdf. See doc.

Zdenko


št 19. 3. 2020 o 8:04 Dayton <montan...@gmail.com> napísal(a):
Hi All,

I´m using Tesseract for Windows to OCR scanned documents and then format the layout in Word in a later stage.

The text extraction that I get in the .TXT output does not add any hard return or any separation between paragraphs, so I have to spend many time to guess where are the end of each line.

Is there any way to add a parameter in the line code to add separations between paragraphs?

Should I use another output format instead of TXT in order to make easier the formatting in Word?

Thanks!

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Shree Devi Kumar

unread,
Mar 20, 2020, 8:27:22 AM3/20/20
to tesseract-ocr
Take a look at gimagereader, which uses tesseract . It has the options you are looking for.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/82dcb69e-b70c-4738-8dbc-2671ad6cae75%40googlegroups.com.

Dayton

unread,
Mar 20, 2020, 5:30:42 PM3/20/20
to tesseract-ocr
Thanks shree. I´ll have a look at gimagereader. Looks like promising.
Reply all
Reply to author
Forward
0 new messages