How to process PDF files line by line with tesseract

1,991 views
Skip to first unread message

jcr

unread,
Nov 8, 2019, 9:15:43 AM11/8/19
to tesseract-ocr
when processing PDF files to obtain text content (convert to TIF with ImageMagick + run Tesseract 4.1.0 on output), I observe that in many cases, the input is read "vertically", such that words/numbers being close to each other (e.g. same line) in the input are torn apart in the txt output.

Is there any way to prevent this? And are there any recommendations for configuration of DPI etc. when processing PDF to text?

Alex Giokas

unread,
Nov 8, 2019, 9:33:08 AM11/8/19
to tesseract-ocr
If your PDF is not a bitmap, then you don't need OCR, simply extract the text.
If the PDF is a bitmap, then convert it to an image format, and then OCR it.
You can play with PSM options (instead of 3, try 11 or 12) if your PDF is sparse, I get better accuracy that way.
If you really need a line-by-line approach, then you have to use some pre-processing algorithm (e.g., use OpenCV to find rows of text, extract that as a ROI, and feed that ROI to tesseract one at a time).
This can be easily achieved, but it increases computation time tremendously.

Regards,
Alex

farhad khalafi

unread,
Nov 8, 2019, 11:45:24 AM11/8/19
to tesser...@googlegroups.com
I also experienced a similar problem with images especially if they used fixed-pitch fonts (older scanned documents often did). 
Tesseract groups characters vertically assuming rotated text. I used PSM 6 instead of 3 with some improvement, but it did miss significant portions of text in return. 
I was processing old student records looking for personal information, like SS#, to redact. I ended up running Tesseract multiple times with different PSM modes. Each mode picked up certain parts and missed on others. It was time-consuming.
Is there an engine flag (or can one be added) to force "no-rotate policy" on layout analysis? I think it will be of tremendous help.
Thanks,
Farhad

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2bc3c616-d82f-4056-8f99-0ed4029fb880%40googlegroups.com.
Message has been deleted

Shree Devi Kumar

unread,
Nov 11, 2019, 1:53:22 AM11/11/19
to tesseract-ocr

On Sat, Nov 9, 2019 at 3:10 AM Aaron Stewart <bigbowlo...@gmail.com> wrote:
If you have any suggestions on how to split input images into individual text lines, I would appreciate it.  I am able to use Python and OpenCV, but I don't have a lot of experience with either.  I can read publications if necessary.  

I'm using Tesseract 5.0.0-alpha from UB Mannheim (Windows 10), to process pages from a directory.  The line spacing is very narrow.  In my project, increasing line spacing improves the recognition accuracy.  

I believe that splitting the input image into separate lines of text would improve the results, in my case.  


=== Original ===
FLOYD. THOMAS J.—La.1,°07; (1°07).
ao LOWNDES = (b’64)-~Ala.2,°90:

=== Spaced ===
FLOYD, THOMAS J.—La.1,"07; (1°07).
HENDRICK. LOWNDES  (b’64)-—~Ala.2,°90:
(1°90).

In the original example, the name HENDRICK is missing and the third line is also missing.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Aaron Stewart

unread,
Nov 11, 2019, 4:43:59 PM11/11/19
to tesseract-ocr
Thank you, that is helpful.
Reply all
Reply to author
Forward
0 new messages