Tesseract 4.0 extracting multiple columns where one is wanted

103 views

Skip to first unread message

peter.b...@playfultechnology.co.uk

unread,

May 2, 2018, 3:23:21 PM5/2/18

to tesseract-ocr

I am using Tesseract 4.0 to extract text from scanned PDF documents. I first use pdftoppm to split the document into pages represented as png files, and then use the following command to perform OCR

tesseract page.pdf stdout -l eng --psm 4

The pages generally have section numbers down the left hand side of the page. Sometimes, these are extracted as a column of text, and the actual text is extracted as a second column. Since I have set --psm 4, I am expecting to get the entire page returned as a single column - and indeed, for some pages I do get what I want.

Why is tesseract sometimes extracting the text in columns even when I tell it not to, and what can I do about it?

ShreeDevi Kumar

unread,

May 3, 2018, 4:37:48 AM5/3/18

to tesser...@googlegroups.com

Try with --psm 6

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0781d032-73b7-415d-97a0-485a1c3210a6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages