delimiting paragraphs

26 views
Skip to first unread message

rkomar

unread,
Jul 21, 2009, 9:49:28 PM7/21/09
to tesseract-ocr
I've started using tesseract and have been pleasantly surprised by how
well it works. I'm scanning some old books and creating LaTeX files
from them. One of the most tedious parts of the job is looking at the
original pages, finding the paragraph breaks (by the indentation), and
then inserting an empty line in the OCR'ed text by hand at each
location. Is there some way to do this automatically with tesseract?
I'm willing to hack the source code if necessary.

Ray Smith

unread,
Aug 5, 2009, 2:52:01 PM8/5/09
to tesser...@googlegroups.com

If there are blank lines between paragraphs, the new page layout will do this for you in 3.00. If not, it willprobably do this in the future.

If you want to have a crack at it yourself, you would have to modify the page layout analysis or add it as a postprocess based on the word boxes.

Ray.

Reply all
Reply to author
Forward
0 new messages