Compressing a sequence of spaces

2,304 views
Skip to first unread message

Rob

unread,
Jun 29, 2009, 11:43:01 AM6/29/09
to tesseract-ocr
Tesseract is compressing a sequence of spaces in an input TIFF into a
single space in the output text. I want to preserve the original
spaces.

Tesseract 2.03
Debian 4 (2.6.18-5-686 kernel)
libtiff-tools
libtiff-dev

I'd appreciate any advice.

Thanks,
Rob

Ray Smith

unread,
Jun 29, 2009, 9:10:41 PM6/29/09
to tesser...@googlegroups.com
You can achieve this with a minor hack to the code.
In baseapi.cpp, in the function TessBaseAPI::TesseractToText, where you see *ptr++ = ' '; you need to iterate this a number of times equal to word->word->space().

Note that the result will be very inaccurate, as (except for fixed pitch) there is no easy way of telling how many spaces it takes to match the original gap in the font that you are going to use to display it.

Ray.
Reply all
Reply to author
Forward
0 new messages