PATCH: preserve interword spaces

8,786 views
Skip to first unread message

dhara...@gmail.com

unread,
Jan 21, 2015, 5:28:52 PM1/21/15
to tesser...@googlegroups.com
I am attaching a small patch that adds a configurable parameter that when enabled will preserve multiple interword spaces in text.  The strategy for doing this was based on Ray's suggestion from several years ago in [1]. This seems to be useful to others as well [2] [3] [4].

As Ray mentioned originally, the number of spaces output will vary/be inaccurate.  That's ok for our use case, as we are really more interested in detecting one space vs. multiple spaces.

I am happy to make adjustments if necessary.

Thanks,

David

preserve_interword_spaces.patch

zdenko podobny

unread,
Jan 22, 2015, 6:02:10 AM1/22/15
to tesser...@googlegroups.com
Hi,

thanks for patch. Can you please provide also some test case/image that would demonstrate problem?

Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/5c5bdde5-4886-469f-a5af-b8097b0b5587%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

dhara...@gmail.com

unread,
Jan 22, 2015, 11:20:45 AM1/22/15
to tesser...@googlegroups.com
Here is a sample image, with differing outputs...

tesseract -l eng -c preserve_interword_spaces=1 109359.tiff output-preserve-enabled
tesseract -l eng 109359.tiff output-default

Thanks.

David
output-default.txt
output-preserve-enabled.txt
109359.tiff

zdenko podobny

unread,
Jan 27, 2015, 5:09:25 PM1/27/15
to tesser...@googlegroups.com
Thanks. Committed in Revision: 36883b4fafcd.

Zdenko

Jan Ruzicka

unread,
Jan 30, 2015, 1:01:16 AM1/30/15
to tesser...@googlegroups.com
Hi,
There is one small problem with the patch.
Now there is a space at the beginning of each line in case of preserve_interword_spaces_ == false.
Previously the space was added after the word and last one was removed.

Adding (words_appended > 0) instead of the 1 in the numSpaces calculation would work, but it looks ugly.

so instead of:
int numSpaces = preserve_interword_spaces_ ? it_->word()->word->space() : 1;

it would be:
int numSpaces = preserve_interword_spaces_ ? it_->word()->word->space() : (words_appended > 0);

Jan
> To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/CAJbzG8xQMgUqgDDy9bQDmoKS-Lh8CYomaS2gfPG1zKgKn-g_Qw%40mail.gmail.com.

dhara...@gmail.com

unread,
Jan 30, 2015, 1:37:41 AM1/30/15
to tesser...@googlegroups.com
Oops, my apologies.  Thank you Jan.  I suppose I was too focused on the output with the parameter enabled :-)
Since the fix is so minor it is perhaps easiest if maybe Zdenko just changes it directly?

resultiterator.cpp:648

David

zdenko podobny

unread,
Jan 30, 2015, 4:31:30 PM1/30/15
to tesser...@googlegroups.com
committed.


Zdenko

Dmitri Silaev

unread,
Feb 4, 2015, 11:09:05 AM2/4/15
to tesser...@googlegroups.com
FYI. This patch causes a compilation error under MSVC 2010 and earlier. And from what I know, same in later versions (not tested). This is due to C++11 member initialization inside class declaration:

File: "ccmain\resultiterator.h"
Ln 239: bool preserve_interword_spaces_ = false;

Best regards,
Dmitri Silaev
www.CustomOCR.com


dhara...@gmail.com

unread,
Feb 5, 2015, 11:22:19 AM2/5/15
to tesser...@googlegroups.com
The initialization can be moved to the ResultIterator constructor... ccmain/resultiterator.cpp:37 

Thanks Dmitri.

David
Reply all
Reply to author
Forward
Message has been deleted
0 new messages