Parameters to increase tolerance of whitespace between characters?

233 views
Skip to first unread message

Timothy Snyder

unread,
Jul 10, 2019, 11:16:55 AM7/10/19
to tesseract-ocr
Hello all,

Does anyone know of any config parameters that will increase the tolerance of whitespace between characters, i.e., increase the amount of whitespace needed to trigger word segmentation?

I have many cases in my text where there are extra whitespace between characters resulting in the segmentation of a single word into multiple words.

Any suggestions would be appreciated!

-Tim

Stephane Charette

unread,
Aug 27, 2019, 5:11:43 AM8/27/19
to tesseract-ocr
I joined for similar/opposite reasons:  In my case Tesseract is removing critical whitespace from between non-dictionary words, and I was looking for tips/hints as to what to tweak in Tesseract's configuration to get it to treat whitespace differently.

Anyone know?

Stéphane

Timothy Snyder

unread,
Aug 27, 2019, 9:14:27 AM8/27/19
to tesser...@googlegroups.com
Yes, any info would be very useful. I've tried modifying a large number of config variables to no effect with Tesseract 4.0+. Having some control over line/word/character segmentation would be a very useful feature.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8e6878a9-655d-41c7-9d4d-bcb7dcfb6419%40googlegroups.com.

Juanjo Serrano Lloria

unread,
Oct 19, 2019, 7:16:58 AM10/19/19
to tesseract-ocr
Same problem in tesseract 4.1.0. Removing whitespaces. I've tried with a lot of parameters.


El martes, 27 de agosto de 2019, 15:14:27 (UTC+2), Timothy Snyder escribió:
Yes, any info would be very useful. I've tried modifying a large number of config variables to no effect with Tesseract 4.0+. Having some control over line/word/character segmentation would be a very useful feature.

On Tue, Aug 27, 2019 at 5:11 AM Stephane Charette <stephane...@gmail.com> wrote:
I joined for similar/opposite reasons:  In my case Tesseract is removing critical whitespace from between non-dictionary words, and I was looking for tips/hints as to what to tweak in Tesseract's configuration to get it to treat whitespace differently.

Anyone know?

Stéphane


On Wednesday, July 10, 2019 at 8:16:55 AM UTC-7, Timothy Snyder wrote:
Hello all,

Does anyone know of any config parameters that will increase the tolerance of whitespace between characters, i.e., increase the amount of whitespace needed to trigger word segmentation?

I have many cases in my text where there are extra whitespace between characters resulting in the segmentation of a single word into multiple words.

Any suggestions would be appreciated!

-Tim

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages