More context
here. I'm trying to get Tesseract to split some of its detected boxes in half or thirds.
My approach has been to draw white vertical lines through the joined letters, so from before:
to after:
If you can't see the lines, here they are in red:
I would have expected that drawing the white lines would split these boxes apart. It does that, but it also has a side effect: it joins the "9" on the first line with the "s" below it on the next line:
even if I draw a white line below the "9" and the "0", this still happens. As you might expect, these tall letters wreak havoc on the resulting OCR'd text.
I'm baffled why this is happening. Based on
this SO answer, my understanding was that Tesseract looked at connected components to find boxes, so I would have expected the white lines to force apart two components.
Is it possible to give Tesseract an explicit list of boxes? If not, is there a more effective way to force apart two letters than what I'm doing?
Thanks!
- Dan