I am creating an automatic trainer for tesseract. However, I am having some issues in that Tesseract is having trouble with long and thin characters when they are placed at the end of a word.
For instance, "XLSAjasLi", tesseract will fail at "i" at the end. I am using tessseract 3.02.
The following is the box coordinate for the error :
2 73 29876 101 29926 0
g 105 29876 133 29926 0
K 137 29876 167 29926 0
8 171 29876 199 29926 0
f 203 29876 225 29926 0
s 229 29876 257 29926 0
K 261 29876 291 29926 0
5 295 29876 323 29926 0
l 327 29876 344 29926 0
I 348 29876 365 29926 0 -- Error
I've also attached a cropping of the multipage tiff file that was created. Note: The rectangular boxes are not on the images originally, they were added to debug the image coordinates. I did not train Tesseract on the image with the rectangular boxes.
The multitiff page is two tiff pages with each one being around 36000x449 pixels big.
The specific error from tesseract training command: ./tesseract-install/bin/tesseract OCR_Trainer_Output/tests/TestLargeImageBW.tiff test.arial.exp0 nobatch box.train
FAIL!
APPLY_BOXES: boxfile line 349/I ((348,29876),(365,29926)): FAILURE! Couldn't find a matching blob
This is just one of many. Basically tesseract deterministically fails whenever either "I", "i", "j", "l" are at the very end of a text.