Box file syntax for tesseract 4 (LSTM) training.

56 views

Skip to first unread message

Amit Man

unread,

Jan 19, 2018, 3:23:43 AM1/19/18

to tesseract-ocr

I've managed to improve tesseract results on some real life documents by using "tesseract ... batch.nochop makebox" and correct the box file. (in addition to adding spaces and EOL's)

I do have some questions about the correct syntax for the box file.

1) If some of the characters in the tiff image are not represented in the box file, will it harm the training (in the sense that it will train tesseract to ignore those characters)? do i have to "get them all"?

2) how important are the coordinates of each characters. should i invest time on making sure they are exact? I understand LSTM works as a "line recogniser", how does that effects the training?

3) makebox generate a "~" character for lines in the document. will fixing those before training will help tesseract detect them better?