Annotation process and wordlist requirement for finetuning tesseract 4 for handwritten text

45 views

Skip to first unread message

Mashrur Mahmud

unread,

Jul 7, 2020, 8:27:29 AM7/7/20

to tesseract-ocr

Hi there. I've attempted training(fine-tuning) Tesseract-4 lstmtraining on handwritten text, for box/tif pairs I generated myself. The overall training process has worked okay without any hitch.

I now wish to apply this fine-tuning process on a larger scale on form images. Here's my conundrum: My forms often contain a mixture of printed text as well as handwritten text. Do I have to annotate both the printed text and the handwritten text? Annotating both printed and handwritten would take a bit of extra effort, so I'm wondering if it sufficient to simply make the boxes only around the handwritten portions. However, I'm worrying that if I only make boxes around handwritten parts and leave out the printed parts, it might confuse my model somehow.

My second question is, when I performed inference with my trained model, it throws a warning: `Failed to load any lstm-specific dictionaries for lang X`. I understand that this is caused by the absence of word lists, punctuation lists etc (although it does still give an inferenced output)

I'm wondering how much a word list affects the inference process? I could simply take the base language's word-list from the github repository and combine it into my newly trained tessdata. However, the forms I will use tesseract on will contain lots of people names (which may not be present in a wordlist?). In such a case, do I have to compile a new wordlist? Or is it sufficient to do without one?

Reply all

Reply to author

Forward

0 new messages