Hi there. I've attempted training(fine-tuning) Tesseract-4 lstmtraining on handwritten text, for box/tif pairs I generated myself. The overall training process has worked okay without any hitch.
I now wish to apply this fine-tuning process on a larger scale on form images. Here's my conundrum: My forms often contain a mixture of printed text as well as handwritten text. Do I have to annotate both the printed text and the handwritten text? Annotating both printed and handwritten would take a bit of extra effort, so I'm wondering if it sufficient to simply make the boxes only around the handwritten portions. However, I'm worrying that if I only make boxes around handwritten parts and leave out the printed parts, it might confuse my model somehow.
My second question is, when I performed inference with my trained model, it throws a warning: `Failed to load any lstm-specific dictionaries for lang X`. I understand that this is caused by the absence of word lists, punctuation lists etc (although it does still give an inferenced output)
I'm wondering how much a word list affects the inference process? I could simply take the base language's word-list from the github repository and combine it into my newly trained tessdata. However, the forms I will use tesseract on will contain lots of people names (which may not be present in a wordlist?). In such a case, do I have to compile a new wordlist? Or is it sufficient to do without one?