Hello All,
I know that tesseract is not intended to be used on handwritten data, but I'm trying to tackle a problem that does not really have a straightforward solution at the moment, which is recognizing handwritten source code. There are no datasets of labelled handwritten source code to build a model from scratch.
There was a 
study done in 2017 where they evaluated the commercial engine 
myscript's performance on handwritten source code. They created and published an evaluation dateset of handwritten python code samples. 
My
 attempt is to compare their results with tesseract 4.0 's performance 
after using the training tools to train tesseract to recognize their 
evaluation data set. 
As a first step, I fine tuned tessdata_best by giving it the following langdata
1.
 eng.training_text - for this file I gave it the actual ground truth 
source code of the handwritten samples ( I ultimately would like the NN 
to create a more generalized model by feeding it a lot of python code 
but as a first step I thought of just going with the target data itself)
2. eng.wordlist - I gave this file the set of python keywords from most frequent to least
3. 
 eng.punc and eng.numbers  - I got rid of the expressions that I know 
will never appear on source code and kept the rest. ( keep in my mind 
the dateset has only source code, the comments are all removed) 
I created the training data using about 27 handwriting fonts I found online.
 I have attached the data and scripts I used and attached the results of the two images 1.png and 9.png in Results.txt
For 9.png as you can see it shows a slight improvement as it doesn't have 
out of vocab characters and the WER is lower.  I noticed that the model 
works well for block letters as in 9.png but still cannot recognize when
 the handwriting is  messy, which makes sense. 
In  1.png where the handwriting is a bit cursive we can't really say that the trained model is better.
My
 question is, what other things that I can try to decrease the WER from 
default tesseract. What can I try differently ? Again, I know the 
results won't be perfect but my objective is to use the training tools 
and show that after training, the model will perform better than default
 tesseract. 
I'm
 going to try training from scratch and training a few layers next, any 
thoughts regarding those approaches would also be helpful.
I have attached all my files and the training scripts used. 
Any feedback would be highly appreciated!
Thanks!
Rajeev.