Hello All,
I know that tesseract is not intended to be used on handwritten data, but I'm trying to tackle a problem that does not really have a straightforward solution at the moment, which is recognizing handwritten source code. There are no datasets of labelled handwritten source code to build a model from scratch.
There was a
study done in 2017 where they evaluated the commercial engine
myscript's performance on handwritten source code. They created and published an evaluation dateset of handwritten python code samples.
My
attempt is to compare their results with tesseract 4.0 's performance
after using the training tools to train tesseract to recognize their
evaluation data set.
As a first step, I fine tuned tessdata_best by giving it the following langdata
1.
eng.training_text - for this file I gave it the actual ground truth
source code of the handwritten samples ( I ultimately would like the NN
to create a more generalized model by feeding it a lot of python code
but as a first step I thought of just going with the target data itself)
2. eng.wordlist - I gave this file the set of python keywords from most frequent to least
3.
eng.punc and eng.numbers - I got rid of the expressions that I know
will never appear on source code and kept the rest. ( keep in my mind
the dateset has only source code, the comments are all removed)
I created the training data using about 27 handwriting fonts I found online.
I have attached the data and scripts I used and attached the results of the two images 1.png and 9.png in Results.txt
For 9.png as you can see it shows a slight improvement as it doesn't have
out of vocab characters and the WER is lower. I noticed that the model
works well for block letters as in 9.png but still cannot recognize when
the handwriting is messy, which makes sense.
In 1.png where the handwriting is a bit cursive we can't really say that the trained model is better.
My
question is, what other things that I can try to decrease the WER from
default tesseract. What can I try differently ? Again, I know the
results won't be perfect but my objective is to use the training tools
and show that after training, the model will perform better than default
tesseract.
I'm
going to try training from scratch and training a few layers next, any
thoughts regarding those approaches would also be helpful.
I have attached all my files and the training scripts used.
Any feedback would be highly appreciated!
Thanks!
Rajeev.