Training tesseract 4.0 to recognize handwritten python code samples

Rajeev Kodippily

unread,

Feb 14, 2022, 4:41:33 AM2/14/22

to tesseract-ocr

Hello All,

I know that tesseract is not intended to be used on handwritten data, but I'm trying to tackle a problem that does not really have a straightforward solution at the moment, which is recognizing handwritten source code. There are no datasets of labelled handwritten source code to build a model from scratch.

There was a study done in 2017 where they evaluated the commercial engine myscript's performance on handwritten source code. They created and published an evaluation dateset of handwritten python code samples.

My attempt is to compare their results with tesseract 4.0 's performance after using the training tools to train tesseract to recognize their evaluation data set.

As a first step, I fine tuned tessdata_best by giving it the following langdata

1. eng.training_text - for this file I gave it the actual ground truth source code of the handwritten samples ( I ultimately would like the NN to create a more generalized model by feeding it a lot of python code but as a first step I thought of just going with the target data itself)

2. eng.wordlist - I gave this file the set of python keywords from most frequent to least

3. eng.punc and eng.numbers - I got rid of the expressions that I know will never appear on source code and kept the rest. ( keep in my mind the dateset has only source code, the comments are all removed)

I created the training data using about 27 handwriting fonts I found online.

I have attached the data and scripts I used and attached the results of the two images 1.png and 9.png in Results.txt

For 9.png as you can see it shows a slight improvement as it doesn't have out of vocab characters and the WER is lower. I noticed that the model works well for block letters as in 9.png but still cannot recognize when the handwriting is messy, which makes sense.

In 1.png where the handwriting is a bit cursive we can't really say that the trained model is better.

My question is, what other things that I can try to decrease the WER from default tesseract. What can I try differently ? Again, I know the results won't be perfect but my objective is to use the training tools and show that after training, the model will perform better than default tesseract.

I'm going to try training from scratch and training a few layers next, any thoughts regarding those approaches would also be helpful.

I have attached all my files and the training scripts used.

Any feedback would be highly appreciated!

Thanks!

Rajeev.

9.png

1.png

langdata.zip

Scripts

Results

Rajeev Kodippily

unread,

Feb 14, 2022, 4:49:08 AM2/14/22

to tesseract-ocr

These are the fonts I used.

-Rajeev

fonts.zip

MYRO STEL ANO

unread,

May 24, 2022, 2:27:06 AM5/24/22

to tesseract-ocr

Hi Can I see your traineddata, I wanna try it on my project

Reply all

Reply to author

Forward