Hello
I am making use of Tesseract OCR to perform number plate recognition on vehicles
I am making use of jTessBoxEditor v1.1 to check my box and tif files
At the moment each iteration of my training consists of using about 250 - 300 number plates
I have read in many places that one should train fonts separately. This is difficult in my case as my source of images of number plates consists of number plates with varying font's unless I manually look through each one of the 100 initial images I use per training iteration to separate them into different groups. Would this really be neccessary?
I have been doing training for over a month now and probably trained on over 1000 images and 3000 number plates and seem to not be able to get a better accuracy percentage of over 86%
I was wondering if you have some suggestions as ideally I would like to see in excess of 90% accuracy
What I have picked up is that the OCR struggles with certain problem characters : O vs 0, 5 vs S, 2 vs Z, B vs 8
Is there a specific way of training that I should use to improve correct reads of these letters. During my editting of the tif/box in jTessBoxEditor I am torn between discarding the bad quality read characters and only keeping the good quality read characters vs correcting each and every character to be what it should be regardless of the quality of the character in the tif file. Which is the better approach and why?
Any other suggestions on how to improve my training using jTessBoxEditor greatly appreciated
Thanks