Hi, I have a display I would like to OCR - example image is at the bottom,
I have found a font that matches as far as I can tell exactly the number format
('5x7-dot-matrix') I have created 40k files similar to pil_image_10445.png with their corresponding .gt.txt files and created a new traineddata file.
My character set is limited to 0-9 and .
I have tried using random sets of characters, and a more structured set nnnnn.nnn, the results from all of the traineddata files is poor.
I have also tried turning the image to grayscale, cropping, enhancing the contrast etc to no avail. I am lucky to get 1 digit recognised.
Bizarrely I get the same output no matter which input image file I use!!
(Using the attached traineddata file and the attached image I get 6.) - which is the same output for all the files I have tried.
I have stuck to either psm 7 or 13 as the others largely don't give any output
I would like some advice about whether continuing to increase the training data set will help, or any hints about trying to get better OCR success for these digits.
I am using tesseract 5.3.0 leptonica 1.83.0 on a debian 11 machine.
I built tesstrain as per the github instructions.
I am using ./tesseract -l dot_gas_int --psm 7 ~/tesstrain/data/dot_gas_int-ground-truth/pil_image_10445.png stdout
Apologies if I am doing something dumb, this is new to me and I am having a go :-)
Thanks
Simon
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fa2232b3-852b-421e-939e-177971178faen%40googlegroups.com.