I am trying to OCR documents that we receive over FTP. The documents are PDF files that contain images. We process the PDF, extracting each page as a TIFF (CCITT T.6) file that is 2509x3530 pixels, 300 dpi, 1 bit depth.
As accuracy is not the best, I am looking at better understanding how to train tesseract. As a first step, I was wondering what fonts were used in generating eng.traineddata ? I have unpacked eng.traineddata using "combind_tessdata -u" and extracted the wordlist using dawg2wordlist, and am now trying to understand what the various artifacts are and how they are used. Is there are description available ?
I was also wondering how one may improve speed of processing. On a i7 4800-MQ @ 2.7GHz I was getting approximately 6 PPM using 1 thread with Tess4J 3.0.0.
Thanks
- viraf