Training tesseract - eng.traineddata

82 views

Skip to first unread message

viraf

unread,

Feb 14, 2016, 11:15:29 AM2/14/16

to tesseract-ocr

I am trying to OCR documents that we receive over FTP. The documents are PDF files that contain images. We process the PDF, extracting each page as a TIFF (CCITT T.6) file that is 2509x3530 pixels, 300 dpi, 1 bit depth.

As accuracy is not the best, I am looking at better understanding how to train tesseract. As a first step, I was wondering what fonts were used in generating eng.traineddata ? I have unpacked eng.traineddata using "combind_tessdata -u" and extracted the wordlist using dawg2wordlist, and am now trying to understand what the various artifacts are and how they are used. Is there are description available ?

I was also wondering how one may improve speed of processing. On a i7 4800-MQ @ 2.7GHz I was getting approximately 6 PPM using 1 thread with Tess4J 3.0.0.

Thanks

- viraf

Reply all

Reply to author

Forward

0 new messages