Training tesseract - eng.traineddata

82 views
Skip to first unread message

viraf

unread,
Feb 14, 2016, 11:15:29 AM2/14/16
to tesseract-ocr
I am trying to OCR documents that we receive over FTP.  The documents are PDF files that contain images.  We process the PDF, extracting each page as a TIFF (CCITT T.6) file that is 2509x3530 pixels, 300 dpi, 1 bit depth.  

As accuracy is not the best, I am looking at better understanding how to train tesseract.  As a first step, I was wondering what fonts were used in generating eng.traineddata ?  I have unpacked eng.traineddata using "combind_tessdata -u" and extracted the wordlist using dawg2wordlist, and am now trying to understand what the various artifacts are and how they are used.  Is there are description available ?  

I was also wondering how one may improve speed of processing.  On a i7 4800-MQ @ 2.7GHz I was getting approximately 6 PPM using 1 thread with Tess4J 3.0.0.  

Thanks

- viraf
Reply all
Reply to author
Forward
0 new messages