SpeedOn an intel i7-4800 MQ @ 2.7GHz I am getting approximately 6 PPM using 1 thread. I was looking for suggestions on how to speed up page processing. I use parallelStream to process each page in a separate thread,
TrainingI am trying to learn about training Tesseract for improved accuracy. Given that the fonts / box files used to generate eng.traindata are not available can one specify the fonts used for english?
Also, is there a description of the various training artifacts ? I used "combine_tessdata -u" to unpack eng.traindata and "dawg2wordlist" to extract thee wordlist, however was looking for documentation to better understand the various training artifacts.
Tom, the images are TIFF (CCITT T.6) images - 2509 x 3530 @ 300 dpi (1 bit - i.e. BW). Th language is english.
I am using Tess4j 3.0, which includes Tesseract 3.0.4. I am instantiating a new Tesseract object for each page, however the cost was minimal (74ms) for the total run.
When you state "taking a big hit on image processing" how would I be able to isolate the issue to image processing?
--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/5CSIYkba5Dc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a9b6dda7-740d-4d66-8b45-a632e9c8dc11%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/008626e5-6017-45da-a5d2-d42c58834216%40googlegroups.com.
Tom, on the item of fonts, eng.inttemp is a binary file in 3.0.4. I did not see a command to extract its contents. Do you have suggestions on how to review this file ? Thanks - viraf
I ran perf tool and noticed that 40% of the time is spent in IntegerMatcher::UpdateTablesForFeatures.
I am new to tesseract and using it through Tess4J. I am trying to OCR faxes where pages are represented as TIFF (CCITT T.6) images - 2509 x 3530 @ 300 dpi (1 bit - i.e. BW).I have two set of questionsSpeedOn an intel i7-4800 MQ @ 2.7GHz I am getting approximately 6 PPM using 1 thread. I was looking for suggestions on how to speed up page processing. I use parallelStream to process each page in a separate thread,
- viraf
I created a large (1800 page) multi-page tiff and am feeding it to Tesseract via command line (on Ubuntu). This way I am testing Tesseract performance.
This is about 25% the performance of a commercial engine that I am evaluating (it gets about 24 PPM with 2 cores on my laptop),
Tom, I created a multi-page TIFF as per earlier recommendation on this thread (avoid multiple inits). Running it on Linux from the command line provided me with a reference by which to compute PPM that I could target with Tess4J. I had hoped to get 10+ PPM / core and shift focus on accuracy. I am at about 6 PPM and unclear where / how to improve performance (speed).
Pages 372 |
Time (ms) 2395903 |
PPM 9.315903 |
372 | 2293524 | 9.731749 |