2 min on 1 page TIFF using Fast trained data

105 views
Skip to first unread message

Ravil R

unread,
Apr 12, 2020, 10:00:33 AM4/12/20
to tesseract-ocr
I have my own simple Windows dll based on tesseractmain,cpp code. It works fine since Tesseract 3x (now I moved it the latest 5 build) and the only issue still persists is its low speed - 1 page TIFF takes around 2 minutes even with the Fast version of tessdata ('eng+rus'). Is this how it actually works or there is something I don't understand?
Almost all the time takes this line:
api.ProcessPages("c:\\1.tif", NULL, 0, NULL);
Sample file is attached
1.tif.zip

Zdenko Podobny

unread,
Apr 13, 2020, 3:08:08 AM4/13/20
to tesser...@googlegroups.com
Why you decided to ignore instructions in comment
Why we should care about your problems if you do not care?

Zdenko


ne 12. 4. 2020 o 16:00 Ravil R <moloch...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/759d47df-da5f-4683-ab13-0f8ffb08b159%40googlegroups.com.

Ravil R

unread,
Apr 13, 2020, 3:50:50 AM4/13/20
to tesseract-ocr
Sorry, I have just now seen your full answer with the questions, yesterday i've just got an email with the advice to go to the forum, that I did.
Now the answers
1) I tested the latest 5.0.0-alpha build using all types of data files, modern: best, fast, normal and old: for 3.0 version
2) Yesterday I also tested 3.05 (with old tess data files) and 4.0 versions (both with old data file and modern "Fast" data files)
3) my PC is notebook i7-7700HQ, 32 GB, Windows 10, MS VC 2015. During the recognition, one core is fully loaded.
4) I read issues regarding performance but didn't find them useful, when someone complains that 2 seconds is too slow it just makes me laughing.
5) 2 minutes for page recognition with "Fast" data is an approximate value, if a tested app is compiled using Release build it is 30% faster, but still very slow. "Best" data files recognition takes around 5 minutes.
6) Tesseract version doesn't significantly affect the results
7) Old data files have the size around the size of "best" data files, work a little faster than "fast" data files but produce output results worse than "fast". So quality of the recognition is raising.

понедельник, 13 апреля 2020 г., 10:08:08 UTC+3 пользователь zdenop написал:
Why you decided to ignore instructions in comment
Why we should care about your problems if you do not care?

Zdenko


ne 12. 4. 2020 o 16:00 Ravil R <moloch...@gmail.com> napísal(a):
I have my own simple Windows dll based on tesseractmain,cpp code. It works fine since Tesseract 3x (now I moved it the latest 5 build) and the only issue still persists is its low speed - 1 page TIFF takes around 2 minutes even with the Fast version of tessdata ('eng+rus'). Is this how it actually works or there is something I don't understand?
Almost all the time takes this line:
api.ProcessPages("c:\\1.tif", NULL, 0, NULL);
Sample file is attached

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Shree Devi Kumar

unread,
Apr 13, 2020, 10:56:24 AM4/13/20
to tesseract-ocr
> if a tested app is compiled using Release build it is 30% faster, but still very slow.
Debug builds are going to be slower.

I tested with command line on linux. The tif file does take long to recognize. Changing file to 300 dpi and smaller size speeded up the time somewhat.

If all your images are in same font, you can try some finetuning to see if it helps.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/36507710-55f7-4c62-8aff-60692be32a96%40googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Ravil R

unread,
Apr 13, 2020, 11:22:54 AM4/13/20
to tesseract-ocr
1) It has a standard fax 204x196 dpi of course I can convert it to 300x300 and then recognize. Does it make sense?
2) Font could be of any type and of different language (eng or rus) so no fine tuning is possible.

понедельник, 13 апреля 2020 г., 17:56:24 UTC+3 пользователь shree написал:

Zdenko Podobny

unread,
Apr 13, 2020, 2:02:42 PM4/13/20
to tesser...@googlegroups.com
OS Name:                   Microsoft Windows 10 Pro
OS Version:                10.0.18362 N/A Build 18362
System Model:              Latitude E5570
System Type:               x64-based PC
Processor(s):              1 Processor(s) Installed.
                           [01]: Intel64 Family 6 Model 78 Stepping 3 GenuineIntel ~2801 Mhz

tesseract -v
tesseract 5.0.0-alpha-638-gef4f
 leptonica-1.80.0 (Mar 12 2020, 12:47:16) [MSC v.1916 LIB Release x64]
  libgif 5.1.2 : libjpeg 6b (libjpeg-turbo 2.0.2) : libpng 1.6.36 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.0.2 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found libarchive 3.3.3 zlib/1.2.11 liblzma/5.2.4 libzstd/1.3.8

-l eng:
tessdata_best duration: 22.839419659999997
tessdata_fast duration: 3.3998838399999984
tessdata duration: 5.028869279999998


-l eng+rus:
tessdata_best duration: 42.03311656
tessdata_fast duration: 4.122473539999999
tessdata duration: 9.4696169


-l eng+rus -c tessedit_do_invert=0
tessdata_best duration: 33.66898392
tessdata_fast duration: 1.7703644200000042
tessdata duration: 6.849705899999998


tested with script:

I built tesseract  with cmake and clang 10 with VS 2017 compatibility.

Zdenko


po 13. 4. 2020 o 9:50 Ravil R <moloch...@gmail.com> napísal(a):
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/36507710-55f7-4c62-8aff-60692be32a96%40googlegroups.com.

Ravil R

unread,
Apr 13, 2020, 11:56:06 PM4/13/20
to tesseract-ocr
Oh you gave so much info, thanks!
My test exe file shows this version information:
tesseract 5.0.0
 leptonica
-1.79.0 (Apr 14 2020, 06:42:43) [MSC v.1900 LIB Debug x86]
  libjpeg
9b : libpng 1.6.32 : libtiff 4.0.7 : zlib 1.2.11


Looks like I need to add (upgrade) the whole package

понедельник, 13 апреля 2020 г., 21:02:42 UTC+3 пользователь zdenop написал:

Zdenko Podobny

unread,
Apr 14, 2020, 6:25:03 AM4/14/20
to tesser...@googlegroups.com
Without AVX support tesseract 4/5 will be slow(er). So try to focus on this.
Using more than one lang will slower OCR too...

Zdenko


ut 14. 4. 2020 o 5:56 Ravil R <moloch...@gmail.com> napísal(a):
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/09e3279e-ed9a-44f8-a1f9-678fb8e034e8%40googlegroups.com.

Ravil R

unread,
Apr 15, 2020, 1:10:46 PM4/15/20
to tesseract-ocr
Yes exactly, I updated libraries (without turbojpeg and libarchive) and added AVX2 support, now t works at least 10 times faster than before. Problem solved. Thank you very much!
Ravil

вторник, 14 апреля 2020 г., 13:25:03 UTC+3 пользователь zdenop написал:

Zdenko Podobny

unread,
Apr 15, 2020, 2:45:39 PM4/15/20
to tesser...@googlegroups.com
Just for future reference: for AVX (and ...) support there is needed to rebuild only tesseract - it depends on compiler and HW.
Of course it make sense to use the latest version of tesseract dependencies (because of security, bugfixes etc) , but they have (AFAIK) minimum effect on tesseract speed (they are use to reading input images).

Zdenko


st 15. 4. 2020 o 19:10 Ravil R <moloch...@gmail.com> napísal(a):
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fce61619-ec01-43cb-8393-1a32d3cc8088%40googlegroups.com.

Ravil R

unread,
Apr 16, 2020, 4:35:49 AM4/16/20
to tesseract-ocr
Ok, got it, not to pay too much attention to the libraries other than tesseract itself

среда, 15 апреля 2020 г., 21:45:39 UTC+3 пользователь zdenop написал:
Reply all
Reply to author
Forward
0 new messages