All-caps, small-caps

219 views
Skip to first unread message

bácsi Kazi

unread,
Dec 15, 2015, 2:33:27 AM12/15/15
to tesseract-ocr
Hi there!

I'm playing with Tesseract 3.02, and I would need precise recognition of capital letters. Unfortunately my files are full of all caps and small caps. During the training if I included such words in the sample, I got random capitals in the rest of the text. I thought I would try to put them into a new font, same. I included them in the dictionary files, somewhat better, but still problematic at letter o, u, v etc. I.e. HELLo WoRLD & friends, despite having HELLO WORLD in dictionary.
It's quite similar to this:
https://code.google.com/p/tesseract-ocr/issues/detail?id=691
What is your experience? How to train Tesseract for caps? Is it better in later versions? Is there a configuration parameter to set?
Thanks!

Kazi

bácsi Kazi

unread,
Dec 27, 2015, 3:43:28 PM12/27/15
to tesseract-ocr
Could you help? Have I missed something blatantly trivial?
Any help would be highly appreciated!

Kazi

zdenko podobny

unread,
Dec 28, 2015, 4:08:35 AM12/28/15
to tesser...@googlegroups.com
When you ask for support please provide example files.
Did you try the latest version of tesseract?

Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/16a46021-43b9-484f-a66f-e3b077b4aadb%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

bácsi Kazi

unread,
Dec 28, 2015, 1:00:56 PM12/28/15
to tesseract-ocr
Dear Zdenko,

I provide an example file in attachment. You can see Enrico, Antonio, Roberto in the output with this mistake, despite all these names are present in the dictionary with all-caps.
I haven't tried later versions, because you have a policy of not providing Windows installers, and I was busy with other programming. But if you say it's worth it, I'll try. Is there any guide how to create a portable version for Windows?
Thanks again!

Kazi
061094 3405,22,24,26,-7,30,35,38,44.pdf27.png
XII_PDS_GrassoG_061094 3405,22,24,26,-7,30,35,38,44_p0027.txt

zdenko podobny

unread,
Dec 28, 2015, 2:23:35 PM12/28/15
to tesser...@googlegroups.com
First of all - there is no such policy as not providing Windows installers.  There is no installer because there is nobody who would maintain it and provide solution (e.g. NSIS destroys my PATH variable on windows ;-) ). Everybody is busy with programming :-) (something else).

Next: there is windows build based on cygwin, so if you need windows portable version you get it (search this forum).

Next in attachment you can find output created with current tesseract code created with:
    tesseract example.png example -l spa
(I renamed your file and I hope I chose correct language for OCR). It seem that result is better than yours including capitalization. 

IMO tesseract executable is nice example how to use tesseract library. Maybe you should try to use tesseract library directly


Zdenko

example.txt

bácsi Kazi

unread,
Dec 29, 2015, 6:41:37 PM12/29/15
to tesseract-ocr
Dear Zdenko!

Thank you for your reply! Even though the original file was in Italian, your output is quite impressive!
I found a guide how to compile with CygWin: http://vorba.ch/2014/tesseract-cygwin.html
So I installed CygWin64 with the necessary packages, then everything went fine with Leptonica, but I screwed up with Tesseract. During make when processing ccutil/ambigs.cpp it lacks the strtok_r.h file, but it's in the vs2010/port folder (if I place it there it finds it ambiguous). I used: CPPFLAGS="-I/usr/local/include" LDFLAGS="-L/usr/local/lib" ./configure because of my Leptonica installation.
So I can't get even a "normal" installation, not to mention the one written here: https://github.com/tesseract-ocr/tesseract/wiki/Compiling
I'm not familiar with this stuff - that's why I was asking an installer (couldn't find the one you were referring to).
I couldn't get either that you have suggested exactly in your last line.
Greetings:

Kazi
config.log

ShreeDevi Kumar

unread,
Dec 30, 2015, 3:42:25 AM12/30/15
to tesser...@googlegroups.com

On cygwin Marco Atzeri has packaged Tesseract as well as the training utilities for 3.04.00 along with some training data. Instruction for cygwin installation is here: https://cygwin.com/cygwin-ug-net/setup-net.html

Tesseract specific packages to be installed:

tesseract-ocr                           3.04.00-2
tesseract-ocr-eng                       3.04-1
tesseract-training-core                 3.04-1
tesseract-training-eng                  3.04-1
tesseract-training-util                 3.04.00-2

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

bácsi Kazi

unread,
Dec 30, 2015, 5:20:31 PM12/30/15
to tesseract-ocr
Thanks!
So I suppose the files from the download page resulted the error, but the newer files on Git work well when building Tesseract on Cygwin.
Greetings:

Kazi
Reply all
Reply to author
Forward
0 new messages