how to test the file kan.unicharambigs used for tesseeract-ocr 3.04 or 3.05

316 views
Skip to first unread message

sriranga(83yrsold)

unread,
Dec 2, 2015, 5:55:40 AM12/2/15
to tesseract-ocr

 I have created kan.unicharambigs(attached below) based on the output text of Kan.training_text file (which is big). I could not understand how to test the attached file and find out whether it works or not?
kindly point out my mistakes in fhe said attached file, if any, for which i shall be thankful to you. I prefer to have commandline test if possible.

==========================================================================
Based on wiki instruction (extract reproduced below for ready reference) =

The rules are not bidirectional, so if you want 'rn' to be considered when 'm' is detected and vise versa you need a rule for each.

Version 3.03 and on supports a new, simpler format for the unicharambigs file:

v2
'' " 1
m rn 0
iii m 0

In this format, the "error" and "correction" are simple utf-8 strings separated by a space, and, after another space, the same type specifier as v1 (0 for optional and 1 for mandatory substitution). Note the downside of this simpler format is that Tesseract has to encode the utf-8 strings into the components of the unicharset. In complex scripts, this encoding may be ambiguous. In this case, the encoding is chosen such as to use the least utf-8 characters for each component, ie the shortest unicharset components will make up the encoding.

Like most other files used in training, the 'unicharambigs' file must be encoded as UTF8, and must end with a newline character. The unicharambigs format is also described in the unicharambigs(5) man page.


kan.unicharambigs

Sriranga(83yrsold)

unread,
Dec 4, 2015, 10:36:13 PM12/4/15
to tesser...@googlegroups.com
Solution is requested urgently.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0d30025d-cc11-4f69-9e98-ec919d3f43df%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tom Morris

unread,
Dec 7, 2015, 1:14:48 PM12/7/15
to tesseract-ocr
Hi Sriranga.  I haven't used the training tools, but since no one else has answered, I'll give it my best attempt.  Shree might have better insights.

First, a question of clarification.  Are you having problems with the file or are you just trying to determine whether it is working properly or not?

If you just want to see if it's working correctly, my impression is that most people do this empirically by a) visual inspection of the file to see if the substitutions look correct and b) running a corpus of text through to see how the contents of the file affect accuracy.

To my untrained eye, the things I wonder about are:
- are all those mandatory substitutions (lines ending in 1) correct? ie is it true that the string in column 1 can *never* be a valid word?
- there is an empty line which probably should be removed
- there are a few lines which have junk after the third column which don't match the specified format e.g.:

ಚಟಿಲ್ಕೆ ಚಟ್ನಿ,, 1   "
ಹೊರಿದಿವೆ ಹೊಂದಿವೆ.1   .

Some of the words with embedded punctuation also look a little suspicious to me.  Not knowing the script or language I don't know how common these errors are, but I'd probably start with a very basic list of substitutions and add to it as I found more common errors.

Hopefully someone else can give you better advice which is based on more than bystander guesswork!

Tom

Sriranga(83yrsold)

unread,
Dec 8, 2015, 3:48:37 AM12/8/15
to tesser...@googlegroups.com
Hi Tom,
attached herewith sample of post-proc.txt used in FreeOCR  - which had incorporated on my special request by creator Ralph Richardson  more than 3 years back. Attached screenshots will speak itself. As a sample I have done in English for easy understand by you.
You can test in any langs. FreeOCR available for free download.
you will notice that post-processor text sample (except no option like 0 or 1)has similar feature available  in the <lang>unicharambig.
Advantage of in-built of "unicharambigs" is at the time of final output of OCRed-
all misspelling will automatically corrected before generating the <lan>traineddata resulting the corrected tessdata can be used for any image for correcting output text.
disadvantage of post processor being external program is - one should have update the post-proc.text everytime  for each  ocred
I am puzzled why unicharmabigs does not work as internal program correctly - when the post processor program works fine?
With regards,
sriranga(83yrs)


2post-processing for english in FreeOCR program.jpg
1post-processing for english in FreeOCR program.jpg
ABCD.png
3 final output ofpost-processing for english in FreeOCR program.jpg
post_proc.txt 2

Sriranga(83yrsold)

unread,
Dec 8, 2015, 4:12:20 AM12/8/15
to tesser...@googlegroups.com
Another question Is how to test  and add more in the <lang>unicharambigs in the tesseract-ocr . In case if it can  be tested in the CMD or terminal what is the commandline to be used?

Tom Morris

unread,
Dec 8, 2015, 1:34:48 PM12/8/15
to tesser...@googlegroups.com
FreeOCR is closed source and Windows only, so it's difficult for me to tell what it's doing (or even what version of Tesseract it includes).  However, the test case that you're using doesn't appear realistic.  Tesseract is optimized for recognizing words, not short random strings of characters, so rather than testing on "vv w" I think you'd get more representative results if you used something like "Novv is the time to go dovvn" and see if it turns the vv's into w's.  Having said that, vv ==> w isn't an entry in the standard eng.unicharambigs.  They only mandatory entries are for quotes, so you could try things like `' or '` to see if they get turned into ".

As far as I know, there's no way to specify a different unicharambigs file on the command line.  You need to replace it in the kan.traineddata file for it to be found.  The combine_tessdata utility is used for packing and unpacked the traineddata files.  e.g.

    $ combine_tessdata -e kan.traineddata kan.unicharambigs
    $ combine_tessdata -o kan.traineddata kan.unicharambigs

One thing that I noticed when looking at the source is that there's an upper limit of 10 characters for the bad and replacement strings, which I'm not sure is documented anywhere.  This should be plenty for most applications, but it's something to keep in mind.

Good luck.  Let us know how you make out.

Tom



You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/VXdCSnno06w/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

Sriranga(83yrsold)

unread,
Dec 9, 2015, 5:31:18 AM12/9/15
to tesser...@googlegroups.com
Tom,
thanks for the hints. Just now I tested the eng.unicharambigs created by me and found workable. - attached files will speak itself. I am happy to note that eng.unicharambigs works fine. also attached output "unicharamtest.txt" for perusal - in which however I noticed that last line "luck good" did not changed to "good luck" - where I made mistake? your suggested sentence
"Novv is the time to go dovvn" also corrected. Please note I regenerated eng.traineddata  in ubuntu 15.10.
With regards, sriranga(83ys)

unicharmtest.txt
eng.training_text
eng.unicharambigs
eng.Arial.exp0.tif
eng.traineddata

Sriranga(83yrsold)

unread,
Dec 9, 2015, 5:38:54 AM12/9/15
to tesser...@googlegroups.com
may be useful for your investigation.
tesstrain.log

Sriranga(83yrsold)

unread,
Dec 10, 2015, 7:30:07 AM12/10/15
to tesser...@googlegroups.com
Hi,
Awaiting your considered feedback on my post?
With regards,sriranga

On Wed, Dec 9, 2015 at 4:00 PM, Sriranga(83yrsold) <withblessing....@gmail.com> wrote:

Tom Morris

unread,
Dec 10, 2015, 12:16:26 PM12/10/15
to tesser...@googlegroups.com
On Wed, Dec 9, 2015 at 5:30 AM, Sriranga(83yrsold) <withblessing....@gmail.com> wrote:
Tom,
thanks for the hints. Just now I tested the eng.unicharambigs created by me and found workable. - attached files will speak itself. I am happy to note that eng.unicharambigs works fine. also attached output "unicharamtest.txt" for perusal - in which however I noticed that last line "luck good" did not changed to "good luck" - where I made mistake?

This mechanism is really intended to fix a small number of characters, not reorder entire word strings.  The "good luck" case may be running into the maximum string size (10) limit, depending on whether or not the count includes the string terminator, but whatever the cause of the failure there, is not a very realistic use case.  I would focus on the actual texts that you're trying to correct. 

Tom
 

Sriranga(83yrsold)

unread,
Dec 11, 2015, 3:00:51 AM12/11/15
to tesser...@googlegroups.com
Tom,
Thanks for the response. I like to know whether you have tested "eng.unicharambigs" at your end and like to have your considered experience/comments, if any. Based on your valuable comments/suggestions if any, I am thinking to try for my lang kannada which is complex Indian lang.


Reply all
Reply to author
Forward
0 new messages