Problem with using two trained.data files in combination for a better result.

76 views
Skip to first unread message

da...@maxcommunications.co.uk

unread,
Aug 8, 2018, 8:34:28 AM8/8/18
to tesseract-ocr
i'm trying to use the combination of two traineddata dictionaries together due to one of them being able to recognise specific numbers better than the other.

Here is an example of the code line.

                 $codeLine .= '<br>magick convert "'.$filePath.'" -quality 90 -density 300x300  -units PixelsPerInch "'.$output.'.jpg"'; //
                 $codeLine .= '<br>tesseract "'.$output.'.jpg" "'.$output.'" -l fo+eng txt pdf';

Despite the fact i put "fo" in front (this is the one that recognises the number 4 better), it still gives me an output text file that is exactly identical to the "eng" dictionary output when i run that solo on it's own. 

For some reason, it chooses to not just prioritise eng but also completely ignoring the fo traineddata file completely.

The "fo" file definitely works as i've tested it solo.

I have attached an image example of the text i'd like to OCR and the two relevant traineddata files.

example.jpg
eng.traineddata
fo.traineddata

Shree Devi Kumar

unread,
Aug 8, 2018, 2:07:02 PM8/8/18
to tesser...@googlegroups.com
i think this could be if your new traineddats is not trained to as high a accuracy level as the eng traineddata.

You can setup a debug log to verify this. see https://github.com/tesseract-ocr/tesseract/issues/1275#issuecomment-360367865 for details

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1a5a6768-baeb-4ba9-9cbd-adda6cba957c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

da...@maxcommunications.co.uk

unread,
Aug 9, 2018, 6:18:38 AM8/9/18
to tesseract-ocr
Hello Shree, thank you for your prompt reply.

I have now changed the logfile as instructed. Where can i find the output tesseract.log file? will it be produced in the same location as the logfile? in C:\Program Files (x86)\Tesseract-OCR\tessdata\configs ? I'm guessing the tesseract.log file will be produced once i've used logfile in the commands.

Kind Regards,

Damon

Shree Devi Kumar

unread,
Aug 9, 2018, 6:29:11 AM8/9/18
to tesser...@googlegroups.com
output tesseract.log file should be produced in the directory from where you are running the command, usually where your OCR output is created. 


For more options, visit https://groups.google.com/d/optout.

Damon Kwong

unread,
Aug 9, 2018, 6:55:33 AM8/9/18
to tesser...@googlegroups.com
Ahh i see, i will report back once i have the output file if i can't figure out the reason why. You've been very helpful, thanks again :)

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/k5fU3wQzXmY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.



--
Damon Kwong
Developer
Development Team

Max Communications
Kent BR7 5AB
020 8309 5445

Cannon House
London SE18 6LB
020 3617 8835





Max Logo - ISO 9001 Accreditation




Message has been deleted

da...@maxcommunications.co.uk

unread,
Aug 10, 2018, 6:31:28 AM8/10/18
to tesseract-ocr
Hi Shree, thanks for your patience and help!

I have managed to produce the tesseract.log file with your help. Now i'm trying to understand it a bit more. here is a quick snippet of the output i want to show you:
Rejecter: 5 [35 ]0 3 [33 ]0 . [2e ]p  (word=n, case=y, unambig=y, multiple=y)
Best choice: accepted=0, adaptable=0, done=0 : Lang result : 53. : R=54.2836, C=-5.08463, F=1.5, Perm=2, xht=[0,256], ambig=0
pos NORM NORM NORM
str 5 3 .
state: 1 1
C -5.085 -3.497 -1.978
1 new words worse than 1 old words: r: 54.2836 v 1.81739 c: -5.08463 v -3.90478 valid dict: 0 v 0
Already done word with lang eng at:Bounding box=(499,2)->(514,1361)
Processing word with lang eng at:Bounding box=(672,1253)->(762,1288)
Trying word using lang eng, oem 1
Best choice: accepted=1, adaptable=0, done=1 : Lang result : Date : R=2.05422, C=-0.662761, F=1, Perm=8, xht=[0,3.40282e+038], ambig=0
pos NORM NORM NORM NORM
str D a t e
state: 1 1 1
C -0.085 -0.095 -0.088 -0.085
1 new words better than 0 old words: r: 2.05422 v 0 c: -0.662761 v 0 valid dict: 1 v 0
Processing word with lang eng at:Bounding box=(521,1084)->(842,1156)
Trying word using lang eng, oem 1
Best choice: accepted=1, adaptable=0, done=1 : Lang result : May : R=1.64554, C=-0.733805, F=1, Perm=8, xht=[0,3.40282e+038], ambig=0
pos NORM NORM NORM
str M a y
state: 1 1
C -0.092 -0.085 -0.105
Best choice: accepted=0, adaptable=0, done=1 : Lang result : 182.2. : R=4.51301, C=-4.37332, F=1, Perm=6, xht=[0,3.40282e+038], ambig=0
pos NORM NORM NORM NORM NORM NORM
str 1 8 2 . 2 .
state: 1 1 1 1 1
C -0.116 -0.204 -0.176 -0.612 -0.210 -0.625
1 new words better than 0 old words: r: 1.64554 v 0 c: -0.733805 v 0 valid dict: 1 v 0
1 new words better than 0 old words: r: 4.51301 v 0 c: -4.37332 v 0 valid dict: 0 v 0
Trying word using lang fo, oem 0

As you can see on the very last line, it says "Trying word using lang fo," I can see this line being repeated about 5 times so it seems that sometimes it does use the fo dictionary. However i wonder how it works. How does it know when to use fo after looking at eng? does it only look at fo when it sees a box coordinate for a letter/word but it's unable to find letters to assign it and so it uses the next dictionary? If so, how can it be that when entering "fo+eng" in the command instead of "eng+fo" make no difference to the priority of the dictionary being assigned first for search?

da...@maxcommunications.co.uk

unread,
Aug 10, 2018, 8:04:22 AM8/10/18
to tesseract-ocr
I just realised some of the output underneath "Trying word using lang fo, oem 0" might be useful information! here it is:
Running NoDangerousAmbig() for 5 [35 ]0 3 [33 ]0 . [2e ]p 
Looking for replaceable ngrams starting with 5 [35 ]0:
Looking for replaceable ngrams starting with 3 [33 ]0:
Looking for replaceable ngrams starting with . [2e ]p:
Looking for ambiguous ngrams starting with 5 [35 ]0:
Looking for ambiguous ngrams starting with 3 [33 ]0:
Looking for ambiguous ngrams starting with . [2e ]p:
53. ViterbiStateEntry(NEW) with ratings_sum=43.4269 length=3 cost=54.283619 top_choice_flags=0x19 XH_GOOD
New Best Word Choice : 53. : R=54.2836, C=-5.08463, F=1.5, Perm=2, xht=[0,256], ambig=0
pos NORM NORM NORM
str 5 3 .
state: 1 1
C -5.085 -3.497 -1.978

Stopper:  53. (word=n, case=y, xht_ok=NORMAL=[0,256])

Running NoDangerousAmbig() for 5 [35 ]0 3 [33 ]0 n [6e ]a 
Looking for replaceable ngrams starting with 5 [35 ]0:
Looking for replaceable ngrams starting with 3 [33 ]0:
Looking for replaceable ngrams starting with n [6e ]a:
candidate ngram: n ( 20 )
current ngram from spec: n p C ( 20 6 24 )
comparison result: -1
Looking for ambiguous ngrams starting with 5 [35 ]0:
Looking for ambiguous ngrams starting with 3 [33 ]0:
Looking for ambiguous ngrams starting with n [6e ]a:
candidate ngram: n ( 20 )
current ngram from spec: n ( 20 )
comparison result: 0
fixpt+=(2 3 0 1 ri)
found ambiguity: ri ( 85 )
candidate ngram: n ( 20 )
current ngram from spec: n ( 20 )
comparison result: 0
fixpt+=(2 3 0 1 tr)
found ambiguity: tr ( 114 )
candidate ngram: n ( 20 )
current ngram from spec: n ( 20 )
comparison result: 0
fixpt+=(2 3 0 1 ij)
found ambiguity: ij ( 116 )
candidate ngram: n ( 20 )
current ngram from spec: n i ( 20 16 )
comparison result: -1

Resulting ambig_blob_choices:
r0.00 c0.00 x[0,1]: 3 5 [35 ]0

r0.00 c0.00 x[0,1]: 27 3 [33 ]0

r0.00 c0.00 x[0,1]: 20 n [6e ]a
r-1.00 c0.00 x[0,1]: 85 ri [72 69 ]
r-1.00 c0.00 x[0,1]: 114 tr [74 72 ]
r-1.00 c0.00 x[0,1]: 116 ij [69 6a ]

53n ViterbiStateEntry(NEW) with ratings_sum=43.4676 length=3 cost=67.374825 top_choice_flags=0x2 inconsistent=(punc 0 case 0 chartype 1 script 0 font 0) XH_GOOD
New Secondary Word Choice : 53n : R=67.3748, C=-5.08463, F=1.5, Perm=2, xht=[0,256], ambig=0
pos NORM NORM NORM
str 5 3 n
state: 1 1
C -5.085 -3.497 -2.159

Running NoDangerousAmbig() for 5 [35 ]0 3 [33 ]0 H [48 ]A 
Looking for replaceable ngrams starting with 5 [35 ]0:
Looking for replaceable ngrams starting with 3 [33 ]0:
Looking for replaceable ngrams starting with H [48 ]A:
candidate ngram: H ( 51 )
current ngram from spec: H p p ( 51 6 6 )
comparison result: -1
Looking for ambiguous ngrams starting with 5 [35 ]0:
Looking for ambiguous ngrams starting with 3 [33 ]0:
Looking for ambiguous ngrams starting with H [48 ]A:
53H ViterbiStateEntry(NEW) with ratings_sum=43.4944 length=3 cost=67.416374 top_choice_flags=0x4 inconsistent=(punc 0 case 0 chartype 1 script 0 font 0) XH_GOOD
New Secondary Word Choice : 53H : R=67.4164, C=-5.08463, F=1.5, Perm=2, xht=[0,256], ambig=0
pos NORM NORM NORM
str 5 3 H
state: 1 1
C -5.085 -3.497 -2.279

Filtering against best choice : 53. : R=54.2836, C=-5.08463, F=1.5, Perm=2, xht=[0,256], ambig=0
pos NORM NORM NORM
str 5 3 .
state: 1 1
C -5.085 -3.497 -1.978

Best Raw Choice : 53. : R=43.4269, C=-5.08463, F=1, Perm=2, xht=[0,256], ambig=0
pos NORM NORM NORM
str 5 3 .
state: 1 1
C -5.085 -3.497 -1.978

Cooked Choice #0 : 53. : R=54.2836, C=-5.08463, F=1.5, Perm=2, xht=[0,256], ambig=0
pos NORM NORM NORM
str 5 3 .
state: 1 1
C -5.085 -3.497 -1.978

Cooked Choice #1 : 53n : R=67.3748, C=-5.08463, F=1.5, Perm=2, xht=[0,256], ambig=0
pos NORM NORM NORM
str 5 3 n
state: 1 1
C -5.085 -3.497 -2.159

Cooked Choice #2 : 53H : R=67.4164, C=-5.08463, F=1.5, Perm=2, xht=[0,256], ambig=0
pos NORM NORM NORM
str 5 3 H
state: 1 1
C -5.085 -3.497 -2.279

Rejecter: 5 [35 ]0 3 [33 ]0 . [2e ]p  (word=n, case=y, unambig=y, multiple=y)
Best choice: accepted=0, adaptable=0, done=0 : Lang result : 53. : R=54.2836, C=-5.08463, F=1.5, Perm=2, xht=[0,256], ambig=0
pos NORM NORM NORM
str 5 3 .
state: 1 1
C -5.085 -3.497 -1.978

da...@maxcommunications.co.uk

unread,
Aug 10, 2018, 11:09:37 AM8/10/18
to tesseract-ocr
Hi Shree, just a quick update.

I've now looked into this output tesseract.log further and now understand how it works and how it will go through different choices and eventually decides on a "best choice". However the output doesn't explain how it then decides what has overriding priority on giving the best outcome. The fact that even after it scours through the "fo" dictionary, it decides on best choice for this dictionary, immediately it will move onto eng dictionary and seems to decide to use eng dictionary output because (i'm guessing), it regards it as more accurate. This means your theory about our custom "fo" dictionary not being trained to a high enough accuracy level seems to be correct. Is there any possible way i can train either eng or fo to improve it's accuracy or override another dictionary on specific characters it's getting wrong? for example, in our case, the eng.traneddata dictionary sometimes gets 3's and 5's mixed up and it has a lot of trouble with 4's.

Your help on this would be greatly appreciated!

Kind Regards,

Damon 


On Thursday, 9 August 2018 11:29:11 UTC+1, shree wrote:

Shree Devi Kumar

unread,
Aug 10, 2018, 1:22:50 PM8/10/18
to tesser...@googlegroups.com
I do not know about the internal algorithms used by tesseract.

If you are having accuracy issues with certain letters and digits, I will suggest that you fine-tune  for impact using the images or similar font.

Please see wiki page on training 4.0 for the command - look for fine tuning for new font/impact. Use eng.traineddata as base, 50-100 lines of training text and 300-400 iterations max. 

Reply all
Reply to author
Forward
0 new messages