traineddata file size too small, error clue ?

80 views
Skip to first unread message

Andres

unread,
Jun 14, 2017, 2:09:45 PM6/14/17
to tesser...@googlegroups.com
Dear all,

I've been training tesseract with a multipage tiff file with 5 pages and approx 12000 boxes.

Now I increased the samples in the tiff file, I have 12 pages and 29241 boxes.

My concern is that my previous traineddata file size is 321817 bytes and the new one is 318022 bytes. I don't know if it should be bigger, as I have no idea about the file format, but I downloaded one version of eng.traineddata from the tesseract repository and I see that its size is 21876572 bytes. Could it be that perhaps it is computing just the results of the first page ? I see in the log that at least, at the beginning of the process, it is processing all the pages.

I am using Tesseract 3.02 on Windows.

I will paste my log here, and below that, my batch file, the one that I use for training.

Log:
A:\training>tesseract.exe patentesar.normal.exp0.tif patentesar.normal.exp0 nobatch bo
x.train.stderr
Tesseract Open Source OCR Engine v3.02 with Leptonica
Page 1 of 12
row xheight=88.6667, but median xheight = 59.6
row xheight=81.8333, but median xheight = 59.6
row xheight=75, but median xheight = 59.6
row xheight=71.1875, but median xheight = 59.6
row xheight=71.1875, but median xheight = 59.6
row xheight=71.1875, but median xheight = 59.6
row xheight=68.5333, but median xheight = 59.6
row xheight=67.3333, but median xheight = 59.6
APPLY_BOXES:
   Boxes read from boxfile:    1671
   Found 1671 good blobs.
TRAINING ... Font name = normal
Generated training data for 52 words
Page 2 of 12
APPLY_BOXES:
   Boxes read from boxfile:    2003
   Found 2003 good blobs.
Generated training data for 58 words
Page 3 of 12
FAIL!
APPLY_BOXES: boxfile line 358/0 ((383,4901),(428,4980)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 529/D ((146,4401),(187,4480)): FAILURE! Couldn't find a matching blob
APPLY_BOXES:
   Boxes read from boxfile:    2128
   Boxes failed resegmentation:       2
   Found 2126 good blobs.
Generated training data for 60 words
Page 4 of 12
APPLY_BOXES:
   Boxes read from boxfile:    2257
   Found 2257 good blobs.
Generated training data for 62 words
Page 5 of 12
APPLY_BOXES:
   Boxes read from boxfile:    2381
   Found 2381 good blobs.
Generated training data for 64 words
Page 6 of 12
FAIL!
APPLY_BOXES: boxfile line 2070/D ((2141,967),(2182,1037)): FAILURE! Couldn't find a matching blob
APPLY_BOXES:
   Boxes read from boxfile:    2460
   Boxes failed resegmentation:       1
   Found 2459 good blobs.
Generated training data for 65 words
Page 7 of 12
FAIL!
APPLY_BOXES: boxfile line 2082/B ((867,1084),(910,1151)): FAILURE! Couldn't find a matching blob
APPLY_BOXES:
   Boxes read from boxfile:    2568
   Boxes failed resegmentation:       1
   Found 2567 good blobs.
Generated training data for 67 words
Page 8 of 12
APPLY_BOXES:
   Boxes read from boxfile:    2680
   Found 2680 good blobs.
Generated training data for 68 words
Page 9 of 12
FAIL!
APPLY_BOXES: boxfile line 2391/D ((1184,910),(1220,973)): FAILURE! Couldn't find a matching blob
APPLY_BOXES:
   Boxes read from boxfile:    2818
   Boxes failed resegmentation:       1
   Found 2817 good blobs.
Generated training data for 70 words
Page 10 of 12
FAIL!
APPLY_BOXES: boxfile line 1248/0 ((1468,3440),(1502,3501)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 2211/0 ((342,1491),(382,1550)): FAILURE! Couldn't find a matching blob
APPLY_BOXES:
   Boxes read from boxfile:    3000
   Boxes failed resegmentation:       2
   Found 2998 good blobs.
Generated training data for 73 words
Page 11 of 12
FAIL!
APPLY_BOXES: boxfile line 1280/6 ((2054,3645),(2087,3702)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 2750/0 ((496,1051),(528,1105)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 3098/D ((2229,530),(2254,583)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 3347/Q ((1167,90),(1197,142)): FAILURE! Couldn't find a matching blob
APPLY_BOXES:
   Boxes read from boxfile:    3370
   Boxes failed resegmentation:       4
   Found 3366 good blobs.
Generated training data for 77 words
Page 12 of 12
row xheight=28.6667, but median xheight = 33.5161
row xheight=28.0889, but median xheight = 33.5161
row xheight=27.1, but median xheight = 33.5161
row xheight=29, but median xheight = 33.5161
row xheight=29, but median xheight = 33.5161
row xheight=29, but median xheight = 33.5161
FAIL!
APPLY_BOXES: boxfile line 0/P ((20,5928),(52,5980)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 1/7 ((73,5928),(89,5980)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 2/4 ((110,5928),(141,5980)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 3/1 ((162,5928),(189,5980)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 44/M ((20,5855),(48,5907)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 45/M ((69,5855),(96,5907)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 46/B ((117,5855),(148,5907)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 47/O ((169,5855),(198,5907)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 90/D ((20,5783),(50,5834)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 91/P ((71,5783),(102,5834)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 92/O ((123,5783),(148,5834)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 93/N ((169,5783),(202,5834)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 136/6 ((20,5711),(46,5762)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 137/P ((67,5711),(103,5762)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 138/X ((124,5711),(146,5762)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 139/M ((167,5711),(190,5762)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 183/M ((20,5639),(51,5690)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 184/1 ((72,5639),(92,5690)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 185/G ((113,5639),(144,5690)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 186/6 ((165,5639),(189,5690)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 229/1 ((20,5567),(44,5618)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 230/T ((65,5567),(89,5618)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 231/N ((110,5567),(141,5618)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 232/O ((162,5567),(196,5618)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 276/T ((20,5496),(44,5546)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 277/F ((65,5496),(91,5546)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 278/G ((112,5496),(140,5546)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 279/5 ((161,5496),(191,5546)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 322/8 ((20,5425),(45,5475)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 323/W ((66,5425),(94,5475)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 324/R ((115,5425),(145,5475)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 325/G ((166,5425),(192,5475)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 370/W ((20,5354),(52,5404)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 371/0 ((73,5354),(102,5404)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 372/G ((123,5354),(155,5404)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 373/H ((176,5354),(201,5404)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 416/2 ((20,5283),(43,5333)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 417/I ((64,5283),(89,5333)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 418/1 ((110,5283),(137,5333)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 419/D ((158,5283),(186,5333)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 463/I ((20,5212),(45,5262)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 464/Q ((66,5212),(92,5262)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 465/K ((113,5212),(144,5262)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 466/E ((165,5212),(186,5262)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 511/G ((20,5142),(48,5191)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 512/Q ((69,5142),(97,5191)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 513/T ((118,5142),(140,5191)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 514/D ((161,5142),(189,5191)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 517/D ((305,5142),(328,5191)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 558/M ((20,5072),(45,5121)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 559/E ((66,5072),(95,5121)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 560/E ((116,5072),(140,5121)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 561/H ((161,5072),(191,5121)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 606/5 ((20,5002),(51,5051)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 607/I ((72,5002),(102,5051)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 608/M ((123,5002),(149,5051)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 609/I ((170,5002),(192,5051)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 653/0 ((20,4932),(50,4981)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 654/0 ((71,4932),(102,4981)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 655/O ((123,4932),(151,4981)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 656/8 ((172,4932),(199,4981)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 700/0 ((20,4862),(49,4911)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 701/W ((70,4862),(93,4911)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 702/0 ((114,4862),(144,4911)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 703/G ((165,4862),(193,4911)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 747/M ((20,4793),(51,4841)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 748/T ((72,4793),(94,4841)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 749/0 ((115,4793),(150,4841)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 750/R ((171,4793),(198,4841)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 795/C ((20,4724),(46,4772)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 796/7 ((67,4724),(96,4772)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 797/1 ((117,4724),(147,4772)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 843/H ((20,4655),(47,4703)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 844/8 ((68,4655),(95,4703)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 1903/0 ((1824,3398),(1823,3397)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 1904/0 ((1844,3398),(1843,3397)): FAILURE! Couldn't find a matching blob
APPLY_BOXES:
   Boxes read from boxfile:    1905
   Boxes failed resegmentation:      76
   Found 1829 good blobs.
Generated training data for 48 words

A:\training>unicharset_extractor patentesar.normal.exp0.box
Extracting unicharset from patentesar.normal.exp0.box
Wrote unicharset file ./unicharset.
Presione una tecla para continuar . . .

A:\training>mftraining -F font_properties -U unicharset patentesar.normal.exp0.tr
Read shape table shapetable of 36 shapes
Reading patentesar.normal.exp0.tr ...
Warning: no protos/configs for g in CreateIntTemplates()
Done!

A:\training>mftraining -F font_properties -U unicharset -O patentesar.normal.exp0.unic
harset patentesar.normal.exp0.tr
Read shape table shapetable of 36 shapes
Reading patentesar.normal.exp0.tr ...
Warning: no protos/configs for g in CreateIntTemplates()
Done!
Presione una tecla para continuar . . .

A:\training>cntraining patentesar.normal.exp0.tr
Reading patentesar.normal.exp0.tr ...
Clustering ...

Writing normproto ...
Presione una tecla para continuar . . .

A:\training>wordlist2dawg frequent_words_list patentesar.freq-dawg unicharset
Loading unicharset from 'unicharset'
Reading word list from 'frequent_words_list'
Reducing Trie to SquishedDawg
Writing squished DAWG to 'patentesar.freq-dawg'
Presione una tecla para continuar . . .

A:\training>wordlist2dawg words_list patentesar.word-dawg unicharset
Loading unicharset from 'unicharset'
Reading word list from 'words_list'
Reducing Trie to SquishedDawg
Writing squished DAWG to 'patentesar.word-dawg'
Presione una tecla para continuar . . .

A:\training>copy /Y normproto patentesar.normal.exp0.normproto
        1 archivo(s) copiado(s).

A:\training>copy /Y inttemp patentesar.normal.exp0.inttemp
        1 archivo(s) copiado(s).

A:\training>copy /Y pffmtable patentesar.normal.exp0.pffmtable
        1 archivo(s) copiado(s).

A:\training>copy /Y Microfeat patentesar.normal.exp0.Microfeat
El sistema no puede encontrar el archivo especificado.

A:\training>copy /Y shapetable patentesar.normal.exp0.shapetable
        1 archivo(s) copiado(s).

A:\training>copy /Y unicharset patentesar.normal.exp0
        1 archivo(s) copiado(s).

A:\training>copy /Y patentesar.normal.exp0.unicharset patentesar.normal.exp0
        1 archivo(s) copiado(s).

A:\training>move /Y patentesar.normal.exp0.normproto tessdata
Se han movido         1 archivos.

A:\training>move /Y patentesar.normal.exp0.inttemp tessdata
Se han movido         1 archivos.

A:\training>move /Y patentesar.normal.exp0.pffmtable tessdata
Se han movido         1 archivos.

A:\training>move /Y patentesar.normal.exp0.Microfeat tessdata
El sistema no puede encontrar el archivo especificado.

A:\training>move /Y patentesar.normal.exp0.shapetable tessdata
Se han movido         1 archivos.

A:\training>move /Y unicharset tessdata
Se han movido         1 archivos.

A:\training>move /Y patentesar.normal.exp0.unicharset tessdata
Se han movido         1 archivos.
Presione una tecla para continuar . . .

A:\training>combine_tessdata tessdata/patentesar.normal.exp0.
Combining tessdata files
TessdataManager combined tesseract data files.
Offset for type 0 is -1
Offset for type 1 is 140
Offset for type 2 is -1
Offset for type 3 is 2559
Offset for type 4 is 309717
Offset for type 5 is 309988
Offset for type 6 is -1
Offset for type 7 is -1
Offset for type 8 is -1
Offset for type 9 is -1
Offset for type 10 is -1
Offset for type 11 is -1
Offset for type 12 is -1
Offset for type 13 is 317370
Offset for type 14 is -1
Offset for type 15 is -1
Offset for type 16 is -1
Presione una tecla para continuar . . .
Batch file:

@rem #############################

@call set_environment.cmd
@SET PATH="%TESSDATA_PREFIX%";%PATH%

tesseract.exe patentesar.normal.exp0.tif patentesar.normal.exp0 nobatch box.train.stderr
@pause

unicharset_extractor patentesar.normal.exp0.box
@pause

mftraining -F font_properties -U unicharset patentesar.normal.exp0.tr
mftraining -F font_properties -U unicharset -O patentesar.normal.exp0.unicharset patentesar.normal.exp0.tr
@pause

cntraining patentesar.normal.exp0.tr
@pause

wordlist2dawg frequent_words_list patentesar.freq-dawg unicharset
@pause

wordlist2dawg words_list patentesar.word-dawg unicharset
@pause

copy /Y normproto patentesar.normal.exp0.normproto 
copy /Y inttemp patentesar.normal.exp0.inttemp 
copy /Y pffmtable patentesar.normal.exp0.pffmtable 
copy /Y Microfeat patentesar.normal.exp0.Microfeat
copy /Y shapetable patentesar.normal.exp0.shapetable

copy /Y unicharset patentesar.normal.exp0
copy /Y patentesar.normal.exp0.unicharset patentesar.normal.exp0

move /Y patentesar.normal.exp0.normproto tessdata
move /Y patentesar.normal.exp0.inttemp tessdata
move /Y patentesar.normal.exp0.pffmtable tessdata
move /Y patentesar.normal.exp0.Microfeat tessdata
move /Y patentesar.normal.exp0.shapetable tessdata

move /Y unicharset tessdata
move /Y patentesar.normal.exp0.unicharset tessdata



@pause
combine_tessdata tessdata/patentesar.normal.exp0.

@pause
copy tessdata\patentesar.normal.exp0.traineddata "%TESSDATA_PREFIX%"\tessdata"

@pause
tesseract patentesar.normal.exp0.tif output -l patentesar.normal.exp0

type output.txt


Best regards and thank you,

Andres



ShreeDevi Kumar

unread,
Jun 14, 2017, 11:31:27 PM6/14/17
to tesser...@googlegroups.com
Traineddata size will depend on many things, not just number of images.

If your unicharset and number of fonts hasn't changed, then the size maybe similar.

Traineddata file also has the wordlists in it, so if you are using a smaller wordlist compared to the one in original eng.traineddata, size maybe smaller.

You can also try the latest version from https://github.com/UB-Mannheim/tesseract/wiki

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CALk3cjShXCkVdOz87_Oyscxy-qTVrZuwc1cUm%3DBy1MKH1hQfQg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Andres

unread,
Jun 15, 2017, 8:46:52 AM6/15/17
to tesseract-ocr
Thank you very much for your answer Shree.

One strange thing is that prints things like "Generated training data for 67 words", but in my words_list file I have just 36 words (one each alphanumeric symbol and one each numeric symbol). Could It be because I have that repeated in frequent_words_list, so there are 72 words in total ?

--
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages