traning devanagari: »Encoding of string failed!«

154 views
Skip to first unread message

ba...@ub.uni-heidelberg.de

unread,
Apr 18, 2019, 8:39:21 AM4/18/19
to tesseract-ocr
Dear reader,
I want to improve devanagari recognition.
I have images and manually corrected Text with line coordinates.
From those, I've generated .box files;
see attached file which produces the error above.

Complete error Message from lstmtrain:
»Encoding of string failed! Failure bytes: 9 32 37 38 ffffffe0 ffffffa4 ffffff98 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffa8 ffffffe0 ffffffa5 ffffff87 ffffffe0 ffffffa4 ffffffb6 ffffffe0 fffff...
Can't encode transcription: 'श्रीगणेशायनमः ।। अलिकुलमण्डितगण्डं प्रत्यूहतिमिरमार्त्तण्डं सिन्दूरारुणशुण्डं देवंवेतण्डमुण्डमवलम्बे १ वि     278घ्नेश्वरायवरदायसुरप्रियाय लम्बोदरा...

...
...«

.lstmf-Files are generated using »tesseract $tiff $box --tessdata-dir ~/tessdata_best -l script/Devanagari lstm.train«

training is run by
»combine_tessdata -u ~/tessdata_best/script/Devanagari.traineddata /tmp/Deva.trta
mkdir /tmp/deva
ls -1 *.lstmf >/tmp/list.txt
lstmtraining --model_output /tmp/deva --continue_from /tmp/Deva.trta.lstm  --traineddata ~/tessdata_best/script/Devanagari.traineddata --train_listfile /tmp/list.txt«

I have double-checked that only characters from Devanagari.traineddata.lstm-unicharset are in the .box files.
No tabs, no control characters.

But the "9" from the error message above sounds like tab...?

Any ideas?

Kind regards, Jochen

PS: latest tesseract 4.1.0-rc1; tessdata_best: commit 95593f0b017280...
durggapatha1890_-_001.box

Shree Devi Kumar

unread,
Apr 18, 2019, 9:06:35 AM4/18/19
to tesser...@googlegroups.com
> I have images and manually corrected Text with line coordinates. From those, I've generated .box files;

What method did you use for generating the .box files?

Please provide the image for the box file for test.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3f411945-e3d5-4b70-bce6-b33e2aab7bfc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Jochen Barth

unread,
Apr 18, 2019, 9:12:14 AM4/18/19
to tesser...@googlegroups.com
I did create the .box files from our alto files, which were generated by transkribus (manually corrected OCR).

Image (converted from tiff - too large, but same number of pixels) see attached file.


Kind regards,
Jochen


Am 18.04.19 um 15:05 schrieb Shree Devi Kumar:

For more options, visit https://groups.google.com/d/optout.


-- 
Jochen Barth * Universitätsbibliothek Heidelberg, IT * Telefon 06221 54-2580
durggapatha1890_-_001.jpg

Shree Devi Kumar

unread,
Apr 18, 2019, 11:41:52 AM4/18/19
to tesser...@googlegroups.com
The following format (as in your box file) will not work for Devanagari.

श 278 1253 2860 1413 0
् 278 1253 2860 1413 0
र 278 1253 2860 1413 0
ी 278 1253 2860 1413 0
ग 278 1253 2860 1413 0
ण 278 1253 2860 1413 0
े 278 1253 2860 1413 0
श 278 1253 2860 1413 0
ा 278 1253 2860 1413 0
य 278 1253 2860 1413 0
न 278 1253 2860 1413 0
म 278 1253 2860 1413 0
ः 278 1253 2860 1413 0
  278 1253 2860 1413 0

See files in attached zip file which show the box/tiff pairs as created by text2image using the text with Murty Sanskrit font.

श्री 112 4669 160 4708 0
ग 156 4669 189 4701 0
णे 185 4668 225 4708 0
शा 221 4667 272 4700 0
य 268 4668 301 4700 0
न 297 4668 329 4700 0
मः 326 4668 370 4700 0
  370 4667 402 4701 0
। 402 4667 407 4701 0
। 428 4667 433 4700 0
  433 4667 451 4700 0

The above format works for training.

Box files created by using the new `lstmbox` or `wordstrbox` formats should also work.

WordStr 33 628 1417 684 0 #श्रीगणेशायनमः ।। अलिकुलमण्डितगण्डं प्रत्यूहतिमिरमार्त्तण्डं सिन्दूरारुणशुण्डं देवंवेतण्डमुण्डमवलम्बे १ वि
1430 586 1434 644 0
WordStr 32 586 1429 644 0 #घ्नेश्वरायवरदायसुरप्रियाय लम्बोदरायसकलायजद्धिताय नागाननायक्षितियज्ञविभूषिताय गौरीसुतायगणनाथनमो 
1416 533 1420 601 0
WordStr 115 533 1415 601 0 #नमस्ते २ श्लोकः ।। तपस्यन्तंमहात्मानं मार्क्कण्डेयंमहामुनिम् ।। व्यासशिष्योमहातेजा जैमिनिः पर्य्यपृच्छत १ व्याख्या व्यासशिष्यः 
1416 508 1420 556 0

I did finetune training using the one image with wordstrbox - 15 lines for 400 iterations would have overfitted but it gives improved recognition.

ubuntu@tesseract-ocr:~/TEST$   lstmeval \
>   --verbosity -1 \
>   --model ./san_impact.traineddata \
>   --eval_listfile ./san.training_files.txt
Loaded 13/13 lines (1-13) of document ./durggapatha1890_-_001.lstmf
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 0, stage 0, Eval Char error rate=3.2031613, Word error rate=12.715244

ubuntu@tesseract-ocr:~/TEST$   lstmeval \
>   --model ./san_impact.traineddata \
>   --eval_listfile ./san.training_files.txt
Loaded 13/13 lines (1-13) of document ./durggapatha1890_-_001.lstmf
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Truth:मार्क्कण्डेयउवाच ॥ सावर्णिस्सू्र्य्यतनयोयोमनुःकथ्यतेष्टमः ।। निशामयतदुत्पत्तिंविस्तराद्गदतोमम १
OCR  :मार्क्कण्डेयउवाच ॥ सावर्णिस्सूर्य्यतनयोयोमनुःकथ्यतेष्टमः ।। निशामयतदुत्पत्तिंविस्तराद्गदतोमम १
Truth:नमस्ते २ श्लोकः ।। तपस्यन्तंमहात्मानं मार्क्कण्डेयंमहामुनिम् ।। व्यासशिष्योमहातेजा जैमिनिः पर्य्यपृच्छत १ व्याख्या व्यासशिष्यः
OCR  :नमस्ते २ श्लोकः ।। तपस्यन्तंमहात्मानं मार्क्कणडेयंमहामुनिम । व्यासशिष्योमहातेजा जैमिनिःपर्य्यपृच्छत व ाय
Truth:शिष्यः महातेजोयस्यसः महातेजाः तपस्यन्तं तपस्यतीति तपस्यन् तं तपस्यन्तं महानआत्मा यस्यसः महात्मा तं महा
OCR  :शिष्यः महातेजोयस्यसः महातेजाः तपस्यन्तं तपस्यतीति तपस्यन् तं तपस्यन्तं महानआत्मा यस्यसः महात्मा तं महा
Truth:महातेजाः जैमिनिः तपस्यन्तं महात्मानं महामुनिं मार्क्कणडेयं पर्य्यपृच्छत इत्यन्वयः व्यासस्यशिष्यःव्यास
OCR  :्वसि्यः महातेजाः जैमिनिः तपस्यन्तं महात्मानं महामुनिं मार््कण्डेयं पर्य्यपृच्छत इत्यन्वयः व्यासस्यशिष्यःव्यास
Truth:प्राज्ञा यस्यसःमहाप्राज्ञा तत्सम्बोधने हे महाप्राज्ञ प्रज्ञाबुद्धिः सर्व्वाणिचतानि शास्त्राणि सर्व्वशास्त्राणि सर्व्वाशास्त्रेषु विशा
OCR  :प्राज्ञा यस्यसःमहाप्राज्ञा तत्सम्बोधने हे महाप्राज्ञ प्रज्ञाबुद्धिः सर्व्वाणिचतानि शास्त्राणि सर्व्वशास्त्राणि सर्व्वशास्तरेषु विशा
Truth:र्क्कण्डेयं मृगान् हरिणान् १ श्लोकः ।। मार्कण्डेयमहाप्राज्ञ सर्व्वशास्त्रविशारद ।। श्रोतुमिच्छाम्यशेषेण देवीमाहात्म्यमुत्तमम् २
OCR  :क्कण्डेयं मृगान् हरिणान् १ श्लोकः ।। मार्कण्डेयमहाप्राज्ञ सर्व्वशास्त्रविशारद ।। श्रोतुमिच्छाम्यशेषेण देवीमाहात्म्यमुत्तमम् २
At iteration 0, stage 0, Eval Char error rate=3.2031613, Word error rate=12.715244

As a comparison, the Devanagari results are:

ubuntu@tesseract-ocr:~/TEST$
ubuntu@tesseract-ocr:~/TEST$   lstmeval \
>   --verbosity -1 \
>   --model ~/tessdata_best/script/Devanagari.traineddata \
>   --eval_listfile ./san.training_files.txt
Loaded 13/13 lines (1-13) of document ./durggapatha1890_-_001.lstmf
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 0, stage 0, Eval Char error rate=21.136278, Word error rate=62.680429



For more options, visit https://groups.google.com/d/optout.
sanJochen.zip
durggapatha1890_-_001-wordstr.zip

Shree Devi Kumar

unread,
Apr 18, 2019, 11:47:47 AM4/18/19
to tesser...@googlegroups.com

Jochen Barth

unread,
Apr 23, 2019, 4:23:11 AM4/23/19
to tesser...@googlegroups.com
Dear Shree,
I've tried it with the format below and combined letter-and-sign-symbols (see attached file)
and with WordStr-Format (see attached file),
but still the same error...

Kind regards, Jochen

Am 18.04.19 um 17:40 schrieb Shree Devi Kumar:
zz-durggapatha1890_-_001.box.lstmbox
durggapatha1890_-_001.box

Shree Devi Kumar

unread,
Apr 23, 2019, 5:01:32 AM4/23/19
to tesser...@googlegroups.com
Hello Jochen,

I prefer the Wordstr format since it is easier to correct the text with ground truth, so I have not tested with the lstmbox file.
A cursory glance at the file shows that the lstmbox file does not have lines with spaces between words.

Another point to remember when training  with images is that the transcription used as ground truth needs to be of `gold standard` otherwise the training will not improve the results. I noticed a few typos on the corrected OCR text for 1st page and many for page 10.

I used pages 1-12 to do a test run of training using Wordstr boxes and that does lead to improved results. I have used a smaller size of images so that co-ordinates may not match. I created the ground truth files using text from the website and corrected errors (mainly in page 1 and 10) - I did not review all for accuracy.

I will zip all files and training script so that you can test at your end. I am not getting the encoding related errors.

Please use `--psm 6` with the lstm.train command.  



--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,
Apr 23, 2019, 5:23:03 AM4/23/19
to tesser...@googlegroups.com
zip file is too big. Let me do an alternative upload.

Training runs ok for me - 

Warning: LSTMTrainer deserialized an LSTMRecognizer!
Continuing from /home/ubuntu/tessdata_best/script/Devanagari.lstm
Loaded 13/13 lines (1-13) of document NKP/dp10.lstmf
Loaded 13/13 lines (1-13) of document NKP/dp1.lstmf
Loaded 13/13 lines (1-13) of document NKP/dp2.lstmf
Loaded 13/13 lines (1-13) of document NKP/dp11.lstmf
Loaded 13/13 lines (1-13) of document NKP/dp12.lstmf
Loaded 13/13 lines (1-13) of document NKP/dp4.lstmf
Loaded 12/12 lines (1-12) of document NKP/dp6.lstmf
Loaded 13/13 lines (1-13) of document NKP/dp3.lstmf
Loaded 13/13 lines (1-13) of document NKP/dp5.lstmf
Iteration 0: GROUND  TRUTH : दु°टी° श्च दाराश्च ते पुत्रदाराः पुत्रदाराः आदिर्येषां ते पुत्रदारादयः तैः पुत्रदारादिभिः दाराः कलत्राणि भवान् त्वं निरस्त
Iteration 0: BEST OCR TEXT : हु०ी० *च दाराशच ते पुत्रदाराः पुत्रदाराः आदिर्येषां ते पुत्रदारादयः तैः पुत्रदारादिभिः दाराः कलत्राणि भवान्‌ त्वं निरस्त
File NKP/dp10.lstmf line 0 :
Mean rms=1.216%, delta=2.957%, train=10.656%(25%), skip ratio=0%
Loaded 13/13 lines (1-13) of document NKP/dp7.lstmf
Iteration 1: GROUND  TRUTH : पविष्ठौ तौ वैश्यपार्थिवौ काश्चित्कथाः चक्रतुः यथान्यायं यथाशास्त्रं यथायोग्यं तेन मुनिना संबिदं भाषां उपविष्टौ
Iteration 1: BEST OCR TEXT :  T
File NKP/dp11.lstmf line 0 :
Mean rms=3.579%, delta=50.635%, train=55.328%(62.5%), skip ratio=0%
Loaded 13/13 lines (1-13) of document NKP/dp8.lstmf
Iteration 2: GROUND  TRUTH : कृत्वातुतौयथान्यायंयथार्हन्तेनसंविदम् २८ उपविष्टौकथाःकाचिच्चक्रतुर्व्वैश्यपार्थिवौ ॥ राजो
Iteration 2: ALIGNED TRUTH : कृत्वातुतौयथान्यायंयथार्हन्तेनसंविदम् २८ उपविष्टौकथाःकाचिच्चकरतुर्व्वैश्यपार्थिवौ ॥ राजो
Iteration 2: BEST OCR TEXT : कृत्वातुलोयथान्यायंयथार्हन्तेनसंविदम्‌ २८' उपविंष्लोकथाःकाचिन्चकतुव्वश्यपा्धिबोौ ॥ राजो
File NKP/dp12.lstmf line 0 :
Mean rms=3.012%, delta=37.121%, train=46.249%(61.667%), skip ratio=0%
Loaded 13/13 lines (1-13) of document NKP/dp9.lstmf
Iteration 3: GROUND  TRUTH : प्राज्ञा यस्यसःमहाप्राज्ञा तत्सम्बोधने हे महाप्राज्ञ प्रज्ञाबुद्धिः सर्व्वाणिचतानि शास्त्राणि सर्व्वशास्त्राणि सर्व्वाशास्त्रेषु विशा
Iteration 3: ALIGNED TRUTH : प्राज्ञा यस्यसःमहाप्राज्ञा तत्सम्बोधने हे महाप्राज्ञ प्रज्ञाबुद्धिः सर्व्वाणिचतानि शास्त्राणि सव्वशास्त्राणि स्व्वाशास्रेषु विशा
Iteration 3: BEST OCR TEXT : म्राज्ञा यस्यसःमहाप्राज्ञा तत्सम्बीधने हे महाप्रा्ञ प्रज्ञाबुद्धिः सरव्वाणिचतानि शास्राणि सव्वशास्त्राणि सव्वशाख्रषु विशा
File NKP/dp1.lstmf line 0 :
Mean rms=2.611%, delta=29.082%, train=38.07%(62.159%), skip ratio=0%
Iteration 4: GROUND  TRUTH : महाभागः भागः भाग्यं सःअष्टमःमनुः महामायानुभावेन महतीमाया यस्याःसा महामाया महामाहायाःअनुभावः महा
Iteration 4: BEST OCR TEXT :  महाभागः भागः भाग्यं सःअष्टमःमनुः महामायानुभावेन महतीमाया यस्याःसा महामाया महामाहायाःअनुभावः महा
File NKP/dp2.lstmf line 0 :
Mean rms=2.324%, delta=24.051%, train=30.666%(49.727%), skip ratio=0%
Iteration 5: GROUND  TRUTH : अपि तैः कोलाविध्वंसिभिः सहयुद्धेजितः अतिप्रबलदण्डिनः तस्य तैःसह युद्धम् अतिप्रबलश्वासौदण्डश्च अतिप्रबल
Iteration 5: BEST OCR TEXT : अपि तैः कोलाविध्वंसिभिः सहयुद्धेज्ञितः अतिप्रबलदणिडनः तस्थ तेःसह युद्धम्‌ अतिप्रबलश्चासौदण्डशच अतिप्रबल
File NKP/dp3.lstmf line 0 :
Mean rms=2.187%, delta=21.104%, train=27.189%(51.439%), skip ratio=0%
Iteration 6: GROUND  TRUTH : क्षीणबलस्य ततः तस्यइति ततः तस्य राज्ञ: सुरथस्य कोशः अर्थसंचयः अपहृतः आत्मसात्कृतः स्वाधीनः किंच बलं
Iteration 6: BEST OCR TEXT : क्षाणबलस्य ततः तस्यईति ततः तस्य राज्ञः सुरथस्य कोडाः अर्थसंचयः अपद्वतः आत्मसात्रुतः स्वाधीनः किंच बत्त
File NKP/dp4.lstmf line 0 :
Mean rms=2.103%, delta=19.153%, train=26.624%(51.234%), skip ratio=0%
Iteration 7: GROUND  TRUTH : श्वापदाकीर्णः तं प्रशान्तश्वापदाकीर्णं प्रशान्ताः परहिंसारहिताः श्वापदाः व्याघ्रादयः आकीर्णं व्याप्तं मुनिशिष्योप
Iteration 7: ALIGNED TRUTH : श्वापदाकीर्णः तं प्रशान्तशवापदाकीर्णं प्रशान्ताः परहिंसारहिताः श्वापदाः व्याघ्रादयः आकीर्णं व्याप्तं मुनिशिष्योप
Iteration 7: BEST OCR TEXT : इवापदाकीणः तं प्रशान्तरवापदाकीर्णं प्रशञान्ताः परहिंसारहिताः इंवापदाः व्याघ्रादयः आकीर्णं व्याप्त सुनिशिष्योप

Shree Devi Kumar

unread,
Apr 23, 2019, 6:02:51 AM4/23/19
to tesser...@googlegroups.com

See NKP.sh and folder NKP

The first part of the script loops through the images and creates Wordstr box files for same using tesseract.
It then uses sed to replace the reognized text by the ground truth.
This corrected box file is then used to create the lstmf files.

lstmtraining is done on textlines.
Use of --psm 6 causes text which is on the margins to be included as part of the line. eg.
दु°टी° श्रीगणेशायनमः ।। अलिकुलमण्डितगण्डं प्रत्यूहतिमिरमार्त्तण्डं सिन्दूरारुणशुण्डं देवंवेतण्डमुण्डमवलम्बे १ वि

You can use any other alternative mechanism to create/correct box files.

With 12 pages used for training for 700 iterations and one page used for eval, the results are as follows:

tessdata_best/san
      At iteration 0, stage 0, Eval Char error rate=22.924007, Word error rate=62.127595

       NKP/NKP-eval.gt.txt: 142 words  63 44% common  0 0% deleted  79 56% changed
       build/NKP-eval-san.txt: 142 words  63 44% common  2 1% inserted  77 54% changed


tessdata_best/script/Devanagari
      At iteration 0, stage 0, Eval Char error rate=13.307604, Word error rate=47.984793

       NKP/NKP-eval.gt.txt: 142 words  85 60% common  0 0% deleted  57 40% changed
       build/NKP-eval-deva.txt: 141 words  85 60% common  0 0% inserted  56 40% changed

san_NKP
      At iteration 0, stage 0, Eval Char error rate=7.8737359, Word error rate=33.76221

      NKP/NKP-eval.gt.txt: 142 words  108 76% common  0 0% deleted  34 24% changed
      build/NKP-eval.txt: 142 words  108 76% common  0 0% inserted  34 24% changed

san_NKP_int
       At iteration 0, stage 0, Eval Char error rate=7.5598106, Word error rate=32.463509

        NKP/NKP-eval.gt.txt: 142 words  106 75% common  0 0% deleted  36 25% changed
        build/NKP_int-eval.txt: 142 words  106 75% common  0 0% inserted  36 25% changed







Jochen Barth

unread,
Apr 23, 2019, 6:54:48 AM4/23/19
to tesser...@googlegroups.com
Thanks a lot.

The error seems to be the missing space after the tab character in line below »WordStr«!

Kind regards,
Jochen


Am 23.04.19 um 12:02 schrieb Shree Devi Kumar:
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.


-- 
Jochen Barth * Universitätsbibliothek Heidelberg, IT * Telefon 06221 54-2580

Shree Devi Kumar

unread,
Apr 23, 2019, 8:14:53 AM4/23/19
to tesser...@googlegroups.com
Glad you figured out the problem.

Please consider sharing the improved traineddata file (when you complete training) for tessdata_contrib repo.

Reply all
Reply to author
Forward
0 new messages