traning devanagari: »Encoding of string failed!«

154 views

Skip to first unread message

ba...@ub.uni-heidelberg.de

unread,

Apr 18, 2019, 8:39:21 AM4/18/19

to tesseract-ocr

Dear reader,

I want to improve devanagari recognition.

I have images and manually corrected Text with line coordinates.

From those, I've generated .box files;

see attached file which produces the error above.

Complete error Message from lstmtrain:

»Encoding of string failed! Failure bytes: 9 32 37 38 ffffffe0 ffffffa4 ffffff98 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffa8 ffffffe0 ffffffa5 ffffff87 ffffffe0 ffffffa4 ffffffb6 ffffffe0 fffff...
Can't encode transcription: 'श्रीगणेशायनमः ।। अलिकुलमण्डितगण्डं प्रत्यूहतिमिरमार्त्तण्डं सिन्दूरारुणशुण्डं देवंवेतण्डमुण्डमवलम्बे १ वि 278घ्नेश्वरायवरदायसुरप्रियाय लम्बोदरा...

...

...«

.lstmf-Files are generated using »tesseract $tiff $box --tessdata-dir ~/tessdata_best -l script/Devanagari lstm.train«

training is run by

»combine_tessdata -u ~/tessdata_best/script/Devanagari.traineddata /tmp/Deva.trta

mkdir /tmp/deva
ls -1 *.lstmf >/tmp/list.txt

lstmtraining --model_output /tmp/deva --continue_from /tmp/Deva.trta.lstm --traineddata ~/tessdata_best/script/Devanagari.traineddata --train_listfile /tmp/list.txt«

I have double-checked that only characters from Devanagari.traineddata.lstm-unicharset are in the .box files.

No tabs, no control characters.

But the "9" from the error message above sounds like tab...?

Any ideas?

Kind regards, Jochen

PS: latest tesseract 4.1.0-rc1; tessdata_best: commit 95593f0b017280...

durggapatha1890_-_001.box

Shree Devi Kumar

unread,

Apr 18, 2019, 9:06:35 AM4/18/19

to tesser...@googlegroups.com

> I have images and manually corrected Text with line coordinates. From those, I've generated .box files;

What method did you use for generating the .box files?

Please provide the image for the box file for test.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3f411945-e3d5-4b70-bce6-b33e2aab7bfc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Jochen Barth

unread,

Apr 18, 2019, 9:12:14 AM4/18/19

to tesser...@googlegroups.com

I did create the .box files from our alto files, which were generated by transkribus (manually corrected OCR).

Image (converted from tiff - too large, but same number of pixels) see attached file.

Same as here: https://digi.ub.uni-heidelberg.de/diglit/durggapatha1890/0005/image

Kind regards,

Jochen

Am 18.04.19 um 15:05 schrieb Shree Devi Kumar:

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXEwQh0%2B6hZo58n4d9cig-7NkTWshi6u5RX4LJQgSspLA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

-- 
Jochen Barth * Universitätsbibliothek Heidelberg, IT * Telefon 06221 54-2580

durggapatha1890_-_001.jpg

Shree Devi Kumar

unread,

Apr 18, 2019, 11:41:52 AM4/18/19

to tesser...@googlegroups.com

The following format (as in your box file) will not work for Devanagari.

श 278 1253 2860 1413 0

् 278 1253 2860 1413 0

र 278 1253 2860 1413 0

ी 278 1253 2860 1413 0

ग 278 1253 2860 1413 0

ण 278 1253 2860 1413 0

े 278 1253 2860 1413 0

श 278 1253 2860 1413 0

ा 278 1253 2860 1413 0

य 278 1253 2860 1413 0

न 278 1253 2860 1413 0

म 278 1253 2860 1413 0

ः 278 1253 2860 1413 0

278 1253 2860 1413 0

See files in attached zip file which show the box/tiff pairs as created by text2image using the text with Murty Sanskrit font.

श्री 112 4669 160 4708 0

ग 156 4669 189 4701 0

णे 185 4668 225 4708 0

शा 221 4667 272 4700 0

य 268 4668 301 4700 0

न 297 4668 329 4700 0

मः 326 4668 370 4700 0

370 4667 402 4701 0

। 402 4667 407 4701 0

। 428 4667 433 4700 0

433 4667 451 4700 0

The above format works for training.

Box files created by using the new `lstmbox` or `wordstrbox` formats should also work.

WordStr 33 628 1417 684 0 #श्रीगणेशायनमः ।। अलिकुलमण्डितगण्डं प्रत्यूहतिमिरमार्त्तण्डं सिन्दूरारुणशुण्डं देवंवेतण्डमुण्डमवलम्बे १ वि

1430 586 1434 644 0

WordStr 32 586 1429 644 0 #घ्नेश्वरायवरदायसुरप्रियाय लम्बोदरायसकलायजद्धिताय नागाननायक्षितियज्ञविभूषिताय गौरीसुतायगणनाथनमो

1416 533 1420 601 0

WordStr 115 533 1415 601 0 #नमस्ते २ श्लोकः ।। तपस्यन्तंमहात्मानं मार्क्कण्डेयंमहामुनिम् ।। व्यासशिष्योमहातेजा जैमिनिः पर्य्यपृच्छत १ व्याख्या व्यासशिष्यः

1416 508 1420 556 0

I did finetune training using the one image with wordstrbox - 15 lines for 400 iterations would have overfitted but it gives improved recognition.

ubuntu@tesseract-ocr:~/TEST$ lstmeval \

> --verbosity -1 \

> --model ./san_impact.traineddata \

> --eval_listfile ./san.training_files.txt

Loaded 13/13 lines (1-13) of document ./durggapatha1890_-_001.lstmf

Warning: LSTMTrainer deserialized an LSTMRecognizer!

At iteration 0, stage 0, Eval Char error rate=3.2031613, Word error rate=12.715244

ubuntu@tesseract-ocr:~/TEST$ lstmeval \

> --model ./san_impact.traineddata \

> --eval_listfile ./san.training_files.txt

Loaded 13/13 lines (1-13) of document ./durggapatha1890_-_001.lstmf

Warning: LSTMTrainer deserialized an LSTMRecognizer!

Truth:मार्क्कण्डेयउवाच ॥ सावर्णिस्सू्र्य्यतनयोयोमनुःकथ्यतेष्टमः ।। निशामयतदुत्पत्तिंविस्तराद्गदतोमम १

OCR :मार्क्कण्डेयउवाच ॥ सावर्णिस्सूर्य्यतनयोयोमनुःकथ्यतेष्टमः ।। निशामयतदुत्पत्तिंविस्तराद्गदतोमम १

Truth:नमस्ते २ श्लोकः ।। तपस्यन्तंमहात्मानं मार्क्कण्डेयंमहामुनिम् ।। व्यासशिष्योमहातेजा जैमिनिः पर्य्यपृच्छत १ व्याख्या व्यासशिष्यः

OCR :नमस्ते २ श्लोकः ।। तपस्यन्तंमहात्मानं मार्क्कणडेयंमहामुनिम । व्यासशिष्योमहातेजा जैमिनिःपर्य्यपृच्छत व ाय

Truth:शिष्यः महातेजोयस्यसः महातेजाः तपस्यन्तं तपस्यतीति तपस्यन् तं तपस्यन्तं महानआत्मा यस्यसः महात्मा तं महा

OCR :शिष्यः महातेजोयस्यसः महातेजाः तपस्यन्तं तपस्यतीति तपस्यन् तं तपस्यन्तं महानआत्मा यस्यसः महात्मा तं महा

Truth:महातेजाः जैमिनिः तपस्यन्तं महात्मानं महामुनिं मार्क्कणडेयं पर्य्यपृच्छत इत्यन्वयः व्यासस्यशिष्यःव्यास

OCR :्वसि्यः महातेजाः जैमिनिः तपस्यन्तं महात्मानं महामुनिं मार््कण्डेयं पर्य्यपृच्छत इत्यन्वयः व्यासस्यशिष्यःव्यास

Truth:प्राज्ञा यस्यसःमहाप्राज्ञा तत्सम्बोधने हे महाप्राज्ञ प्रज्ञाबुद्धिः सर्व्वाणिचतानि शास्त्राणि सर्व्वशास्त्राणि सर्व्वाशास्त्रेषु विशा

OCR :प्राज्ञा यस्यसःमहाप्राज्ञा तत्सम्बोधने हे महाप्राज्ञ प्रज्ञाबुद्धिः सर्व्वाणिचतानि शास्त्राणि सर्व्वशास्त्राणि सर्व्वशास्तरेषु विशा

Truth:र्क्कण्डेयं मृगान् हरिणान् १ श्लोकः ।। मार्कण्डेयमहाप्राज्ञ सर्व्वशास्त्रविशारद ।। श्रोतुमिच्छाम्यशेषेण देवीमाहात्म्यमुत्तमम् २

OCR :क्कण्डेयं मृगान् हरिणान् १ श्लोकः ।। मार्कण्डेयमहाप्राज्ञ सर्व्वशास्त्रविशारद ।। श्रोतुमिच्छाम्यशेषेण देवीमाहात्म्यमुत्तमम् २

At iteration 0, stage 0, Eval Char error rate=3.2031613, Word error rate=12.715244

As a comparison, the Devanagari results are:

ubuntu@tesseract-ocr:~/TEST$

ubuntu@tesseract-ocr:~/TEST$ lstmeval \

> --verbosity -1 \

> --model ~/tessdata_best/script/Devanagari.traineddata \

> --eval_listfile ./san.training_files.txt

Loaded 13/13 lines (1-13) of document ./durggapatha1890_-_001.lstmf

Warning: LSTMTrainer deserialized an LSTMRecognizer!

At iteration 0, stage 0, Eval Char error rate=21.136278, Word error rate=62.680429

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2e4164e7-432c-3543-fff5-0f8bd15e8f75%40ub.uni-heidelberg.de.

For more options, visit https://groups.google.com/d/optout.

sanJochen.zip

durggapatha1890_-_001-wordstr.zip

Shree Devi Kumar

unread,

Apr 18, 2019, 11:47:47 AM4/18/19

to tesser...@googlegroups.com

Also see https://github.com/OCR-D/ocrd-train/pull/66

https://github.com/tesseract-ocr/tesseract/issues/2357#issuecomment-477239316

Jochen Barth

unread,

Apr 23, 2019, 4:23:11 AM4/23/19

to tesser...@googlegroups.com

Dear Shree,

I've tried it with the format below and combined letter-and-sign-symbols (see attached file)

and with WordStr-Format (see attached file),

but still the same error...

Kind regards, Jochen

Am 18.04.19 um 17:40 schrieb Shree Devi Kumar:

zz-durggapatha1890_-_001.box.lstmbox

durggapatha1890_-_001.box

Shree Devi Kumar

unread,

Apr 23, 2019, 5:01:32 AM4/23/19

to tesser...@googlegroups.com

Hello Jochen,

I prefer the Wordstr format since it is easier to correct the text with ground truth, so I have not tested with the lstmbox file.

A cursory glance at the file shows that the lstmbox file does not have lines with spaces between words.

Another point to remember when training with images is that the transcription used as ground truth needs to be of `gold standard` otherwise the training will not improve the results. I noticed a few typos on the corrected OCR text for 1st page and many for page 10.

I used pages 1-12 to do a test run of training using Wordstr boxes and that does lead to improved results. I have used a smaller size of images so that co-ordinates may not match. I created the ground truth files using text from the website and corrected errors (mainly in page 1 and 10) - I did not review all for accuracy.

I will zip all files and training script so that you can test at your end. I am not getting the encoding related errors.

Please use `--psm 6` with the lstm.train command.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5ee99b59-c1d8-874e-043a-771b00f4b434%40ub.uni-heidelberg.de.

For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,

Apr 23, 2019, 5:23:03 AM4/23/19

to tesser...@googlegroups.com

zip file is too big. Let me do an alternative upload.

Training runs ok for me -

Warning: LSTMTrainer deserialized an LSTMRecognizer!

Continuing from /home/ubuntu/tessdata_best/script/Devanagari.lstm

Loaded 13/13 lines (1-13) of document NKP/dp10.lstmf

Loaded 13/13 lines (1-13) of document NKP/dp1.lstmf

Loaded 13/13 lines (1-13) of document NKP/dp2.lstmf

Loaded 13/13 lines (1-13) of document NKP/dp11.lstmf

Loaded 13/13 lines (1-13) of document NKP/dp12.lstmf

Loaded 13/13 lines (1-13) of document NKP/dp4.lstmf

Loaded 12/12 lines (1-12) of document NKP/dp6.lstmf

Loaded 13/13 lines (1-13) of document NKP/dp3.lstmf

Loaded 13/13 lines (1-13) of document NKP/dp5.lstmf

Iteration 0: GROUND TRUTH : दु°टी° श्च दाराश्च ते पुत्रदाराः पुत्रदाराः आदिर्येषां ते पुत्रदारादयः तैः पुत्रदारादिभिः दाराः कलत्राणि भवान् त्वं निरस्त

Iteration 0: BEST OCR TEXT : हु०ी० *च दाराशच ते पुत्रदाराः पुत्रदाराः आदिर्येषां ते पुत्रदारादयः तैः पुत्रदारादिभिः दाराः कलत्राणि भवान्‌ त्वं निरस्त

File NKP/dp10.lstmf line 0 :

Mean rms=1.216%, delta=2.957%, train=10.656%(25%), skip ratio=0%

Loaded 13/13 lines (1-13) of document NKP/dp7.lstmf

Iteration 1: GROUND TRUTH : पविष्ठौ तौ वैश्यपार्थिवौ काश्चित्कथाः चक्रतुः यथान्यायं यथाशास्त्रं यथायोग्यं तेन मुनिना संबिदं भाषां उपविष्टौ

Iteration 1: BEST OCR TEXT : T

File NKP/dp11.lstmf line 0 :

Mean rms=3.579%, delta=50.635%, train=55.328%(62.5%), skip ratio=0%

Loaded 13/13 lines (1-13) of document NKP/dp8.lstmf

Iteration 2: GROUND TRUTH : कृत्वातुतौयथान्यायंयथार्हन्तेनसंविदम् २८ उपविष्टौकथाःकाचिच्चक्रतुर्व्वैश्यपार्थिवौ ॥ राजो

Iteration 2: ALIGNED TRUTH : कृत्वातुतौयथान्यायंयथार्हन्तेनसंविदम् २८ उपविष्टौकथाःकाचिच्चकरतुर्व्वैश्यपार्थिवौ ॥ राजो

Iteration 2: BEST OCR TEXT : कृत्वातुलोयथान्यायंयथार्हन्तेनसंविदम्‌ २८' उपविंष्लोकथाःकाचिन्चकतुव्वश्यपा्धिबोौ ॥ राजो

File NKP/dp12.lstmf line 0 :

Mean rms=3.012%, delta=37.121%, train=46.249%(61.667%), skip ratio=0%

Loaded 13/13 lines (1-13) of document NKP/dp9.lstmf

Iteration 3: GROUND TRUTH : प्राज्ञा यस्यसःमहाप्राज्ञा तत्सम्बोधने हे महाप्राज्ञ प्रज्ञाबुद्धिः सर्व्वाणिचतानि शास्त्राणि सर्व्वशास्त्राणि सर्व्वाशास्त्रेषु विशा

Iteration 3: ALIGNED TRUTH : प्राज्ञा यस्यसःमहाप्राज्ञा तत्सम्बोधने हे महाप्राज्ञ प्रज्ञाबुद्धिः सर्व्वाणिचतानि शास्त्राणि सव्वशास्त्राणि स्व्वाशास्रेषु विशा

Iteration 3: BEST OCR TEXT : म्राज्ञा यस्यसःमहाप्राज्ञा तत्सम्बीधने हे महाप्रा्ञ प्रज्ञाबुद्धिः सरव्वाणिचतानि शास्राणि सव्वशास्त्राणि सव्वशाख्रषु विशा

File NKP/dp1.lstmf line 0 :

Mean rms=2.611%, delta=29.082%, train=38.07%(62.159%), skip ratio=0%

Iteration 4: GROUND TRUTH : महाभागः भागः भाग्यं सःअष्टमःमनुः महामायानुभावेन महतीमाया यस्याःसा महामाया महामाहायाःअनुभावः महा

Iteration 4: BEST OCR TEXT : महाभागः भागः भाग्यं सःअष्टमःमनुः महामायानुभावेन महतीमाया यस्याःसा महामाया महामाहायाःअनुभावः महा

File NKP/dp2.lstmf line 0 :

Mean rms=2.324%, delta=24.051%, train=30.666%(49.727%), skip ratio=0%

Iteration 5: GROUND TRUTH : अपि तैः कोलाविध्वंसिभिः सहयुद्धेजितः अतिप्रबलदण्डिनः तस्य तैःसह युद्धम् अतिप्रबलश्वासौदण्डश्च अतिप्रबल

Iteration 5: BEST OCR TEXT : अपि तैः कोलाविध्वंसिभिः सहयुद्धेज्ञितः अतिप्रबलदणिडनः तस्थ तेःसह युद्धम्‌ अतिप्रबलश्चासौदण्डशच अतिप्रबल

File NKP/dp3.lstmf line 0 :

Mean rms=2.187%, delta=21.104%, train=27.189%(51.439%), skip ratio=0%

Iteration 6: GROUND TRUTH : क्षीणबलस्य ततः तस्यइति ततः तस्य राज्ञ: सुरथस्य कोशः अर्थसंचयः अपहृतः आत्मसात्कृतः स्वाधीनः किंच बलं

Iteration 6: BEST OCR TEXT : क्षाणबलस्य ततः तस्यईति ततः तस्य राज्ञः सुरथस्य कोडाः अर्थसंचयः अपद्वतः आत्मसात्रुतः स्वाधीनः किंच बत्त

File NKP/dp4.lstmf line 0 :

Mean rms=2.103%, delta=19.153%, train=26.624%(51.234%), skip ratio=0%

Iteration 7: GROUND TRUTH : श्वापदाकीर्णः तं प्रशान्तश्वापदाकीर्णं प्रशान्ताः परहिंसारहिताः श्वापदाः व्याघ्रादयः आकीर्णं व्याप्तं मुनिशिष्योप

Iteration 7: ALIGNED TRUTH : श्वापदाकीर्णः तं प्रशान्तशवापदाकीर्णं प्रशान्ताः परहिंसारहिताः श्वापदाः व्याघ्रादयः आकीर्णं व्याप्तं मुनिशिष्योप

Iteration 7: BEST OCR TEXT : इवापदाकीणः तं प्रशान्तरवापदाकीर्णं प्रशञान्ताः परहिंसारहिताः इंवापदाः व्याघ्रादयः आकीर्णं व्याप्त सुनिशिष्योप

Shree Devi Kumar

unread,

Apr 23, 2019, 6:02:51 AM4/23/19

to tesser...@googlegroups.com

Uploaded the files at https://github.com/Shreeshrii/tessdata_sanskrit

See NKP.sh and folder NKP

The first part of the script loops through the images and creates Wordstr box files for same using tesseract.

It then uses sed to replace the reognized text by the ground truth.

This corrected box file is then used to create the lstmf files.

lstmtraining is done on textlines.

Use of --psm 6 causes text which is on the margins to be included as part of the line. eg.

दु°टी° श्रीगणेशायनमः ।। अलिकुलमण्डितगण्डं प्रत्यूहतिमिरमार्त्तण्डं सिन्दूरारुणशुण्डं देवंवेतण्डमुण्डमवलम्बे १ वि

You can use any other alternative mechanism to create/correct box files.

With 12 pages used for training for 700 iterations and one page used for eval, the results are as follows:

tessdata_best/san

At iteration 0, stage 0, Eval Char error rate=22.924007, Word error rate=62.127595

NKP/NKP-eval.gt.txt: 142 words 63 44% common 0 0% deleted 79 56% changed

build/NKP-eval-san.txt: 142 words 63 44% common 2 1% inserted 77 54% changed

tessdata_best/script/Devanagari

At iteration 0, stage 0, Eval Char error rate=13.307604, Word error rate=47.984793

NKP/NKP-eval.gt.txt: 142 words 85 60% common 0 0% deleted 57 40% changed

build/NKP-eval-deva.txt: 141 words 85 60% common 0 0% inserted 56 40% changed

san_NKP

At iteration 0, stage 0, Eval Char error rate=7.8737359, Word error rate=33.76221

NKP/NKP-eval.gt.txt: 142 words 108 76% common 0 0% deleted 34 24% changed

build/NKP-eval.txt: 142 words 108 76% common 0 0% inserted 34 24% changed

san_NKP_int

At iteration 0, stage 0, Eval Char error rate=7.5598106, Word error rate=32.463509

NKP/NKP-eval.gt.txt: 142 words 106 75% common 0 0% deleted 36 25% changed

build/NKP_int-eval.txt: 142 words 106 75% common 0 0% inserted 36 25% changed

Jochen Barth

unread,

Apr 23, 2019, 6:54:48 AM4/23/19

to tesser...@googlegroups.com

Thanks a lot.

The error seems to be the missing space after the tab character in line below »WordStr«!

Kind regards,

Jochen

Am 23.04.19 um 12:02 schrieb Shree Devi Kumar:

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXi%2BWOepAfkv7eF73c7U8R3_EpBG8EE7Wgqf0a-iKuBmA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

-- 
Jochen Barth * Universitätsbibliothek Heidelberg, IT * Telefon 06221 54-2580

Shree Devi Kumar

unread,

Apr 23, 2019, 8:14:53 AM4/23/19

to tesser...@googlegroups.com

Glad you figured out the problem.

Please consider sharing the improved traineddata file (when you complete training) for tessdata_contrib repo.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/bf014e73-d063-00fc-ddb5-643ddf7fec28%40ub.uni-heidelberg.de.

Reply all

Reply to author

Forward

0 new messages