Diacriticals Training

shreyansh dwivedi

unread,

Sep 28, 2020, 1:13:55 AM9/28/20

to tesser...@googlegroups.com

Hello everyone,

I want to train some diacritical which are not present in latin.trained model, apart from latin i used vietnamese and latvian trained model but the some of the diacriticals are missed in those models too, some of missed characters are mentioned below which i need to recognise.

ṭ

Ṭ

ṅ

ṭh

ḍ

ḍh

ṇ

ṃ

ṣ

Ḥ

ḥ

I want to train the above diacritical to recognise the characters in the text image, through the tesseract engine.

Any help would be appreciated and from the scratch would be a great way to understand.

Thank you!

Shree Devi Kumar

unread,

Sep 28, 2020, 2:49:36 AM9/28/20

to tesseract-ocr

I am currently running a training run based on synthetic training data for Sanskrit to support both Devanagari script with vedic accents as well as iAST (Roman with diacritics support). I will share the traineddata for you and others who are interested to test how well it works with real life images.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMREWd6R%2Bec5r%3D77%2BRWGM7PUKZPqqJT%2BkNX6r9zwijvW5sxykQ%40mail.gmail.com.

shreyansh dwivedi

unread,

Oct 1, 2020, 2:00:01 AM10/1/20

to tesser...@googlegroups.com

Hello Shree,

Firstly, thank you for looking into it. Secondly, I would be grateful if you share the piece of code with the explanation part of how to train new characters for the tesseract engine.Procedural approach will make the things better for understanding. Thank you! Regards

Shreyansh Dwivedi

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW7TbFaTCNbsSQBfVw8L%2BHf0AXOC-iJPtg4LG4sg9vPDw%40mail.gmail.com.

Shree Devi Kumar

unread,

Oct 1, 2020, 3:44:38 AM10/1/20

to tesseract-ocr

Please read tesseract documentation regarding lstm training by replacing a layer.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMREWd4bdNumFEmNOBWuYXejxPfVfJ%2BsNFUpDpenu558e6vxtQ%40mail.gmail.com.

Shree Devi Kumar

unread,

Oct 8, 2020, 8:48:19 AM10/8/20

to tesseract-ocr

I have uploaded the results of various trainings for IAST (with diacritics) and Devanagari for Sanskrit at https://github.com/Shreeshrii/tess5training-sanskrit-iast/tree/main/tessdata/best . The traineddata files and the corresponding lstm-unicharset has been uploaded there.

The training has been done mostly with line images of synthetic training data in various fonts. On evaluation datasets of synthetic training data, not seen during training, I get a CER of 2-3%. I am curious to know how well these perform with real life images.

I will appreciate if those who are testing can send me a few of their test images along with the ground truth text.

Virus-free. www.avg.com

On Mon, Sep 28, 2020 at 12:19 PM Shree Devi Kumar <shree...@gmail.com> wrote:

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shree Devi Kumar

unread,

Oct 31, 2020, 6:50:11 AM10/31/20

to tesseract-ocr

>ṣ -> it recognises as ş

I cannot reproduce the issue. I am getting the following

Line 120: praise of Viṣṇu. Lz. 1388.
Line 147: lakṣmī XXXIX. 51.

Complete output is attached. It uses https://github.com/Shreeshrii/tess5training-sanskrit-iast/blob/main/tessdata/fast/Sanskrit-1017-fast.traineddata

Hello Shree,

I have a image comprising of sanskrit text and Romal Text comprising of diacritical a, ā, ś, Ś, ṛ, ṇ, ṃ, ū, ī, ṭ, ṅ, ḍ, ṛ, ṣ. I am using the sanskrit_int.tarinedata created by you, it recognises sanskrit text quite good for properly scanned images but for the diacritical part only a few characters could be identified namely ā, ū, but for

ṣ -> it recognises as ş

right now i am using QTBoxEditor to correct the wrongly recognised characters like the one above.

I want to ask while training for the new language model some rules are defined and one of them is the naming convention od image, here in this i want to ask what is the font type and how to identify which font name is used in the image :

[language name].[font name].exp[number].[file extension]

how to identify what should bethe font name for the image

for better understanding i am attaching the image file.

On Mon, Oct 19, 2020 at 4:45 PM Shree Devi Kumar <shree...@gmail.com> wrote:

Please share the groundtruth for the test images also.

Yes, you can certainly try to train on basis of these models.

On Mon, Oct 19, 2020, 15:51 shreyansh dwivedi <advoca...@gmail.com> wrote:
Hello Shree,
Subh navratri,
I used the trained model build by you but unfortunately they are not giving results, please refer to the picture and the text inscribed in it, what if we may build the model on the basis of it. PFA.

Regards,
Shreyansh Dwivedi

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWRgU8JFRm2RP3ndzrsVVeS%3DFF%2BDg8w3LTkjR_kv9eU7g%40mail.gmail.com.

page1_2-2-Sanskrit-1017-tmp.txt

page1_2-2.jpg

shreyansh dwivedi

unread,

Nov 5, 2020, 6:16:08 AM11/5/20

to tesser...@googlegroups.com

hello shree i am attaching the image file , box file and the train.bash script in this email along with the error generated while running the script, FYIP currently i am using windows so run the bash script on msys2 terminal

font_properties

san_NKP_int.inttemp

san_NKP_int.normproto

san_NKP_int.ocrb.exp0.box

san_NKP_int.ocrb.exp0.png

san_NKP_int.ocrb.exp0.tr

san_NKP_int.ocrb.exp1.box

san_NKP_int.ocrb.exp1.png

san_NKP_int.ocrb.exp1.tr

san_NKP_int.ocrb.exp2.box

san_NKP_int.ocrb.exp2.png

san_NKP_int.ocrb.exp2.tr

san_NKP_int.ocrb.exp3.box

san_NKP_int.ocrb.exp3.png

san_NKP_int.ocrb.exp3.tr

san_NKP_int.ocrb.exp4.box

san_NKP_int.ocrb.exp4.png

san_NKP_int.ocrb.exp4.tr

san_NKP_int.pffmtable

san_NKP_int.shapetable

san_NKP_int.traineddata

san_NKP_int.unicharset

train.bash

unicharset

below is the error screen shot generated while running the bash script

.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUFM%3D%3DW%2BpybX69BpLgvEWa5a%3DjG5X4sMEk4T0C98P5sYA%40mail.gmail.com.

Shree Devi Kumar

unread,

Nov 5, 2020, 6:44:55 AM11/5/20

to tesseract-ocr

Are you trying to train for the legacy tesseract engine?

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMREWd7c14tPPHB2xqJf1FvCgEep_pr6CMYLhuSoFT9GNsqvtA%40mail.gmail.com.

shreyansh dwivedi

unread,

Nov 5, 2020, 6:54:05 AM11/5/20

to tesser...@googlegroups.com

no nothing like that there is no specific i have decided out of four (0 - 3 ), you may suggest which works best for this problem statement.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWqnPAoB25M08u7cR_%2B0n31LpTV-j8e-xjZtMBHY5-Lcg%40mail.gmail.com.

shreyansh dwivedi

unread,

Nov 5, 2020, 6:57:26 AM11/5/20

to tesser...@googlegroups.com

No nothing like that there is no specific engine i have decided out of four (0 - 3 ), you may suggest which works best for this problem statement.

Shree Devi Kumar

unread,

Nov 5, 2020, 9:45:57 AM11/5/20

to tesseract-ocr

Legacy engine training won't work for Devanagari. The cube engine which was used in tesseract for Hindi has been removed.

If you are only training for English and diacritics it may work for you. But note that there are no fine-tuning options for it. You have to train a model from scratch.

,.....

shapetable, tr etc are all files for legacy engine, 3.0x and before.

It is supported in tesseract4 with --oem 0

shreyansh dwivedi

unread,

Nov 12, 2020, 1:51:13 AM11/12/20

to tesser...@googlegroups.com

Hello shree,

Than, what is the way to train the sanskrit along with roman diacritical and achieve accuracy too or the alternative ways to do achieve this ?

Regards,

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUJu%2B4fRB2vL0T_%3D6CMT4CZ%3DRccGRw24Pnc84QcTxtDLQ%40mail.gmail.com.

Shree Devi Kumar

unread,

Nov 12, 2020, 5:38:59 AM11/12/20

to tesseract-ocr

Please see tesseract-ocr/tesstrain repo

You need line images and their groundtruth text and the makefile will make box, lstmf and do the training.

Many blog posts and tutorials about tesseract training are for tesseract 3.0x. They will not work for Devanagari.

You can also look at tesstutorial for 4.0. you can try plusminus or replace top layer type of training.

For good results you need a lot of training data, eg. 50000 text lines.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMREWd45DEt_y5EcXLQR0_gecJdEPJY1fNyGkmMDugYnGCDG%2BQ%40mail.gmail.com.

Shree Devi Kumar

unread,

Nov 12, 2020, 10:15:32 AM11/12/20

to tesseract-ocr

You can clone https://github.com/Shreeshrii/tesstrain-sanPlusMinus

and then add your owm training data and try.

Virus-free. www.avg.com

shreyansh dwivedi

unread,

Nov 13, 2020, 12:26:26 AM11/13/20

to tesser...@googlegroups.com

Hello Shree,

The mentioned repository does not exist, please check the screenshots.

Regards,

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUz8rw7hfuLXm_ke2Ji0f_xUDrKHEOF5AZeZguWjos2-g%40mail.gmail.com.

Capture_2.JPG

Shree Devi Kumar

unread,

Nov 13, 2020, 12:57:21 AM11/13/20

to tesseract-ocr

Looks like it was marked as private. Please check now.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMREWd7bDKt4cf7rFHNk6yCD0tTHE3qMMh0Q%2BsNmqpPdSRRGAQ%40mail.gmail.com.

Shree Devi Kumar

unread,

Nov 14, 2020, 10:14:27 PM11/14/20

to tesseract-ocr

I have updated the repo to use latest version of tesstrain. Also added a plotting option to see the results of training.

I have used only synthetic images in Devanagari script.

For Sanskrit iAST + Devanagari, use script/Devanagari as start model.

shreyansh dwivedi

unread,

Nov 30, 2020, 10:25:54 AM11/30/20

to tesser...@googlegroups.com

Shree I have gone through it, but I might need proper workflow to understand the in depth of the training model process of sanskrit in devanagari script, can you suggest any tutorial for the same, and what are your views over the pyimagesearch tutorials, can i purchase the OCR expert bundle for it.

Regards,

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXqTryK3skoB3a9Y-fseYs88Pv7yHX0PLPzvqUKw8X1og%40mail.gmail.com.

shree

unread,

Dec 3, 2020, 8:16:19 AM12/3/20

to tesseract-ocr

1. git clone https://github.com/Shreeshrii/tesstrain-sanPlusMinus

2. cd tesstrain-sanPlusMinus

3. nohup make training MODEL_NAME=sanPlusMinus START_MODEL=san TESSDATA=/home/ubuntu/tessdata_best MAX_ITERATIONS=500000 LANG_TYPE=Indic DEBUG_INTERVAL=-1 > plot/sanPlusMinus.LOG &

In the last command which starts training, change the TESSDATA directory to point to wherever you have the tessdata_best/san.traineddata model.

Greg Jay

unread,

Dec 7, 2020, 2:28:17 AM12/7/20

to tesseract-ocr

I am very interested in having a working version of Tesseract that can do OCR of Devanagari including all Unicode Extended and Vedic extensions. I also wish to be able to OCR IAST and ISO15919 Latin script glyphs with Diacritics for representing Sanskrit and other Indic languages in transliteration. Also I am interested in any progress in training Tesseract to recognize Grantha script which traditionally was used in South India (especially Tamil Nadu) to represent Sanskrit. Grantha letters are also used to represent Sanskrit in Manipravala a medieval language which is a mixture of Tamil and Sanskrit. Any information and news on these issues is of interest to me.

Thanks in advance to all those training Tesseract along these lines.

Greg

shree

unread,

Dec 11, 2020, 12:13:11 PM12/11/20

to tesseract-ocr

For Sanskrit in Devanagari and IAST, you can try the traineddata files from https://github.com/Shreeshrii/tesstrain-Sanskrit-IAST

For Sanskrit alone, you can try the traineddata file from https://github.com/Shreeshrii/tesstrain-sanPlusMinus

These have the float models, to improve speed they can be compressed using `combine_tessdata -c`

I would appreciate feedback on how well these work compared to the official `san` and `Devanagari` files.

I had done some training for grantha using the Noto fonts. But to be usable, I need more training data of actual line images and their groundtruth transcription. If you can provide that, I'll be happy to retrain it.

Greg Jay

unread,

Dec 14, 2020, 1:40:59 AM12/14/20

to tesser...@googlegroups.com

Thank you

On Dec 11, 2020, at 7:13 AM, shree <shree...@gmail.com> wrote:

For Sanskrit in Devanagari and IAST, you can try the traineddata files from https://github.com/Shreeshrii/tesstrain-Sanskrit-IAST

For Sanskrit alone, you can try the traineddata file from https://github.com/Shreeshrii/tesstrain-sanPlusMinus

Thanks I’m not sure exactly what to do with these links or the files they access?

These have the float models, to improve speed they can be compressed using `combine_tessdata -c`

Sorry but I don’t know what all this means?

I would appreciate feedback on how well these work compared to the official `san` and `Devanagari` files.

I would be happy to give feedback. I have been using san. But was unaware that you can also use Devanagari. What is the difference?

I had done some training for grantha using the Noto fonts. But to be usable, I need more training data of actual line images and their groundtruth transcription. If you can provide that, I'll be happy to retrain it.

I would be happy to provide more examples of Grantha. If you tell me how to make “actual line images” and “groundtruth transcription”?

I can make images of Grantha. Let me know the format?

I can also provide the transliteration in IAST or ISO15919 or some other Indic script like Devanagari.

Sorry if I show my lack of understanding here.

Shree Devi Kumar

unread,

Dec 14, 2020, 4:47:29 AM12/14/20

to tesseract-ocr

Appreciate your offer to help and provide feedback as well as training data.

Let me try to answer your queries:

1. > I have been using san. But was unaware that you can also use Devanagari. What is the difference?

san has been trained for Sanskrit. But it is missing certain Devanagari characters. See https://github.com/tesseract-ocr/tessdata/issues/64

script/Devanagari has been trained for san, hin, mar, nep and eng. So the missing characters are all trained in this, though the language model is not strictly for san.

2. >>These have the float models, to improve speed they can be compressed using `combine_tessdata -c`

Tesseract has two kinds of traineddata files, those with best/float/double models and those with fast/integer models.

tessdata_best repo has the best/float/double models. These have better accuracy but are much slower. These can be used as START_MODEL for further finetune training.

tessdata_fast repo has fast/integer models. These are 'best value for money' models and are the ones included in the official distributions. They have slightly less accuracy but are much faster.

The traineddata files I had uploaded were only the `best/float` models after finetune training. These can be compressed to `fast/iinteger` models using the command

`combine_tessdata -c my.traineddata`

I will upload the fast versions also to the repo so that both types are available without the need for the extra step.

3. >> I’m not sure exactly what to do with these links or the files they access?

See https://github.com/tesseract-ocr/tessdoc/blob/master/Data-Files.md and https://github.com/tesseract-ocr/tessdoc/blob/master/Compiling.md#language-data

The traineddata files are the files in the tessdata folder eg. eng.traineddata, san.traineddata script/Devanagari.traineddata

https://github.com/Shreeshrii/tesstrain-Sanskrit-IAST/tree/main/data/tessdata_best has links to traineddata files after different runs of finetuning.

Sample script on Linux, if the finetuned traineddata files are in $HOME/tess5training-iast/tessdata

```

my_files=$(ls */*{*.jpg,*.tif,*.tiff,*.png,*.jp2,*.gif})
for my_file in ${my_files}; do
for LANG in Sanskrit-1017 ; do
echo -e "\n ***** " $my_file "LANG" $LANG PSM $PSM "****"
OMP_THREAD_LIMIT=1 tesseract $my_file "${my_file%.*}" --oem 1 --psm 3 -l "$LANG" --dpi 300 --tessdata-dir $HOME/tess5training-iast/tessdata -c page_separator='' -c tessedit_char_blacklist="¢£¥€₹™$¬©®¶‡†&@"
done
done
```
4. tell me how to make “actual line images” and “groundtruth transcription”?

For using tesstrain repo for training, we use single line images and their groundtruth transcription in UTF-8 text.

Files names need to have same basename with groundtruth extension being .gt.txt

Example

https://github.com/Shreeshrii/tesstrain-sanPlusMinus/blob/master/data/sanPlusMinus-ground-truth/Adishila/san.Adishila.0000001.exp0_0.png

https://github.com/Shreeshrii/tesstrain-sanPlusMinus/blob/master/data/sanPlusMinus-ground-truth/Adishila/san.Adishila.0000001.exp0_0.gt.txt

I have generated a lot of synthetic data using fonts and training text. It will be useful to have line images from scanned pages with their transcription. These can be used first to evaluate the different models and also for further finetuning.

Reply all

Reply to author

Forward