Appreciate your offer to help and provide feedback as well as training data.
Let me try to answer your queries:
1. > I have been using san. But was unaware that you can also use Devanagari. What is the difference?
script/Devanagari has been trained for san, hin, mar, nep and eng. So the missing characters are all trained in this, though the language model is not strictly for san.
2. >>These have the float models, to improve speed they can be compressed using `combine_tessdata -c`
Tesseract has two kinds of traineddata files, those with best/float/double models and those with fast/integer models.
tessdata_best repo has the best/float/double models. These have better accuracy but are much slower. These can be used as START_MODEL for further finetune training.
tessdata_fast repo has fast/integer models. These are 'best value for money' models and are the ones included in the official distributions. They have slightly less accuracy but are much faster.
The traineddata files I had uploaded were only the `best/float` models after finetune training. These can be compressed to `fast/iinteger` models using the command
`combine_tessdata -c my.traineddata`
I will upload the fast versions also to the repo so that both types are available without the need for the extra step.
3. >> I’m not sure exactly what to do with these links or the files they access?
The traineddata files are the files in the tessdata folder eg. eng.traineddata, san.traineddata script/Devanagari.traineddata
```
my_files=$(ls */*{*.jpg,*.tif,*.tiff,*.png,*.jp2,*.gif})
for my_file in ${my_files}; do
for LANG in Sanskrit-1017 ; do
echo -e "\n ***** " $my_file "LANG" $LANG PSM $PSM "****"
OMP_THREAD_LIMIT=1 tesseract $my_file "${my_file%.*}" --oem 1 --psm 3 -l "$LANG" --dpi 300 --tessdata-dir $HOME/tess5training-iast/tessdata -c page_separator='' -c tessedit_char_blacklist="¢£¥€₹™$¬©®¶‡†&@"
done
done
```
4. tell me how to make “actual line images” and “groundtruth transcription”?
For using tesstrain repo for training, we use single line images and their groundtruth transcription in UTF-8 text.
Files names need to have same basename with groundtruth extension being .gt.txt
Example