tesseract-ocr

393 views
Skip to first unread message

Navaneetha Bitla

unread,
Jun 19, 2018, 5:30:30 AM6/19/18
to tesseract-ocr
Hi, this is Navaneetha

i'm working in hand written character recognition project. 

I have trained 1300 different hand written fonts of english and moved the files into tessdata directory.

tested tesseract using the below commands:

$convert -density 300 input.png -depth 8 -strip -background white -alpha off out.tiff

 $tesseract out.tiff eng

The input.png is of Alanis Handa font and i have trained this font but i'm not getting atleast 40% accuracy.

Can someone help me.


Thanks in advance.
out.txt

Shree Devi Kumar

unread,
Jun 19, 2018, 6:31:16 AM6/19/18
to tesser...@googlegroups.com
Which version of tesseract/.

How did you train the fonts? What was accuracy level for training? How many iterations?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/24acbae0-13e3-4eac-a55a-802629665854%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Navaneetha Bitla

unread,
Jun 19, 2018, 7:24:39 AM6/19/18
to tesser...@googlegroups.com
using serak trainer i have trained the 1300 hand written fonts. it doesnt show the accuracy level and iterations.

is that important, actually i dont know that's why i'm asking.

Thank you for the immediate replay.

On Tue, Jun 19, 2018 at 4:00 PM, Shree Devi Kumar <shree...@gmail.com> wrote:
Which version of tesseract/.

How did you train the fonts? What was accuracy level for training? How many iterations?
On Tue, Jun 19, 2018 at 3:00 PM Navaneetha Bitla <neeth...@gmail.com> wrote:
Hi, this is Navaneetha

i'm working in hand written character recognition project. 

I have trained 1300 different hand written fonts of english and moved the files into tessdata directory.

tested tesseract using the below commands:

$convert -density 300 input.png -depth 8 -strip -background white -alpha off out.tiff

 $tesseract out.tiff eng

The input.png is of Alanis Handa font and i have trained this font but i'm not getting atleast 40% accuracy.

Can someone help me.


Thanks in advance.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Message has been deleted

Navaneetha Bitla

unread,
Jun 19, 2018, 12:19:54 PM6/19/18
to tesser...@googlegroups.com
serak trainer using training tesseract 3.5.



On Tue, Jun 19, 2018 at 9:29 PM, James Q <james.qu...@taina.tech> wrote:
Hi Navaneetha
I am also looking to start training tesseract using handwritten fonts and am about to start setting up my training environment. Are you training tesseract 4 by following the guide at https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 ?

If so are you fine tuning the existing english model, retraining just the top layer(s) or training from scratch with your additional fonts?

Thanks
Jim

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Message has been deleted

Navaneetha Bitla

unread,
Jun 20, 2018, 11:22:45 AM6/20/18
to tesser...@googlegroups.com

the above link has 1900+ fonts from that site i have downloaded the ttf files of fonts and converted to tiff files online.

then i have trained the tiff files(fonts) using serak trainer.


If you got the accuracy just forward the results so everyone can konw and will follw you.

Thank you

On Wed, Jun 20, 2018 at 3:13 PM, James Q <james.qu...@taina.tech> wrote:
I'm going to be using tesseract 4 and using the tesstrain.sh script. If I come across things that improve accuracy though I will let you know.

Where did you find 1300 handwriting fonts?


On Tuesday, June 19, 2018 at 5:19:54 PM UTC+1, Navaneetha Bitla wrote:
serak trainer using training tesseract 3.5.


On Tue, Jun 19, 2018 at 9:29 PM, James Q <james.qu...@taina.tech> wrote:
Hi Navaneetha
I am also looking to start training tesseract using handwritten fonts and am about to start setting up my training environment. Are you training tesseract 4 by following the guide at https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 ?

If so are you fine tuning the existing english model, retraining just the top layer(s) or training from scratch with your additional fonts?

Thanks
Jim

On Tuesday, June 19, 2018 at 10:30:30 AM UTC+1, Navaneetha Bitla wrote:
Hi, this is Navaneetha

i'm working in hand written character recognition project. 

I have trained 1300 different hand written fonts of english and moved the files into tessdata directory.

tested tesseract using the below commands:

$convert -density 300 input.png -depth 8 -strip -background white -alpha off out.tiff

 $tesseract out.tiff eng

The input.png is of Alanis Handa font and i have trained this font but i'm not getting atleast 40% accuracy.

Can someone help me.


Thanks in advance.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Shree Devi Kumar

unread,
Jun 20, 2018, 11:30:41 AM6/20/18
to tesser...@googlegroups.com
You will have better control on training if you use tesstrain.sh provided with tesseract.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.


--

Navaneetha Bitla

unread,
Jun 20, 2018, 11:35:32 AM6/20/18
to tesser...@googlegroups.com
can you help us by saying how to train with tesstrain.sh

It will help all of us, we are thankful to you.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Shree Devi Kumar

unread,
Jun 20, 2018, 11:45:14 AM6/20/18
to tesser...@googlegroups.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,
Jun 20, 2018, 4:11:30 PM6/20/18
to tesser...@googlegroups.com
Attached is a BASH script for Finetune training for 'Impact' (refer to Ray's tutorial in wiki for more details).
Use this when you want to finetune a model for a single new font.

You will need to change the paths for directories and filenames based on your system.

The script assumes that you have tesseract 4.0.0-beta installed alongwith training tools. Refer to wiki main page for info on how to download latest version of code from PPA etc.

Please read through the script first, change as needed, create the required training texts and then run the script.

#!/bin/bash
#####################################################
# Script to finetune a language traineddata file for one new font
# for tesseract4.0.0-beta
# Modify directory paths and filenames as required for your setup.
#####################################################
# Choose which parts of script are to be run?
MakeData=yes
RunTraining=yes
RunEval=yes
#####################################################

# Language 
Lang=eng

# downloaded directory with language data
langdata_dir=~/langdata

# Make about 150 lines of representative training text for finetuning
finetune_training_text=$langdata_dir/$Lang/$Lang.finetune.training_text 

# Make about 150 lines of representative training text for evaluation
eval_training_text=$langdata_dir/$Lang/$Lang.eval.training_text 

# fonts directory for this system
fonts_dir=~/.fonts

# Finetune training for IMPACT - ONE font ONLY  
fonts_for_training=" \
'Alanis Hand'  \
"
 
# directory with the old 'best' language training set to continue from eg. ara, eng, san
bestdata_dir=~/tessdata_best

# tessdata-dir which has osd.trainddata, eng.traineddata, config and tessconfigs folder and pdf.ttf
tessdata_dir=~/tessdata

# directory with training scripts - tesstrain.sh etc.
tesstrain_dir=~/tesseract/src/training

# output directories for this run
trained_output_dir=./$Lang-finetune-impact
eval_output_dir=./$Lang-finetune-impact-eval

if [ $MakeData = "yes" ]; then

echo "###### MAKING EVAL DATA ######"
 rm -rf $eval_output_dir
 mkdir $trained_output_dir

echo "#### running tesstrain.sh for eval text ####"

eval bash $tesstrain_dir/tesstrain.sh \
--lang $Lang \
--linedata_only \
--noextract_font_properties \
--exposures "0" \
--fonts_dir $fonts_dir \
--fontlist $fonts_for_training \
--langdata_dir $langdata_dir \
--tessdata_dir  $tessdata_dir \
--training_text $eval_training_text \
--output_dir $eval_output_dir

echo "###### MAKING TRAINING DATA ######"
 rm -rf $trained_output_dir
 mkdir $trained_output_dir

echo "#### running tesstrain.sh for training text ####"

eval bash $tesstrain_dir/tesstrain.sh \
--lang $Lang \
--linedata_only \
--noextract_font_properties \
--exposures "0" \
--fonts_dir $fonts_dir \
--fontlist $fonts_for_training \
--langdata_dir $langdata_dir \
--tessdata_dir  $tessdata_dir \
--training_text $finetune_training_text \
--output_dir $trained_output_dir

echo "#### running combine_tessdata to extract lstm model from 'tessdata_best' for $Lang ####"

combine_tessdata -e $bestdata_dir/$Lang.traineddata $bestdata_dir/$Lang.lstm

fi

if [ $RunTraining = "yes" ]; then

echo "###### LSTM TRAINING ######"

echo "#### running lstmtraining for finetuning from $bestdata_dir/$Lang.traineddata #####"

lstmtraining \
--continue_from  $bestdata_dir/$Lang.lstm \
--traineddata    $bestdata_dir/$Lang.traineddata \
--max_iterations 1000 \
--debug_interval 0 \
--train_listfile $trained_output_dir/$Lang.training_files.txt \
--model_output  $trained_output_dir/finetune

echo "###### BUILD FINETUNED MODEL ######"

echo "#### Building final trained file $Lang-finetune-$Lang.traineddata  ####"

lstmtraining \
--stop_training \
--continue_from $trained_output_dir/finetune_checkpoint \
--traineddata    $bestdata_dir/$Lang.traineddata \
--model_output "$trained_output_dir/$Lang-finetune-$Lang.traineddata"

fi

if [ $RunEval = "yes" ]; then

echo "###### EVAL ORIGINAL MODEL ######"

lstmeval \
--model  $bestdata_dir/$Lang.traineddata \
--eval_listfile $eval_output_dir/$Lang.training_files.txt \
--verbosity 0

echo "###### EVAL FINETUNED MODEL ######"

lstmeval \
--model  $trained_output_dir/$Lang-finetune-$Lang.traineddata \
--eval_listfile $eval_output_dir/$Lang.training_files.txt \
--verbosity 0

fi

Shree Devi Kumar

unread,
Jun 20, 2018, 4:56:27 PM6/20/18
to tesser...@googlegroups.com
Here are the bash script files:

1. for finetune for impact training - add a font
2. for finetune plus-minus training - for adding a new character 
lstmtrain_finetune_impact.sh
lstmtrain_finetune_plus.sh

Navaneetha Bitla

unread,
Jun 20, 2018, 11:08:02 PM6/20/18
to tesser...@googlegroups.com

Shree Devi Kumar

unread,
Jun 21, 2018, 2:25:59 AM6/21/18
to tesser...@googlegroups.com
Thank you very much sir

Ma'am, not Sir. I am Mrs. Kumar.

Let me know if you have any questions or need clarification regarding the scripts. I will post them on the wiki after any needed changes.

Navaneetha Bitla

unread,
Jun 21, 2018, 2:45:26 AM6/21/18
to tesser...@googlegroups.com
ok fine

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

chandra churh chatterjee

unread,
Jun 21, 2018, 2:55:14 AM6/21/18
to tesser...@googlegroups.com
Excuse me @Shree Devi Kumar can you please tell me whether data for training tesseract 4.0 would be better if the data has images which have paragraphed hand written texts 
or single character based texts as follows

hsf_1_00000.png

Navaneetha Bitla

unread,
Jun 21, 2018, 4:19:38 AM6/21/18
to tesser...@googlegroups.com
yeah i've tried to train with these images but its giving dpi etc error.

Then i've moved to ttf font then converted ttf to tiff finally trained the data but output is very bad, i dont know whether bad results for training process or dataser.

Still trying to make progress.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Shree Devi Kumar

unread,
Jun 21, 2018, 5:22:11 AM6/21/18
to tesser...@googlegroups.com
Tesseract4 LSTM training is line based. 

Shree Devi Kumar

unread,
Jun 21, 2018, 5:24:11 AM6/21/18
to tesser...@googlegroups.com
I had tried training with the handwriting font you mentioned in first message. 

I think that font has same shapes for capitals as well as lower case letters.

So recognition rates will be lower for it.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Message has been deleted

Shree Devi Kumar

unread,
Jun 21, 2018, 8:05:42 AM6/21/18
to tesser...@googlegroups.com
> Quite a few of these handwriting fonts are uppercase letters only (so lowercase come out as uppercase when typed) . What is the best type of [lang].training_text data to use for training these - is it uppercase only?

It would depend on the application where training is being used.

If you want support for both upper case and lower case, then make a list of fonts that have only uppercase letters and create LSTMF files for that with a training text that has only capitals. For rest of the fonts use a normal training text with both upper and lower case. While running LSTMtraining use bothh sets of lstmf files.
Message has been deleted

Shree Devi Kumar

unread,
Jun 21, 2018, 11:06:15 AM6/21/18
to tesser...@googlegroups.com
You can use ALL fonts at once. However, I have had errors with box files not being created for some fonts and the tesstrain_utils.sh script dies only at end while checking whether files are readable or not.  In that case have to restart the process again.

On Thu, Jun 21, 2018 at 8:28 PM James Q <james.qu...@taina.tech> wrote:
Hi Shree, I'm trying out the script you posted earlier which is great so thank you! I was wondering how many fonts I can specify at once in the 'fonts_for_training' list. I have run it with 9 fonts at once and that seems fine but I would like to do 100s or even 1000s if I can. Is this the best way or would I be better off creating the lstmf files in batches first?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.

fadif...@gmail.com

unread,
Jun 21, 2018, 12:33:32 PM6/21/18
to tesseract-ocr
@Shree

Thanks for providing the two bash scripts
I want to ask you about tesstrain.sh and tesstrain_utils.sh, Is there something that must be edited before running lstmtrain_finetune_impact.sh ?

Shree Devi Kumar

unread,
Jun 21, 2018, 12:37:08 PM6/21/18
to tesser...@googlegroups.com
# Make about 150 lines of representative training text for finetuning
finetune_training_text=$langdata_dir/$Lang/$Lang.finetune.training_text 

# Make about 150 lines of representative training text for evaluation
eval_training_text=$langdata_dir/$Lang/$Lang.eval.training_text 





For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
Message has been deleted
0 new messages