Custom Tiff/Box pairs support in tesstrain.sh

99 views
Skip to first unread message

hrishikesh kaulwar

unread,
Jun 18, 2019, 5:24:09 AM6/18/19
to tesseract-ocr
Greetings,
    I just got to know that tesstrain.sh is modified to support user provided box/tiff pairs by adding a tiff/box directory flag. I used that version of tesseract source to use my own tiff/box pairs. But when I ran tesstrain.sh I got to know that it just copies tiff/box pairs provided by me to training directory but .lstmf file is generated from eng.training_text file. My tiff/box pairs are not getting used in creating training data. Can someone point out what mistake I am making? or some way to only use user provided tiff/box pairs to create training data?
 Thanks in advance.

Shree Devi Kumar

unread,
Jun 18, 2019, 5:38:19 AM6/18/19
to tesser...@googlegroups.com
It should work if your files follow similar naming convention.

lang.xxxnnn.exp0.tif
lang.xxxnnn.exp0.box

Where lang is your language code eg. eng

xxxnnn is any unique random string (fontname in files generated by text2image)

  

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f49566cf-0b6c-4b84-8c47-014ee31d3f60%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

hrishikesh kaulwar

unread,
Jun 19, 2019, 4:12:02 AM6/19/19
to tesseract-ocr
Thank you for your help. I have checked it many times. Could you tell me where I am doing wrong? It takes my 3 tiff box pairs for example and copies it into train directoey. Then it overwrites exp0.tif file with randomly generated text and text2image tool. Although 3 tiff box pairs are accepted it only creates lstmf of 1st file generated by text2image and ignores rest. I have attached generate_training_data.sh script. also the screeshot of the folder where lstmf files are generated.

Also one more doubt is when I use lstm.train command a text file also gets generated with lstmf file.
I have named image files as per convention
tesseract  eng.Arial_Regular.exp0.png eng.Arial_Regular.exp0 lstm.train
Image is attached above. and two files generated are also attached.
On Tuesday, June 18, 2019 at 3:08:19 PM UTC+5:30, shree wrote:
It should work if your files follow similar naming convention.

lang.xxxnnn.exp0.tif
lang.xxxnnn.exp0.box

Where lang is your language code eg. eng

xxxnnn is any unique random string (fontname in files generated by text2image)

  

On Tue, Jun 18, 2019 at 2:54 PM hrishikesh kaulwar <hpka...@gmail.com> wrote:
Greetings,
    I just got to know that tesstrain.sh is modified to support user provided box/tiff pairs by adding a tiff/box directory flag. I used that version of tesseract source to use my own tiff/box pairs. But when I ran tesstrain.sh I got to know that it just copies tiff/box pairs provided by me to training directory but .lstmf file is generated from eng.training_text file. My tiff/box pairs are not getting used in creating training data. Can someone point out what mistake I am making? or some way to only use user provided tiff/box pairs to create training data?
 Thanks in advance.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
eng.Arial_Regular.exp0.txt
eng.Arial_Regular.exp0.png
eng.Arial_Regular.exp0.lstmf
generate_training_data.sh
Screenshot from 2019-06-19 13-40-16.png

Shree Devi Kumar

unread,
Jun 19, 2019, 4:32:54 AM6/19/19
to tesser...@googlegroups.com
> eng.Arial_Regular.exp0.png 

The script expects tif files not png.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.

hrishikesh kaulwar

unread,
Jun 19, 2019, 5:14:52 AM6/19/19
to tesseract-ocr
Hello shree,
     I tried again with .tif and lstm.train command generated .txt file again along with lstmf file. I don't think that's the error. Thanks for helping.

Shree Devi Kumar

unread,
Jun 19, 2019, 5:48:12 AM6/19/19
to tesser...@googlegroups.com
>Also one more doubt is when I use lstm.train command a text file also gets generated with lstmf file
You can ignore that txt file. Only lstmf is used for further processing.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.
Message has been deleted

hrishikesh kaulwar

unread,
Jun 19, 2019, 6:53:07 AM6/19/19
to tesseract-ocr

Okay I will ignore it. Just wanted to know what the generation of text file signifies in lstm train step since its unusual. Is it some decoding encoding error? Is it showing incomplete lstm training?  I have attached a sample text file. You can check out the file. Tell me if you know what is wrong? Thanks again.
eng.Arial_Regular.exp0.txt
eng.Arial_Regular.exp0.box
eng.Arial_Regular.exp0.lstmf
eng.Arial_Regular.exp0.tif

hrishikesh kaulwar

unread,
Jun 20, 2019, 1:25:11 AM6/20/19
to tesseract-ocr

Hey shree could you tell me what line in tesstrain.sh takes care of user provided tiff box pairs. Like what is the line which creates lstmf files from those pairs and then puts the name of lstmf files in training_list. Thanks in advance.

Shree Devi Kumar

unread,
Jun 20, 2019, 2:52:52 AM6/20/19
to tesser...@googlegroups.com
See tesstrain_utils.sh 

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Shree Devi Kumar

unread,
Jun 20, 2019, 4:25:20 AM6/20/19
to tesser...@googlegroups.com
if [[ ${MY_BOXTIFF_DIR} != "" ]]; then
    tlog "\n=== Copy existing box/tiff pairs from '${MY_BOXTIFF_DIR}'"
    cp  ${MY_BOXTIFF_DIR}/*.box ${TRAINING_DIR} | true
    cp  ${MY_BOXTIFF_DIR}/*.tif ${TRAINING_DIR} | true
    ls -l  ${TRAINING_DIR}
fi

copies the files to training directory

phase_I_generate_image 8

generates box/tiff pairs from the training text and fonts specified. Please note that if you had same name files copied from my_boxtiff_dir, they will get overwritten,

phase_UP_generate_unicharset

generates unicharset from all box files in training directory (meeting the file naming convention lang.xxx.exp0.box)

phase_E_extract_features " --psm 6 lstm.train " 8 "lstmf"

this created lstmf files from all the box/tiff pairs

make__lstmdata

creates the list of lstmf files
moves all required files from tmp directory to output directory


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/166bfb15-52d9-4cc1-8f28-bb20e7ff3797%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

hrishikesh kaulwar

unread,
Jun 20, 2019, 7:50:39 AM6/20/19
to tesseract-ocr
That was very crystal clear explanation. Thank you for explaining shree. I got it now.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

hrishikesh kaulwar

unread,
Jun 21, 2019, 2:19:08 AM6/21/19
to tesseract-ocr
Hey Shree,
       Thanks for helping so much. Can you tell me if fine tuning tesseract doesn't work for images like eng.Arial_Regular.exp0.png attached here. Could you suggest some way around detecting it correctly through tesseract?? The S is not getting detected after fine tunning in many ways.
eng.Arial_Regular.exp0.png

Shree Devi Kumar

unread,
Jun 21, 2019, 4:45:28 AM6/21/19
to tesser...@googlegroups.com
Dewarp the image for better recognition, without training.

I used scantailor.

 tesseract dewarp.tif -  --psm 6
Page 1
SAFE SURGERY CHECKLIST
I dapted from WHO Safe Surgery Checklist

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.
eng.Arial_Regular.exp0.tif

hrishikesh kaulwar

unread,
Jul 1, 2019, 8:32:51 AM7/1/19
to tesseract-ocr
Okay thanks for the suggestion. I will try it.
Reply all
Reply to author
Forward
0 new messages