Tesstrain.sh fails when provided > 7 tif/box pairs

71 views
Skip to first unread message

tc...@zips.uakron.edu

unread,
Jan 4, 2019, 11:27:37 AM1/4/19
to tesseract-ocr
Hey all,

I'm currently working on a program that explores the handwritten OCR capabilities of Tesseract.

I have ~1400 images with ~8 lines of handwritten textlines per image with accompanying BOX files. Additionally, I've got a couple of handwritten fonts that I'm using to bootstrap the training process.

One problem I'm having is that when I invoke tesstrain.sh, it will consitently fail at some point (mostly around Phase E) when more than 7 box/tif pairs or fonts are provided as input. I've tried combinations where all the inputs are font files, all inputs are handwritten tif/box pairs, and inputs as a mix of the two.

I had originally tried using Shree's modified boxtrain files but was receiving an error that had to do with failing to read in a unicharset file. So, I modified tesstrain.sh and tesstrain_utils.sh (referencing Shree's modified scripts) myself to work with my own provided tif/box pairs.

Is there a limit to the number of inputs to tesstrain.sh that should be followed or should I confidently be able to give tesstrain.sh all 1400 of my images no problem?

Thanks,
Tim Snyder


tc...@zips.uakron.edu

unread,
Jan 4, 2019, 11:31:42 AM1/4/19
to tesseract-ocr


The program usually comes to this point before it indefinitely hangs.

fail.png

Shree Devi Kumar

unread,
Jan 4, 2019, 12:45:01 PM1/4/19
to tesser...@googlegroups.com
tesestrain.sh is setup to process files in batches of 8 simultaneously. Are you allowing the script to run to completion?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/dba86440-e325-4156-bfc7-85a1a680c63e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,
Jan 4, 2019, 12:51:57 PM1/4/19
to tesser...@googlegroups.com
You can also try the ocr-d/train project which can train using scanned images.

tc...@zips.uakron.edu

unread,
Jan 4, 2019, 1:09:46 PM1/4/19
to tesseract-ocr
Yeah I gave it quite a while to complete and it was still stuck on the same text2image call. Upon inspection, I see that its hanging after the eighth call to text2image during Phase I when the synthetic images are being generated. I'm getting the same behavior using the unmodified tesstrain scripts as well. Do you know if there's an easy way to force tesstrain.sh to process files sequentially?

I'll be sure to check out ocr-d/train.

tc...@zips.uakron.edu

unread,
Jan 4, 2019, 1:15:42 PM1/4/19
to tesseract-ocr
Disregard my last question. I figured out how to modify the batch size and found that it will hang indefinitely after processing the first batch of files if the specified batch size is smaller than the number of files I want to process. I set the batch size to 9999 and everything seems to be working fine now. Odd.

On Friday, January 4, 2019 at 11:27:37 AM UTC-5, tc...@zips.uakron.edu wrote:

Shree Devi Kumar

unread,
Jan 4, 2019, 3:18:22 PM1/4/19
to tesser...@googlegroups.com
That's indeed strange. What's your version of  tesseract and o/s? You should not be getting such errors with unmodified tesstrain.sh script.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

tc...@zips.uakron.edu

unread,
Jan 4, 2019, 3:49:21 PM1/4/19
to tesseract-ocr
I'm using Tesseract v4.0.0.20181030 which I cloned from the main GitHub page two days ago.

I built Tesseract and the training tools from source with the Autotools and Make files.

Tesseract and the training tools are being run on a WSL install of Ubuntu v18.04.1 LTS on a VirtualBox VM running Windows 10 Pro v1803.

Thanks for the reply, shree.

bohdan.mo...@gmail.com

unread,
Jan 5, 2019, 8:00:05 AM1/5/19
to tesseract-ocr
Change the loop inside phase_I_generate_image() of tesseract_utils.sh to

        local counter=0
        for font in "${FONTS[@]}"; do
            sleep 1
            generate_font_image "${font}" &
            let counter=counter+1
            if [[ "${counter}" -ge par_factor ]]; then
              wait -n
            fi
        done
Current version has bash error, moreover you waste time with wait instead of wait -n

Zdenko Podobny

unread,
Jan 5, 2019, 2:08:57 PM1/5/19
to tesser...@googlegroups.com
Can you make a PR or issue so your suggestion for improvement is not lost?

Zdenko


so 5. 1. 2019 o 14:00 <bohdan.mo...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Timothy Snyder

unread,
Jan 7, 2019, 11:43:39 AM1/7/19
to tesser...@googlegroups.com
Unfortunately this did not work for me. I still have to change these lines in tesstrain.sh to successfully run it.

phase_I_generate_image 9999
...
phase_E_extract_features " --psm 6  lstm.train " 9999 "lstmf"
...
phase_E_extract_features "box.train" 9999 "tr"

For mine to work, 9999 can be any number greater than the total number of fonts and box/tif pairs that you wish to train with.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Stefan Weil

unread,
Jan 23, 2019, 11:28:26 AM1/23/19
to tesseract-ocr
Thank you for reporting this issue. That was a bug in the training script. I fixed it now in latest Tesseract. See commit https://github.com/tesseract-ocr/tesseract/commit/ecf73f5bc7422f17ab68ad6daa08954324bd3ab5.
Reply all
Reply to author
Forward
0 new messages