OCR-d failed at Unicharset line -Help!

May

unread,

Aug 2, 2018, 7:08:11 PM8/2/18

to tesseract-ocr

Hey all,

I am following Shree's script for OCR-d in the google groups for ocrd-training (https://groups.google.com/forum/#!topic/tesseract-ocr/be4-rjvY2tQ). I managed to pass the combine tessdata stage but got stuck at the unicharset stage:

I have edited the script to direct it to my path:

I do find a unicharset file named "unicharset" but not as "my.unicharset". Changing the script by removing "my." also did not solve the problem. Do you know what's causing the issue?

Best

May

unread,

Aug 2, 2018, 7:11:08 PM8/2/18

to tesseract-ocr

Here are attached photos

Shree Devi Kumar

unread,

Aug 2, 2018, 11:52:38 PM8/2/18

to tesser...@googlegroups.com

Please use latest scripts from https://github.com/OCR-D/ocrd-train

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/48347dd8-7b7e-4d0d-9cb5-b21e3ec23f31%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

May

unread,

Aug 7, 2018, 1:20:40 AM8/7/18

to tesseract-ocr

Hey Shree

I also tried with the orignal script from the github. But faced the same issue with the process stuck at unicharset_output.

These are the versions:

tesseract 3.05.02

leptonica-1.75.3

libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.2.0

Shree Devi Kumar

unread,

Aug 7, 2018, 1:26:12 AM8/7/18

to tesser...@googlegroups.com

Ocr-d scripts are geared towards tesseract 4.0.x. you are trying to use it with tesseract 3.05.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/af43b995-7e24-4dca-827c-080755211544%40googlegroups.com.

May

unread,

Aug 7, 2018, 2:42:40 AM8/7/18

to tesseract-ocr

Thanks a lot Shree. I tried the tesseract 4.0 and the training is working well until it reaches the lstm-training step and got stuck there. I am totally new in the training so hope you don't mind if I am asking silly questions. Do you know why I got stuck? Also, would you call this training fine-tuning? As I just want to improve the accuracy of existing eng.langdata.

May

unread,

Aug 7, 2018, 3:09:42 AM8/7/18

to tesseract-ocr

Oh the training started by itself after a long while and still processing. Does it normally take that long to train 6 images?

Shree Devi Kumar

unread,

Aug 7, 2018, 4:30:00 AM8/7/18

to tesser...@googlegroups.com

lstm training can take weeks, days, hours depending on the options chosen.

you have given complete network spec, so that is training from scratch.

Please see the following training wiki page for training related info:

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5a1e3259-e0e4-45aa-8eb5-db28f0eba535%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,

Aug 7, 2018, 5:08:10 AM8/7/18

to tesser...@googlegroups.com

question: why are you trying to do training?

There are hundreds of languages already supported by tesseract. Have you tried them?

If none of them work, then you need to define what is required - eg. Is a particular type face required? Is the traineddata missing some required characters? Is the language not fully supported ?

Answering these questions will help you decide what training, if any, is required.

May

unread,

Aug 7, 2018, 1:24:45 PM8/7/18

to tesseract-ocr

I'm trying to extract data from scanned pdf forms that contain geotechnical data like these. But the tesseract is not recognizing them accurately as some numbers and characters are wrongly interpreted especially some of the keywords like 'N1' and numbers that I am looking for. I have tried pre-processing the img and -psm 4 which did improve accuracy to some bit but I think they have reached their limit so I thought of fine-tuning from the scanned images. Dp you think fine-tuning is the right way? Also if I fine-tune, how many raw single-lined image files and their corresponding typed text do I need to have?

Shree Devi Kumar

unread,

Aug 7, 2018, 1:45:10 PM8/7/18

to tesser...@googlegroups.com

Re finetuning - see https://github.com/tesseract-ocr/tesseract/issues/1782#issuecomment-411018986

Have you tried to provide each word separately (eg. using opencv ) for recognition?

Reply all

Reply to author

Forward