not training on image after loading data

Kumar Rajwani

unread,

Feb 5, 2021, 10:35:21 AM2/5/21

to tesseract-ocr

HI,

i am trying to finetune eng.traindata as per my images i have tried to train but all time i am stuck somewhere can you tell me how can i procced further.

current steps

step 1 make box files

%%bash

for file in *.jpg; do

echo $file

base=`basename $file .jpg`

tesseract $file $base lstmbox

done

step 2 make lstmf file

%%bash

for file in *.jpg; do

echo $file

base=`basename $file .jpg`

tesseract $file $base lstm.train

done

step 3 create unichar set

%%bash

function wrap {

for i in `seq 0 $1`; do

echo "$2$i$3"

done

}

N=0

unicharset_extractor `wrap $N "eng.arial.exp" ".box"`

step 4 start training

!lstmtraining \

--model_output output/ \

--continue_from lstm_model/eng.lstm \

--traineddata /usr/share/tesseract-ocr/5/tessdata/eng.traineddata \

--train_listfile list.train \

--eval_listfile list.eval \

--max_iterations 400 \

in step 4 it will give following output

Loaded file lstm_model/eng.lstm, unpacking...

Warning: LSTMTrainer deserialized an LSTMRecognizer!

Continuing from lstm_model/eng.lstm

Loaded 128/128 lines (1-128) of document eng.arial.exp0.lstmf

Loaded 131/131 lines (1-131) of document eng.arial.exp9.lstmf

Loaded 135/135 lines (1-135) of document eng.arial.exp7.lstmf

Loaded 114/114 lines (1-114) of document eng.arial.exp2.lstmf

Loaded 93/93 lines (1-93) of document eng.arial.exp6.lstmf

Loaded 104/104 lines (1-104) of document eng.arial.exp4.lstmf

Loaded 88/88 lines (1-88) of document eng.arial.exp5.lstmf

Loaded 131/131 lines (1-131) of document eng.arial.exp3.lstmf

This is not training after this.

so can you tell me what changes i can do to successfull training.

Shree Devi Kumar

unread,

Feb 5, 2021, 10:58:14 AM2/5/21

to tesseract-ocr

Add the following to your lstmtraining command and see.

--debug_interval -1

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/977d82fc-c2a6-4c3d-8db5-c6c917e9c8c0n%40googlegroups.com.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Kumar Rajwani

unread,

Feb 5, 2021, 11:07:00 AM2/5/21

to tesseract-ocr

hi,

i have tried that it's shows following output

Starting sh -c "trap 'kill %1' 0 1 2 ; java -Xms1024m -Xmx2048m -jar ./ScrollView.jar & wait"

ScrollView: Waiting for server...

Error: Unable to access jarfile ./ScrollView.jar

sh: 1: kill: No such process

Shree Devi Kumar

unread,

Feb 5, 2021, 11:10:14 AM2/5/21

to tesseract-ocr

Have you tried with value -1??

minus 1

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/539244c7-0826-4bae-a610-c4e1ededfce6n%40googlegroups.com.

Kumar Rajwani

unread,

Feb 5, 2021, 11:14:27 AM2/5/21

to tesseract-ocr

hi,

i have tried minus 1 and got following result

Iteration 0: GROUND TRUTH : ) @®

Iteration 0: BEST OCR TEXT : Yo

File eng.arial.exp0.lstmf line 0 :

Shree Devi Kumar

unread,

Feb 5, 2021, 11:25:43 AM2/5/21

to tesseract-ocr

On Fri, Feb 5, 2021 at 4:44 PM Kumar Rajwani <kumarraj...@gmail.com> wrote:

hi,

i have tried minus 1 and got following result
Iteration 0: GROUND TRUTH : ) @®
Iteration 0: BEST OCR TEXT : Yo
File eng.arial.exp0.lstmf line 0 :

What's your version of tesseract? What o/s?

Without your files, it's difficult to know what's causing the issue.

with -1 debug_interval you should get the info for every iteration.

Message has been deleted

Kumar Rajwani

unread,

Feb 5, 2021, 12:03:10 PM2/5/21

to tesseract-ocr

i have tried to do same thing in tesseract 4 which stuck at following line.

Compute CTC targets failed!

On Friday, February 5, 2021 at 5:04:42 PM UTC+5:30 Kumar Rajwani wrote:

!tesseract -v
tesseract 5.0.0-alpha-20201231-171-g04173
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found FMA
Found SSE
Found OpenMP 201511
Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1

image example
i have added one image from my training data.

i am using the colab system which have ubuntu os.
https://colab.research.google.com/drive/1_Bn4wbK6dE5zYAuFyC4Eczq_eNU2shuz?usp=sharing
this is my notebook you can see complete process in finetune 2 section.

Shree Devi Kumar

unread,

Feb 5, 2021, 12:16:22 PM2/5/21

to tesseract-ocr

I see the tabular image that you shared. I don't think training is going to help you in this. eng.traineddata should be able to recognize it quite well. You should select the different areas of interest and just OCR those sections.

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/342f3faf-b107-4243-845e-ba8a16274122n%40googlegroups.com.

Kumar Rajwani

unread,

Feb 5, 2021, 12:20:30 PM2/5/21

to tesseract-ocr

main thing is i want to learn about training tesseract on image level so can you please tell me how can i procced further. i want to know where is the main problem.

Kumar Rajwani

unread,

Feb 5, 2021, 12:44:19 PM2/5/21

to tesseract-ocr

i have tried a lot of images where it getting 90% accuracy and missing always one side of image. that's the reason i want to train model if it can improve a little a bit it would be great.

if you can provide a script or steps that can help me it would be good for me.

Shree Devi Kumar

unread,

Feb 5, 2021, 2:23:26 PM2/5/21

to tesseract-ocr

Training won't fix that.

See https://www.pyimagesearch.com/2020/09/07/ocr-a-document-form-or-invoice-with-tesseract-opencv-and-python/

https://stackoverflow.com/questions/61265666/how-to-extract-data-from-invoices-in-tabular-format

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e663d426-2c32-432b-80b3-4ff9d8fe86d4n%40googlegroups.com.

Kumar Rajwani

unread,

Feb 5, 2021, 3:08:31 PM2/5/21

to tesseract-ocr

Thanks for this. i know about the usage of the tesseract. i have multiple images where i can't improve image quality so i want to improve my model to get text from it.

are you saying that text detection will not improve by training?

Because i don't have an issue with text recognition most of time it right.

can you tell me how can i improve the model to get more text from the image? I am using psm 11 where it find lot's of text but some are missing.

Kumar Rajwani

unread,

Feb 7, 2021, 2:43:56 AM2/7/21

to tesseract-ocr

hey can you please tell me how can i improve the text detection for the same kind of images?

Kumar Rajwani

unread,

Feb 8, 2021, 12:47:49 PM2/8/21

to tesseract-ocr

hey, i am still waiting for your reply. can you please solve my doubts.

Ger Hobbelt

unread,

Feb 11, 2021, 7:34:29 PM2/11/21

to tesser...@googlegroups.com

Have you read the two pages linked to in the answer from february 5th?

Have you executed those procedures, or anything similar, to extract the individual table call images, to feed those to tesseract?

So far you have not shown images or any results that show you have used a tabular recognition and cell extraction process at all (which is a preprocess required by the type of input image you have provided so far if you want to significantly improve OCR output quality), so, *hey*, what are your results so far following the sage advice (Feb 5)?

(quoted below for convenience:)

On Friday, February 5, 2021 at 7:53:26 PM UTC+5:30 shree wrote:

See https://www.pyimagesearch.com/2020/09/07/ocr-a-document-form-or-invoice-with-tesseract-opencv-and-python/

https://stackoverflow.com/questions/61265666/how-to-extract-data-from-invoices-in-tabular-format

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e663d426-2c32-432b-80b3-4ff9d8fe86d4n%40googlegroups.com.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--

Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web: http://www.hobbelt.com/
http://www.hebbut.net/
mail: g...@hobbelt.com
mobile: +31-6-11 120 978
--------------------------------------------------

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5cb397af-eedd-40bb-979d-d7128ab7c64en%40googlegroups.com.

Kumar Rajwani

unread,

Feb 12, 2021, 6:56:06 AM2/12/21

to tesseract-ocr

Hey, both of the pages answer on is steps after some text detected right?

https://i.pinimg.com/564x/bd/a3/d4/bda3d4bf11b0f727db1f9d81faac1b5d.jpg

i have all this type of images where I am not able to detect date at the top right. also contact name, phone, fax this is not correctly read every time or missed in detection part.

That's the reason i am asking i have a similar format of the document so if i trained the model on that it will help the model in the detection and recognition part?

I don't know how tesseract detecting the text from the whole form.

i have tried thresholding, scaling, sharpening but this can't give me results all time.

Ger Hobbelt

unread,

Feb 12, 2021, 5:14:12 PM2/12/21

to tesser...@googlegroups.com

Ah, a misunderstanding there.

Ok, the key message of those pages is: you must extract each "table cell" as a /separate/ image to help OCR, then, if needed, combine the text results for each of those smaller images to form the text of your page.

That's often referred to as "segmentation".

Tesseract has an algorithm for that built in AFAICT, but that is geared towards pages of text (reams of texts, lines of text) and picking out the individual words in there. That task gets very confused when you feed it a table layout, which has all kinds of edges in the images that are /not/ text, but table cell /borders/.

So what those links are hinting at is that you need to come up with an image *preprocess* which can handle your type of table. This depends on your particular table layout, as there are many ways to "design / style" a table.

So you will have to write some script which will find and then cut out each table cell as an image to feed tesseract.

When you look for segmentation approaches on the net, leptonica and opencv get mentioned a lot.

Unfortunately most segmentation work when googling for it is about object and facial recognition. Not a problem per se, isnt a table cell an object too? Well, not really, not in the sense they're using it as those algorithms approach the image segmentation from the concept of each object being an area filled with color(s). This would be applicable if the table was styled as cells with an alternating background, for instance, but yours is all white and just some thin black borders.

There's a couple of ideas for that:

1: conform the image to an (empty) form template, i.e. seek a way to make your scanned form overlay near perfectly on a template image. Then you have to define your areas of interest (box coordinates in the template) and clip those parts out, save them as individual files and feed those to tesseract. This is often done for government application forms: there is a reason you're supposed to only write within the boxes. 😉

That is what that first link alludes at. It's just one idea among many to try.

2: what if you cannot or must not apply idea 1? Can we perhaps detect those table borders through image processing and /then/ come up with something that can take that data and help us extract the cell images?

I must say I haven't done this myself yet, but some googling uncovered this link (after having quickly scanned several false positives in my google results and several altered search attempts) : https://stackoverflow.com/questions/33949831/whats-the-way-to-remove-all-lines-and-borders-in-imagekeep-texts-programmatic

Here are a couple if fellows who have thought "out of the box" (pun intended) and gotten some results by phrasing my question in an entirely different way: instead of wondering how we can detect and extract those table cells, they try to answer the question: "what if we are able to *remove* those cell borders visually? Yes, we will worry about the texts in the cells looking as a haphazard ream of text later and expect trouble to discern which bit of recognized text was in what cell exactly (tesseract can output hOCR + other formats which deliver text plus coordinates of placement - you may have to wprk on that *afterwards* when you do something like they're doing:

https://stackoverflow.com/questions/33949831/whats-the-way-to-remove-all-lines-and-borders-in-imagekeep-texts-programmatic

Looks promising to me. What I'ld attempt next with their approach is see if I can make those detected borders extend and then me being able to extract each individual black area (cell!) as pixel *mask*, to be applied to my (conformed) page image so everything is thrown out except the pixels in that cell and thus giving me one image of one cell worth of text. Repeat that for each black area (see https://stackoverflow.com/questions/33949831/whats-the-way-to-remove-all-lines-and-borders-in-imagekeep-texts-programmatic answers to see what i mean: th3 result image he gets which is pure black with the table borders (lines) in white.)

/They/ tackle the problem similarly but conceptually in a very different way than I am thinking about now: they go and mask out the detected table borders in one go.

That can work very well and is much faster as they are not extracting subimages by masking or other means.

Their *potential* trouble will be deciding which bit of text was together in which cell. That can be done in bbox analysis after ocr/tesseract has done its job. (again, google can provide hints. Again, it depends on your particular circumstances)

My (very probable) trouble will be identifying the black cell areas singularly: doing a simple flood fill with a color, then extract anything covered by that color, is troublesome as the table border detection might very well not be perfect and thus cause my simpke flood fill to color adjacent cells too. 😢 So, if I had your task, I'ld be looking at ways to extract, say, each individual *minimum rectangle* which does not contain white pixels (uh-oh, need noise removal then!) OR perhaps a way where each detected line segment is described as a vector and then extend those lines out across the page to get my rectangles in between: those would be my cells then. That's a bother when the table has cells spanning columns or rows. So more research needed before I'ld code that preprocess.

Another issue with the line detection + removal/zoning techniques would be making sure the lines are either near perfect horizontal and vertical all (*orienting*/*deskewing* the image will help some there) OR you must come up with an algo that's able to find angled lines (while it should ignore the curvy text characters). Again, yet another area of further investigation if I were at it.

The key here is that you'll have to do some work on your images before you can call tesseract and expect success.

HTH.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f839b8b5-0996-445b-8607-9cc63c1a0d32n%40googlegroups.com.

Kumar Rajwani

unread,

Mar 5, 2021, 5:53:33 AM3/5/21

to tesseract-ocr

great answer .

Can you please guide me if our word is not recognized correctly by tesseract so how can we insert it in dictionary.

As i read here (https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/33418.pdf) 6. Linguistic Analysis it have dictonary word.

Reply all

Reply to author

Forward