mftraining Segmentation fault error

Tom De Costere‏

לא נקראה,

2 בנוב׳ 2016, 12:44:522.11.2016

עד tesseract-ocr‏

Hello,

We are trying to train tesseract with a new font consisting of multiple handwritings from our customers.

The training itself works nicely and the OCR results are very good (85-90% correct detection).

However today something strange started to happen during the training process (which we have automated using Python on Ubuntu 10.04).

During the training with MFTraining we encountered the error "Ouch! number of protos = 513, vs max of 512!Segmentation fault (core dumped)"

Which results in the non-creation of the pffmtable file, which is required in the next step.

This started to happen after we reached a certain number of font files (130 concatenated TR files) on which the training has to happen.

Can anybody help us with this problem?

Software details:

OS: Ubuntu 16.04.1 LTS

Codename: xenial

Tesseract: 3.0.4 installed through APT-GET

tesseract-ocr/xenial,now 3.04.01-4 amd64 [installed]

tesseract-ocr-eng/xenial,xenial,now 3.04.00-1 all [installed,automatic]

tesseract-ocr-equ/xenial,xenial,now 3.04.00-1 all [installed,automatic]

tesseract-ocr-osd/xenial,xenial,now 3.04.00-1 all [installed,automatic]

ShreeDevi Kumar‏

לא נקראה,

2 בנוב׳ 2016, 13:03:262.11.2016

עד tesser...@googlegroups.com‏

Please see https://groups.google.com/forum/#!msg/tesseract-dev/u5CSn3B3mYc/U39zS6MeCQAJ

There seems to be a limit ---

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fc4f92b3-d9e0-497e-806f-4de580b07a80%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

RKVS Raman‏

לא נקראה,

2 בנוב׳ 2016, 14:41:542.11.2016

עד tesseract-ocr‏

But why would you need 130 tr files?

Are you using 130 fonts?

There is a limit of 64 fonts i guess in tesseract.

If it is just 1 font (or 1 kind of handwriting in ur case) then you can put it in 1 multi page tiff file which does not exceed 120 pages.

Best Regards
-Raman

-----------------------------------------------
RKVS Raman
http://sites.google.com/site/rkvsraman
------------------------------------------------

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduURWyZEJ6vhHgQY4pSfTHC_jv4QThvcR9u6%2B5M6ikB%3Dsg%40mail.gmail.com.

Tom De Costere‏

לא נקראה,

3 בנוב׳ 2016, 4:51:283.11.2016

עד tesseract-ocr‏

Hello,

Thank you for your responses!

Let me clarify the situation here on which training is performed, so you understand why we have 130+ tr files.

We have fill-in forms for our customers, which they have to hand over to our workers, in order to specify when and what our worker have performed at their house. On these forms there are fill-in boxes, like a date and name and work hours.

Now the major time waste at our company is the manual parsing of the documents into our electronic bookkeeping application.

The current situation is: our workforce have to manually type over the filled in values from the papers into the application.

As you can guess, this is a very long and time consuming task, which nobody loves to do every day.

Since there are, at the moment, almost no other OCR technologies which give a good recognition rate for handwriting, we are trying Tesseract to improve this job.

Our currently automated training algorithm uses these fill-in forms as basis for the learning of Tesseract.

We created a .NET program for generating the box files and correcting the OCR values, which some of our workers use at the moment.

The corrected box files are then sent to our OCR server (running Tesseract), which trains the language file with the new inputs.

So in order to improve the detection percentage, we are creating one big language file for our entire customerbase, with unique fonts for each customer.

Since every customers has his/her unique handwriting.

At the moment we have generated over 1000 box files for around 130 customers (130 from 3000+ customers).

So to give an example:

ncorp.traineddate consists of fonts:

- ocrB (standard printer font)

- customerA (handwriting for customer A)

- customerB (handwriting for customer B)

- customerC (handwriting for customer C)

- ...

This is why we have over 130 TR files at the moment, and the number is steadily rising every hour.

Now it would be ideal if Tesseract had a re-train function, instead of training the whole file again and again.

So that we would simply inject a new font for a new customer when it's needed.

Correct me if I'm wrong, but as far as I know and as far as I have found on the internet, Tesseract doesn't have a re-train function which uses an existing traineddata file as input. And then outputs an improved version of this traineddata file.

@Shree

@Rkvsraman

If there is a limit for Tesseract training, why are they supplying a font_properties file with around 4000 fonts then?

Or is this purely to be able to train using these fonts?

Might there be another way to use the training for such a large amount of fonts?

Can training the fonts into multiple language files then be the solution?

Kind regards,

Tom

Op woensdag 2 november 2016 19:41:54 UTC+1 schreef rkvsraman:

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fc4f92b3-d9e0-497e-806f-4de580b07a80%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

ShreeDevi Kumar‏

לא נקראה,

3 בנוב׳ 2016, 12:53:513.11.2016

עד tesser...@googlegroups.com‏

Please see https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh

The max no of fonts for each language is not very large.

I am not even sure whether increasing the number of fonts beyond a limit will improve the recognition.

I think it is unlikely that tesseract can handle thousands of box/tif pairs that you are planning.

I hope one of the developers will reply with a more definitive response.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/89053474-d6b7-4c44-ba99-3a9b36eb146e%40googlegroups.com.

Tom De Costere‏

לא נקראה,

4 בנוב׳ 2016, 5:37:354.11.2016

עד tesseract-ocr‏

Just to be sure, are the developers watching this Google Group or should I make a topic under the "tesseract-dev" group?

FYI: we've breached the 5k number of fonts this morning

I'm thinking of notifying the users that they should only create box files for documents containing terrible handwriting.

Since I'm seeing quite good detection rates on new documents, even when they are not trained yet.

Op donderdag 3 november 2016 17:53:51 UTC+1 schreef shree:

ShreeDevi Kumar‏

לא נקראה,

4 בנוב׳ 2016, 8:21:564.11.2016

עד tesser...@googlegroups.com‏

Probably better to post on tesseract-dev, though there is no guarantee that the developers will reply.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4f0db807-9bb8-40e1-b995-33951cb496a8%40googlegroups.com.

Tom De Costere‏

לא נקראה,

8 בנוב׳ 2016, 8:32:378.11.2016

עד tesseract-ocr‏

It seems my topic is not suitable for the DEV forum. (topic creation refused)

I would appreciate it sinceraly if anyone can bring this topic to the attention of the devs.

Thanks in advance!

Tom

Op vrijdag 4 november 2016 13:21:56 UTC+1 schreef shree:

ShreeDevi Kumar‏

לא נקראה,

8 בנוב׳ 2016, 9:11:028.11.2016

עד tesser...@googlegroups.com‏

Tom,

Please see https://github.com/tesseract-ocr/tesseract/pull/466

I think the developers may want to focus on the merge of Google's private new LSTM codebase with the public github repo.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/31ee927f-e673-4cc8-9455-ebb4ef228a55%40googlegroups.com.

השב לכולם

השב למחבר

העבר לנמענים