Issue with Fine-Tuning eng.traineddata on Large Dataset: Negative Mean RMS Error

Ilyas

unread,

Jan 30, 2024, 11:13:06 AM1/30/24

to tesseract-ocr

Hello everyone,

I've been successfully fine-tuning the eng.traineddata model with smaller datasets, but when I try to scale up to a larger dataset to include a more diverse range of documents, I encounter an unusual error. The training process starts, but it immediately reports a negative Mean RMS error, which seems to be an anomaly.

Environment
Tesseract Version: 4.1.3
Platform: Ubuntu 20.04

I run the following command for fine-tuning:
lstmtraining --debug_interval 0
--traineddata tesstrain/data/experiments/5PX1000D_rs42/model_eng_psm7_mi100000_5PX1000D_rs42/model_eng_psm7_mi100000_5PX1000D_rs42.traineddata
--old_traineddata tesstrain/src/tessdata_best/eng.traineddata
--continue_from tesstrain/data/experiments/5PX1000D_rs42/model_eng_psm7_mi100000_5PX1000D_rs42/model_eng_psm7_mi100000_5PX1000D_rs42.lstm
--model_output tesstrain/data/experiments/5PX1000D_rs42/model_eng_psm7_mi100000_5PX1000D_rs42/checkpoints/model_eng_psm7_mi100000_5PX1000D_rs42
--train_listfile tesstrain/data/experiments/5PX1000D_rs42/list.train
--eval_listfile tesstrain/data/experiments/5PX1000D_rs42/list.eval
--max_iterations 100000
--target_error_rate 0.01

The output I'm wondering about is :
At iteration 1/600/600, Mean rms=-2147483.6%, delta=0.033%, char train=275.696%, word train=100%, skip ratio=0%, New worst char error = 275.696 wrote checkpoint.

I expected the training process to proceed normally with the Mean RMS error showing sensible values, similar to when training on smaller datasets. When I use around 100k lstmf files it doesn't have this behaviour but with 400k this happens.

Am I looking in the wrong direction or missing something ?
I tried to look for something similar in the groups and discussions but couldn't find anything.
Thanks

Ger Hobbelt

unread,

Jan 30, 2024, 7:47:50 PM1/30/24

to tesseract-ocr

On Tue, 30 Jan 2024, 17:13 Ilyas, <ilyas.o...@gmail.com> wrote:

The output I'm wondering about is :
At iteration 1/600/600, Mean rms=-2147483.6%,

I dont know why or what is causing this; I just notice the value is quite remarkable as it looks like INT32_MIN got fed into some promillage/percentage calculation for the rms value there.

(From Google for those who dont recognize the 2^31 value right away:

INT_MIN is a macro that specifies that an integer variable cannot store any value below this limit. It represents the minimum value or the upper limit of the integer data type. The value of INT_MIN is: INT_MIN = –2147483648 (for 32-bit Integers) INT_MIN = –9,223,372,036,854,775,808 (for 64-bit Integers)

The rms value is clearly this value divided by 1000. Without having had a look as the source code, I'd say that might happen if some code path produced and error or clipped a larger negative value hard to the edge value for int32_t.

Someone will need to run this in a debugger to find the culprit. If you can do this, then that would help trace back the source of this weirdness.

Sorry I can't be of much more help. The 2^31 value (or rather the 7most significant digits thereof jumped in my face.

Smells like a potential bug somewhere...

Regards,

Ger

Tom Morris

unread,

Jan 31, 2024, 12:59:03 PM1/31/24

to tesseract-ocr

On Tuesday, January 30, 2024 at 11:13:06 AM UTC-5 Ilyas wrote:

The output I'm wondering about is :
At iteration 1/600/600, Mean rms=-2147483.6%, delta=0.033%, char train=275.696%, word train=100%, skip ratio=0%, New worst char error = 275.696 wrote checkpoint.

I expected the training process to proceed normally with the Mean RMS error showing sensible values, similar to when training on smaller datasets. When I use around 100k lstmf files it doesn't have this behaviour but with 400k this happens.

Am I looking in the wrong direction or missing something ?

As Ger pointed out, the underflow is likely the symptom of a bug, but no one is likely to be able to help much without a much smaller reproducer.

The first thing I'd try would be to eliminate possible bad data in the 300K new files as a source of the error. Can you run 100K chunks of the added files separately without any error?

If that works, I'd try to figure out the upper limit that works - 200K? 300K? 350K? Perhaps you'll find an upper bound that's high enough for your use case and you can avoid the hard work of tracking down the bug.

There's unlikely to be any easy way to figure out what's going on.

Tom

Ilyas

unread,

Feb 6, 2024, 7:13:21 AM2/6/24

to tesseract-ocr

Hi Ger,

Thank you so much for your insights. The connection to INT32_MIN and the potential for an underflow or overflow error does seem plausible, especially given the remarkable error value that aligns with INT32_MIN/1000. I haven't delved into debugging at that level yet, but your suggestion provides a solid starting point for further investigation.

Best regards,

Ilyas

unread,

Feb 6, 2024, 7:14:43 AM2/6/24

to tesseract-ocr

Hi Tom,

I appreciate you suggestions. I hadn't considered the possibility of bad data in the larger dataset contributing to this error. I'll start with that to identify if any specific subset is causing the problem. I'll report back with what I find out from these tests.

Thank you for pointing me in this direction.

Best,
Ilyas

Reply all

Reply to author

Forward