Training Fonts, mftraining hangs

John Alway

unread,

Aug 30, 2022, 10:18:13 PM8/30/22

to tesseract-ocr

Hello,

I've been following a tutorial on youtube titled "Tesseract OCR - Lesson 2: Training Tesseract for new font" here:

https://www.youtube.com/watch?v=1v8BPw0Dn0I&ab_channel=TheCode

I'm using tesseract 4.0 on Window 10.

I went through the steps he used, and everything seems to go smoothly until I get to the actual training. When I run "mftraining" the program hangs. It seems to get stuck and doesn't indicate why are what it's doing.

I'm using a set of fonts in an image. I have the full alphabet upper and lower case and the numbers 0 to 9 on the png image. I've attached the image. Unlike him, I'm using the English. I don't know the font, so I'm just calling it tiktok to give it a name. My training file is called eng.tiktok.exp0.tr

I used jTessBoxEditor to correct mistakes and set the box sizes and positions precisely.

When I run this command:

mftraining -F font_properties -U unicharset -O eng.unicharset eng.tiktok.exp0.tr

The program just hangs. I've waited over twenty minutes.

Should I wait longer? What could cause it to hang?

Thanks!

...John

alphabet-numbs.png

Zdenko Podobny

unread,

Aug 31, 2022, 8:27:14 AM8/31/22

to tesser...@googlegroups.com

First of all: if you follow any tutorial on internet - report the problem to the author of the tutorial.

Next: use official documentation for training. I see there are a bunch of folks just "generating content" - to gain an audience. Without insight and therefore also without support, using old/outdated information...

Tesseract 4 was released 29 Oct 2018. Almost 4 year ago! The recent tesseract version is 5.2 and training process was also improved: https://github.com/tesseract-ocr/tesstrain

Zdenko

st 31. 8. 2022 o 0:18 John Alway <jal...@gmail.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/534c3f74-420b-4c96-83dd-609bcb002f81n%40googlegroups.com.

John Alway

unread,

Aug 31, 2022, 1:40:29 PM8/31/22

to tesser...@googlegroups.com

"First of all: if you follow any tutorial on internet - report the problem to the author of the tutorial."

Next: use official documentation for training. I see there are a bunch of folks just "generating content" - to gain an audience. Without insight and therefore also without support, using old/outdated information..."

People are trying to find a nice, easy tutorial to help them get through the forest. I think that's the bottom line. Thanks for the link.

'Tesseract 4 was released 29 Oct 2018. Almost 4 year ago! The recent tesseract version is 5.2 and training process was also improved: https://github.com/tesseract-ocr/tesstrain"

I understand this, but I'm using C# .Net, and I don't think version 5 is available in C#. Unless I'm mistaken? There are costly packages, such as IronOcr which uses tesseract 5, but there is no way I can take that route.

Regards,

...John

Virus-free.www.avg.com

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8w5MX1XchQp6jfu2Vz06zWp82HxbDHrgp7%2BQ_Neh%2BDeug%40mail.gmail.com.

Adrian Paul Ciobanita

unread,

Aug 31, 2022, 1:57:50 PM8/31/22

to tesser...@googlegroups.com

I don't think the github link is helpful too much, tbh. I've had this issue with training something particular for my case since 2020. I've not had much time lately, but there's still no clean and easy tutorial to retrain something, that correctly describes how to create and use the ground truth files with the jstextesitor boxes, which one is which. The wording on the documentation is written by people that are biased towards the tool and know the ins and outs of it.

With this context in mind, it's no wonder why there are so many questions and ask for help on "how to retrain/fine tune".

Thank you for taking the time to respond. I am more than happy to write such a tutorial/example and what not, but it's hard even for me to do it, understand it, let alone have the knowledge to pull that off. If someone would be interested in showing me the "starting", as explained earlier for the groundtruth, unpacking the trained data, repackage it, correctly use the boundary boxes I'm willing to help out others and answer any questions others might have.

Thank you for your contribution and help, genuinely!

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAN7TTkEYLexQwmN9y3qO6cwqGoW72HoaEj8XksAgTwb3qVSNPA%40mail.gmail.com.

Zdenko Podobny

unread,

Aug 31, 2022, 3:04:55 PM8/31/22

to tesser...@googlegroups.com

Trained data from tesseract 5 are compatible with 4, so definitely I would suggest using the latest tesseract version for training - there was a lot of bug fixing and speed improvements.

IMO tesseract training has never been easy. I always suggest focusing on image preprocessing rather than training.

Following an easy looking training tutorial could also mean you are on the wrong path (=> you waste your time and it increases your frustration). For example:Tesseract 4 has 2 engines LSTM and legacy engine. In this particular video, for which engine is that training? For hints see[1]. Did you plan to train that engine or other one?

[1] https://tesseract-ocr.github.io/tessdoc/tess5/TrainingTesseract-5.html#overview-of-training-process

Zdenko

st 31. 8. 2022 o 15:40 John Alway <jal...@gmail.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAN7TTkEYLexQwmN9y3qO6cwqGoW72HoaEj8XksAgTwb3qVSNPA%40mail.gmail.com.

Adrian Paul Ciobanita

unread,

Aug 31, 2022, 3:10:31 PM8/31/22

to tesser...@googlegroups.com

Can you recomend tutorials, or books avout how to do image pre-processing effectively and efficiently?

Do we need to do different types of image pre-processing for each image? If we have 100+ images, how do we ensure that the pre-processing is helping the prediction accuracy 100%?

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wt1fjKy66uZ9-sEAqUkEwagLAZhFxcDV_8QLuPgk%3DnvQ%40mail.gmail.com.

Zdenko Podobny

unread,

Aug 31, 2022, 3:21:13 PM8/31/22

to tesser...@googlegroups.com

Shreeshrii, bertky and many others from the tesseract community invested a lot of time to improve training and documentation (e.g. tesstrain.sh was abandoned and replaced with python training). This is a community project so any improvements (code, documentation) is welcomed. We try to collect and keep the best information in our github repositories.

IMO training requires understanding of the OCR process and training process (e.g. why do I need to run training?). For example - training for images like alphabet-numbs.png is useless and it is quite common that users after retraining have worse results as with standard trained data from tesseract repository.

Zdenko

st 31. 8. 2022 o 15:57 Adrian Paul Ciobanita <adrian.c...@gmail.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CADB4xchbnxoX%3D0GvF3jKk_%3Dje_twnYaEZemyoxFiTL9X8H%3DPew%40mail.gmail.com.

Zdenko Podobny

unread,

Aug 31, 2022, 3:52:56 PM8/31/22

to tesser...@googlegroups.com

There is nothing like 100% OCR accuracy. Simply from a bad image you can not get good results (maybe google vision is close ;-), but it is a different story).

Our best experiences are collected at docs (https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html).

For different images/problems you need different solutions. E.g. in the case of historical documents you will need to focus on thresholding, in the case of natural scenes on text detection, in the case of invoices, document layout processing...

There is much research on this: some older papers are available on academia.edu E.g.

https://www.academia.edu/4790395/PhotoOCR_Reading_Text_in_Uncontrolled_Conditions

https://www.academia.edu/2793675/End_to_end_scene_text_recognition

https://www.academia.edu/6030087/Tex_Binarization_In_Color_Documents

https://www.academia.edu/39052965/OCR_to_read_embossed_text_from_Credit_Debit_card

https://www.academia.edu/1171645/A_variational_approach_to_degraded_document_enhancement

https://www.academia.edu/19957320/Multi_spectral_document_image_binarization_using_image_fusion_and_background_subtraction_techniques

https://www.academia.edu/1171639/Low_quality_document_image_modeling_and_enhancement

Zdenko

st 31. 8. 2022 o 17:10 Adrian Paul Ciobanita <adrian.c...@gmail.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CADB4xcjLwGXFqNFnxwFVWcOXAtztmd3aLrs4G6AnzXawWYwf_A%40mail.gmail.com.

Adrian Paul Ciobanita

unread,

Sep 1, 2022, 3:00:06 AM9/1/22

to tesser...@googlegroups.com

Thank you for the links and knowledge!

It definitely makes a good read, and a fine introduction to missing pieces.

Stay safe and healthy,

Ciobanita Paul Adrian.

~ SATCOM Sr. Systems Engineer / DevOps engineer / Test Engineer ~
Skype: adrian_iss_consult

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8woFXFYfo9zHQ0Vj%3D0JrqJPSWphq7C7WGF9YLUOeHwDFg%40mail.gmail.com.

John Alway

unread,

Sep 1, 2022, 5:13:40 AM9/1/22

to tesser...@googlegroups.com

Hello Zdenko,

Thank you for the advice. I ended up being able to tweak the tesseract parameters and was able to improve performance so that it was good enough without having to train.

And I do appreciate the hard work and cleverness that has gone into creating and improving tesseract. It's a beautiful piece of work.

Regards,

...John

Virus-free.www.avg.com

You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/GkntrKGFqu0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wt1fjKy66uZ9-sEAqUkEwagLAZhFxcDV_8QLuPgk%3DnvQ%40mail.gmail.com.

Adrian Paul Ciobanita

unread,

Sep 1, 2022, 6:02:20 AM9/1/22

to tesser...@googlegroups.com

Hey John,

Are you able to share which parameters you tweaked to get better performance?

Thank you

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAN7TTkG7-foR9g7YkBT8f%2BwU3P9Qf4s00V%3DMEazgkBboztzYsg%40mail.gmail.com.

John Alway

unread,

Sep 1, 2022, 5:36:33 PM9/1/22

to tesser...@googlegroups.com

Hello Adrian,

I can try. I'm using C# .Net, btw. Tesseract 4.1.1, which is the latest for this. I do think my settings are very specific to my purpose, so it may be of no benefit to you.

I tried several different settings and most of them didn't work. So, I experimented.

Here are the things I tried. The things commented out with "//" are things that I tried and didn't work for me. I'm trying to extract text from messages of a specific font type from an image. The messages have several lines of alphanumeric text.

Here is the code section where I did most of my experimenting:

var page = _engine.Process(img, PageSegMode.Auto);
//var page = _engine.Process(img, PageSegMode.AutoOnly); // Performs okay, but still no A or B

//var page = _engine.Process(img, PageSegMode.AutoOsd); // Performs okay, but still no A or B
//var page = _engine.Process(img, PageSegMode.RawLine); // terrible performance

//var page = _engine.Process(img, PageSegMode.SingleColumn);
//var page = _engine.Process(img, PageSegMode.SparseText);
//var page = _engine.Process(img, PageSegMode.SingleBlockVertText); // terrible performance
_engine.SetVariable("tessedit_char_whitelist", " abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ");
//_engine.SetVariable("tessedit_char_whitelist", "AB");

string result = page.GetText();

You can see from above that I settled on the following settings:

var page = _engine.Process(img, PageSegMode.Auto);

_engine.SetVariable("tessedit_char_whitelist", " abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ");

string result = page.GetText();

The images I use are from screenshots. That might not help. Hope it does!

Regards,

...John

Virus-free.www.avg.com

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CADB4xchKyXGwKM%2Bx978K6DyiDsoxyV2E1yg164rx-WC7jmECJA%40mail.gmail.com.

Jaspreet Kaur

unread,

Sep 3, 2022, 2:26:03 AM9/3/22

to tesser...@googlegroups.com

Preprocessing helps when your images are not clear to enhance the image quality. You can also work with box file by correcting the box files and put right characters into it.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CADB4xcjLwGXFqNFnxwFVWcOXAtztmd3aLrs4G6AnzXawWYwf_A%40mail.gmail.com.

Adrian Paul Ciobanita

unread,

Sep 3, 2022, 6:25:43 AM9/3/22

to tesser...@googlegroups.com

Hello Jaspreet,

Do you know of resources / documentation that explain step by step how to correctly use those box files, with ground truth files? I know jsTextEditor can help out with that, but I never been able to correctly use those box files, after. I cannot find a good article to explain how to generate the grouns truth files either.

PS: I know how to generate the box files and how to change them correctly. Last time I did it, I was doing it on per letter, rather than words.

Thanks for feedback.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAFoW%2BHKtr99dFPBZUH%2B-NZc7xeJ4fpwvXgqgLdobDuTLOSgzRQ%40mail.gmail.com.

Reply all

Reply to author

Forward