Improving results based on manually transcribed pages

79 views

Skip to first unread message

ThisIsTheTruthUncut

unread,

Sep 3, 2021, 6:33:47 AM9/3/21

to tesseract-ocr

Hi all

I have approximately 13 variations of the same book to run OCR on.

I have done the first variation and manually corrected all errors but, at nearly 600 pages for only the first book, this process has taken too long.

The font is the same throughout the variations, so I'd like to know how I can use the current English train files and use my scans + text files to improve accuracy.

This is the kind of quality of the images I have manually checked and want to use to improve accuracy

And the subsequent images I need to OCR are mostly like this

Note there are slight differences in the text, for example, "Record" and "record" - which is the purpose of my project.

Can anyone recommend an article or video for training a new font? Or perhaps someone might be willing to help me with this for a payment?

Thanks

Max Richey

unread,

Sep 4, 2021, 5:16:03 AM9/4/21

to tesseract-ocr

Like you, most of the online help about training for a single font was no help to me at all.

So, please know that it is easy to go even further down the rabbit hole than you are now.

I have successfully trained a Tesseract 5.0 LSTM (not to be confused with LDS, lol) language model on a single font. But I did it from scratch because the font and manuscript were so unique and included many ligatures. I also had to develop the font first and in three scripts (Greek, Hebrew, Latin). And I am just now scanning and correcting the source images.

Along with learning to code in Python to prepare the images, I used the repository here:

https://github.com/tesseract-ocr/tesstrain

You may be in a good place to use this method, too. Except that you would need to designate a pre-existing model to start training from, which I'm sure you have.

You also need a good quad-core computer, with Tesseract and Python installed. Any remaining pre-requisites are listed on the repo site.

I would have to research it to know for sure if this would work in your case. And I am up to my neck in my own project.

I have, however, developed some Python tools that might be helpful. They would split the image and text files into lines with matching names.

But there are many ways to skin those two cats. And you probably won't have to learn to code as I did. There may also be others that will post more insights.

Training is accomplished by providing ground truth .tif files that are the cropped, individual lines from the source image.

There must also be a matching one-line text file. Filename matching is critical. There must be a lot of them.

More is better.

I started with 440 lines of ground truth at 20,000 iterations (several hours duration).

Using the scanned images I now have, I am still re-training at 250K iterations (2 days) with 7 times that number of lines.

I have achieved a character accuracy of 99.3% and a word accuracy of 96.5%. Case sensitivity is very good, too.

It was worth the effort, but like you, I must still manually correct the OCR scanned results.

Now, I am only correcting about 2-4% instead of 100% like you had to.

The rough stuff? I am 5 years into it, I still have a lot of hope that the effort will be historical. I cant' stop, now.

Perhaps you won't, either. Hang in there. Remember, it's hard to kick against the goads.

Maybe this will help, you.

Max

Reply all

Reply to author

Forward

0 new messages