Using Tesseract on Fortran code from late 60's

Mixotricha

unread,

Feb 10, 2025, 5:29:29 AMFeb 10

to tesseract-ocr

I have a question about using Tesseract for trying to recover some source code of a printed listing that most likely would have come off a line printer in the early 70's probably scanned in by photocopier and them more recently by a more modern digital scanner.

I have two copies of the document. One the original scan and another that was recently scanned for me by the archive area of the University that houses the document. Unfortunately both have different problems!

Here are two sample images of the same content from the two different documents :

https://gitlab.com/mixotricha/d-17b-computer/-/blob/main/ocr_work/output-111.png?ref_type=heads

https://gitlab.com/mixotricha/d-17b-computer/-/blob/main/new_scan/output-107.png?ref_type=heads

Now some things were in my favour. It is computer code so much of it was able to guess through human translation. Its a limited subset of the English language ( written in Fortran IV ) and certain combinations are repeated over and over.

This is the start of my human translation here :

https://gitlab.com/mixotricha/d-17b-computer/-/blob/main/fortran.code?ref_type=heads

But what is proving to be more of a problem is that the numerical content of the document. This code uses lots of goto statements and jumps and jumps to numerical points in the left most column. It also uses a similar jump arrangement for format references for its input and output format printing.

To have any hope of recovering this code I really need a way to recover the numerical information in this code. Most particularly the numbers in the left most column.

So I wonder what a good approach with Tesseract would be? I've tried to watch some tutorials and to read the docs. I am comfortable in using python but I am not entirely sure if Tesseract fits this use case?

Months ago I started to build a project using Tesseract but got confused by the different versions available. This is what I thought I should do

- First build up an image set of as many different versions of the numerical content from 0-9 that I can pick out of one of the documents.

- Put those into an image grid. Then use that image set ( with a script I will write in pyton) to generate as much sample image data as possible and matching text translation.

- Then this is where I start to draw a bit of a blank ... :)

I'd be greatful for any suggestions as to what the best approach is ... !

Thanks

Graham Toal

unread,

Feb 10, 2025, 1:33:36 PMFeb 10

to tesser...@googlegroups.com

I can't help with tesseract advice - when I wanted to do the same thing I found it easier to write a custom OCR for this specific problem from scratch. It's very much an experiment and a work-in-progress (although I've not worked on it for about a year I'm afraid) but you might find something helpful from the discussion or the code: https://retrocomputingforum.com/t/custom-ocr-for-printer-listings/4016 and http://gtoal.com/src/OCR/

However you *will* need to do better scans using a flatbed scanner if you still have access to the originals. Those scans are unusable - the pages in the recent one had not been laid flat - it looks like they were taken with an overhead camera..

Graham

Mixotricha

unread,

Feb 11, 2025, 6:52:28 PMFeb 11

to tesseract-ocr

Thanks that is a really helpful link. Unfortunately I do not have much chance of getting better documents. The second scan came from a helpful archivist at an installation that requires a classification to enter. Otherwise I would literally get on a plane and go and look myself. I was gratified that they were as helpful as they were. Really the halting point in this translation is not the human words. It is the jump vectors ( the goto statements ) and so now I am back to seeing if I can figure out some sort of relationship in the jump vectors in the left hand column. Unfortunately they do not match the line numbers on the right hand side. But maybe I have just not figured out what that relationship might be. Basically back to searching for context. Some other things in my favour are that the thesis itself is an excellent piece of work really well explained and has what are basically unit tests included that are themselves quite legible. I feel getting this code back is right on the edge of possibility if I just think about it a bit more.

Graham Toal

unread,

Feb 11, 2025, 8:09:40 PMFeb 11

to tesser...@googlegroups.com

On Tue, Feb 11, 2025 at 5:52 PM Mixotricha <connoll...@gmail.com> wrote:

Thanks that is a really helpful link. Unfortunately I do not have much chance of getting better documents. The second scan came from a helpful archivist at an installation that requires a classification to enter. Otherwise I would literally get on a plane and go and look myself. I was gratified that they were as helpful as they were. Really the halting point in this translation is not the human words. It is the jump vectors ( the goto statements ) and so now I am back to seeing if I can figure out some sort of relationship in the jump vectors in the left hand column. Unfortunately they do not match the line numbers on the right hand side. But maybe I have just not figured out what that relationship might be. Basically back to searching for context. Some other things in my favour are that the thesis itself is an excellent piece of work really well explained and has what are basically unit tests included that are themselves quite legible. I feel getting this code back is right on the edge of possibility if I just think about it a bit more.

I sympathise on the access problem - we submitted a bunch of listings and docs to our local museum for safe keeping and haven't seen it since. I guess they're being kept very safe :-/

But don't give up hope on getting better access. I was quite impressed that the folks working on restoring the Bloodhound at bmpg.org.uk were able to get access to the original Coral66 source code. I myself managed to get the MOD's Defence Procurement Agency to give me permission to post the Coral 66 manual, just by asking via the contact page at the HMSO. So you never know... sometimes these people can be surprisingly reasonable.

So my fixed-pitch stuff isn't going to help you. I have two other suggestions: 1) classic re-keying by 2 or 3 independent people. (if 2, then someone has to go over the differences and explicitly make a selection; if 3, use a 2 out of 3 consensus to pick the preferred version. Neither is foolproof but does considerably lower the rate of errors.); and 2) there's some experimental dewarping software worth trying such as https://mzucker.github.io/2016/08/15/page-dewarping.html which might be better than the sort of sortware used in things like CZUR scanners that have a very specific model of a V shaped spine between pages of a book.

Looking at your hand-tidied source I would expect that a custom fortran parser could find a lot of corrections, simply by keeping a name and frequency table of variables - to catch things like CCMREG vs COMREG for example and automatically suggesting the preferred version. I found that a hacked-up parser for Algol 60 was extremely helpful at that sort of correction, leaving only a few minor errors to catch using a real compiler once the sources were cleaned up enough to be compilable.

Good luck with your project.

G

jesterjunk

unread,

Feb 12, 2025, 2:58:14 AMFeb 12

to tesseract-ocr

Howdy Mixotricha,

I just happened upon your post and thought that I would share this playlist, as it is a deep dive into a lot of the complexities of OCR.

Preprocessing is a major thing for getting optimal OCR results, that is why I put the video title in Bold for it below.

OCR in Python https://www.youtube.com/playlist?list=PLTejunv6WZfHQcHsNkHLtUN5beWZEHdez

⠀⠀01⠀⠀12:08⠀⠀Introduction to OCR (OCR in Python Tutorials 01.01)
⠀⠀02⠀⠀11:14⠀⠀How to Install the Libraries (OCR in Python Tutorials 01.02)
⠀⠀03⠀⠀ 7:46⠀⠀How to Open an Image in Python with PIL (Pillow) (OCR in Python 02.01)
⠀⠀04⠀⠀53:24⠀⠀How to Preprocess Images for Text OCR in Python (OCR in Python Tutorials 02.02)
⠀⠀05⠀⠀ 6:18⠀⠀Introduction to PyTesseract (OCR in Python Tutorials 02.03)
⠀⠀06⠀⠀ 5:37⠀⠀How to OCR an Index in Python with PyTesseract (OCR in Python Tutorials 03.01)
⠀⠀07⠀⠀18:27⠀⠀How to use Bounding Boxes with OpenCV (OCR in Python Tutorials 03.02)
⠀⠀08⠀⠀12:58⠀⠀How to Create a List of Named Entities from an Index with OpenCV (OCR in Python Tutorials 03.03)
⠀⠀09⠀⠀15:48⠀⠀How to OCR a Text with Marginalia by Extracting the Body (OCR in Python Tutorials 04.01)
⠀⠀10⠀⠀ 7:14⠀⠀How to Separate a Footnote from Body Text in Python with OpenCV

Remember to Breathe,

jesterjunk

Tom Morris

unread,

Mar 17, 2025, 5:41:56 PMMar 17

to tesseract-ocr

Mixotricha wrote on a separate thread:

I had a thought that the vectors will probably be reasonable sized units. 5,10,15 and so on. If I was writing this Fortran that is probably what I would do. And then if I came back I might add smaller units between. Context helps.

Yes, this was standard practice for assigning labels, because having to relabel things was such an enormous pain. The initial labels were chosen with a big enough delta to allow space to insert additional labels when needed, so you might start with 10, 20, 30, then add 25 when needed, then 22, always hoping that you're not going to need two additional labels between 20 and 22.

Tom

Dhvani Gajjar

unread,

Mar 18, 2025, 2:10:26 AMMar 18

to tesser...@googlegroups.com

Sure

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/f02e50a5-43ab-4114-a009-26c10239bf98n%40googlegroups.com.

Reply all

Reply to author

Forward