90/30 Update

sboy...@gmail.com

unread,

Feb 20, 2022, 12:15:58 PM2/20/22

to Univac Emulators

Well, we haven't been able to find a BEM tape and I got curious about how BEM actually did some of the things that it did. Like displaying the status of currently executing jobs, displaying a list of mounted volumes, listing the VTOC of a volume, etc.

So, I started a project to recreate some of BEM's functionality so that I would have it available when I needed it. So far, I have recreated the monitor and the following commands:

/LOGON

/LOGOFF

/STATUS

/DISPLAY

/VTOC

The attached screen recording shows some features of these commands in action. First I logon, the I show the status of the current terminal session, the I display running jobs, the mounted volumes and the VTOC of SYSRES (REL042).

For those of you who remember BEM or even Interactive Services, this should look very familiar.

Steve B

U200 Emulator _ Telnet 2022-02-20 12-06-50.mp4

Charlie Gibbs

unread,

Feb 20, 2022, 12:36:34 PM2/20/22

to sboy...@gmail.com, Univac Emulators

Wow. Impressive. One of these days you're going to shame me into
digging out my box of assembly listings for all the utilities I wrote,
and try to scan them. I've fiddled with pdftotext a bit, and it seems
to work well enough that I might be able to build some source files.
Heck, maybe you could assemble my own assembler, which I got to the
point where it could do sysgens and is (IMNSHO) much nicer.

--
cgi...@surfnaked.ca (Charlie Gibbs)

Stephen Boyd

unread,

Feb 20, 2022, 1:58:18 PM2/20/22

to Charlie Gibbs, Univac Emulators

The hardest part to getting this stuff working is stupid finger mistakes
(and crappy documentation). I'll bet you remember how long it takes to
track down something like typing LH instead of LA or SH instead of STH.
Looks just fine when you are looking at the assembly listing but doesn't
work for shit! :)

I'll bet you would recognize the things that I did to implement the
monitor because you would have had to do the same things for your IMS
simulator. The BEM commands are even programmed like IMS action
programs. Probably nothing like the original but it works.

I would love to see your old utilities working again. I'm not sure how
well OCR will work on source code. I tried it on some of the old
494/1290 stuff and got mostly garbage. But I was working with low
quality scans.

Steve B

Charlie Gibbs

unread,

Feb 20, 2022, 7:29:48 PM2/20/22

to Stephen Boyd, Univac Emulators

On 2022-02-20 10:58 a.m., Stephen Boyd wrote:

> The hardest part to getting this stuff working is stupid finger mistakes
> (and crappy documentation). I'll bet you remember how long it takes to
> track down something like typing LH instead of LA or SH instead of STH.
> Looks just fine when you are looking at the assembly listing but doesn't
> work for shit! :)

My favourite was the time I typed DS instead of DC. I couldn't figure
out why that damned variable was getting clobbered. Since I was just
getting my own assembler working at the time, I added code to issue a
warning if a value was given on a DS. I never made that mistake
again... :-)

> I'll bet you would recognize the things that I did to implement the
> monitor because you would have had to do the same things for your IMS
> simulator. The BEM commands are even programmed like IMS action
> programs. Probably nothing like the original but it works.
>
> I would love to see your old utilities working again. I'm not sure how
> well OCR will work on source code. I tried it on some of the old
> 494/1290 stuff and got mostly garbage. But I was working with low
> quality scans.

I'll dig the listings out next chance I get. Initial results with
pdf2text were sufficiently encouraging that I might be able to get them
within hand-editing distance.

--
cgi...@surfnaked.ca (Charlie Gibbs)

Sebastian Rasmussen

unread,

Feb 20, 2022, 10:17:39 PM2/20/22

to Stephen Boyd, Charlie Gibbs, Univac Emulators

Hi!

> I'm not sure how> well OCR will work on source code. I tried it on some of the old
> 494/1290 stuff and got mostly garbage. But I was working with low quality scans.

I got into vintage computer emulators because the group working on old
UNIX needed
transcription of source code. I would be happy to help out
transcribing things here too
if you need it..? :)

/ Sebastian

Stephen Boyd

unread,

Feb 21, 2022, 10:11:47 AM2/21/22

to Sebastian Rasmussen, Charlie Gibbs, Univac Emulators

Help is always appreciated. Maybe you and Charlie can talk when he gets
the listings scanned.

Charlie Gibbs

unread,

Feb 23, 2022, 1:31:52 PM2/23/22

to Sebastian Rasmussen, Stephen Boyd, Univac Emulators

Here's an assembly listing and test run of a little utility I wrote to
scan low memory for a given data string. I'd love to be able to OCR it
and get a source code file out of it. I tried playing with ocrmypdf but
got mostly garbage. If anyone has any experience OCRing old mainframe
listings I'd love to get some pointers. I have a lot of stuff to scan
so if I can get a Linux utility running here it would make life a lot
easier.

P.S. If the assembly listing looks a little different from what you're
used to, it's because it's from my own assembler, which IMHO is much
nicer than the stock assembler.

--
cgi...@surfnaked.ca (Charlie Gibbs)

mscan.pdf

Stephen Boyd

unread,

Feb 23, 2022, 2:38:10 PM2/23/22

to Charlie Gibbs, Sebastian Rasmussen, Univac Emulators

Attached is a copy of the text that I got using Nuance PaperPort to OCR
mscan.pdf. The result is poorly formatted but the interesting bits don't
look too badly garbled. The formatting sucks but it would be possible to
extract the source code via a little creating cutting and pasting.

This is not ideal but it would be OK for small projects.

I have no idea where I got Nuance from. It must have come with one of my
scanners or pre-installed on my PC.

A quick look at your code and it seems as though you knew about the
negative addressing trick back in the day because that is what you seem
to be using here.

Nice looking assembly listing BTW!

Steve B

mscan.txt

Charlie Gibbs

unread,

Feb 23, 2022, 4:51:50 PM2/23/22

to Stephen Boyd, Sebastian Rasmussen, Univac Emulators

On 2022-02-23 11:38 a.m., Stephen Boyd wrote:

> On 2/23/22 1:31 p.m., Charlie Gibbs wrote:
>
>> On 2022-02-20 7:17 p.m., Sebastian Rasmussen wrote:
>>
>> Here's an assembly listing and test run of a little utility I wrote to
>> scan low memory for a given data string. I'd love to be able to OCR
>> it and get a source code file out of it. I tried playing with
>> ocrmypdf but got mostly garbage. If anyone has any experience OCRing
>> old mainframe listings I'd love to get some pointers. I have a lot of
>> stuff to scan so if I can get a Linux utility running here it would
>> make life a lot easier.
>>
>> P.S. If the assembly listing looks a little different from what you're
>> used to, it's because it's from my own assembler, which IMHO is much
>> nicer than the stock assembler.
>
> Attached is a copy of the text that I got using Nuance PaperPort to OCR
> mscan.pdf. The result is poorly formatted but the interesting bits don't
> look too badly garbled. The formatting sucks but it would be possible to
> extract the source code via a little creating cutting and pasting.
>
> This is not ideal but it would be OK for small projects.

I couldn't resist. Attached is the cleaned-up source code. Hope I got
everything fixed...

> I have no idea where I got Nuance from. It must have come with one of my
> scanners or pre-installed on my PC.

It's not that bad considering what it has to deal with. I've posted a
message on comp.os.linux.misc to see whether anyone else has any pointers.

> A quick look at your code and it seems as though you knew about the
> negative addressing trick back in the day because that is what you seem
> to be using here.

Yes, I forgot I was doing that.

> Nice looking assembly listing BTW!

Thank you. I made a lot of usability enhancements. In case you're
wondering, the asterisk after some of the line numbers in the
cross-reference listing indicate lines which modify the variable in
question.

This program was the smallest general-purpose utility I could find (137
lines). One of the next smallest ones (just under 250 lines) is another
one that I forgot about - it finds the currently-loaded ICAM and
displays its network configuration(s) on the console. It might be
tailored to release 6 too tightly to work on other versions, but it
could be fun to try it.

--
cgi...@surfnaked.ca (Charlie Gibbs)

mscan.asm

Sebastian Rasmussen

unread,

Feb 23, 2022, 7:27:20 PM2/23/22

to Charlie Gibbs, Stephen Boyd, Univac Emulators

> Here's an assembly listing and test run of a little utility I wrote

Alright, mscan.txt is my transcribed result of the entire file.
mscan-seb.asm is the part that you OCRed using nuance and mscan.diff
is the result of diffing that with mscan-seb.asm. There are a few
things it missed during OCRing. Even your manually fixed mscan.asm has
the same mistakes:

* a mysteriously introduced "SPACE 3"
* ADDR becoming AODR
* 0F becoming OF
* some comments ending up at the wrong line
* ABCDEF becoming 43CDEF in the HXTR table

But I agree with you, in general Nuance OCR did a really good job!

I have not verified all the hex dumped numbers in my transcription btw.

/ Sebastian

mscan.txt

mscan-seb.asm

mscan.diff

Charlie Gibbs

unread,

Feb 23, 2022, 10:54:13 PM2/23/22

to Sebastian Rasmussen, Stephen Boyd, Univac Emulators

On 2022-02-23 4:27 p.m., Sebastian Rasmussen wrote:

>> Here's an assembly listing and test run of a little utility I wrote
>
> Alright, mscan.txt is my transcribed result of the entire file.
> mscan-seb.asm is the part that you OCRed using nuance and mscan.diff
> is the result of diffing that with mscan-seb.asm. There are a few
> things it missed during OCRing. Even your manually fixed mscan.asm has
> the same mistakes:
>
> * a mysteriously introduced "SPACE 3"

Aha, here's where inside knowledge comes in. Those three blank lines
there are the result of a SPACE 3 directive. The assembler doesn't
print it, but just leaves three blanks lines. There's no way you could
know this - but being the author of the program, I do. :-)

> * ADDR becoming AODR
> * 0F becoming OF
> * some comments ending up at the wrong line
> * ABCDEF becoming 43CDEF in the HXTR table

Those are errors that I missed.

Excellent work, Sebastian! I've attached a corrected mscan.asm.

> But I agree with you, in general Nuance OCR did a really good job!
>
> I have not verified all the hex dumped numbers in my transcription btw.

I did, and found a few errors. I've renamed mscan.txt to mscan-seb.txt
and attached a new mscan.txt, along with mscantxt.dif.

Stephen, this mscan.asm should be ready to assemble (and, hopefully,
run). Give it a try when you get a chance.

--
cgi...@surfnaked.ca (Charlie Gibbs)

mscan.asm

mscan-seb.txt

mscantxt.dif

Sebastian Rasmussen

unread,

Feb 23, 2022, 11:46:05 PM2/23/22

to Charlie Gibbs, Stephen Boyd, Univac Emulators

> Those are errors that I missed.
> Excellent work, Sebastian! I've attached a corrected mscan.asm.

Thank you for a good collaboration!

> > I have not verified all the hex dumped numbers in my transcription btw.
> I did, and found a few errors. I've renamed mscan.txt to mscan-seb.txt
> and attached a new mscan.txt, along with mscantxt.dif.

How did you verify them? By looking up the opcodes somewhere?
I have never used this computer, so I might need a few pointers. :)

/ Sebastian

Charlie Gibbs

unread,

Feb 24, 2022, 12:26:45 AM2/24/22

to Sebastian Rasmussen, Stephen Boyd, Univac Emulators

On 2022-02-23 8:45 p.m., Sebastian Rasmussen wrote:

>> Those are errors that I missed.
>> Excellent work, Sebastian! I've attached a corrected mscan.asm.
>
> Thank you for a good collaboration!

Teamwork at its best. :-)

>>> I have not verified all the hex dumped numbers in my transcription btw.
>> I did, and found a few errors. I've renamed mscan.txt to mscan-seb.txt
>> and attached a new mscan.txt, along with mscantxt.dif.
>
> How did you verify them? By looking up the opcodes somewhere?
> I have never used this computer, so I might need a few pointers. :)

I've done enough assembly language programming on this beast that I
could recite this stuff backwards in my sleep - including a list of
opcodes. But I was checking your work against the hard copy, so I have
two sources.

I had a wall of 90/30 documentation, which I've scanned and uploaded to
Bitsavers. More information than you'll ever want, probably...

--
cgi...@surfnaked.ca (Charlie Gibbs)

Stephen Boyd

unread,

Feb 24, 2022, 9:51:53 AM2/24/22

to Charlie Gibbs, Sebastian Rasmussen, Univac Emulators

I've set up a new virtual 8418 on my emulator here named CG0001 with a
full set of library files so that I can keep Charlies stuff separate
from mine.

So far, we've managed to fool the assembler. I've attached a clean
assembly listing of mscan.asm.

I haven't tried to run it yet.

Steve B

MscanCompile.pdf

Stephen Boyd

unread,

Feb 24, 2022, 10:25:15 AM2/24/22

to Charlie Gibbs, Sebastian Rasmussen, Univac Emulators

Well, I ran it and it seems to work.

See attached job log.

MscanRun.pdf

Charlie Gibbs

unread,

Feb 24, 2022, 12:25:30 PM2/24/22

to Stephen Boyd, Sebastian Rasmussen, Univac Emulators

On 2022-02-24 7:25 a.m., Stephen Boyd wrote:

> Well, I ran it and it seems to work.
>
> See attached job log.

Yay! I'll see about scanning more stuff. I'd love to see whether my
ICAM probe works.

I've heard back from a few people about OCR and will organize my notes.
I've uploaded all the manuals I've scanned so far to Bitsavers, and Al
Kossow has done some magic on them; he's shrunk them to a third their
size with no apparent loss of quality. In addition he's done some sort
of OCR on them so that the text is searchable, and pdftotext can extract
it. I'll keep you posted...

--
cgi...@surfnaked.ca (Charlie Gibbs)

Sebastian Rasmussen

unread,

Feb 24, 2022, 11:21:02 PM2/24/22

to Charlie Gibbs, Univac Emulators

Hi!

> I've heard back from a few people about OCR and will organize my notes.
> I've uploaded all the manuals I've scanned so far to Bitsavers,

Thank you for investing the time in doing this. :)
And yes, Al Kossow/Bitsavers are the best!
Do you have a link to where the things you've scanned are located on bitsavers?

/ Sebastian

Charlie Gibbs

unread,

Feb 25, 2022, 2:52:35 AM2/25/22

to Sebastian Rasmussen, Univac Emulators

Go to http://bitsavers.trailing-edge.com/pdf/univac/ and look in some of
the subdirectories. The 90/30 stuff that we're working with right now
is in system_80_and_series_90, but I've also uploaded a lot of the stuff
in the following other subdirectories:

1004
1005
9300
9400
terminals/uts_20
terminals/uts_400
terminals/Uniscope_100

Yes, I've used all this stuff at one time or another.

I'm getting ready to scan more listings. Things are pretty busy here
right now, though, so I'll be slowing down a bit.

--
cgi...@surfnaked.ca (Charlie Gibbs)

Charlie Gibbs

unread,

Feb 27, 2022, 6:51:12 PM2/27/22

to Sebastian Rasmussen, Stephen Boyd, Univac Emulators

On 2022-02-20 7:17 p.m., Sebastian Rasmussen wrote:

I've tried converting my PDF file to TIFF and running it through
Tesseract, with disappointing results. Tesseract is quite picky about
what it reads, and it either rejects the file outright or generates a
few bits of garbage. The original scan was at 300 dpi - I tried
scanning at 600 dpi but it didn't make a noticeable difference.

From searching the web I've concluded that OCR is either a life study
or a black art - possibly both. If anyone has any suggestions I'd love
to hear them. I have a lot of stuff to scan and it would be nice to be
able to OCR it in-house.

--
cgi...@surfnaked.ca (Charlie Gibbs)

Sebastian Rasmussen

unread,

Feb 27, 2022, 10:33:45 PM2/27/22

to Charlie Gibbs, Univac Emulators

Hi!

I have recently OCRed a number of PDFs containing articles
about other old computers through this website with good results:
https://tools.pdf24.org/en/ocr-pdf

You do have to indicate what language is used. But I have not
submitted any source code listings so I don't know how well
it would process that type of document.

/ Sebastian

PS. I don't know if that website keeps the documents, so only
OCR things that are public or you will make public anyway.
Better to be overly cautious. :)

Stephen Boyd

unread,

Feb 28, 2022, 10:35:20 AM2/28/22

to Charlie Gibbs, Sebastian Rasmussen, Univac Emulators

You're right. OCR is something of a black art. I have very little direct
experience with OCR but I have spent the last 20 years of my life
messing with document imaging systems. I had to be able to read bar
codes and QR codes and experimented with OCR. For what it is worth, here
are some tips that have helped me over the years.

Always scan documents in black and white. One bit per pixel images are
easier to decode that colour or grey scale.

If your OCR tool supports scanning TIFF images, always scan to TIFF
using CCITT3 or CCITT4 compression. These compression algorithms are
lossless, meaning that you get out exactly what you put in. Other
compression schemes like JPEG or PNG are lossy algorithms. What you get
out is different and "fuzzier" that what you put in. It is good enough
to fool the human eye but terrible for bar coding or OCR.

In general, higher resolution images are better than lower. Having said
that I always had good luck bar coding images at 300DPI or 400DPI.
Anything less was chancy at best.

I experimented with OCR to try to extract things like freight bill
numbers and bill of lading numbers from scanned documents. I never found
anything that came close to being accurate enough for my purposes but I
did learn a few things.

In general OCR is geared to reading prose, like novels or newspapers. So
the first thing that many OCR kits do is try to find the columns in the
image so that it can format the scanned text similarly. For our
purposes, if your OCR kit has an option I would suggest telling it that
everything is in a single column.

Many OCR kits will ask you for the language of the text. These are
usually horrible for things like program listings because they use a
heuristic algorithm to try to guess what the next character should be
based on what the previous characters were. Since assembler listings
don't follow the rules of English spelling and grammar OCR kits like
these just get hopelessly confused. Some kits of this class use a
dictionary to help with word recognition. If your kit allows it try
adding the assembler mnemonics to the dictionary. That might help. My
guess would be that a plain old dumb ass OCR kit with no heuristics
might work best for us.

Another problem is that OCR is notoriously bad at recognizing strings of
mixed numbers and letters, like hex. It really has no way of knowing if
that O should be an O or a zero, the 1 a 1 or an I or an L. I don't know
of any way around this particular problem.

I don't know if any of the above will be of any help to you. If you
don't have any luck I would be more than happy to run your scanned
listing through Nuance. Nuance only supports PDF format.

BTW, I tried about 15 or 20 the the free, online PDF to text tools on
mscan.pdf. None of them gave anything but garbage.

Steve B

Stephen Boyd

unread,

Feb 28, 2022, 11:55:52 AM2/28/22

to Charlie Gibbs, Sebastian Rasmussen, Univac Emulators

On 2/27/22 6:51 p.m., Charlie Gibbs wrote:
>

> From searching the web I've concluded that OCR is either a life study
> or a black art - possibly both. If anyone has any suggestions I'd
> love to hear them. I have a lot of stuff to scan and it would be nice
> to be able to OCR it in-house.
>

One more thing. I also have ABBYY Fine Reader that will accept TIFF
files. If you want to scan to TIFF and send me the files I can run them
thru ABBYY and see what we get.

Steve B

Charlie Gibbs

unread,

Feb 28, 2022, 12:24:52 PM2/28/22

to Sebastian Rasmussen, Univac Emulators

On 2022-02-27 7:33 p.m., Sebastian Rasmussen wrote:

> Hi!
>

> I have recently OCRed a number of PDFs containing articles
> about other old computers through this website with good results:
> https://tools.pdf24.org/en/ocr-pdf
>
> You do have to indicate what language is used. But I have not
> submitted any source code listings so I don't know how well
> it would process that type of document.

Ah yes, that newfangled "software as a service" paradigm.
I'll keep it in mind as a last resort. I have a _lot_ of stuff to scan.

If I ever get into things like training OCR engines, it might simplify
things if I can specify the 64-character set supported by those old
printers. I've seen lots of lower case come out of Tesseract, and
that's one thing that I can guarantee isn't in the originals.

--
cgi...@surfnaked.ca (Charlie Gibbs)

Charlie Gibbs

unread,

Feb 28, 2022, 12:51:27 PM2/28/22

to Stephen Boyd, Sebastian Rasmussen, Univac Emulators

Thanks for the scanning tips. I'm using a Brother ADS-2700W, which
scans to PDF. I'm scanning in black and white at 300 dpi, which is the
same setting I used for the manuals. I did try scanning my listing at
600 dpi, but it didn't make much difference.

I've been playing with Tesseract. It doesn't accept PDF, so I tried
converting my PDF to TIFF using the ImageMagick tool:

convert -density 288 mscan.pdf -resize 25% -alpha Off mscan.tiff

Adding the density and resize parameters gives a much better looking
image in the TIFF file. Still, Tesseract spits out garbage with the
occasional recognizable phrase.

I presume that an OCR engine's work can be simplified if we can tell it
to only recognize the 63 glyphs that the printer is capable of printing.
The garbage that came out of Tesseract contained a lot of lower case,
and I can guarantee there is none of that in the original listings.

I've found a script called textcleaner, which people claim does a good
job of prepping a file for OCR, but it has so many parameters that I'd
need to find a book to figure out how to use it.

Sigh... I suppose I could always retype the source code. As I've
learned from long experience, the second time only takes half as long.

--
cgi...@surfnaked.ca (Charlie Gibbs)

Stephen Boyd

unread,

Mar 1, 2022, 10:30:15 AM3/1/22

to Charlie Gibbs, Sebastian Rasmussen, Univac Emulators

On 2/28/22 12:51 p.m., Charlie Gibbs wrote:
> I've been playing with Tesseract. It doesn't accept PDF, so I tried
> converting my PDF to TIFF using the ImageMagick tool:
>

I've downloaded a trial version of the full version of the latest ABBYY
Finereader. It has lots of tweaks that are going to take a while to
figure out. Initial testing looks promising. I'll keep you posted.

>
> Sigh... I suppose I could always retype the source code. As I've
> learned from long experience, the second time only takes half as long.
>

If you decide that you need to go this route don't forget that Sebastian
has volunteered to help with transcription.

Sebastian Rasmussen

unread,

Mar 1, 2022, 10:58:24 AM3/1/22

to Univac Emulators

> > Sigh... I suppose I could always retype the source code. As I've
> > learned from long experience, the second time only takes half as long.
> If you decide that you need to go this route don't forget that Sebastian
> has volunteered to help with transcription.

Yes, I'd be happy to help if needed. I would wager that the most important
thing is to scan the material using a decent procedure, then worry about
the OCR:ing/transcription afterwards. Helping out with the scanning is
a bit more difficult for me. :)

/ Sebastian

Stephen Boyd

unread,

Mar 1, 2022, 11:08:30 AM3/1/22

to Charlie Gibbs, Sebastian Rasmussen, Univac Emulators

Here is what I ended up with after a couple of hours messing around with
ABBYY Finereader. The .TXT file is unretouched. This is what Finereader
spit out.

First I converted the PDF to TIFF using pdf2tiff.com then I created a
template to scan only the area of the page with source code in it. That
was basically it. Finereader has the ability to be trained to recognize
difficult fonts but that doesn't really seem to be necessary. The only
downside to Finereader so far is that it costs money. Looks like $260 CAD.

If it looks like it would be useful I would be willing to fund 1 copy to
get the files converted.

mscan.txt

Charlie Gibbs

unread,

Mar 1, 2022, 12:46:21 PM3/1/22

to Stephen Boyd, Sebastian Rasmussen, Univac Emulators

The results aren't bad, but compared to whatever you used last time I
wonder whether it's worth laying out a bunch of money for it. In
particular, most of the horizontal spacing information seems to be lost.
Still, it might be worthwhile if we can't find anything better.

I'll continue looking into Tesseract when I get the chance. There are a
number of enthusiastic reviews of it, so maybe it's just a matter of
finding the right magical incantation.

--
cgi...@surfnaked.ca (Charlie Gibbs)

Charlie Gibbs

unread,

Mar 2, 2022, 1:10:58 PM3/2/22

to Stephen Boyd, Sebastian Rasmussen, Univac Emulators

While out for a walk last night a few thoughts rolled around in my head
which might make this project easier.

After looking into various OCR solutions and the hoops you have to jump
through to use them, I've come to the realization that a general-purpose
OCR routine is overkill - by a lot - when trying to scan old printouts.
We don't need a routine that will search for recognizable glyphs
anywhere on a page, in any size, with an unlimited character set - and
trying to do so would require a lot of unnecessary effort.

A mainframe printer of the era printed lines that are usually 132
characters long, at a horizontal pitch of 10 characters per inch, with a
vertical spacing of either 6 or 8 lines per inch. We don't need to
search for randomly-placed characters. All we have to do is establish a
grid that's 13.2 inches wide and as deep as the page. This grid
consists of one cell per character position. At 300 dots per inch,
cells are 30 pixels wide, and 50 pixels high at 6 lines per inch. At 8
lines per inch, cells are 37.5 pixels high (ugh - I might have to scan
at 600 dpi to make things come out even).

Given a bitmap of a page, all we have to do is position this grid
properly, and each nonempty cell (for suitable values of "nonempty")
will contain a glyph. This will make positioning text as close to
perfect as we're going to get.

The second shortcut is that the printer is only capable of generating 63
glyphs, so there's no need to search the entire ASCII character set (let
alone UTF-8). All we need to do is create 63 templates - one for each
glyph - and try to match each cell's pixels with each of the 63
templates and take the ones that comes closest. There are some
heuristics involved here, but it wouldn't be nearly as hairy as a full
OCR routine.

I'll start puttering with some code. In a perfect world I'd be able to
tell my scanner to give me a raw bitmap, but I think it's hard-wired to
generate PDFs. My first step, then, will be to write code that converts
a page image to a bitmap - if PDF turns out to be too complex I could
first convert it to another format such as TIFF, PBM, or whatever.

Then the fun begins...

--
cgi...@surfnaked.ca (Charlie Gibbs)

Stephen Boyd

unread,

Mar 2, 2022, 3:12:39 PM3/2/22

to Charlie Gibbs, Sebastian Rasmussen, Univac Emulators

On 3/02/22 1:10 p.m., Charlie Gibbs wrote:
>
> Given a bitmap of a page, all we have to do is position this grid
> properly, and each nonempty cell (for suitable values of "nonempty")
> will contain a glyph. This will make positioning text as close to
> perfect as we're going to get.
>

There is the trick. Even a very small mis-positioning of the page or the
least bit of skew makes the whole decoding of the bitmap that much more
difficult.

> The second shortcut is that the printer is only capable of generating
> 63 glyphs, so there's no need to search the entire ASCII character set
> (let alone UTF-8). All we need to do is create 63 templates - one for
> each glyph - and try to match each cell's pixels with each of the 63
> templates and take the ones that comes closest. There are some
> heuristics involved here, but it wouldn't be nearly as hairy as a full
> OCR routine.

Another plus that you would have is the domain specific knowledge
relating to 90/30 assembly language. For example, if you see X' you know
that a hex number follows, if you see F' expect an integer, etc. You
could also build a dictionary of all opcodes and other known symbols.

>
> I'll start puttering with some code. In a perfect world I'd be able
> to tell my scanner to give me a raw bitmap, but I think it's
> hard-wired to generate PDFs. My first step, then, will be to write
> code that converts a page image to a bitmap - if PDF turns out to be
> too complex I could first convert it to another format such as TIFF,
> PBM, or whatever.

PDF files are horribly complex. I never did bother to figure them out.
Google has an open source toolkit named Pdfium. This is one of the PDF
toolkits that I used over the years. None of them were perfect and I
wound up having to try one toolkit then if that failed the program would
automatically try another. Ugly. Your best bet for ease of programming
is to use an imaging toolkit of some kind to convert your image file to
an in-memory bitmap. You really don't want to have to mess with decoding
image files yourself if you can avoid it. Not least because image file
standards aren't really standards. TIFF files from one place will be
radically different than TIFF files from another place. Just like PDF
toolkits I wound up having to use multiple imaging toolkits in case the
first one found the imaging file to be invalid for some reason.

Good luck. I'm interested to see how you make out.

Steve B

>
> Then the fun begins...
>

Charlie Gibbs

unread,

Mar 2, 2022, 4:51:04 PM3/2/22

to Stephen Boyd, Sebastian Rasmussen, Univac Emulators

On 2022-03-02 12:12 p.m., Stephen Boyd wrote:

> On 3/02/22 1:10 p.m., Charlie Gibbs wrote:
>
>> Given a bitmap of a page, all we have to do is position this grid
>> properly, and each nonempty cell (for suitable values of "nonempty")
>> will contain a glyph. This will make positioning text as close to
>> perfect as we're going to get.
>
> There is the trick. Even a very small mis-positioning of the page or the
> least bit of skew makes the whole decoding of the bitmap that much more
> difficult.

Yes, I expect that will be one of the trickier parts. But I have a few
ideas and I'm ready to give it a try.

>> I'll start puttering with some code. In a perfect world I'd be able
>> to tell my scanner to give me a raw bitmap, but I think it's
>> hard-wired to generate PDFs. My first step, then, will be to write
>> code that converts a page image to a bitmap - if PDF turns out to be
>> too complex I could first convert it to another format such as TIFF,
>> PBM, or whatever.
>
> PDF files are horribly complex. I never did bother to figure them out.
> Google has an open source toolkit named Pdfium. This is one of the PDF
> toolkits that I used over the years. None of them were perfect and I
> wound up having to try one toolkit then if that failed the program would
> automatically try another. Ugly. Your best bet for ease of programming
> is to use an imaging toolkit of some kind to convert your image file to
> an in-memory bitmap. You really don't want to have to mess with decoding
> image files yourself if you can avoid it. Not least because image file
> standards aren't really standards. TIFF files from one place will be
> radically different than TIFF files from another place. Just like PDF
> toolkits I wound up having to use multiple imaging toolkits in case the
> first one found the imaging file to be invalid for some reason.

I've found a Linux utility that quickly converts a PDF file to a series
of PBM files, one per page. The PBM format is very simple: a couple of
header lines followed by pure bitmap data. It'll be easy to write code
to load it into a big array in memory - then I can start playing with it.

> Good luck. I'm interested to see how you make out.

I'll keep you posted.

--
cgi...@surfnaked.ca (Charlie Gibbs)

Reply all

Reply to author

Forward