Hi Nick,
Thanks for the heads up. As a colonial, I'm afraid that I'm woefully unaware of the dates of the editions of the OED and, consequently, whether this edition is in the public domain. I've guessing it's not due to the reference to CDROM -- a rather new-ish technology.
The overarching goal is OCR, then crowdsource correct, a public domain edition of the Oxford English Dictionary. Of course, the devil is in the details. The OKFN's starting point is to use Internet Archive's ABBYY FineReader transcription, with which I'm somewhat unimpressed. Nonetheless, I've been attempting to push that front as far as it will go as a starting point. I have previous experience with reading/interpreting FineReader XML, so it's a relatively easy path to explore to start with.
The ultimate pipeline will exploit lots of lexical, typographic, layout, and semantic information to generate starting canonical word entries to be corrected in some type of crowd sourcing environment. I'm not an OCR expert, but love the multi-layered and machine + man approaches that this will require.
I've got hundreds of pages of vol. 1 HTML generated with color-coded low OCR accuracy highlighting, cleaned up layout, and a few other things. I've committed to the OKFN that I'll get them up somewhere for folks to review (probably
github.io). Of course, I've only just scratched the surface of the analysis which will be required. When I look at low accuracy segments, I find blocks of Persian and other languages, in additional to all other manner of variability.
I'd love to have more collaborators on this. I should write up some more formal description of my current plan of attack and possible alternatives. The OKFN isn't big on that kind for formality. They're more in the "Let's do this thing!" kind of vein.
Thanks for your interest. I should tell you upfront, that this is a sideline to my sidelines, so it may not get tons of attention.
Tom