Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Conversion to HTML

7 views
Skip to first unread message

Robert Prins

unread,
Apr 2, 2022, 12:29:20 PM4/2/22
to
I'm about to start the conversion of a 1980'ies book from book to html. One of
the quirky things was its lay-out, 635 pages of text, 40 characters wide pages,
fixed pitch font (Courier?) and justified, sometimes looking a bit silly, when
only five words fit on a line, like, spaces replaced by underscores

that___appear___here.___So__here's__the

Whereas in earlier conversions, like
<https://prino.neocities.org/www/a-und-a/anhalterwesen-und-anhaltergefahren.html> I've
use proper html, converted footnotes to end-of-chapter notes, etc I'm not sure
what I should do here, as part of the appeal of this long-out-of-print book is
it's formatting.

Any suggestions?

Robert
--
Robert AH Prins
robert(a)prino(d)org
The hitchhiking grandfather - https://prino.neocities.org/
Some REXX code for use on z/OS - https://prino.neocities.org/zOS/zOS-Tools.html

Dale

unread,
Apr 2, 2022, 12:46:00 PM4/2/22
to
On 4/2/2022 2:29 PM, Robert Prins wrote:
> I'm about to start the conversion of a 1980'ies book from book to html.
> One of the quirky things was its lay-out, 635 pages of text, 40
> characters wide pages, fixed pitch font (Courier?) and justified,
> sometimes looking a bit silly, when only five words fit on a line, like,
> spaces replaced by underscores
>
> that___appear___here.___So__here's__the
>
> Whereas in earlier conversions, like
> <https://prino.neocities.org/www/a-und-a/anhalterwesen-und-anhaltergefahren.html>
> I've use proper html, converted footnotes to end-of-chapter notes, etc
> I'm not sure what I should do here, as part of the appeal of this
> long-out-of-print book is it's formatting.
>
> Any suggestions?
>
> Robert


Microsoft Word saves as HTML better than LibreOffice Writer

I seem to remember Save-As is better than Export too

--
Mystery? -> https://www.dalekelly.org/

Philip Herlihy

unread,
Apr 2, 2022, 1:58:19 PM4/2/22
to
In article <t29tku$rbe$1...@dont-email.me>, Robert Prins wrote...
>
> I'm about to start the conversion of a 1980'ies book from book to html. One of
> the quirky things was its lay-out, 635 pages of text, 40 characters wide pages,
> fixed pitch font (Courier?) and justified, sometimes looking a bit silly, when
> only five words fit on a line, like, spaces replaced by underscores
>
> that___appear___here.___So__here's__the
>
> Whereas in earlier conversions, like
> <https://prino.neocities.org/www/a-und-a/anhalterwesen-und-anhaltergefahren.html> I've
> use proper html, converted footnotes to end-of-chapter notes, etc I'm not sure
> what I should do here, as part of the appeal of this long-out-of-print book is
> it's formatting.
>
> Any suggestions?
>
> Robert

I'd question why you want to convert it to "HTML" if you want to preserve the
layout. These days we expect decent web pages to be "responsive" to the size
of the viewport (think small mobile to super-wide monitor) and to any user
settings users (including those with visual acuity problems) might apply. If
you want to put it on line "as it is", then print it to PDF and offer a link to
that! Otherwise, a lot of work, for questionable value. If you are determined
to do this, then the max-width CSS property, set to 40em, would be a good place
to start.

--

Phil, London

David E. Ross

unread,
Apr 2, 2022, 2:51:43 PM4/2/22
to
On 4/2/2022 11:29 AM, Robert Prins wrote:
> I'm about to start the conversion of a 1980'ies book from book to html. One of
> the quirky things was its lay-out, 635 pages of text, 40 characters wide pages,
> fixed pitch font (Courier?) and justified, sometimes looking a bit silly, when
> only five words fit on a line, like, spaces replaced by underscores
>
> that___appear___here.___So__here's__the
>
> Whereas in earlier conversions, like
> <https://prino.neocities.org/www/a-und-a/anhalterwesen-und-anhaltergefahren.html> I've
> use proper html, converted footnotes to end-of-chapter notes, etc I'm not sure
> what I should do here, as part of the appeal of this long-out-of-print book is
> it's formatting.
>
> Any suggestions?
>
> Robert
>

DO NOT DO IT! At least do not try to replicate the appearance of the
book. You might break the book down into chapters, each in a separate
Web page subsidiary to a Table of Contents page. But do not try to
replicate each of the book's pages as a separate Web page. No one will
want to read it.

Also, do not replicate the book as a PDF file. Jacob Nielsen, an expert
on Web design and usability, says:> Users hate coming across a PDF file
while browsing, because it breaks
> their flow. Even simple things like printing or saving documents are
> difficult because standard browser commands don't work. Layouts are
> often optimized for a sheet of paper, which rarely matches the size
> of the user's browser window. Bye-bye smooth scrolling. Hello tiny
> fonts.
>
> Worst of all, PDF is an undifferentiated blob of content that's hard
> to navigate.
>
> PDF is great for printing and for distributing manuals and other big
> documents that need to be printed. Reserve it for this purpose and
> convert any information that needs to be browsed or read on the
> screen into real web pages.

Here is what I would do --

Scan the book a page at a time with optical character recognition (OCR)
software. Edit the result since OCR software is generally not perfect.
Combine the resulting files into one file per chapter.

Create a master CSS file for formatting. Include spoecifying a
variable-width font such as Georgia or Verdana, which are easier to read
on a computer monitor than Courier or other fixed-width fonts. Also,
variable-width fonts render fully justified without the problem you
cite. Avoid specifying a font that is not generally available or that
is overy artistic. Also avoid specifying colors for either the font or
background.

Manually insert the necessary HTML markup into each chapter's file,
including links to the master CSS file. Create the Table of Contents
HTML file, listing only the chapters by number and title and with links
to them. Having previously proofed the OCR results, now proof read the
Web pages; better, have someone else read the Web pages aloud to you.

Test the chapters and Table of Contents at <http://validator.w3.org/>.
Test the CSS file at <http://jigsaw.w3.org/css-validator/>. Correct ALL
errors.

Yes, this will be tedious, especially for a 635 page book. I have done
this with smaller documents, including a two-page newsletter that
becomes a single Web page and multiple page newspaper articles that
again become single Web pages.

--
David E. Ross
"A Message to Those Who Are Not Vaccinated"
See my <http://www.rossde.com/index.html#vaccine>.

😉 Good Guy 😉

unread,
Apr 2, 2022, 2:53:43 PM4/2/22
to
On 02/04/2022 19:29, Robert Prins wrote:


Any suggestions?


I am assuming you are able to scan the book and extract OCR the text to paste it in an HTML document. If this is so and you don't want to spend time arranging the text manually then simply create the page in the normal way but make sure it is centred (using margin: auto) and give a width of 50%. The page will be narrow but at least people will be able to see it properly in their browser.

The link you provided tells me that you are not bothered to centre your pages. Did you know that they are very easy to centre then. You just need to create a container for your body and apply margin-left & right auto.


Arrest
Dictator Putin

We Stand
With Ukraine

Stop Putin
Ukraine Under Attack


--
Similar to Windows 11 Home edition, Windows 11 Pro edition now requires internet connectivity during the initial device setup (OOBE) only. If you choose to setup device for personal use, MSA will be required for setup as well. You can expect Microsoft Account to be required in subsequent WIP flights.

Now this is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning

Allodoxaphobia

unread,
Apr 2, 2022, 8:23:18 PM4/2/22
to
On Sat, 2 Apr 2022 18:29:11 +0000, Robert Prins wrote:
> I'm about to start the conversion of a 1980'ies book from book to html.

Are you refering to IBM's BookManager `book` format,
or something on paper?

Robert Prins

unread,
Apr 3, 2022, 5:15:07 AM4/3/22
to
On 2022-04-02 18:51, David E. Ross wrote:
> On 4/2/2022 11:29 AM, Robert Prins wrote:
>> I'm about to start the conversion of a 1980'ies book from book to html. One of
>> the quirky things was its lay-out, 635 pages of text, 40 characters wide pages,
>> fixed pitch font (Courier?) and justified, sometimes looking a bit silly, when
>> only five words fit on a line, like, spaces replaced by underscores
>>
>> that___appear___here.___So__here's__the
>>
>> Whereas in earlier conversions, like
>> <https://prino.neocities.org/www/a-und-a/anhalterwesen-und-anhaltergefahren.html> I've
>> use proper html, converted footnotes to end-of-chapter notes, etc I'm not sure
>> what I should do here, as part of the appeal of this long-out-of-print book is
>> it's formatting.
>>
>> Any suggestions?
>>
>> Robert
>>
>
> DO NOT DO IT! At least do not try to replicate the appearance of the
> book. You might break the book down into chapters, each in a separate
> Web page subsidiary to a Table of Contents page. But do not try to
> replicate each of the book's pages as a separate Web page. No one will
> want to read it.

That's definitely not something I am going to do. It will (eventually) be
"chapterised" and maybe even US-state / CA-province split, if I decide to
actually include the AD 1982 data. Still trying to contact the authors, names
are too generic to Google, but the publisher still seems to exist, and I've sent
a message to the owner.

> Also, do not replicate the book as a PDF file. Jacob Nielsen, an expert
> on Web design and usability, says:> Users hate coming across a PDF file
> while browsing, because it breaks
> their flow. Even simple things like printing or saving documents are
> difficult because standard browser commands don't work. Layouts are
> often optimized for a sheet of paper, which rarely matches the size
> of the user's browser window. Bye-bye smooth scrolling. Hello tiny
> fonts.
>
> Worst of all, PDF is an undifferentiated blob of content that's hard
> to navigate.

Yes and no, some PDF's are just that blobs, but if I look at the ones I use
most, IBMs documentation for z/OS, they are quite OK, although, like every other
PDF, they suffer from not having a "Back" function.

> PDF is great for printing and for distributing manuals and other big
> documents that need to be printed. Reserve it for this purpose and
> convert any information that needs to be browsed or read on the
> screen into real web pages.
>
> Here is what I would do --
>
> Scan the book a page at a time with optical character recognition (OCR)
> software. Edit the result since OCR software is generally not perfect.
> Combine the resulting files into one file per chapter.

I already have an incredibly badly OCR'ed to M$ Word DOC version of the book,
and a TIFF file containing all 324 pages (200dpi) with two physical pages per
page, which might (and actually seems to) explain the OCR mess.

> Create a master CSS file for formatting. Include spoecifying a
> variable-width font such as Georgia or Verdana, which are easier to read
> on a computer monitor than Courier or other fixed-width fonts. Also,
> variable-width fonts render fully justified without the problem you
> cite. Avoid specifying a font that is not generally available or that
> is overy artistic. Also avoid specifying colors for either the font or
> background.

My site uses a "style.css" file that has organically grown, and could probably
do with some TLC. Some other parts use per-subsection CSS and there's even still
some in-line CSS, that should really go to an external file, but there's only 24
hours in a day, and Mrs Prins also needs attention. ;)

My site, if you've looked at it, uses a simple "brown-bag" background, and other
than "Georgia", "Courier New", "Arial", I only use generic ("serif",
"monospace", "cursive") fonts. This book, like the one from the German police
and the California Highway Department would not get any background. Oops, the
rather weird looking z/OS part also uses "Liberation Mono" and Inconsolata but
these are both free fonts, so anyone can install them should they not have them.
(And yes that part uses the old IBM "green screen" colours)

> Manually insert the necessary HTML markup into each chapter's file,
> including links to the master CSS file. Create the Table of Contents
> HTML file, listing only the chapters by number and title and with links
> to them. Having previously proofed the OCR results, now proof read the
> Web pages; better, have someone else read the Web pages aloud to you.

I'm sure I will leave a typo every now and then, actually found some in the
paper version, so I'm not (too) worried about that.

> Test the chapters and Table of Contents at <http://validator.w3.org/>.
> Test the CSS file at <http://jigsaw.w3.org/css-validator/>. Correct ALL
> errors.

I know those, and use them all the time.

> Yes, this will be tedious, especially for a 635 page book. I have done
> this with smaller documents, including a two-page newsletter that
> becomes a single Web page and multiple page newspaper articles that
> again become single Web pages.

Actually, a few years ago I typed in some 220 pages of calculator related
newsletters, <https://prino.neocities.org/calculators/52-notes/52index.html>
including adding backward links - the forward ones are still on the back-burner.
And yes, those ones also use a fixed-pitch font.

Once in a while I go over them again, occasionally still spotting typos, despite
the fact that my daughter also proofread then.

Anyway, thanks for your useful comments!

Robert

PS: I very much like the vaccine page! Triple vaxed myself here in Belgium, and
could get three more vaccines in my native the Netherlands, might actually do so
when the vaccines are updated for the new varieties.

Robert Prins

unread,
Apr 3, 2022, 5:24:05 AM4/3/22
to
No, real paper book to html.

I don't know if there are tools to convert IBM's BookManager format to html, but
that's sadly not relevant anymore, as IBM no longer provides .BOO files, which
is bad, as they were directly usable on z/OS, I've got the full final set (~5Gb)
that came with z/OS 1.13, but only loaded a small set on my system, and am
accessing the others via the PC based tools.

Jukka K. Korpela

unread,
Apr 4, 2022, 3:36:37 AM4/4/22
to
Robert Prins wrote:

> I'm about to start the conversion of a 1980'ies book from book to html.
> One of the quirky things was its lay-out, 635 pages of text, 40
> characters wide pages, fixed pitch font (Courier?) and justified,

If the entire book is in exactly that format and you wish to preserve
the format, then it is pointless to convert it to HTML, since plain text
is so much more natural. Well, if there is a technical reason to make it
HTML, just slap the entire content inside <pre> element. Well, if you
want to keep the exact pagination as well, make each page a <pre>
element and use the CSS code

pre { page-break-before: always }

> sometimes looking a bit silly, when only five words fit on a line, like,
> spaces replaced by underscores
>
> that___appear___here.___So__here's__the

To preserve the formatting, in a plain text file or inside <pre>, you
would need to have the same number of spaces as in the printed book.

I first though you could do things more flexibly. Since digital scanning
hardly gets the number of spaces right automatically, you could let it
produce text with single spaces between words and then use e.g.

<p>that appear here. So here's the...</p>

with

p { width: 40ch; font-family: monospace; text-align: justify; }

But it does not produce the same result. Text justification in browsers
does not simply add space characters but instead stretches spacing
between words evenly.

Perhaps you should first analyze how the book content is best presented
in modern media, instead of assuming that the original format should be
preserved, even if it was simply caused by the limitations of tools.

(About 40 years ago, I co-authored a book that was published as
typewritten, in a monospace font. If someone wanted to publish it now in
digital format, it would be pointless to imitate that format, and I
would surely use my rights to forbid that!)

Yucca


Robert Prins

unread,
Apr 4, 2022, 2:26:02 PM4/4/22
to
On 2022-04-04 07:36, Jukka K. Korpela wrote:
> Robert Prins wrote:
>
>> I'm about to start the conversion of a 1980'ies book from book to html. One of
>> the quirky things was its lay-out, 635 pages of text, 40 characters wide
>> pages, fixed pitch font (Courier?) and justified,
>
> If the entire book is in exactly that format and you wish to preserve the
> format, then it is pointless to convert it to HTML, since plain text is so much
> more natural. Well, if there is a technical reason to make it HTML, just slap
> the entire content inside <pre> element. Well, if you want to keep the exact
> pagination as well, make each page a <pre> element and use the CSS code
>
> pre { page-break-before: always }

The book is long out of print, but parts are still a bit relevant, of,
relevantish, and webifying makes is available to a (somewhat) bigger audience.

>> sometimes looking a bit silly, when only five words fit on a line, like,
>> spaces replaced by underscores
>>
>> that___appear___here.___So__here's__the
>
> To preserve the formatting, in a plain text file or inside <pre>, you would need
> to have the same number of spaces as in the printed book.

I'm now at page 245. working my way through the abysmal DOC file next to the TIFF.

However, I'm already making changes to the presentation of some elements by
adding CSS to allow me to change ALL CAPS text to Boldface and the same for
mixed case "headings". At the moment it's still "white-space: pre", on the
<body>, but that may (eventually) also change, and adding real <OL>s would also
be useful, although I don't know if it would be possible to revert those to the
old format, for the above it's just a matter of swapping

text-transform: uppercase; and font-weight: bold;

to revert to the old typewriter style, and for <p>'s it would just be a matter
of using/not using a "white-space: pre;" style.

> I first though you could do things more flexibly. Since digital scanning hardly
> gets the number of spaces right automatically, you could let it produce text
> with single spaces between words and then use e.g.
>
> <p>that appear here. So here's the...</p>
>
> with
>
> p { width: 40ch; font-family: monospace; text-align: justify; }
>
> But it does not produce the same result. Text justification in browsers does not
> simply add space characters but instead stretches spacing between words evenly.

Which of course makes things look much better! The OCR'ed text never leaves more
than two spaces between words, and most of the time only one.

> Perhaps you should first analyze how the book content is best presented in
> modern media, instead of assuming that the original format should be preserved,
> even if it was simply caused by the limitations of tools.

Probably not at all. It's just that I like this kind of stuff, it's almost the
same as buying (usually for peanuts) old massive wooden lounge tables that are
black, and sanding them down and polishing them back to the light oak colour the
raw wood once had, and then selling them, not for the nice amount I get for them
(buy: eur 2.50, sell: eur 250) but just to not see them end up as firewood!

> (About 40 years ago, I co-authored a book that was published as typewritten, in
> a monospace font. If someone wanted to publish it now in digital format, it
> would be pointless to imitate that format, and I would surely use my rights to
> forbid that!)

:)

Robert Prins

unread,
May 16, 2022, 5:02:46 PM5/16/22
to
On 2022-04-02 18:29, Robert Prins wrote:
> I'm about to start the conversion of a 1980'ies book from book to html. One
> of the quirky things was its lay-out, 635 pages of text, 40 characters wide
> pages, fixed pitch font (Courier?) and justified,

OK, I've finished the initial conversion!

The result, minus the covers (front/back) can be found @
<https://prino.neocities.org/temp/hey_now_hitchhikers!.html>, and it's exactly
the same as the old book, except for the removal of a few typos (and yes, I'm
sure I've introduced additional typos, need to proofread again, and again, and
again...) and swapping some pages that were swapped in the original paper
version, or at least the scan I have, grandson will only come back from the US
with the real paper copy in two or so weeks.

And yes, it's ugly as, well you know, fluck...

However, during the conversion I've already added some CSS to make things more
bearable. Obviously I will have to do a lot more, but I would like to allow the
user to optionally select the original format, or maybe optionally change the
original format into something more AD 2022-ish, and there seems to be plenty of
JavaScript available to do so, now using something very simple from
<https://www.geeksforgeeks.org/how-to-switch-between-multiple-css-stylesheets-using-javascript/>.

Right now I have some questions, but I guess more will follow:

1) How do I keep, at least for now, the "Switch" button visible?
2) How do I hide the page-numbers, and with them the space around them when in
"Modern" mode? CSS? Or do I need JS? Examples would be useful...
3) What would be a good (web) font to replace the "Courier New"/monospace?

Obviously any other hints would also be appreciated.

I've only started on the actual "modernisation", so please bear with me!

Thanks,

Robert Prins

unread,
May 29, 2022, 4:52:29 PM5/29/22
to
On 2022-04-04 07:36, Jukka K. Korpela wrote:
> Robert Prins wrote:
>
>> I'm about to start the conversion of a 1980'ies book from book to html. One of
>> the quirky things was its lay-out, 635 pages of text, 40 characters wide
>> pages, fixed pitch font (Courier?) and justified,

<snip>

> Perhaps you should first analyze how the book content is best presented in
> modern media, instead of assuming that the original format should be preserved,
> even if it was simply caused by the limitations of tools.
>
> (About 40 years ago, I co-authored a book that was published as typewritten, in
> a monospace font. If someone wanted to publish it now in digital format, it
> would be pointless to imitate that format, and I would surely use my rights to
> forbid that!)

I've now finished the conversion
<https://prino.neocities.org/temp/hey_now_hitchhikers!.html>, although there are
a few minor loose ends to tie up, and an annoying format-hanging button that
needs to be put somewhere "out-of-the-way".

The default display is in "modern" format, albeit that I'm still using a
mono-space font, as I haven't been able to figure out how to replace the ordered
lists with real ordered lists, something that would also allow me to change the
font, doing so now messes up the indentation somewhat.

If anyone has any suggestions as to how I can improve things further, I'd be
grateful.

Fow what it's worth, I've managed to track down one of the authors, Larry Evans,
although my attempts to contact him have until now been unsuccessful, the
telephone rings, but isn't answered, a LinkedIn connect request is still
outstanding, and the only email address I found bounces,so I might have to
resort to snail mail. ;)

Thanks,
0 new messages