Converting a scanned PDF to html

1 view
Skip to first unread message

Robert Prins

unread,
Jan 17, 2021, 12:15:34 PM1/17/21
to
At <http://docplayer.org/12394012-An-halterwvesen-und-an-haltergefah-ren.html>
there's a report by the German Bundeskriminalamt about hitchhiking, and access
is to say the least, and sticking to German, not very "Anwenderfreundlich", and
the text version is readable but lacks the tables.

As I link to this page from one of the pages on my site
<https://prino.neocities.org/sylvain_viard/sylvain_viard.html>, and as I don't
have anything better to do right now, thanks to those little spikey balls
floating around, I decided that it might be useful to convert the PDF to html.

The BKA has given me permission to do so:

<quote>
Dear Mr. Prins,

thank you for contacting the German Federal Criminal Police Office
(Bundeskriminalamt - BKA).

I can happily inform you, that you have the permission from the
“Bundeskriminalamt” to create a html-version out of the pdf-version of the book
and use it on your website.

Unfortunately we do not have a copy that we can send you. [RP: I had asked if
they might still have a copy lying around in pre-PDF format]

I hope I could help you.

Kind regards

by order

Dimitrakis

________________________
Bundeskriminalamt
Internet: https://www.bka.de
</quote>

It turns out that the option to download the PDF as Word on the above site
doesn't work (I gave up after Ms PacMan was still biting after nearly an hour),
but the text in the PDF is selectable, although with plenty of spelling errors,
but those are easy to correct when looking at the PDF.

The current version can be found at
<https://prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.html> and
it's far from final. I'm doing a basic conversion of the text (even inserting
<h3> tags with the page numbers), and at the rate I'm going, I might convert the
whole PDF in a two or three weeks,...

however, there are some items I would like to have suggestions on:

1) Font

Do I go for monospace, like the original report, or do I something more(?)
friendly on the eyes?

2) Footnotes

Obviously they don't make sense in html, so I'm thinking about using
<details><summary> </summary> <details> tags to place them in-line,
probably/possibly underlining (on hover) of the "xx)" text.

3) Tables

Don't cut & paste, so I'll have to convert them and here I've hit a snag, I can
code myself around it, but it's ugly.

Explanation: If you look at the tables in the PDF, the first is on page 14 (26
in the PDF), it has a double outside border and a single inside one, but most
cells don't have top or bottom borders.

I've tried removing them with in-<td> styles, to no avail, so for table 3 (on
page 16 (PDF page 27) I've hacked my way around it by putting all per-column
items in a single cell, separated by <BR> tags, you can find the original and
hacked copies by doing a find on "Tab. 3: Straftaten durch Anhalter und an
Anhaltern"

It works, except of course the inside borders are still double, but I can live
with that, but as I wrote, it's ugly. Is there a better way?

And how do you create the inverted "L" shaped tables that are on PDF pages 83
and 117, to name just two?

Obviously I will ***not*** rotate any tables!

4) Images? Not yet there, first one on PDF page 71 (Just Cut & Paste) or the
graphs on PDF page 106, where SVG would seem to the logical option, having also
converted many of the the original PNG's in the Sylvain Viard document to that
format)

Those are the questions for now, looking forward to your suggestions,

Thanks,

Robert
--
Robert AH Prins
robert(a)prino(d)org
The hitchhiking grandfather - https://prino.neocities.org/indez.html
Some REXX code for use on z/OS - https://prino.neocities.org/zOS/zOS-Tools.html

Thomas 'PointedEars' Lahn

unread,
Jan 17, 2021, 6:37:54 PM1/17/21
to
Robert Prins wrote:

> At
> <http://docplayer.org/12394012-An-halterwvesen-und-an-haltergefah-ren.html>
>
> It turns out that the option to download the PDF as Word on the above site
> doesn't work (I gave up after Ms PacMan was still biting after nearly an
> hour),

WFM.

> but the text in the PDF is selectable, although with plenty of
> spelling errors, but those are easy to correct when looking at the PDF.

As PDF is based on PostScript, there are tools like ps2txt (alias for
ps2ascii(1) which is an alias for gs(1), the GhostScript binary) which can
extract text from PDF documents automatically. It appears to work quite
well with the downloaded PDF document, in case you are still unable to
download the Word document.

There are also tools called “pdf2html”. One is an npm package and requires
a JRE [1], but there are others, both command-line tools and Web sites.
Just google it.

[1] <https://www.npmjs.com/package/pdf2html>

> 1) Font
>
> Do I go for monospace, like the original report, or do I something more(?)
> friendly on the eyes?

That depends on to which degree you want to preserve the original document.

If you are not doing this for archiving purposes, I suggest to declare a
list of sans-serif variable-width font families instead, with the more
preferable font family in front and ending the list with the generic “sans-
serif”. A possible list that can be recommended is

body {
font-family: Verdana, Geneva, Arial, Helvetica, sans-serif;
}

(YMMV. For example, typographers would probably frown at me for including
“Arial” there, or because I put it before “Helvetica”.)

If you are not into typography, or do not have the time to educate yourself
about it, simply declare only “sans-serif”.

You might need to set the font-family for some descendant elements as well.
(Implementations are inconsistent.)

> 2) Footnotes
>
> Obviously they don't make sense in html,

They do, just not as page-end notes as, contrary to popular belief, there
are no “_HTML_ pages”. They could be footnotes in the table footer,
section-end notes, or text-end notes.

> so I'm thinking about using
> <details><summary> </summary> <details> tags to place them in-line,
> probably/possibly underlining (on hover) of the "xx)" text.

I do not think this is the correct HTML markup for footnotes. See also:

<https://developer.mozilla.org/en-US/docs/Web/HTML/Element/details>

Footnotes as small linked superscript text are working for me. I would
suggest to inspect Wikipedia for how footnotes should be done (BTDT). You
can also combine that with my Accessible Pure CSS Tooltips (license is
GPLv3) that I am using on <http://PointedEars.de/es-matrix>.

> 3) Tables
>
> Don't cut & paste, so I'll have to convert them and here I've hit a snag,
> I can code myself around it, but it's ugly.

The problem may be solved now that you can download the Word document.
However, if you cannot, then you may be able make your life a little easier
by changing the text (if still necessary) to the following (CSV) format
(without indentation):

td_content;td_content;td_content …
td_content;td_content;td_content …

Then you can first apply the replacement

; → </td><td>

and then (e.g. using regular expressions)

^ (start of line) → <tr><td>
$ (end of line) → </td></tr>

(Use another delimiter if it is obvious that the delimiter occurs in the
data.)

Then surround all rows with

<table>

and

</table>

after which you can make adjustments like <td> → <th>, rowspan, colspan and
accessibility attributes.

I also remember having seen a tool that can do this conversion from text
rows to HTML tables automatically, but I do not remember its name and the
circumstances.

> Explanation: If you look at the tables in the PDF, the first is on page 14
> (26 in the PDF), it has a double outside border and a single inside one,
> but most cells don't have top or bottom borders.

Although it may look old-fashioned, the latter is actually how *simple*
*data* tables SHOULD be done. For example, it is a standing recommendation
for LaTeX tables in scientific works: Only draw horizontal lines (“\hline”
or “\midrule”) to separate *groups* of rows. (In HTML this can be achieved
with a “thead” and one or more with “tbody” elements.)

That the original table style may not be suitable for the Web does not mean
that copy-and-paste is necessarily a bad idea. In my PDF reader “Okular”
(version 1.3.2) at least, only the text from that table is copied then.
Once you have the text in the cells using proper table markup, the borders
can be easily styled with CSS. For example, something like

table { border-collapse: collapse; border: 2px double black; }
thead tr { border-bottom: 2px solid black; }
th, td { padding: 0.25em; border-right: 2px solid black; }

would come closest to the original table style. (Whether you want to do
that depends on how much you want to preserve the original.)

I would put the table footnotes in the “tfoot” element (BTDT).

> And how do you create the inverted "L" shaped tables that are on PDF pages
> 83 and 117, to name just two?

In the case of the table on page 83 of the PDF document, simply omit the
last 4 table cells in each row, or add empty cells but style them so that
they are not visible.

> Obviously I will ***not*** rotate any tables!

I do not see the need for any rotation in the first place :)

> 4) Images? Not yet there, first one on PDF page 71 (Just Cut & Paste) or
> the graphs on PDF page 106, where SVG would seem to the logical option,

Unless you want to do some fancy visualization, if you only want to link to
further information about the area of the map, a simple image map (“map” and
“img” element) will suffice (and will be most backwards-compatible). Since
the map contours only have to be approximate, this will be a lot easier to
do than to recreate the map exactly with SVG (unless you have an image
editor that can convert bitmaps to SVG easily – let me know which one,
then).

Otherwise only extract the image using e.g. The GIMP or ImageMagick
convert(1), and add an “img” element.

--
PointedEars
<https://github.com/PointedEars> | <http://PointedEars.de/wsvn/>
Twitter: @PointedEars2
Please do not cc me. /Bitte keine Kopien per E-Mail.

Jukka K. Korpela

unread,
Jan 19, 2021, 3:44:02 AM1/19/21
to
Robert Prins wrote:

> The current version can be found at
> <https://prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.html> and
> it's far from final.

Starting from a PDF document in order to produce an HTML version is
rather awkward, but in this case perhaps necessary (assuming you really
want to create an HTML version). Making the officials find you another
version might take a very long time and may well fail.

Doing the conversion basically by hand, just extracting the content as
plain text and adding markup, is probably what I would do, too. There
are various ways to try to automate the process, but there are many
problems and even if it were somehow successful, you would probably
still need to do manual fixes (or program tuning) a lot.

> however, there are some items I would like to have suggestions on:
>
> 1) Font
>
> Do I go for monospace, like the original report, or do I something
> more(?) friendly on the eyes?

This depends on your goals and limitations. If you are just creating a
“facsimile” reproduction of the PDF document in HTML, you would try to
preserve its visual appearance as far as possible. But how far could you
go then, and what would then be the point of the whole process?

Since the use of a monospace font is probably a tradition from the
typewrter era and since it makes the text more difficult to read, the
simplest approach is to omit all font settings, letting each browser use
its default font. Alternatively, set a reasonable proportional font.

Using justified text does not work well unless you use some hyphenation.
German text has so many long words that especially in a narrow viewing
area, the appearance becomes poor. You might consider some hyphenation
(like manually added &shy; in longest compound words) even if you keep
using justification.

> 2) Footnotes
>
> Obviously they don't make sense in html, so I'm thinking about using
> <details><summary> </summary> <details> tags to place them in-line,
> probably/possibly underlining (on hover) of the "xx)" text.

You would run into the problem that <details> is a block element.

The simplest approach is probably to put the footnotes in a separate
file and make the footnote references links to elements in that file.
You might consider embedding that file in the main document with
<iframe>, but you can do that later. (Well, an even simpler approach is
to make the references links to elements at the end of the document,
where you would put the footnote texts. But then you would probably need
to have back-references, like on Wikipedia pages, so that after
following a link to a footnote, the user can easily get back to place
where the reference is.)

> 3) Tables
>
> Don't cut & paste, so I'll have to convert them and here I've hit a
> snag, I can code myself around it, but it's ugly.
>
> Explanation: If you look at the tables in the PDF, the first is on page
> 14 (26 in the PDF), it has a double outside border and a single inside
> one, but most cells don't have top or bottom borders.

I’m not sure I see what the problem is. Do you think you need to
replicate such use of borders, instead of simply having a table with the
correct data and a suitable rendering? Anyway, if you don’t want to have
double borders between cells, set border-collapse: collapse on the table
element.

> And how do you create the inverted "L" shaped tables that are on PDF
> pages 83 and 117, to name just two?

I’m not sure what the structure there is. Just two different tables
touching each other? But you can make them a single table by using empty
cells (and making sure they don’t show: empty-cells: hide).

> 4) Images? Not yet there, first one on PDF page 71 (Just Cut & Paste) or
> the graphs on PDF page 106, where SVG would seem to the logical option,
> having also converted many of the the original PNG's in the Sylvain
> Viard document to that format)

Depends on the images of course, but normally PNG should be sufficient
for tables that find in administrative documents.

Robert Prins

unread,
Jan 19, 2021, 4:52:18 AM1/19/21
to
On 2021-01-19 08:43, Jukka K. Korpela wrote:
> Robert Prins wrote:
>
>> The current version can be found at
>> <https://prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.html> and
>> it's far from final.
>
> Starting from a PDF document in order to produce an HTML version is rather
> awkward, but in this case perhaps necessary (assuming you really want to create
> an HTML version). Making the officials find you another version might take a
> very long time and may well fail.

They've already told me that there is no Wordstar/WP/Word/etc version available.
There may be one at the university, but as you wrote, it may take ages to find it.

> Doing the conversion basically by hand, just extracting the content as plain
> text and adding markup, is probably what I would do, too. There are various ways
> to try to automate the process, but there are many problems and even if it were
> somehow successful, you would probably still need to do manual fixes (or program
> tuning) a lot.

The PDF text is selectable, be it with a nont insignificant number of errors,
but given that as a Dutchman I've had German at school, it's easy to (proof)read
and spot obvious errors.

>> however, there are some items I would like to have suggestions on:
>>
>> 1) Font
>>
>> Do I go for monospace, like the original report, or do I something more(?)
>> friendly on the eyes?
>
> This depends on your goals and limitations. If you are just creating a
> “facsimile” reproduction of the PDF document in HTML, you would try to preserve
> its visual appearance as far as possible. But how far could you go then, and
> what would then be the point of the whole process?

The points of the html conversion are:

1) smaller size (but who cares nowadays when actual visible page-content might
be as little as 1 or 2% of the page-size) The 269-page PDF is 12.5Mb, I'm now at
PDF page 63, and my html page is just 135kb!

2) Access, the PDF is just a big file without any means of moving around to
specific chapters.

> Since the use of a monospace font is probably a tradition from the typewrter era
> and since it makes the text more difficult to read, the simplest approach is to
> omit all font settings, letting each browser use its default font.
> Alternatively, set a reasonable proportional font.

I've removed the monospace, no clue what font it now uses, but it most
definitely looks better.

> Using justified text does not work well unless you use some hyphenation. German
> text has so many long words that especially in a narrow viewing area, the
> appearance becomes poor. You might consider some hyphenation (like manually
> added &shy; in longest compound words) even if you keep using justification.

There are quite a few soft-hyphenated words in the text, I might add some soft
hyphens in the longest words. What's your opinion on the current text-width
(700px)?
<file:///D:/01-lift/02-prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.html>
For me it seems to be a reasonable compromise.

>> 2) Footnotes
>>
>> Obviously they don't make sense in html, so I'm thinking about using
>> <details><summary> </summary> <details> tags to place them in-line,
>> probably/possibly underlining (on hover) of the "xx)" text.
>
> You would run into the problem that <details> is a block element.

As an earlier post here has already explained.

> The simplest approach is probably to put the footnotes in a separate file and
> make the footnote references links to elements in that file. You might consider
> embedding that file in the main document with <iframe>, but you can do that
> later. (Well, an even simpler approach is to make the references links to
> elements at the end of the document, where you would put the footnote texts. But
> then you would probably need to have back-references, like on Wikipedia pages,
> so that after following a link to a footnote, the user can easily get back to
> place where the reference is.)

I might go for a conversion to end-notes, with a link back.

And I've come across one site, which I didn't bookmark, that suggested that
footnotes might be the one thing that tooltips are useful for, although
accessibility of tooltips is not their strong point, to put it mildly.

>> 3) Tables
>>
>> Don't cut & paste, so I'll have to convert them and here I've hit a snag, I
>> can code myself around it, but it's ugly.
>>
>> Explanation: If you look at the tables in the PDF, the first is on page 14 (26
>> in the PDF), it has a double outside border and a single inside one, but most
>> cells don't have top or bottom borders.
>
> I’m not sure I see what the problem is. Do you think you need to replicate such
> use of borders, instead of simply having a table with the correct data and a
> suitable rendering? Anyway, if you don’t want to have double borders between
> cells, set border-collapse: collapse on the table element.

I'll probably stick with what I have now, adding border-collapse to individual
cells is just too much work. But how do you completely hide (top/bottom only)
borders on individual cells? My approach of using <br> and putting everything
into one cell works, although it's not very nice.

>> And how do you create the inverted "L" shaped tables that are on PDF pages 83
>> and 117, to name just two?
>
> I’m not sure what the structure there is. Just two different tables touching
> each other? But you can make them a single table by using empty cells (and
> making sure they don’t show: empty-cells: hide).

That would do the trick, didn't know about it.

>> 4) Images? Not yet there, first one on PDF page 71 (Just Cut & Paste) or the
>> graphs on PDF page 106, where SVG would seem to the logical option, having
>> also converted many of the the original PNG's in the Sylvain Viard document to
>> that format)
>
> Depends on the images of course, but normally PNG should be sufficient for
> tables that find in administrative documents.

Once I get to them, currently only on page 63 of the PDF, I'll make a decision,
I'll probably go for PNG's first, but the plans of lines of public transport on
the final pages should be (fairly) easy to convert to SVG. (Then again the PNG's
might actually be smaller...)

Helmut Richter

unread,
Jan 19, 2021, 7:04:37 AM1/19/21
to
On Tue, 19 Jan 2021, Robert Prins wrote:

> They've already told me that there is no Wordstar/WP/Word/etc version
> available. There may be one at the university, but as you wrote, it may take
> ages to find it.

Well, for a text from the 1980ies, this may not be easy. I wonder whether
the content is still interesting today but that is not my problem.

If the purpose is to get a machine-readable text with less errors than
what has PDF produced by OCR, that would be fine. You may consider a
OCR-only tool like "tesseract" instead; I prefer that but I am not sure
whether this is just snake oil.

If the purpose is to get a formatted text with headlines and paragraphes,
forget it. Whenever I had to translate a Word document into HTML, I have first
extracted the plain text and then added the markup. This is much less work
than removing the Word-specific markup which is to ensure that the outcome
looks exactly like the Word document that was the source. Moreover, you save 80
or 90 % of the markup, and the HTML text is correct HTML and human readable.

> > Doing the conversion basically by hand, just extracting the content as plain
> > text and adding markup, is probably what I would do, too. There are various
> > ways to try to automate the process, but there are many problems and even if
> > it were somehow successful, you would probably still need to do manual fixes
> > (or program tuning) a lot.

There is a handy middle way: Make sure that headlines and paragraphs are at
the right places in the plain-text file, and use a tool like "markdown" to
actually insert HTML tags. This model is how Wikipedia works for the authors.

--
Helmut Richter

Robert Prins

unread,
Jan 19, 2021, 11:23:20 AM1/19/21
to
On 2021-01-19 12:04, Helmut Richter wrote:> On Tue, 19 Jan 2021, Robert Prins wrote:
>
>> They've already told me that there is no Wordstar/WP/Word/etc version
>> available. There may be one at the university, but as you wrote, it may take
>> ages to find it.
>
> Well, for a text from the 1980ies, this may not be easy. I wonder whether
> the content is still interesting today but that is not my problem.

It's one of the very few studies about the dangers of hitchhiking, and that
makes is "kind of interesting".

> If the purpose is to get a machine-readable text with less errors than
> what has PDF produced by OCR, that would be fine. You may consider a
> OCR-only tool like "tesseract" instead; I prefer that but I am not sure
> whether this is just snake oil.
>
> If the purpose is to get a formatted text with headlines and paragraphes,
> forget it. Whenever I had to translate a Word document into HTML, I have first
> extracted the plain text and then added the markup. This is much less work
> than removing the Word-specific markup which is to ensure that the outcome
> looks exactly like the Word document that was the source. Moreover, you save
> 80
> or 90 % of the markup, and the HTML text is correct HTML and human readable.

I've been given a converted-to-Word version, which is as good as useless, as it
contains all the typos, and worse, the scan artifacts as images, so I just Cut &
Paste one page of the PDF at a time, remove all the spelling errors (I hope),
add basic html, i.e. <h2/3/4>, <p> and <sup> notes for the footnotes, slap a
16-character <hr> before any footnotes, and a full <hr> at the end of the page,
a hitchhiking friend will proofread the thing again, and having the separators
makes it a bit easier to see where you are. They will obviously be removed from
the final version, and the footnotes will become end-notes, with back-links.

Tables take a bit more time, but all in all I process about 4-8 pages per hour,
which is good enough, I haven't got much else to do right now, it's way too cold
in Vilnius to go onto the balcony and continue my sanding and painting work, and
there are sadly too many police checkpoint to go out and hitchhike. (And yes,
even with Covid-19 on the rampage, people still stop for hitchhikers)

>>> Doing the conversion basically by hand, just extracting the content as plain
>>> text and adding markup, is probably what I would do, too. There are various
>>> ways to try to automate the process, but there are many problems and even if
>>> it were somehow successful, you would probably still need to do manual fixes
>>> (or program tuning) a lot.

On my PC I have a (Pascal) program that converts the output of my main
statistics processing program into RTF, and on z/OS I've got a 5,000+ line REXX
edit macro to do the same, and keeping those working while adding tables here,
there, and everywhere is more than enough. Writing code for what's in essence a
one-off task makes no sense, just convert it and be done with it!

> There is a handy middle way: Make sure that headlines and paragraphs are at
> the right places in the plain-text file, and use a tool like "markdown" to
> actually insert HTML tags. This model is how Wikipedia works for the authors.

Never heard of it, but even that might be overkill given the simplicity of the html.

For the rest, thank you for your comments!

Robert
--
Robert AH Prins
robert(a)prino(d)org
The hitchhiking grandfather @ https://prino.neocities.org/
Some useful(?) REXX @ https://prino.neocities.org/zOS/zOS-Tools.html

Helmut Richter

unread,
Jan 19, 2021, 12:01:50 PM1/19/21
to
I does only as much as you would program yourself in some script language
for the same purpose. When I learnt about it, I had already such a script
for myself with quite much the same simple interface.

https://nl.wikipedia.org/wiki/Markdown

--
Helmut Richter

Robert Prins

unread,
Jan 20, 2021, 4:24:34 PM1/20/21
to
On 2021-01-19 08:43, Jukka K. Korpela wrote:
> Robert Prins wrote:
>
>> The current version can be found at
>> <https://prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.html> and
>> it's far from final.

<snip>

>> 3) Tables
>>
>> Don't cut & paste, so I'll have to convert them and here I've hit a snag, I
>> can code myself around it, but it's ugly.
>>
>> Explanation: If you look at the tables in the PDF, the first is on page 14 (26
>> in the PDF), it has a double outside border and a single inside one, but most
>> cells don't have top or bottom borders.
>
> I’m not sure I see what the problem is. Do you think you need to replicate such
> use of borders, instead of simply having a table with the correct data and a
> suitable rendering? Anyway, if you don’t want to have double borders between
> cells, set border-collapse: collapse on the table element.

Just for "fun", I've been fiddling with the tables to see if I can get the same
format as in the PDF, and while doing so, I found out that my "whole-of-site"
"style.css" is just not very useful, so I cut it down to the basics that I need
for this conversions.

I've managed to get one table to look like it "should" look, but in the process
I've lost the outside border on all 'class="pdftab"' tables, and even
Firebug'ing between the converted PDF and my
<https://prino.neocities.org/sandbox.html> I have been unable to get the border
around the second table, and I would really appreciate it if someone could
explain what I'm missing.

Thanks,

Robert

PS: And yes, the in-line styling on the <tr> tags still needs to go to CSS.

Thomas 'PointedEars' Lahn

unread,
Jan 21, 2021, 5:37:13 PM1/21/21
to
Jukka K. Korpela wrote:

> Robert Prins wrote:
>> The current version can be found at
>> <https://prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.html>
>> and it's far from final.
>
> Starting from a PDF document in order to produce an HTML version is
> rather awkward, but in this case perhaps necessary (assuming you really
> want to create an HTML version).

It is actually a common task in the industry as people (especially public
offices) can easily produce PDF documents by scanning sheets of hardcopies
or with a word processor, but often still do not have the manpower or
technical skills to produce clean HTML (documents) for use on a Web site.
So it is good for a(n) aspiring Web developer to know how to do that.

I for one was tasked about a year ago with converting PDF documents,
produced by the Swiss Federal Office of Public Health, to HTML, so that the
information provided by them would be accessible. It basically still looks
the same as it did when I was finished:

<https://www.priminfo.admin.ch/de/zahlen-und-fakten>

(You can see there that some newer documents have not been converted to HTML
yet.)

F’up2 comp.infosystems.www.authoring.html

--
PointedEars
FAQ: <http://PointedEars.de/faq> | <http://PointedEars.de/es-matrix>

Thomas 'PointedEars' Lahn

unread,
Jan 21, 2021, 5:39:25 PM1/21/21
to
Robert Prins wrote:

> On 2021-01-19 08:43, Jukka K. Korpela wrote:
>> Robert Prins wrote:
>>> The current version can be found at
>>>
<https://prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.html>
>>> and it's far from final.
>>
>> Starting from a PDF document in order to produce an HTML version is
>> rather awkward, but in this case perhaps necessary (assuming you really
>> want to create an HTML version). Making the officials find you another
>> version might take a very long time and may well fail.
>
> They've already told me that there is no Wordstar/WP/Word/etc version
> available.

As I told already you, I downloaded it from the very source that you
specified. So it certainly *is* available, even if it is the result of a
conversion.

Thomas 'PointedEars' Lahn

unread,
Jan 21, 2021, 5:40:54 PM1/21/21
to
Robert Prins wrote:

> On 2021-01-19 08:43, Jukka K. Korpela wrote:
>> Robert Prins wrote:
>>> The current version can be found at
>>>
<https://prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.html>
>>> and it's far from final.
>>
>> Starting from a PDF document in order to produce an HTML version is
>> rather awkward, but in this case perhaps necessary (assuming you really
>> want to create an HTML version). Making the officials find you another
>> version might take a very long time and may well fail.
>
> They've already told me that there is no Wordstar/WP/Word/etc version
> available.

As I told you already, I downloaded it from the very source that you

Robert Prins

unread,
Jan 22, 2021, 3:40:50 PM1/22/21
to
For some reason, I missed this post a few days ago, not good...

On 2021-01-17 23:37, Thomas 'PointedEars' Lahn wrote:
> Robert Prins wrote:
>
>> At
>> <http://docplayer.org/12394012-An-halterwvesen-und-an-haltergefah-ren.html>
>>
>> It turns out that the option to download the PDF as Word on the above site
>> doesn't work (I gave up after Ms PacMan was still biting after nearly an
>> hour),
>
> WFM.
>
>> but the text in the PDF is selectable, although with plenty of
>> spelling errors, but those are easy to correct when looking at the PDF.
>
> As PDF is based on PostScript, there are tools like ps2txt (alias for
> ps2ascii(1) which is an alias for gs(1), the GhostScript binary) which can
> extract text from PDF documents automatically. It appears to work quite
> well with the downloaded PDF document, in case you are still unable to
> download the Word document.
>
> There are also tools called “pdf2html”. One is an npm package and requires
> a JRE [1], but there are others, both command-line tools and Web sites.
> Just google it.

I've actually used the one on the Adobe site, and it's given me a 13Mb .RTF
file, which is just as useful as the PDF, both allow me to cut&paste text, with
the same errors that need fixing. "Zusammenhang" is a recurrent problem.

>> 1) Font
>>
>> Do I go for monospace, like the original report, or do I something more(?)
>> friendly on the eyes?
>
> That depends on to which degree you want to preserve the original document.

Nobody's going to print it, but if they want to, there's the PDF...

> If you are not doing this for archiving purposes, I suggest to declare a
> list of sans-serif variable-width font families instead, with the more
> preferable font family in front and ending the list with the generic “sans-
> serif”. A possible list that can be recommended is
>
> body {
> font-family: Verdana, Geneva, Arial, Helvetica, sans-serif;
> }
>
> (YMMV. For example, typographers would probably frown at me for including
> “Arial” there, or because I put it before “Helvetica”.)
>
> If you are not into typography, or do not have the time to educate yourself
> about it, simply declare only “sans-serif”.
>
> You might need to set the font-family for some descendant elements as well.
> (Implementations are inconsistent.)

I'm using "Georgia,serif", like everywhere else in my site. Verdana is for me
one of those fonts that immediately provokes a "yuck" reaction. Tables use
"Courier New",monospace

>> 2) Footnotes
>>
>> Obviously they don't make sense in html,
>
> They do, just not as page-end notes as, contrary to popular belief, there
> are no “_HTML_ pages”. They could be footnotes in the table footer,
> section-end notes, or text-end notes.

Section-end notes would be a nice compromise, section would for me be a Kapitel.

>> so I'm thinking about using
>> <details><summary> </summary> <details> tags to place them in-line,
>> probably/possibly underlining (on hover) of the "xx)" text.
>
> I do not think this is the correct HTML markup for footnotes. See also:
>
> <https://developer.mozilla.org/en-US/docs/Web/HTML/Element/details>
>
> Footnotes as small linked superscript text are working for me. I would
> suggest to inspect Wikipedia for how footnotes should be done (BTDT). You
> can also combine that with my Accessible Pure CSS Tooltips (license is
> GPLv3) that I am using on <http://PointedEars.de/es-matrix>.

Wikipedia footnotes together with the accessible tooltips are magic, I've looked
at them before, but I just couldn't figure out how they work, so, at least for
now, I'll go for a simple link to a "Notes section" at the section-end, with a
link back from there. Any suggestion on how you'd call this section in German?
I probably could create something in REXX, recently wrote something that can add
"profiling" code to my Pascal programs in Regina REXX and it takes just 7
seconds from pressing Enter to get the final output, processing just under
80,000 lines of Pascal, compiling the modified code, and running the nine programs.

As someone who's worked on IBM mainframes since 1985, I'm not very much into all
of the Windows/Unix/Linux tools, I know the basics of "grep" and "sed", but
that's about it!

>> Explanation: If you look at the tables in the PDF, the first is on page 14
>> (26 in the PDF), it has a double outside border and a single inside one,
>> but most cells don't have top or bottom borders.
>
> Although it may look old-fashioned, the latter is actually how *simple*
> *data* tables SHOULD be done. For example, it is a standing recommendation
> for LaTeX tables in scientific works: Only draw horizontal lines (“\hline”
> or “\midrule”) to separate *groups* of rows. (In HTML this can be achieved
> with a “thead” and one or more with “tbody” elements.)
>
> That the original table style may not be suitable for the Web does not mean
> that copy-and-paste is necessarily a bad idea. In my PDF reader “Okular”
> (version 1.3.2) at least, only the text from that table is copied then.
> Once you have the text in the cells using proper table markup, the borders
> can be easily styled with CSS. For example, something like
>
> table { border-collapse: collapse; border: 2px double black; }
> thead tr { border-bottom: 2px solid black; }
> th, td { padding: 0.25em; border-right: 2px solid black; }
>
> would come closest to the original table style. (Whether you want to do
> that depends on how much you want to preserve the original.)

I've simply created four styles

.b0 {
border-bottom: 0;
}

.t0 {
border-top: 0;
}

.l0 {
border-left: 0;
}

.r0 {
border-right: 0;
}

to remove borders from <td> elements that I don't want, and might combine them
later into the likes of .bt0/.br0/.brt0/etc, and the tables I now get are,
except for some spacing, carbon copies of the originals. I'm not going to try to
get the exact spacing by adding (even more) &nbsp;'s.

> I would put the table footnotes in the “tfoot” element (BTDT).

I'm currently only using "tbody", for now I prefer KISS.

>> And how do you create the inverted "L" shaped tables that are on PDF pages
>> 83 and 117, to name just two?
>
> In the case of the table on page 83 of the PDF document, simply omit the
> last 4 table cells in each row, or add empty cells but style them so that
> they are not visible.
>
>> Obviously I will ***not*** rotate any tables!
>
> I do not see the need for any rotation in the first place :)

I know, web-pages have an infinite width. :)

>> 4) Images? Not yet there, first one on PDF page 71 (Just Cut & Paste) or
>> the graphs on PDF page 106, where SVG would seem to the logical option,
>
> Unless you want to do some fancy visualization, if you only want to link to
> further information about the area of the map, a simple image map (“map” and
> “img” element) will suffice (and will be most backwards-compatible). Since
> the map contours only have to be approximate, this will be a lot easier to
> do than to recreate the map exactly with SVG (unless you have an image
> editor that can convert bitmaps to SVG easily – let me know which one,
> then).

There's something that converts BW images to SVG,
<http://potrace.sourceforge.net/>, you've probably heard about it.

At some stage I've tried to convert the .PNG's used in
<https://prino.neocities.org/mario_rinvolucri/chapter2.html> to SVG (after
rescanning them, I've actually got the book, the original GIF's @
http://bernd.wechner.info/Hitchhiking/Mario/chapter2.html> are too low-res), but
I seem to remember that the resulting SVG's were bigger than the "PNGOUT"
compressed PNG's. For what it's worth Bernd Wechner's webified version of this
book uses

"font-family: Arial, Helvetica, sans-serif;"

whereas I don't specify any font, which seems to result in "Times New Roman"
with Firefox (and Edge), which I find easier on the eyes.

> Otherwise only extract the image using e.g. The GIMP or ImageMagick
> convert(1), and add an “img” element.

I've actually found an SVG image of Saarland on Wikipedia, and after hacking it
into something more compact, Inkscape files contain a hell of a lot of bloat,
it's in the current version @
<https://prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.html> (do a
find on "abb") Not sure if the yellow(ish) is the best colour, and the towns
were added on a "looks OK" basis. (For what it's worth, the current SVG still
contains two groups that are nearly identical, but I've been unable to merge
them, doing so would reduce the size even more!)

Helmut Richter

unread,
Jan 23, 2021, 5:38:27 AM1/23/21
to
On Fri, 22 Jan 2021, Robert Prins wrote:

> > > <http://docplayer.org/12394012-An-halterwvesen-und-an-haltergefah-ren.html>
> > >
> > > It turns out that the option to download the PDF as Word on the above site
> > > doesn't work (I gave up after Ms PacMan was still biting after nearly an
> > > hour),

Have you ever published an URL of the original (i.e. the relatively to
other versions most original) PDF version?

Without it, one can hardly say anything about the quality of other
versions produced from it.

--
Helmut Richter

Robert Prins

unread,
Jan 23, 2021, 7:25:28 AM1/23/21
to
There's another copy of the same PDF around on the site of a HH friend @
<http://www.franknature.nl/anhalterwesen.pdf>, but those are as far as I know
the only two, other than that it also shows up on Google on sites with, to say
the least, "dodgy" names that I wouldn't touch with a bargepole.

There also seem to be some paper copies around, just Google the title.

The .RTF version can be (temporarily) found @
<https://prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.rtf>. It
renders OK in Word (Word 2002), and very badly in LO Writer, and contains a
horrible amount of "images", scan artifacts.

Helmut Richter

unread,
Jan 23, 2021, 11:21:26 AM1/23/21
to
On Sat, 23 Jan 2021, Robert Prins wrote:

> On 2021-01-23 10:38, Helmut Richter wrote:
> > On Fri, 22 Jan 2021, Robert Prins wrote:
> >
> > > > > <http://docplayer.org/12394012-An-halterwvesen-und-an-haltergefah-ren.html>
> > > > >
> > > > > It turns out that the option to download the PDF as Word on the above
> > > > > site
> > > > > doesn't work (I gave up after Ms PacMan was still biting after nearly
> > > > > an
> > > > > hour),
> >
> > Have you ever published an URL of the original (i.e. the relatively to
> > other versions most original) PDF version?
> >
> > Without it, one can hardly say anything about the quality of other
> > versions produced from it.
>
> There's another copy of the same PDF around on the site of a HH friend @
> <http://www.franknature.nl/anhalterwesen.pdf>, but those are as far as I know

I tried to read it with tesseract, and the outcome looks good at first
sight. Please tell me whether this is of some help for you. Of course,
tesseract can only read what it recognises as text, no images or the like.

The main problem was to convert the pdf into a graphics file. I ended up
with 276M TIFF. tesseract took 12 min to make text out of it; the text
in UTF-8 encoding takes 446K and can be found at
https://hhr-m.userweb.mwn.de/weblab/anhalterwesen.txt .

I have learnt a bit know-how and a lot of know-how-not.

--
Helmut Richter

😉 Good Guy 😉

unread,
Jan 23, 2021, 9:00:39 PM1/23/21
to
On 23/01/2021 14:24, Robert Prins wrote:

There's another copy of the same PDF around on the site of a HH friend @ <http://www.franknature.nl/anhalterwesen.pdf>, but those are as far as I know the only two, other than that it also shows up on Google on sites with, to say the least, "dodgy" names that I wouldn't touch with a bargepole.

There also seem to be some paper copies around, just Google the title.

The .RTF version can be (temporarily) found @ <https://prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.rtf>. It renders OK in Word (Word 2002), and very badly in LO Writer, and contains a horrible amount of "images", scan artifacts.

Robert

Can you not just embed the file in your HTML like so?

<https://technical.mytechsite.gq/docs/test.html>

Robert Prins


--

With over 1.2 billion devices now running Windows 10, customer satisfaction is higher than any previous version of windows.

Robert Prins

unread,
Jan 24, 2021, 9:13:31 AM1/24/21
to

On 2021-01-24 01:58, 😉 Good Guy 😉 wrote:
> On 23/01/2021 14:24, Robert Prins wrote:
>>
>> There's another copy of the same PDF around on the site of a HH friend @
<http://www.franknature.nl/anhalterwesen.pdf>, but those are as far as I know
the only two, other than that it also shows up on Google on sites with, to say
the least, "dodgy" names that I wouldn't touch with a bargepole.
>>
>> There also seem to be some paper copies around, just Google the title.
>>
>> The .RTF version can be (temporarily) found @
<https://prino.neocities.org/www/Anhalterwesen_und_Anhaltergefahren.rtf>. It
renders OK in Word (Word 2002), and very badly in LO Writer, and contains a
horrible amount of "images", scan artifacts.
>>
>> Robert
>
> Can you not just embed the file in your HTML like so?
>
> <https://technical.mytechsite.gq/docs/test.html>

Then I might just as well send people to the original. The reasons for the
conversion to html are, as mentioned before,

1) Size: the PDF is 12.5 Mb, the final html is likely to be well under 1Mb,
currently on page 81/93 of the PDF (of the 197/209) of actual contents, and I'm
as yet not sure what I'm going to do with the 60 pages containing appendices
with the questionnaires. Size right now is a mere 222kb...

2) Accessibility. 'nuff said.

Admittedly, your usual website nowadays carries multi-megabyte behind the scenes
CSS and JS, on my mobile I just deleted another 108(!)Mb of "cookies" left there
by the <http://www.independent.co.uk/>, so accessibility is the main issue, and
based on my own experience, quite a bit of the contents of this study is still
reasonably relevant!

And it's Covid-19 time, too cold to be out on the balcony sanding and painting
doors (only two of 16 left anyway), impossible to hitchhike, although I'm still
going to try this week to keep the stritch going, so this conversion and
updating the PC based copies of my HH programs (about to hit a snag, as some
input lines now exceed 255 characters) are useful to keep me busy.

Thomas 'PointedEars' Lahn

unread,
Jan 27, 2021, 3:29:07 PM1/27/21
to
Robert Prins wrote:

> For some reason, I missed this post a few days ago, not good...

You’re welcome :->

> […]
<https://github.com/PointedEars> | <http://PointedEars.de/wsvn/>

Thomas 'PointedEars' Lahn

unread,
Jan 29, 2021, 4:29:22 PM1/29/21
to
Helmut Richter wrote:

> I tried to read it with tesseract, and the outcome looks good at first
> sight.

I did not know it. Thank you for sharing this :)

<https://github.com/tesseract-ocr/tesseract>
<https://github.com/PointedEars> | <http://PointedEars.de/wsvn/>
Reply all
Reply to author
Forward
0 new messages