Robert Prins wrote:
> At
> <
http://docplayer.org/12394012-An-halterwvesen-und-an-haltergefah-ren.html>
>
> It turns out that the option to download the PDF as Word on the above site
> doesn't work (I gave up after Ms PacMan was still biting after nearly an
> hour),
WFM.
> but the text in the PDF is selectable, although with plenty of
> spelling errors, but those are easy to correct when looking at the PDF.
As PDF is based on PostScript, there are tools like ps2txt (alias for
ps2ascii(1) which is an alias for gs(1), the GhostScript binary) which can
extract text from PDF documents automatically. It appears to work quite
well with the downloaded PDF document, in case you are still unable to
download the Word document.
There are also tools called “pdf2html”. One is an npm package and requires
a JRE [1], but there are others, both command-line tools and Web sites.
Just google it.
[1] <
https://www.npmjs.com/package/pdf2html>
> 1) Font
>
> Do I go for monospace, like the original report, or do I something more(?)
> friendly on the eyes?
That depends on to which degree you want to preserve the original document.
If you are not doing this for archiving purposes, I suggest to declare a
list of sans-serif variable-width font families instead, with the more
preferable font family in front and ending the list with the generic “sans-
serif”. A possible list that can be recommended is
body {
font-family: Verdana, Geneva, Arial, Helvetica, sans-serif;
}
(YMMV. For example, typographers would probably frown at me for including
“Arial” there, or because I put it before “Helvetica”.)
If you are not into typography, or do not have the time to educate yourself
about it, simply declare only “sans-serif”.
You might need to set the font-family for some descendant elements as well.
(Implementations are inconsistent.)
> 2) Footnotes
>
> Obviously they don't make sense in html,
They do, just not as page-end notes as, contrary to popular belief, there
are no “_HTML_ pages”. They could be footnotes in the table footer,
section-end notes, or text-end notes.
> so I'm thinking about using
> <details><summary> </summary> <details> tags to place them in-line,
> probably/possibly underlining (on hover) of the "xx)" text.
I do not think this is the correct HTML markup for footnotes. See also:
<
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/details>
Footnotes as small linked superscript text are working for me. I would
suggest to inspect Wikipedia for how footnotes should be done (BTDT). You
can also combine that with my Accessible Pure CSS Tooltips (license is
GPLv3) that I am using on <
http://PointedEars.de/es-matrix>.
> 3) Tables
>
> Don't cut & paste, so I'll have to convert them and here I've hit a snag,
> I can code myself around it, but it's ugly.
The problem may be solved now that you can download the Word document.
However, if you cannot, then you may be able make your life a little easier
by changing the text (if still necessary) to the following (CSV) format
(without indentation):
td_content;td_content;td_content …
td_content;td_content;td_content …
Then you can first apply the replacement
; → </td><td>
and then (e.g. using regular expressions)
^ (start of line) → <tr><td>
$ (end of line) → </td></tr>
(Use another delimiter if it is obvious that the delimiter occurs in the
data.)
Then surround all rows with
<table>
and
</table>
after which you can make adjustments like <td> → <th>, rowspan, colspan and
accessibility attributes.
I also remember having seen a tool that can do this conversion from text
rows to HTML tables automatically, but I do not remember its name and the
circumstances.
> Explanation: If you look at the tables in the PDF, the first is on page 14
> (26 in the PDF), it has a double outside border and a single inside one,
> but most cells don't have top or bottom borders.
Although it may look old-fashioned, the latter is actually how *simple*
*data* tables SHOULD be done. For example, it is a standing recommendation
for LaTeX tables in scientific works: Only draw horizontal lines (“\hline”
or “\midrule”) to separate *groups* of rows. (In HTML this can be achieved
with a “thead” and one or more with “tbody” elements.)
That the original table style may not be suitable for the Web does not mean
that copy-and-paste is necessarily a bad idea. In my PDF reader “Okular”
(version 1.3.2) at least, only the text from that table is copied then.
Once you have the text in the cells using proper table markup, the borders
can be easily styled with CSS. For example, something like
table { border-collapse: collapse; border: 2px double black; }
thead tr { border-bottom: 2px solid black; }
th, td { padding: 0.25em; border-right: 2px solid black; }
would come closest to the original table style. (Whether you want to do
that depends on how much you want to preserve the original.)
I would put the table footnotes in the “tfoot” element (BTDT).
> And how do you create the inverted "L" shaped tables that are on PDF pages
> 83 and 117, to name just two?
In the case of the table on page 83 of the PDF document, simply omit the
last 4 table cells in each row, or add empty cells but style them so that
they are not visible.
> Obviously I will ***not*** rotate any tables!
I do not see the need for any rotation in the first place :)
> 4) Images? Not yet there, first one on PDF page 71 (Just Cut & Paste) or
> the graphs on PDF page 106, where SVG would seem to the logical option,
Unless you want to do some fancy visualization, if you only want to link to
further information about the area of the map, a simple image map (“map” and
“img” element) will suffice (and will be most backwards-compatible). Since
the map contours only have to be approximate, this will be a lot easier to
do than to recreate the map exactly with SVG (unless you have an image
editor that can convert bitmaps to SVG easily – let me know which one,
then).
Otherwise only extract the image using e.g. The GIMP or ImageMagick
convert(1), and add an “img” element.
--
PointedEars
<
https://github.com/PointedEars> | <
http://PointedEars.de/wsvn/>
Twitter: @PointedEars2
Please do not cc me. /Bitte keine Kopien per E-Mail.