Converting SE to PDFs

480 views
Skip to first unread message

Lukas Bystricky

unread,
Dec 17, 2023, 1:49:48 PM12/17/23
to Standard Ebooks
Following up on this discussion I wrote a script to convert SE repos to pdfs. It's available here. The structure of SE makes parsing everything relatively simple. Probably 90% of my time was spent figuring out how to get the front matter and the body to have different counters. (Of the remaining time, 90% of that was spent trying to get the Python packages to install correctly.)

I've attached a couple of examples, one relatively simple and one more complex. I'm sure they're not perfect, but hopefully it's a decent starting point if anyone wants to pursue this further. 
sylvia-townsend-warner_lolly-willowes.pdf
sun-tzu_the-art-of-war_lionel-giles.pdf

David

unread,
Dec 18, 2023, 3:45:49 AM12/18/23
to Standard Ebooks
Looks very good to me, especially as "proof-of-concept". (Art of War certainly stress-tests the endnotes!)

I noticed the odd typographical peculiarity (e.g. Lolly, p4, 6 lines up, the stranded DLQUO),  but are such artefacts inevitable? Great starting point in any case.

David / Fife, UK

Vince

unread,
Dec 18, 2023, 4:06:54 AM12/18/23
to Standard Ebooks
First, it’s pretty impressive as is, so congratulations!

Next, I can’t imagine myself ever using this, so you can absolutely ignore me. :) I’m also don’t know the goal, i.e. how close should the text formatting of the PDF be to the epub? I would think (and want) it to be the same, but that might not be the goal (or even possible). With that said:

It doesn’t appear that changes made to the text by CSS are being made. For example
  1. Whether endnotes or footnotes, the references themselves need to be smaller than 1em; in the epub, the CSS handles changing them to .75em.
  2. In the dedication, “Captain Valentine Giles” is in small caps in the source but not in the PDF.
  3. In the preface, the first blockquote is in French, so is entirely in italics in the epub. But the first sentence is emphasized, and so is set in normal text. But some of that is handled through CSS, not markup, and so the PDF only gets the markup part, i.e. it italicizes only the first sentence.

Spacing is also off: there’s a space before the endnote references (e.g. the “1" in the first sentence of the preface), there’s often a space on either site of quotes, sometimes in front of punctuation, and so forth.

Also, do PDF’s support links? If so, should endnote references be links to the endnotes? And the backlink present in them links back to the reference? (Presuming they’re left as endnotes; if they become footnotes, this obviously is moot.)


On Dec 18, 2023, at 1:49 AM, Lukas Bystricky <lukasby...@gmail.com> wrote:

Following up on this discussion I wrote a script to convert SE repos to pdfs. It's available here. The structure of SE makes parsing everything relatively simple. Probably 90% of my time was spent figuring out how to get the front matter and the body to have different counters. (Of the remaining time, 90% of that was spent trying to get the Python packages to install correctly.)

I've attached a couple of examples, one relatively simple and one more complex. I'm sure they're not perfect, but hopefully it's a decent starting point if anyone wants to pursue this further. 

--
You received this message because you are subscribed to the Google Groups "Standard Ebooks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/standardebooks/379036b3-14b7-484c-9812-17989d3dfdean%40googlegroups.com.
<sylvia-townsend-warner_lolly-willowes.pdf><sun-tzu_the-art-of-war_lionel-giles.pdf>

Lukas Bystricky

unread,
Dec 18, 2023, 4:46:36 PM12/18/23
to Standard Ebooks
Epubs in general work really well, but they tend to have issues with images and tables. My (cheap) e-reader in particular really struggles with them. So I thought it might be nice to have a canonical version of the book that shows images/tables in a consistent way. In the thread I mentioned some people had expressed interest in print copies, so this would be a step in that direction too. But obviously still some work left.  

I do think the formatting should be as close as possible to the epub. Most of those issues were me unintentionally overwriting some existing CSS, but Weasyprint itself does have some limitations. For one it doesn't accept CSS namespaces. I'd already got around that by turning all the "epub:type" tags into classes. I suppose something similar will have to be done with the xml:lang tags in blockquotes. Another issue, related to what David pointed out, is that currently Weasyprint doesn't support the hanging-punctuation styling. Perhaps there's some hack to prevent hanging punctuation some other way, otherwise a less-than-ideal solution would be to turn off justified lines. The good news is that Weasyprint is actively developed, and they seem quite responsive to feedback so maybe I'll send them an email and see if hangining-punctuation or namespaces are in their plans. 

As for the endnote links, I'd turned off the styling but they do work in Adobe at least. The page numbers in the table of contents also link to the appropriate chapter. 

Lukas Bystricky

unread,
Dec 23, 2023, 3:27:15 AM12/23/23
to Standard Ebooks
So as I expected, most of the issues that Vince mentioned were me being careless with CSS. I've attached a new copy of the Art of War, I think it looks a lot better.

One thing that was interesting was the issue of small caps. At least it was (mildly) interesting to me; perhaps this is common knowledge, so apologies in advance if that's the case. Apparently not all fonts (in fact very few) come equipped with a true "small caps" variant. If the small caps variant doesn't exist then most brewers will simulate it by converting the text to all caps and then shrinking it. This is not typographically equivalent to small caps and unsurprisingly some people (probably mainly professional typesetters) have strong opinions on this and really dislike "fake" small caps. Weasyprint apparently sides with the professional typesetters and doesn't simulate small caps. After quite a bit of searching, I was able to find a font (EB Garamond), that I believe is open source, containing a small caps variant and by using that I got real small caps in the pdf. 

This is actually an issue with e-readers too. At least on my e-reader small caps are ignored entirely. I don't know what font we're using with SE, but I suspect it doesn't have a small caps variant and my e-reader doesn't automatically simulate them. I also don't know how widespread this issue is (I've mentioned before my e-reader is very cheap), but if we decide it's worthwhile to fix this issue, one solution would be to force the small caps to be simulated with some clever CSS. Another (typographically preferable) solution would be to include the EB Garamond font with the epub. I'm not sure how difficult that would be. 
sun-tzu_the-art-of-war_lionel-giles.pdf

Lukas Bystricky

unread,
Dec 23, 2023, 3:32:59 AM12/23/23
to Standard Ebooks
Ugh, the previous pdf was an old one without italics. 
sun-tzu_the-art-of-war_lionel-giles.pdf

Vince Rice

unread,
Dec 23, 2023, 9:43:14 AM12/23/23
to standar...@googlegroups.com
Correct, many fonts don’t have “real” small caps. But preferring to not use the proper fallback is just pedantic extremism, IMO. What browsers do is the correct behavior, and Weasyprint made a bad decision (it should at least be an option).

On Dec 23, 2023, at 3:33 PM, Lukas Bystricky <lukasby...@gmail.com> wrote:

Ugh, the previous pdf was an old one without italics. 
To view this discussion on the web visit https://groups.google.com/d/msgid/standardebooks/7cb61c1f-2473-455f-9a22-83d1c2bb51c1n%40googlegroups.com.
<sun-tzu_the-art-of-war_lionel-giles.pdf>

Vince

unread,
Dec 23, 2023, 10:14:32 AM12/23/23
to Standard Ebooks
The one does look much better, great improvement!

The spacing can still be weird, though. E.g., the extra space here before the comma, before the noteref, and the lack of space after the noteref. PastedGraphic-1.png
The noteref issues are consistent; the comma issue is sporadic, but a possibility is that it’s occurring where there’s a tag of some sort on the word before the comma.

Here is what looks to be an extra space after a period (which is after an abbreviation, which would fit with the above theory).
PastedGraphic-2.png

Here are all three issues on the same line:
PastedGraphic-3.png

In this case whatever is happening threw the comma to the start of the next line.
PastedGraphic-4.png

I also happened across a class of link that doesn’t work; the endnotes (and their backlinks) that I tried worked fine. But in endnote 289 is a link labeled “See supra”; the link is to a paragraph id in chapter 5, but the PDF just beeps (I’m viewing it in Preview on the Mac). Here’s what the it shows when it’s hovered on:
PastedGraphic-5.png

There are many links like that: e.g. note 301 has its own endnote, 776, that doesn’t work; the “chapter I” link in note 303 doesn’t work, etc. But there are other non-endnote links, e.g. “note 292” in link 301, that do work (when hovered over, that link says “Go to page 160”).
PastedGraphic-6.png
So maybe something that interprets the link and determines the page isn’t happening on the ones that don’t work?

Lukas Bystricky

unread,
Dec 24, 2023, 6:49:27 AM12/24/23
to Standard Ebooks
Yes, I agree it is strange that Weasyprint doesn't have the option to use fake small caps, especially since by far most fonts don't have a small cap variant. But it seems like it's not just Weasyprint, my e-reader also doesn't automatically simulate small caps. I don't know how prevalent that problem is. If it's a common issue, maybe it's worth coming up with a fix for it.

The spacing issues are apparently coming from the html prettifier I was using adding an extra space after every element (apparently that's by design–why anyone would want that is not clear to me). Turning off the prettifier solves that problem. The invalid links were an issue with the regex I was using. I think that's resolved now too (see attached). It's really good to have a second set of eyes on this, I appreciate it. 



sun-tzu_the-art-of-war_lionel-giles.pdf

Vince

unread,
Dec 24, 2023, 9:23:22 AM12/24/23
to Standard Ebooks
Another great improvement! That does indeed appear to correct all of the spacing issues.

The note reference superscripts appear to be significantly smaller than in the epubs (60% vs 75%); they’re at the edge of readable for those of us with “more mature” eyes, as my optometrist calls them. :)

The links in the TOC (the one the Mac shows in the sidepanel of Preview; I don’t know it’s called in PDF-world) work great, and the links in the embedded ToC also work, but the page numbers on the frontmatter show in the embedded ToC as all 0’s, although the page numbers on the pages themselves is correct. So, for example, the link for “Sun Wu and His Book” shows as page 0 in the ToC, but the actual page# shows as vi. As you can see, ditto for all the frontmatter, including the halftitlepage (“The Art of War”).

PastedGraphic-1.png

PastedGraphic-2.png

I tried 50+ links, and all of them worked. I did find a problem with one of the links, but it’s a source file problem (“note 626” in link 51 links to note 628 instead). I’ll submit a PR to fix it. I also found one with a word missing from the link.

On a related note, this is a general PDF problem I’ve long disliked, but the page# shown when hovering on a link is not the page# shown at the bottom of the page, but instead some made-up internal PDF page number. Nor can you “Go to” one of the actual page numbers, you instead have to do some internal math to guess at the PDF hidden page# to go to. I’m assuming there’s nothing you can do about this, just mentioning it because it’s one of the things I don’t like about PDF’s. :)

Really nice job, Lukas. This is looking great!

-- 
You received this message because you are subscribed to the Google Groups "Standard Ebooks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/standardebooks/72da25b5-5aa2-4d35-b14a-fefab4d17e32n%40googlegroups.com.
<sun-tzu_the-art-of-war_lionel-giles.pdf>

Lukas Bystricky

unread,
Dec 27, 2023, 3:26:32 PM12/27/23
to Standard Ebooks
Yeah that was another regex issue. I fixed the ToC and made the superscripts bigger. I also tried to automate replacing "see here" type notes with page references. This book turned out to be a very good test case for that. 
sun-tzu_the-art-of-war_lionel-giles.pdf

Christopher Hapka

unread,
Dec 28, 2023, 8:13:48 AM12/28/23
to Standard Ebooks
Have you tried running the script on a book with illustrations? (The LoI would need linking like the ToC, presumably).

I'm also curious how it would handle our drama table formatting--this project has both (a play with illustrations) if you'd like to give it a try.




Lukas Bystricky

unread,
Dec 28, 2023, 12:19:48 PM12/28/23
to Standard Ebooks
It works to an extent (for now I've skipped the LOI). There's some issues with line breaks though. Sometimes there's a page break inserted after the character, like here:
Screenshot 2023-12-28 at 3.46.46 PM.png

I originally thought that maybe there couldn't be a page break inside a <td> element, but then sometimes the page break works properly:
Screenshot 2023-12-28 at 3.47.01 PM.png

I tried experimenting with break-before, break-inside options on the td, but nothing really worked. I believe that the difference is that in first case "The Statue" is on the last line of the page, while in the second case "Don Juan" is several lines above the bottom. My guess is that the orphan/widow properties might be useful here, but no luck so far. 

Using break-inside:avoid on the tr solves the problem to some extent, in that the characters and their dialogue always appear on the same page (see attached pdf), but this results in occasional long bits of whitespace like this:
Screenshot 2023-12-28 at 3.57.07 PM.png
george-bernard-shaw_man-and-superman_avoid-breaks.pdf
Reply all
Reply to author
Forward
0 new messages