Status of building asciidoc books

Seth Woodworth

unread,

Mar 26, 2015, 1:33:22 PM3/26/15

to gitenber...@googlegroups.com

I started integrating GITenberg with Travis-CI this weekend. Travis-CI is an open source continuous integration server. Typically, a CI server watches for changes on github, then checks out the changed code, runs tests and/or tries to compile the software. Importantly, the hook for a CI server to run is any change being made on Github triggering a 'build' via post_commit hooks.

I've taken my asciidoc fork of Rime of the Ancient Mariner, told Travis-CI about my repo, and added this Travis config file.

Now, whenever I make a commit to Rime it triggers a build on Travis that looks like this. In this case, I am installing asciidoctor and using it to transform the Rime asciidoc file into html. But I could just as easily build epubs with a slightly different command.

Any files generated by the travis-ci build are automatically uploaded to the amazon file storage cloud. The Rime html is available here.

This is a preliminary result. And more work needs to be done before this is ready for primetime.

Asciidoc isn't the only type of file we can build this way. I've taken the bi-lingual book that Tom mentioned, added it to GITenberg, and forked it to my repo: Jesuit Relations. I think the PG html version isn't as awesome as the raw text version, which has english and french side by side. But I attempted to build the html of Jesuit Relations with Project Gutenberg's epubmaker. I have epubmaker and the python requirements installed, but I am having a pathing issue in a bash script that is causing the build to fail.

Nevertheless, this has been an enlightening experiment and I am very hopeful that we can build PG html edition ebooks easily.

This contributes to an overall point I would like to make:

I like asciidoc, I think we have the best tools for asciidoc.

I want to try everything and compare them.

PG has ~400 books in ReStructured Text (a format I have researched thoroughly and consider it second to only Asciidoctor). I would love to auto-build these books as part of the GITenberg infrastructure.

Last few points:

There are currently tradeoffs and limitations to using Travis-ci:

+ the GITenberg organization is too large for Travis to list all of our repos (a common issue, but should be fixable)

+ using travis means including a .travis.yml file in every repo

+ we will have to enable each repo by hand on the travis-ci site (there may be an api for this)

P.S. I have also formatted this email in asciidoc and posted it to the Documentation repo.

--Seth

Tom Morris

unread,

Mar 26, 2015, 6:31:29 PM3/26/15

to gitenber...@googlegroups.com

Thanks for the update. I don't know enough about the tradeoff between asciidoc and RST to provide useful feedback, but I do know that DistributedProofreaders generate HTML for pretty much everything that they do. While they're not omniscient or infallible, I think those tens of thousands of texts ought to be given a fair amount of weight. Note also, that the HTML that they produce is *not* the same as what PG publishes, and I've heard some uncomplimentary things about PGs "cleanups," so going back a generation might make a difference in quality.

On Thu, Mar 26, 2015 at 1:33 PM, Seth Woodworth <se...@sethish.com> wrote:

+ we will have to enable each repo by hand on the travis-ci site (there may be an api for this)

http://docs.travis-ci.com/api

Tom

Tom Morris

unread,

Mar 26, 2015, 6:47:47 PM3/26/15

to gitenber...@googlegroups.com

p.s.

On Thu, Mar 26, 2015 at 1:33 PM, Seth Woodworth <se...@sethish.com> wrote:

I have epubmaker and the python requirements installed, but I am having a pathing issue in a bash script that is causing the build to fail.

I had a quick glance at this and it looks to me like you're missing the dependency tidy (HTML cleaner). It's expect a command line utility by that name to run in a subprocess.

Tom

Seth Woodworth

unread,

Mar 26, 2015, 10:56:51 PM3/26/15

to gitenber...@googlegroups.com

I have noticed snarky comments in the epubmaker source about DP.

Do we know if it is epubmaker that is converting the html from DP to PG?

You were right about htmltidy. The epub builds! Download here.

--
You received this message because you are subscribed to the Google Groups "GITenberg Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gitenberg-proj...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gitenberg-project/CAE9vqEHA17tpo3gxV4AgBhjNzZ0gofVfVqqZDgNunS2%3Dr9aU%2BA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Seth Woodworth

unread,

Mar 26, 2015, 11:24:47 PM3/26/15

to gitenber...@googlegroups.com

Does anyone know of good alternative open source html > epub software for building via Travis?

I feel that our goals should be to produce the best quality epubs for the most users.

Asciidoc is going to give us flexibility, but you're right that the 10s of thousands of books speak volumes.

Jon Hurst

unread,

Mar 27, 2015, 4:26:57 AM3/27/15

to gitenber...@googlegroups.com, se...@sethish.com

On Friday, March 27, 2015 at 3:24:47 AM UTC, Seth Woodworth wrote:

Does anyone know of good alternative open source html > epub software for building via Travis?

I feel that our goals should be to produce the best quality epubs for the most users.

Hi Seth,

It is simply not possible to reliably take any old html and make an epub out of it. You have to make far too many possibly invalid assumptions. Marcello's epubmaker attempts to do this, but actually ends up working from the opposite direction. DP HTML is now written specifically for epubmaker to turn into an epub. Any HTML that was written prior to epubmaker is a crapshoot at best; worse there is no guarantee that new versions of epubmaker will successfully build HTML tested against a previous version.

It is not the least bit difficult to write HTML and CSS that can easily be converted to an epub. Epub is really not that tricky. However, the HTML and CSS at PG has not been written in this way. It could be cleaned up to work, but the task would be Herculean and the quality of PG's material does not justify the effort; there are just too many other things wrong with it.

One option you have is to have the epub as the main format. A one time conversion is much easier, and gets you Kindle formats pretty much for free. Converting epub to HTML is pretty straightforward: unzip, concatenate anything in the spine that has the same <head> section, fix up the links.

Another option is to allow users to generate custom PDF (i.e. sized to the device, choice of fonts etc. at build time) using PrinceXML. This is actually how I read ebooks, as it is the only way to achieve typography above the level of atrocious. It would probably require less fixing up of HTML as PrinceXML is a far more powerful and flexible engine then your average ereader's epub/mobi layout engine, and circumvents the difficulties caused by quirks in different epub readers. This would still require some fixing up of the HTML though, and who knows what terrors lie in the PG catalogue.

In short, I would say what you are trying to do is very, very difficult indeed. It's not that its particularly technically difficult. It's just that PG's largely unmaintained collection is not a good place to start from. It cannot be fixed in software. Any option is going to require enormous amounts of work by someone.

Good luck!

Jon

Message has been deleted

Jon Hurst

unread,

Mar 27, 2015, 8:13:49 AM3/27/15

to gitenber...@googlegroups.com, se...@sethish.com

On Friday, March 27, 2015 at 8:26:57 AM UTC, Jon Hurst wrote:

One option you have is to have the epub as the main format. A one time conversion is much easier, and gets you Kindle formats pretty much for free. Converting epub to HTML is pretty straightforward: unzip, concatenate anything in the spine that has the same <head> section, fix up the links.

As an addendum, this suggestion probably represents your best choice of pipeline. Take the epub from PG and drop everything else. There is, after all, always going to be an epub version, even if its all but unusable. Unzip the epub and put what you get under version control; it will mainly be text so version control will be effective. Producing your epub is then just a case of zipping it up again, and producing mobi is just a case of running kindlegen against the newly recreated epub. What PG refers to as "the HTML version" is either a single html file, which cannot by definition include images, or a zip file which does include images and can be unzipped and will work locally with a browser using file:// protocol. If you are going with the latter, you could probably get away with adding a bit of javascript to give next/previous links between the epub chunks. For the text, if you care (which I honestly don't believe you should) you could just concatenate w3m/elinks dumps of the html in the spine. If you really, really care about the text, you'll need an additional text file that will need to be separately modified whenever anything is updated; I would recommend against this as it would undoubtedly be a Royal PITA to manage.

For editing, you can either edit the epub chunks directly or take the created epub and edit it with Sigil. From what little experience I have with Sigil, it seems to round-trip without adding too much junk.

It would still be an epic amount of work to get anything decent, but at least it would be a sane, version controlled epic amount of work.

Jon

Eric Hellman

unread,

Mar 27, 2015, 9:52:18 AM3/27/15

to gitenber...@googlegroups.com, se...@sethish.com

So essentially there are 2 different problems.

1. (epic) mitigating the mess in the older stuff (and identifying the newer, higher quality versions where they exist.)

2. making sure that the solid DP produced texts can be maintained going forward.

--
You received this message because you are subscribed to the Google Groups "GITenberg Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gitenberg-proj...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/gitenberg-project/9676075e-2da1-4505-8353-c4d9adbc389e%40googlegroups.com.

Raymond Yee

unread,

Mar 27, 2015, 12:05:21 PM3/27/15

to gitenber...@googlegroups.com

Hi Tom,

Do you know whether the DP HTML for books is publicly available? And if so, how to find that HTML.

-Raymond

Alex Cabal

unread,

Mar 27, 2015, 1:50:01 PM3/27/15

to gitenber...@googlegroups.com

encrypted.asc

Alex Cabal

unread,

Mar 27, 2015, 1:52:19 PM3/27/15

to gitenber...@googlegroups.com

Ugh, I think Enigmail shit the bed and sent a GPG message to the list
for some reason. Sorry! Here's what I wrote:

I'm already working on hand-migrating some stuff--the Standard Ebooks
project I spoke to you about Eric. It's not something that can be
automated because PG is so scattered in terms of quality and standards.
Sometimes the epub/html at PG is good enough, sometimes it's a total
mess. It seems to depend on when the PG text was produced and who
produced it.

If GITenberg decides to go with epubs, note that sometimes the epub
available at PG *isn't the same* as the HTML version of the same book.
Often the epub version is missing things like italics that are included
in the HTML version. I have no idea why that is, but beware--not all of
the epubs are auto-generated apparently.

Epub is great because it's just zipped up HTML files with some extra
metadata. You don't even need a program like Sigil to make one (though
it helps). For SE I've put together a bash script toolchain to compile
a plain directory with epub3 structure into both a pure epub3 file (that
goes in the repo) and a more compatible epub2 file. I'll share that
with you all once I release it, hopefully in the next few months. The
caveat is that it's Linux-only right now.

Personally I chose epub3 as the base format for Standard Ebooks because
it's basically just HTML5, and that gives us the ability to add
interesting semantics that you can't easily do in text-only formats
(<em> vs <i> tags, <abbr>, <section> vs <div>, etc.). Plus the semantic
inflection standard lets us add a lot of really interesting rich
metadata to ebooks (<section epub:type="chapter">). Epub3 can be
decomposed to asciidoc and nearly any other format easily, it's already
the gold standard for ereaders, and for Kindle readers Calibre does a
great job of automatic conversion--better than kindlegen. It was a
no-brainer choice for me.

GITenberg could fairly easily automate generating epubs from the html
source for PG books that don't have an epub download available. Again
the quality would be the main issue.

On 03/27/2015 08:52 AM, Eric Hellman wrote:
> So essentially there are 2 different problems.
> 1. (epic) mitigating the mess in the older stuff (and identifying the
> newer, higher quality versions where they exist.)
> 2. making sure that the solid DP produced texts can be maintained going
> forward.
>
>> On Mar 27, 2015, at 8:13 AM, Jon Hurst <jhurs...@gmail.com

>> <mailto:jhurs...@gmail.com>> wrote:
>>
>> On Friday, March 27, 2015 at 8:26:57 AM UTC, Jon Hurst wrote:
>>
>> One option you have is to have the epub as the main format. A one
>> time conversion is much easier, and gets you Kindle formats pretty
>> much for free. Converting epub to HTML is pretty straightforward:
>> unzip, concatenate anything in the spine that has the same <head>
>> section, fix up the links.
>>
>>
>> As an addendum, this suggestion probably represents your best choice

>> of pipeline. Take the epub from PG and drop /everything/ else. There

>> is, after all, always going to be an epub version, even if its all but
>> unusable. Unzip the epub and put what you get under version control;
>> it will mainly be text so version control will be effective. Producing
>> your epub is then just a case of zipping it up again, and producing
>> mobi is just a case of running kindlegen against the newly recreated
>> epub. What PG refers to as "the HTML version" is either a single html
>> file, which cannot by definition include images, or a zip file which
>> does include images and can be unzipped and will work locally with a
>> browser using file:// protocol. If you are going with the latter, you
>> could probably get away with adding a bit of javascript to give
>> next/previous links between the epub chunks. For the text, if you care
>> (which I honestly don't believe you should) you could just concatenate
>> w3m/elinks dumps of the html in the spine. If you really, really care
>> about the text, you'll need an additional text file that will need to
>> be separately modified whenever anything is updated; I would recommend
>> against this as it would undoubtedly be a Royal PITA to manage.
>>
>> For editing, you can either edit the epub chunks directly or take the
>> created epub and edit it with Sigil. From what little experience I
>> have with Sigil, it seems to round-trip without adding too much junk.
>>
>> It would still be an epic amount of work to get anything decent, but
>> at least it would be a sane, version controlled epic amount of work.
>>
>> Jon

>>.

>> --
>> You received this message because you are subscribed to the Google
>> Groups "GITenberg Project" group.
>> To unsubscribe from this group and stop receiving emails from it, send
>> an email to gitenberg-proj...@googlegroups.com

>> <mailto:gitenberg-proj...@googlegroups.com>.

>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/gitenberg-project/9676075e-2da1-4505-8353-c4d9adbc389e%40googlegroups.com

>> <https://groups.google.com/d/msgid/gitenberg-project/9676075e-2da1-4505-8353-c4d9adbc389e%40googlegroups.com?utm_medium=email&utm_source=footer>.

>> For more options, visit https://groups.google.com/d/optout.
>

> --
> You received this message because you are subscribed to the Google
> Groups "GITenberg Project" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gitenberg-proj...@googlegroups.com

> <mailto:gitenberg-proj...@googlegroups.com>.

> To view this discussion on the web visit

> https://groups.google.com/d/msgid/gitenberg-project/70159BB6-F58F-4730-BEC5-4AFEF6D2B274%40hellman.net
> <https://groups.google.com/d/msgid/gitenberg-project/70159BB6-F58F-4730-BEC5-4AFEF6D2B274%40hellman.net?utm_medium=email&utm_source=footer>.

Tom Morris

unread,

Mar 27, 2015, 2:21:28 PM3/27/15

to gitenber...@googlegroups.com

On Fri, Mar 27, 2015 at 12:05 PM, Raymond Yee <raymo...@gmail.com> wrote:

Do you know whether the DP HTML for books is publicly available? And if so, how to find that HTML.

You should ask someone whether deeper knowledge than I, but I think there is a gap between the final DP proofed & formatted text and the final PG published versions. Neither the HTML as produced by DP post processors (PPers) nor the HTML as uploaded to the PG site as input for the whitewashers (WWers) is publicly available. I don't know if either is archived at all. I suspect that any DP HTML archives are maintained by individual PPers only (if at all). Don't know about PG.

Tom

Seth Woodworth

unread,

Mar 27, 2015, 3:24:56 PM3/27/15

to gitenber...@googlegroups.com

responses inline

On Fri, Mar 27, 2015 at 1:52 PM, Alex Cabal <al...@alexcabal.com> wrote:

I'm already working on hand-migrating some stuff--the Standard Ebooks
project I spoke to you about Eric.

Context for everyone else, Alex's Standard Ebooks project has produced at least one gorgeous ebook.

I'm really looking forward to future releases (on another domain, so ignore the sales aspect of the .com site).

(Alex, plz clarify if not true)

If GITenberg decides to go with epubs, note that sometimes the epub
available at PG *isn't the same* as the HTML version of the same book.

I would like to avoid pigeonholing ourselves to _just_ epub if there are ways to avoid it.

If the methodologies for going epub > other formats work, then I am for it.

Often the epub version is missing things like italics that are included
in the HTML version. I have no idea why that is, but beware--not all of
the epubs are auto-generated apparently.

I suspect the difference is made by the PG epubmaker software.

I tried to publish a diff of the html for Jesuit Relations as a pull request,

but the diff was too large for GitHub to render.

[...] For SE I've put together a bash script toolchain to compile
a plain directory with epub3 structure into both a pure epub3 file [...] I'll share that

with you all once I release it, hopefully in the next few months.

Great! I look forward to it.

Personally I chose epub3 as the base format for Standard Ebooks because
it's basically just HTML5, and that gives us the ability to add
interesting semantics that you can't easily do in text-only formats
(<em> vs <i> tags, <abbr>, <section> vs <div>, etc.).

Asciidoc is really good at customizing the html representations of various text-based markup.

It has an added benefit of allowing for separation of the markup of a [verse] block and the presentaion.

I feel like this might be a useful way to future-proof these formats.

But there is likely an ideal html+css in an epub that would serve the same purpose.

Epub3 can be decomposed to asciidoc and nearly any other format easily

Good point. If we had a good epub3, I can't see bothering decomposing to or through asciidoc to other formats.

But I would like to have the technique stored somewhere in case we need it.

Seth Woodworth

unread,

Mar 27, 2015, 4:44:34 PM3/27/15

to gitenber...@googlegroups.com

I've created a page for questions we would like to ask of someone from DP.

Please provide PR's for additional questions. Or I will add you to the repo as a collaborator if you want to push directly. (shoot me an email offlist with your github handle)

--

You received this message because you are subscribed to the Google Groups "GITenberg Project" group.

To unsubscribe from this group and stop receiving emails from it, send an email to gitenberg-proj...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gitenberg-project/CAE9vqEGtPQuDwp0_Zw1ie82Q%3Dc-J%3D3VUx2t6JNJWxfcuN%3Df9XA%40mail.gmail.com.

Alex Cabal

unread,

Mar 27, 2015, 7:43:07 PM3/27/15

to gitenber...@googlegroups.com

On 03/27/2015 02:24 PM, Seth Woodworth wrote:
> responses inline
>
> On Fri, Mar 27, 2015 at 1:52 PM, Alex Cabal <al...@alexcabal.com
> <mailto:al...@alexcabal.com>> wrote:
>
>
> I'm already working on hand-migrating some stuff--the Standard Ebooks
> project I spoke to you about Eric.
>
>
> Context for everyone else, Alex's Standard Ebooks project

> <https://standardebooks.com/alices-adventures-in-wonderland/> has
> produced at least one /gorgeous/ ebook.

standardebooks.COM is an old venture and that page will be retired soon.

standardebooks.org is the current project and entirely free, public
domain, and open-source. I've just finished the pre-alpha version of
the site, and you can currently browse through all of the ebooks I've
personally produced. There's lots of broken links and incomplete
content but all of the ebooks should be able to be downloaded. Haven't
told anyone else yet--I guess this is the release announcement!

The base format is the "pure epub3" download. Download those files to
see how I've chosen to structure code, metadata, etc. Maybe some of
those choices might be helpful/enlightening for GITenberg. The epub2
download is auto-generated from the epub3. There's a style guide I
follow so each ebook should be consistent. There's also a git backend
for each ebook and you can see a short history for each one, but you
can't browse the repos themselves yet; coming soon.

If you decide to open up an epub file, I recommend just unzipping it to
a directory and NOT using a program like Sigil. Sigil will add its own
junk to the epub even if you just open it to be viewed.

My toolchain is also not yet on the site, but once it's up I'll let you
all know and you can take a look to see if it would be helpful to
developing GITenberg books. The style guide and typography guides are
also incomplete rough drafts and in no way intended for wide release yet.

And with that, sorry to have hijacked this thread with my own project :)
I'll set up a separate mailing list soon for those of you who are
interested.

Sam Wilson

unread,

Mar 28, 2015, 10:25:58 PM3/28/15

to gitenber...@googlegroups.com

On 28/03/15 07:43, Alex Cabal wrote:

standardebooks.org is the current project and entirely free, public domain, and open-source. I've just finished the pre-alpha version of the site, and you can currently browse through all of the ebooks I've personally produced. There's lots of broken links and incomplete content but all of the ebooks should be able to be downloaded. Haven't told anyone else yet--I guess this is the release announcement! The base format is the "pure epub3" download. Download those files to see how I've chosen to structure code, metadata, etc. Maybe some of those choices might be helpful/enlightening for GITenberg. The epub2 download is auto-generated from the epub3. There's a style guide I follow so each ebook should be consistent. There's also a git backend for each ebook and you can see a short history for each one, but you can't browse the repos themselves yet; coming soon. If you decide to open up an epub file, I recommend just unzipping it to a directory and NOT using a program like Sigil. Sigil will add its own junk to the epub even if you just open it to be viewed. My toolchain is also not yet on the site, but once it's up I'll let you all know and you can take a look to see if it would be helpful to developing GITenberg books. The style guide and typography guides are also incomplete rough drafts and in no way intended for wide release yet. And with that, sorry to have hijacked this thread with my own project :) I'll set up a separate mailing list soon for those of you who are interested.

This looks fantastic! :) Very interesting. I like your style guide.

It does seem that HTML is a pretty good origin format for ebooks, if it can be turned into compliant epubs easily. And given semantic tags for things like verse and quotations. But then, docbook can be turned into html, and asciidoc into docbook... but perhaps not completely...

And what about printing to paper? I used to work as a book binder, and toyed with turning PG books into bound volumes — the only satisfactory way I could do it was to convert to LaTeX first! That worked terrifically. Perhaps the HTML printing tools have improved though; that was a few years ago (although, judging by things like Lulu and Gitbook, perhaps not).

I'm not really au fait with all the different digital libraries out there, but it seems that there's mostly a lot of attention to metadata, and far less to typography. And I want to read books! Beautiful ones. :)

I've been contributing to Wikisource for a couple of years, because the proofreading interface is pretty good, the scans are linked forever to the proofread text, and in the wiki world people are free to do what they think is best. (PGDP has always felt a bit like contributing to "someone else's" project, so I've not done much there.) The wikisource origin format is of course wikitext, but really considering the complexity of template substitution etc. I think the main usable output of wikisource is probably the epubs created by wsexport. I wrote a little thing to list all the double-proofread (i.e. 'validated') works in one place, and am now attempting to see what it'd take to turn those into gitenberg-like repositories.

With the advent of books on github, I feel like the way is becoming clearer to a system of having beautiful typography, availability of source scans, the means of contributing improvements, and accurate and detailed metadata, all in one thing. Huzza!

:) Sorry if that was a bit of a ramble... time for a Sunday morning coffee I think....

—Sam.

Reply all

Reply to author

Forward