Ugh, I think Enigmail shit the bed and sent a GPG message to the list
for some reason. Sorry! Here's what I wrote:
I'm already working on hand-migrating some stuff--the Standard Ebooks
project I spoke to you about Eric. It's not something that can be
automated because PG is so scattered in terms of quality and standards.
Sometimes the epub/html at PG is good enough, sometimes it's a total
mess. It seems to depend on when the PG text was produced and who
produced it.
If GITenberg decides to go with epubs, note that sometimes the epub
available at PG *isn't the same* as the HTML version of the same book.
Often the epub version is missing things like italics that are included
in the HTML version. I have no idea why that is, but beware--not all of
the epubs are auto-generated apparently.
Epub is great because it's just zipped up HTML files with some extra
metadata. You don't even need a program like Sigil to make one (though
it helps). For SE I've put together a bash script toolchain to compile
a plain directory with epub3 structure into both a pure epub3 file (that
goes in the repo) and a more compatible epub2 file. I'll share that
with you all once I release it, hopefully in the next few months. The
caveat is that it's Linux-only right now.
Personally I chose epub3 as the base format for Standard Ebooks because
it's basically just HTML5, and that gives us the ability to add
interesting semantics that you can't easily do in text-only formats
(<em> vs <i> tags, <abbr>, <section> vs <div>, etc.). Plus the semantic
inflection standard lets us add a lot of really interesting rich
metadata to ebooks (<section epub:type="chapter">). Epub3 can be
decomposed to asciidoc and nearly any other format easily, it's already
the gold standard for ereaders, and for Kindle readers Calibre does a
great job of automatic conversion--better than kindlegen. It was a
no-brainer choice for me.
GITenberg could fairly easily automate generating epubs from the html
source for PG books that don't have an epub download available. Again
the quality would be the main issue.
On 03/27/2015 08:52 AM, Eric Hellman wrote:
> So essentially there are 2 different problems.
> 1. (epic) mitigating the mess in the older stuff (and identifying the
> newer, higher quality versions where they exist.)
> 2. making sure that the solid DP produced texts can be maintained going
> forward.
>
>> On Mar 27, 2015, at 8:13 AM, Jon Hurst <
jhurs...@gmail.com
>> <mailto:
jhurs...@gmail.com>> wrote:
>>
>> On Friday, March 27, 2015 at 8:26:57 AM UTC, Jon Hurst wrote:
>>
>> One option you have is to have the epub as the main format. A one
>> time conversion is much easier, and gets you Kindle formats pretty
>> much for free. Converting epub to HTML is pretty straightforward:
>> unzip, concatenate anything in the spine that has the same <head>
>> section, fix up the links.
>>
>>
>> As an addendum, this suggestion probably represents your best choice
>> of pipeline. Take the epub from PG and drop /everything/ else. There
>> is, after all, always going to be an epub version, even if its all but
>> unusable. Unzip the epub and put what you get under version control;
>> it will mainly be text so version control will be effective. Producing
>> your epub is then just a case of zipping it up again, and producing
>> mobi is just a case of running kindlegen against the newly recreated
>> epub. What PG refers to as "the HTML version" is either a single html
>> file, which cannot by definition include images, or a zip file which
>> does include images and can be unzipped and will work locally with a
>> browser using file:// protocol. If you are going with the latter, you
>> could probably get away with adding a bit of javascript to give
>> next/previous links between the epub chunks. For the text, if you care
>> (which I honestly don't believe you should) you could just concatenate
>> w3m/elinks dumps of the html in the spine. If you really, really care
>> about the text, you'll need an additional text file that will need to
>> be separately modified whenever anything is updated; I would recommend
>> against this as it would undoubtedly be a Royal PITA to manage.
>>
>> For editing, you can either edit the epub chunks directly or take the
>> created epub and edit it with Sigil. From what little experience I
>> have with Sigil, it seems to round-trip without adding too much junk.
>>
>> It would still be an epic amount of work to get anything decent, but
>> at least it would be a sane, version controlled epic amount of work.
>>
>> Jon
>>.
>> --
>> You received this message because you are subscribed to the Google
>> Groups "GITenberg Project" group.
>> To unsubscribe from this group and stop receiving emails from it, send
>> an email to
gitenberg-proj...@googlegroups.com
>> <mailto:
gitenberg-proj...@googlegroups.com>.
>> <
https://groups.google.com/d/msgid/gitenberg-project/9676075e-2da1-4505-8353-c4d9adbc389e%40googlegroups.com?utm_medium=email&utm_source=footer>.
> --
> You received this message because you are subscribed to the Google
> Groups "GITenberg Project" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to
gitenberg-proj...@googlegroups.com
> <mailto:
gitenberg-proj...@googlegroups.com>.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/gitenberg-project/70159BB6-F58F-4730-BEC5-4AFEF6D2B274%40hellman.net
> <
https://groups.google.com/d/msgid/gitenberg-project/70159BB6-F58F-4730-BEC5-4AFEF6D2B274%40hellman.net?utm_medium=email&utm_source=footer>.