Welcome to Project Alexandria

54 views
Skip to first unread message

Brian Troutwine

unread,
Feb 27, 2012, 3:30:05 PM2/27/12
to prj-ale...@googlegroups.com
This is a mailing-list for Project Alexandria, possibly _the_ list.
I'd like to hear thoughts on a small question to start this thing off:

What do we want to see? Put another way, what's the _goal_ of this project?

There's been lots of discussion over the years about re-building
Project Gutenberg, a lot of it descending into bike-shedding. I,
personally, would like to see Project Alexandria as a collection of
like-minded folks working toward:

* machine/human-readable book text files
* tools to view and reformat these text files and
* an ecosystem of applications _around_ the files and the
infrastructural toolset

The first point implies the adoption of a markup format, the second
conversion utilities between the project's adopted format and those
which are popular and device applications that give access to the
library produced.

I propose that the best way to meet this aims is to compete solutions
against one another. Bike-shedding debates will kill an otherwise
viable community; debates about functional tools is much more
effective. To that end, I've begun mirroring PG's '.txt' files and
will get them pull-requested into
https://github.com/felix-faber/project-alexandria as soon as the whole
lot is ready.

Thoughts? Hopes? Wishes or dreams?
--
Brian L. Troutwine

Brian Troutwine

unread,
Feb 27, 2012, 3:37:19 PM2/27/12
to prj-ale...@googlegroups.com

Whoops; typed without thinking. I'll produce a torrent for the initial
mirror, but checking the _whole_ think into git would be a nightmare.


> Thoughts? Hopes? Wishes or dreams?
> --
> Brian L. Troutwine

--
Brian L. Troutwine

Cameron Hill

unread,
Feb 27, 2012, 6:04:49 PM2/27/12
to project-alexandria-books
Right, I think the entire PG library as a whole is too colosal and
differentiated to be useful to us yet.

As briefly mentioned elsewhere in the HN thread, it would be easiest
to essentially build up from scratch taking a few select titles from
PG and getting them into an appropriate text format -- hone the
process and roll from there.

As far as text formats, I am mostly familiar with Markdown and have
briefly looked at RST. I don't think Markdown could handle these
needs. RST seems to have the immediate advantage of handling tables
out of the box better than Markdown and it seems PG is/was tentatively
leaning towards RST.

RST also seems prepared to handle math equations
http://docutils.sourceforge.net/docs/ref/rst/directives.html#math
The limiting factor is how well the output format can typeset the
mathematical output -- but for math/science books (which where my big
interest is) this seems advantageous.




On Feb 27, 2:37 pm, Brian Troutwine <br...@troutwine.us> wrote:
> On Mon, Feb 27, 2012 at 3:30 PM, Brian Troutwine <br...@troutwine.us> wrote:
> > This is a mailing-list for Project Alexandria, possibly _the_ list.
> > I'd like to hear thoughts on a small question to start this thing off:
>
> >    What do we want to see? Put another way, what's the _goal_ of this project?
>
> > There's been lots of discussion over the years about re-building
> > Project Gutenberg, a lot of it descending into bike-shedding. I,
> > personally, would like to see Project Alexandria as a collection of
> > like-minded folks working toward:
>
> >  * machine/human-readable book text files
> >  * tools to view and reformat these text files and
> >  * an ecosystem of applications _around_ the files and the
> > infrastructural toolset
>
> > The first point implies the adoption of a markup format, the second
> > conversion utilities between the project's adopted format and those
> > which are popular and device applications that give access to the
> > library produced.
>
> > I propose that the best way to meet this aims is to compete solutions
> > against one another. Bike-shedding debates will kill an otherwise
> > viable community; debates about functional tools is much more
> > effective. To that end, I've begun mirroring PG's '.txt' files and
> > will get them pull-requested into
> >https://github.com/felix-faber/project-alexandriaas soon as the whole

Travis Jensen

unread,
Feb 27, 2012, 6:57:44 PM2/27/12
to project-alexandria-books
I agree on the start small and work up. My advice would be to approach
this like a Lean Startup: Put together an initial concept with minimal
effort, add to the community based on that effort, get feedback from
that community, rinse, repeat.

I'm certainly familiar with some of the problems with PG from an end-
user's perspective, but I'm not familiar with it from a contributor's/
editor's perspective. We should probably identify the personas (yes,
I've done a lot of product management ;) and a couple of goals for
each. At the very least, we have three personas we will want to
cover: author (for people adding newly created content to the public
domain), editor (people adding other people's PD work to the project
and making revisions to it [note: there may be two personas there]),
and reader.

Having worked at a company producing software for libraries, I can
understand how PG could be mired in deep muck around "standards". That
is one of the biggest advantages to a process like Python's BDFL.
Having somebody at the top who can, when all is said and done, say
"this is the right thing for the project" can be a very good thing to
keep people from bike shedding.

tj

Brian Troutwine

unread,
Feb 27, 2012, 7:11:34 PM2/27/12
to prj-ale...@googlegroups.com
On Mon, Feb 27, 2012 at 6:04 PM, Cameron Hill
<camero...@designory.com> wrote:
> Right, I think the entire PG library as a whole is too colosal and
> differentiated to be useful to us yet.

Indeed. It's been five years since I last mirrored PG and their
collection has grown substantially--pure textual duplication seems
very common. Anyway, I'll announce when there's a torrent available.
I've never known people to rue having too much data.

> As briefly mentioned elsewhere in the HN thread, it would be easiest
> to essentially build up from scratch taking a few select titles from
> PG and getting them into an appropriate text format -- hone the
> process and roll from there.
>
> As far as text formats, I am mostly familiar with Markdown and have
> briefly looked at RST. I don't think Markdown could handle these
> needs. RST seems to have the immediate advantage of handling tables
> out of the box better than Markdown and it seems PG is/was tentatively
> leaning towards RST.

Texts I propose:

* Walden -- contains poetry, prose, tables and footnote references
(http://www.gutenberg.org/files/205/205-0.txt)
* Crime and Punishment -- prose, UTF8 and biggish
(http://www.gutenberg.org/files/28054/28054-0.txt)
* Deductive Logic -- rather punishing layout job for some devices
(http://www.gutenberg.org/cache/epub/6560/pg6560.txt)
* Hyperbolic Functions -- tex only (http://www.gutenberg.org/ebooks/13692)

I've gone ahead and created a pull request with these texts scattered
in the repo.

> RST also seems prepared to handle math equations
> http://docutils.sourceforge.net/docs/ref/rst/directives.html#math
> The limiting factor is how well the output format can typeset the
> mathematical output -- but for math/science books (which where my big
> interest is) this seems advantageous.

There's also asciidocs: http://www.methods.co.nz/asciidoc/ I'm all for
minimal modifications to the source texts, especially if PG might pick
up the changes/tools.

--
Brian L. Troutwine

aiscott

unread,
Feb 27, 2012, 7:13:57 PM2/27/12
to prj-ale...@googlegroups.com
I'm having a bad day I guess.  This is attempt #3 at a reply, so it's going to be terse.

1) I like Markdown and/or RST
1a) XML seems like it would be a good choice as far as transformations, but perhaps not friendly for data entry.

2) Infrastructure.  I like the infrastructure of Homebrew for Mac (please google it).  The gist of it is the main repo contains the Homebrew tools and recipes for the products.  Products are in their own repo.  I'd recommend independent github repos for Alexandria (love the name, btw)

2a) Homebrew has concepts such as dependencies.  I could see if we have competing file formats, that a dependency on the build tool could keep things clean.

2b) Dependencies could be used to make "collections," where a collection could be anything from the books of a series to a reading-list that someone wants to publish.

--Scott

felix faber

unread,
Feb 27, 2012, 7:16:24 PM2/27/12
to project-alexandria-books
Hey there!

On Feb 28, 12:04 am, Cameron Hill <cameron.h...@designory.com> wrote:
> Right, I think the entire PG library as a whole is too colosal and
> differentiated to be useful to us yet.

I agree with Brian that we must not merely bike-shed.
Action is required.

Nevertheless, going directly for the entire library is maybe a bit
soon.

We might even have to throw away what we do in the beginning.
But that will be a good thing. Only by testing ideas and iterating can
we find the perfect solution.
We are at the beginning of a learning process. And books are not
_simple_.

> As far as text formats, I am mostly familiar with Markdown and have
> briefly looked at RST. I don't think Markdown could handle these
> needs. RST seems to have the immediate advantage of handling tables
> out of the box better than Markdown and it seems PG is/was tentatively
> leaning towards RST.
>
> RST also seems prepared to handle math equationshttp://docutils.sourceforge.net/docs/ref/rst/directives.html#math
> The limiting factor is how well the output format can typeset the
> mathematical output -- but for math/science books (which where my big
> interest is) this seems advantageous.

Afaik, formula in RST can be compiled with LaTeX.
Directly as part of a PDF or to images that get included.
That's as good as it gets for scientific content.

Best,
Julius aka Felix Faber

Brian Troutwine

unread,
Feb 27, 2012, 7:20:58 PM2/27/12
to prj-ale...@googlegroups.com
On Mon, Feb 27, 2012 at 6:57 PM, Travis Jensen <travis...@gmail.com> wrote:
> I agree on the start small and work up. My advice would be to approach
> this like a Lean Startup: Put together an initial concept with minimal
> effort, add to the community based on that effort, get feedback from
> that community, rinse, repeat.

+1

> I'm certainly familiar with some of the problems with PG from an end-
> user's perspective, but I'm not familiar with it from a contributor's/
> editor's perspective.  We should probably identify the personas (yes,
> I've done a lot of product management ;) and a couple of goals for
> each.  At the very least, we have three personas we will want to
> cover: author (for people adding newly created content to the public
> domain), editor (people adding other people's PD work to the project
> and making revisions to it [note: there may be two personas there]),
> and reader.
>
> Having worked at a company producing software for libraries, I can
> understand how PG could be mired in deep muck around "standards". That
> is one of the biggest advantages to a process like Python's BDFL.
> Having somebody at the top who can, when all is said and done, say
> "this is the right thing for the project" can be a very good thing to
> keep people from bike shedding.
>
> tj

--
Brian L. Troutwine

Brian Troutwine

unread,
Feb 27, 2012, 9:30:57 PM2/27/12
to prj-ale...@googlegroups.com
On Mon, Feb 27, 2012 at 6:57 PM, Travis Jensen <travis...@gmail.com> wrote:
> I agree on the start small and work up. My advice would be to approach
> this like a Lean Startup: Put together an initial concept with minimal
> effort, add to the community based on that effort, get feedback from
> that community, rinse, repeat.
>
> I'm certainly familiar with some of the problems with PG from an end-
> user's perspective, but I'm not familiar with it from a contributor's/
> editor's perspective.  We should probably identify the personas (yes,
> I've done a lot of product management ;) and a couple of goals for
> each.  At the very least, we have three personas we will want to
> cover: author (for people adding newly created content to the public
> domain), editor (people adding other people's PD work to the project
> and making revisions to it [note: there may be two personas there]),
> and reader.

I'd love to see you elaborate more on this.

> Having worked at a company producing software for libraries, I can
> understand how PG could be mired in deep muck around "standards". That
> is one of the biggest advantages to a process like Python's BDFL.
> Having somebody at the top who can, when all is said and done, say
> "this is the right thing for the project" can be a very good thing to
> keep people from bike shedding.
>
> tj

--
Brian L. Troutwine

cforster

unread,
Feb 27, 2012, 10:37:42 PM2/27/12
to project-alexandria-books
Hey Folks,

I am an English Lit PhD so my perspective may be a little different
from others here; the idea of bringing the virtues of DVCS to public
domain texts is a thought which has occurred to me before and so I
couldn't resist trying to help.

1) Start small.

Yes. The idea of identifying a handful of texts (or maybe even a
single text) sounds very smart. Brian's suggestions (Walden, etc) seem
like good ideas to me.

2) What exactly is the goal?

To my mind the goal is to bring good, readable, public domain texts to
readers. Project Gutenberg is great; but as everyone here recognizes
it has some drawbacks. (I'll elaborate on those criteria: good and
readable a bit below). I'd point out PG is not the only source for
such texts; check out, for instance, the Oxford Text Archive:
http://ota.ahds.ac.uk/ (there are others).

If the goal is increase access to public domain texts, I think we
might imagine our goal as re-mediating print objects. Tools for
creating born-digital etexts seem, to my mind, available right now.

In passing, I'd like to stress that reducing the friction between the
public domain and the reading public is an unalloyed good. Many of
these texts have lives in public schools (the novels of Twain, poems
of Keats, Shelley, and Browning, Shakespeare, etc) and (at least
potentially) have very large audiences. As I have said before to other
audiences, "Imagine if no one ever had to pay for Jane Eyre again."

3) Formats

I think the best possible solution would be a single markup format
which could be processed out to LaTeX (for print/PDF; everything that
Felix is talking about), ePub (mobi, whatever, for e-readers), HTML,
and plain text.

Have a look at the TEI; TEI is a flavor of XML and is considered in
many ways the standard for document encoding for many academic
projects. It is not a format in the strict sense, but a flexible,
customizable standard. Because it is not a single format, it is not as
"standard" as you'd like a standard to be. One of the challenges the
TEI faces is that it tries to do everything: that includes medieval
manuscripts, electronic texts, incunabula, printed books, etc; because
of that it is very complicated and somewhat fragmented. I think most
folks here are interested in ~printed books~. This can simplify our
markup needs significantly. Some form of simplified TEI might be a
good bet.

Check out the TEI Stylesheets for a first step towards moving from TEI
to LaTeX, HTML, etc:
http://www.tei-c.org/Tools/Stylesheets/

4) Final Thoughts from a Literature Scholar

I mentioned two criteria; readable texts is a function, I think, of
format. The other criterion is "good" texts. By "good" I emphatically
do not mean the quality of the literature; I mean texts whose identity
and provenance is unambiguous.

This can easily seem like a silly, pedantic question: folks who aren't
familiar with critical editing or with the different editions, states,
printings, and versions of a text can often exist (i.e. the very
complicated histories books have in the course of their transmission).
Such matters are very complicated. Establishing what "the text" is, is
a matter of no small complication. Gutenberg texts often frighten
scholars because it is very unclear where they come from.

I know the desires of scholars and those of readers (and say, this
mailing list) are not always the same. In the world I'd love to live
in, there would be texts based on some existing print edition. We
would have consistently marked up electronic texts based on clearly
identified print editions (with GOOD metadata) which would provide the
raw materials for folks to create their own editions. The ability to
easily add stand-off annotations would be icing on the cake--a true
boon for students and readers of all stripes. But here, I know, I'm
moving well beyond the goals of making PG (or perhaps simply public
domain) texts better (i.e. more readable).

I look forward to seeing what folks think.

- Chris

Brian Troutwine

unread,
Feb 27, 2012, 11:50:01 PM2/27/12
to prj-ale...@googlegroups.com
On Mon, Feb 27, 2012 at 10:37 PM, cforster <chris.s...@gmail.com> wrote:
> Hey Folks,
>
> I am an English Lit PhD so my perspective may be a little different
> from others here; the idea of bringing the virtues of DVCS to public
> domain texts is a thought which has occurred to me before and so I
> couldn't resist trying to help.

My wife does Old English studies, translations so I am not terribly
unfamiliar with your particular discipline's methodology. Hi!

> 1) Start small.
>
> Yes. The idea of identifying a handful of texts (or maybe even a
> single text) sounds very smart. Brian's suggestions (Walden, etc) seem
> like good ideas to me.
>
> 2) What exactly is the goal?
>
> To my mind the goal is to bring good, readable, public domain texts to
> readers. Project Gutenberg is great; but as everyone here recognizes
> it has some drawbacks. (I'll elaborate on those criteria: good and
> readable a bit below). I'd point out PG is not the only source for
> such texts; check out, for instance, the Oxford Text Archive:
> http://ota.ahds.ac.uk/ (there are others).

Indeed and a good point. I think Public Gutenberg is merely the most
public example of an anarchy of archives, one which might well be the
worst managed, from the point of view of someone with library science
sympathies. Outside of Archive.org's scanned materials, PG has the
most _popular_ works available.

> If the goal is increase access to public domain texts, I think we
> might imagine our goal as re-mediating print objects. Tools for
> creating born-digital etexts seem, to my mind, available right now.

Do elaborate on this thought, please, with special emphasis on the
consequence of it.

> In passing, I'd like to stress that reducing the friction between the
> public domain and the reading public is an unalloyed good. Many of
> these texts have lives in public schools (the novels of Twain, poems
> of Keats, Shelley, and Browning, Shakespeare, etc) and (at least
> potentially) have very large audiences. As I have said before to other
> audiences, "Imagine if no one ever had to pay for Jane Eyre again."

Absolutely agreed. PG falls down here in its lack of machine parsable
texts. That is, while I can easily make a machine lex a PG text, I
can't parse one into a syntax tree. That inhibits conversion into
modern file formats, severely limiting the appeal of PG's services.

> 3) Formats
>
> I think the best possible solution would be a single markup format
> which could be processed out to LaTeX (for print/PDF; everything that
> Felix is talking about), ePub (mobi, whatever, for e-readers), HTML,
> and plain text.

Agreed.

> Have a look at the TEI; TEI is a flavor of XML and is considered in
> many ways the standard for document encoding for many academic
> projects. It is not a format in the strict sense, but a flexible,
> customizable standard. Because it is not a single format, it is not as
> "standard" as you'd like a standard to be. One of the challenges the
> TEI faces is that it tries to do everything: that includes medieval
> manuscripts, electronic texts, incunabula, printed books, etc; because
> of that it is very complicated and somewhat fragmented. I think most
> folks here are interested in ~printed books~. This can simplify our
> markup needs significantly. Some form of simplified TEI might be a
> good bet.
>
> Check out the TEI Stylesheets for a first step towards moving from TEI
> to LaTeX, HTML, etc:
> http://www.tei-c.org/Tools/Stylesheets/

I've worked with TEI a bit and it's the Humanities answer to the CS
crowd's DocBook: XSLT stylesheets for rigorously defined XML in both
cases, ostensibly the _most_ general format possible for being naught
but XML but suffering for that exact reason. Such a broad solution
violates the Worse is Better observation, tending to drive down
adoption simply for the difficulty of new users learning the tools.
That a parser for TEI can't be hacked together in a language without
pre-existing, simple XML libraries is also a problem: adding
complexity to the parsing of the markup format will tend to cause a
monoculture, rather than the diverse ecosystem of code that will drive
PA on into nifty areas.

It'd rather see the project adopt a markup format that's not generally
applicable but can be quickly understood over the inverse, even though
that will mean, in time, we'll have to hash out extensions to ReST and
process around that.

> 4) Final Thoughts from a Literature Scholar
>
> I mentioned two criteria; readable texts is a function, I think, of
> format. The other criterion is "good" texts. By "good" I emphatically
> do not mean the quality of the literature; I mean texts whose identity
> and provenance is unambiguous.
>
> This can easily seem like a silly, pedantic question: folks who aren't
> familiar with critical editing or with the different editions, states,
> printings, and versions of a text can often exist (i.e. the very
> complicated histories books have in the course of their transmission).
> Such matters are very complicated. Establishing what "the text" is, is
> a matter of no small complication. Gutenberg texts often frighten
> scholars because it is very unclear where they come from.

I would love to see PA include rigorous meta-data on each work's
history. It would be helpful if you could choose a work for inclusion
into PA and produce what you'd like to see. Presumably something with
a complicated history, but not absolutely tortuous.

> I know the desires of scholars and those of readers (and say, this
> mailing list) are not always the same. In the world I'd love to live
> in, there would be texts based on some existing print edition. We
> would have consistently marked up electronic texts based on clearly
> identified print editions (with GOOD metadata) which would provide the
> raw materials for folks to create their own editions. The ability to
> easily add stand-off annotations would be icing on the cake--a true
> boon for students and readers of all stripes. But here, I know, I'm
> moving well beyond the goals of making PG (or perhaps simply public
> domain) texts better (i.e. more readable).

I don't necessarily think so. Feature creep is a real concern, but if
you can make happen what you find important and enough of us feel
giddy about it, well, I'm sure we'd all agree the initial goal list
was incomplete. :)

> I look forward to seeing what folks think.

I'd love to see a worked example.

--
Brian L. Troutwine

Alexandre Raymond

unread,
Feb 27, 2012, 11:50:13 PM2/27/12
to prj-ale...@googlegroups.com
Hi everyone,

Speaking of TEI, it looks like PG started working on using a variant
of TEI called PGTEI at one point:
http://www.gutenberg.org/tei/
http://pgtei.pglaf.org/marcello/0.4/doc/20000-h.html

Perhaps this is indeed the correct approach. One drawback, however, is
that this markup language is not simple... far from it.

The way I see it, a DVCS could ease the development process, allowing
a more iterative process, perhaps starting from scanned pages, then
converted to raw text, and finally iteratively refined until it is
correctly encoded.

One nice property of TEI seems to be that it is easily exportable to a
wide variety of formats, including html/txt/latex/etc.

Maybe all that is missing is a nice set of web tools to guide this process.

Alexandre

On Mon, Feb 27, 2012 at 10:37 PM, cforster <chris.s...@gmail.com> wrote:

> --
> You have received this message from the Project Alexander mailing list.
> The IRC channel is #ProjectAlexandria on irc.freenode.net.
> The project-central repository is here: https://github.com/felix-faber/project-alexandria

cforster

unread,
Feb 28, 2012, 12:51:36 AM2/28/12
to project-alexandria-books

On Feb 27, 11:50 pm, Brian Troutwine <br...@troutwine.us> wrote:
> On Mon, Feb 27, 2012 at 10:37 PM, cforster <chris.s.fors...@gmail.com> wrote:

> > If the goal is increase access to public domain texts, I think we
> > might imagine our goal as re-mediating print objects. Tools for
> > creating born-digital etexts seem, to my mind, available right now.
>
> Do elaborate on this thought, please, with special emphasis on the
> consequence of it.

A comment earlier in this thread (I believe) had mentioned authoring
new texts; rather than allowing authors to produce/share/circulate new
texts, it seems to me our goal here is to increase access to public
domain texts; i.e. ~books~ published before 1923 (at least in the US
context). I think this focus will allow us to make certain assumptions
that we would be less inclined to make if born-digital texts were also
of chief concern rather than just digital versions of existing books.

> I've worked with TEI a bit and it's the Humanities answer to the CS
> crowd's DocBook:

Well, I might say DocBook is the CS answer to the TEI. ;)

>XSLT stylesheets for rigorously defined XML in both
> cases, ostensibly the _most_ general format possible for being naught
> but XML but suffering for that exact reason. Such a broad solution
> violates the Worse is Better observation, tending to drive down
> adoption simply for the difficulty of new users learning the tools.
> That a parser for TEI can't be hacked together in a language without
> pre-existing, simple XML libraries is also a problem: adding
> complexity to the parsing of the markup format will tend to cause a
> monoculture, rather than the diverse ecosystem of code that will drive
> PA on into nifty areas.
>
> It'd rather see the project adopt a markup format that's not generally
> applicable but can be quickly understood over the inverse, even though
> that will mean, in time, we'll have to hash out extensions to ReST and
> process around that.

That sounds like a completely fair assessment; producing good TEI
texts is NOT trivial. To my mind, the question of how to mark up the
texts is _the_ question. ReST may just be good enough; I don't know.

> I would love to see PA include rigorous meta-data on each work's
> history. It would be helpful if you could choose a work for inclusion
> into PA and produce what you'd like to see. Presumably something with
> a complicated history, but not absolutely tortuous.

A commitment to meaningful metadata would be great. But this would
have more fundamental impacts in how folks imagine what they're doing;
to really know where a text comes from would seriously complicate the
"clean up / improve / build on" Project Gutenberg vision of PA. Such
cleaning up / clarification / metadata itself could be added later.

I'll certainly look into putting something together once some sense of
consensus emerges in how folks imagine marking these texts up.

Travis Jensen

unread,
Feb 28, 2012, 12:30:09 PM2/28/12
to prj-ale...@googlegroups.com


On Mon, Feb 27, 2012 at 10:51 PM, cforster <chris.s...@gmail.com> wrote:

A commitment to meaningful metadata would be great. But this would
have more fundamental impacts in how folks imagine what they're doing;
to really know where a text comes from would seriously complicate the
"clean up / improve / build on" Project Gutenberg vision of PA. Such
cleaning up / clarification / metadata itself could be added later.

I completely agree that good meta-data is incredibly valuable, but meta-data discussions tend to be those kinds of discussions that lead to the morass. The nice thing about using a good VCS is that meta-data can be associated with the revisions independently, meaning we can probably punt on the meta-data for now. 

If we have a system that we can add the meta-data to later, it also means we also don't have to decide all the meta-data to add in one shot, but rather can incrementally add it.

tj
--
Travis Jensen

Read the Software Maven @ http://softwaremaven.innerbrane.com/
Read my LinkedIn profile @ http://www.linkedin.com/in/travisjensen
Read my Twitter mumblings @ http://twitter.com/SoftwareMaven
Send me email @ travis...@gmail.com

*What kind of guy calls himself the Software Maven???*

Cameron Hill

unread,
Feb 28, 2012, 1:55:33 PM2/28/12
to project-alexandria-books
On Feb 27, 11:51 pm, cforster <chris.s.fors...@gmail.com> wrote:
>
> That sounds like a completely fair assessment; producing good TEI
> texts is NOT trivial. To my mind, the question of how to mark up the
> texts is _the_ question. ReST may just be good enough; I don't know.
>

Painlessly creating/modifying the source markups is an important part
of this goal. Part of what I think we want to accomplish is allowing
readers/editors to be able to trivially submit corrections into some
sort of repository. An XML based markup would be too cumbersome for
the average person without additional tools.

In addition to ReST, there is an 'enhanced' Markdown syntax called
MultiMarkdown (https://github.com/fletcher/MultiMarkdown/wiki/
MultiMarkdown-Syntax-Guide) which adds some more valuable syntax.
Markdown also has several extensions which try to takle some of
limitations of vanilla Markdown.

Also I keep getting drawn to Pandoc (http://johnmacfarlane.net/
pandoc/) and think it could be a valuable start for quickly getting
"good enough" output in multiple formats. Are there any other existing
tools whose value could sway our markup choices one way or the other?

felix faber

unread,
Feb 28, 2012, 1:58:52 PM2/28/12
to project-alexandria-books

> Texts I propose:
>
>   * Walden -- contains poetry, prose, tables and footnote references
> (http://www.gutenberg.org/files/205/205-0.txt)
>   * Crime and Punishment -- prose, UTF8 and biggish
> (http://www.gutenberg.org/files/28054/28054-0.txt)
>   * Deductive Logic -- rather punishing layout job for some devices
> (http://www.gutenberg.org/cache/epub/6560/pg6560.txt)
>   * Hyperbolic Functions -- tex only (http://www.gutenberg.org/ebooks/13692)
>
> I've gone ahead and created a pull request with these texts scattered
> in the repo.

Good idea.
I pulled them into the repo.

By the way, does PG contain books that are image-heavy?
e.g. books for children with lots of illustrations..

Brian Troutwine

unread,
Feb 28, 2012, 2:00:41 PM2/28/12
to prj-ale...@googlegroups.com
On Tue, Feb 28, 2012 at 1:58 PM, felix faber
<julius....@googlemail.com> wrote:
> Good idea.
> I pulled them into the repo.

Thanks!

> By the way, does PG contain books that are image-heavy?
> e.g. books for children with lots of illustrations..

Yes it does. Alice in Wonderland, for instance.

--
Brian L. Troutwine

Alexandre Raymond

unread,
Feb 28, 2012, 2:18:11 PM2/28/12
to prj-ale...@googlegroups.com
A hybrid approach might also work:

- users submit a draft version using a simple markup language (ReST/markdown)
- a first proofreading pass is done on this version
- once there are no typos left, an automated conversion can transform
the document in a more complex file format such as TEI or DocBook
(from what I can see, Pandoc could be used for this purpose), on which
more advanced editing can be done
- advanced users can then enter additional information in the
TEI/DocBook master-file, such as author/editor/publisher, page
numbers, quotes in foreign languages, etc
- finally, standard TEI/DocBook transformation templates can be used
to export to text/html/LaTeX/pdf/epub/etc

This decoupled approach might also solve the issue that different
works can be better expressed using different markup languages. For
example, LaTeX is very good with math, ReST is very simple for end
users, etc.

Alexandre

cforster

unread,
Feb 28, 2012, 2:36:09 PM2/28/12
to project-alexandria-books
On Feb 28, 2:18 pm, Alexandre Raymond <cerb...@gmail.com> wrote:
> A hybrid approach might also work:
>
> - users submit a draft version using a simple markup language (ReST/markdown)
> - a first proofreading pass is done on this version

<snip>

A hybrid approach would have the advantages of simplicity for the
producers/editors of texts while maintaing a level of complexity that
would have benefits in producing texts for readers. Note that already
we are having to try to invent an entire work flow--teetering, that is
to say, on ballooning the complexity of the project into
unmanageability.

*A Modest Proposal:* As my earlier post suggests, my desires run in
this direction as well; but if our chief conviction here is that DCVS
can improve public domain texts maybe we should stop trying to decide
on / invent a standard (TEI, ReST, Markdown, whatever) and define
instead a handful of criteria which are our goal and a text to work
on. For example, everyone who cares about this project, grab the PG
text of _Frankenstein_ and produce a version which:

- has a source in some modifiable format (points for keeping it easy
to edit and metadata rich!)
- outputs to a variety of formats: ePub / HTML / LaTeX (PDF)

And then we really just watch folks take advantage of DCVS and see
whether something like a best practice emerges organically. It's just
a thought; but recall that this project began from the observation
that DCVS could offer new opportunities for increasing the usability
of public domain texts and now we find ourselves mired (too strong a
word) in all the ugliness of standards and formats.

> This decoupled approach might also solve the issue that different
> works can be better expressed using different markup languages. For
> example, LaTeX is very good with math, ReST is very simple for end
> users, etc.
>

LaTeX, I might note in passing, also has the advantage of actually
typesetting a text. This is crucial if one wants a readable version on
something like paper or PDF. I think one reason PG texts do not have
wider readerships (and I'm relying on anecdotal evidence in this
assessment) is that no one (again, anecdote) likes reading plaintext.
ePub (etc) gives you the advantage of e-readers of various stripes;
but many people still love the look of books on print, LaTeX (with a
nice selection of open font; perhaps Linux Libertine) can give you
nice looking, book-like text (which you can print out, read as a PDF,
whatever).

Chris

Cameron Hill

unread,
Feb 28, 2012, 2:57:27 PM2/28/12
to project-alexandria-books
There are also at least two different versions/editions of Alice In
Wonderland on PG which might make it a good candidate to determine how
branching might work for titles with multiple versions.

http://www.gutenberg.org/ebooks/11 (The Millenium Fulcrum Edition)
[no images?]
http://www.gutenberg.org/ebooks/19033 (The Storyland Series edition)
[with images]



On Feb 28, 1:00 pm, Brian Troutwine <br...@troutwine.us> wrote:
> On Tue, Feb 28, 2012 at 1:58 PM, felix faber
>

felix faber

unread,
Feb 29, 2012, 4:14:18 AM2/29/12
to project-alexandria-books


On Feb 28, 8:57 pm, Cameron Hill <cameron.h...@gmail.com> wrote:
> There are also at least two different versions/editions of Alice In
> Wonderland on PG which might make it a good candidate to determine how
> branching might work for titles with multiple versions.
>
> http://www.gutenberg.org/ebooks/11 (The Millenium Fulcrum Edition)
> [no images?]http://www.gutenberg.org/ebooks/19033(The Storyland Series edition)
> [with images]
I added the Storyland Series edition to the repo.
There are images for the Millenium Fulcrum edition. (http://
www.gutenberg.org/ebooks/114)
But they have no clearly defined position in the text.

felix faber

unread,
Feb 29, 2012, 4:31:11 AM2/29/12
to project-alexandria-books
On Feb 28, 8:36 pm, cforster <chris.s.fors...@gmail.com> wrote:
> *A Modest Proposal:* As my earlier post suggests, my desires run in
> this direction as well; but if our chief conviction here is that DCVS
> can improve public domain texts maybe we should stop trying to decide
> on / invent a standard (TEI, ReST, Markdown, whatever) and define
> instead a handful of criteria which are our goal and a text to work
> on. For example, everyone who cares about this project, grab the PG
> text of _Frankenstein_ and produce a version which:
>
> - has a source in some modifiable format (points for keeping it easy
> to edit and metadata rich!)
> - outputs to a variety of formats: ePub / HTML / LaTeX (PDF)
Yes!
We can only decide on a format once we have gained enough experience
in using them.
Every serious format proposal should come with a set of examples.
(Prose, verse, technical literature, ..)

> And then we really just watch folks take advantage of DCVS and see
> whether something like a best practice emerges organically. It's just
> a thought; but recall that this project began from the observation
> that DCVS could offer new opportunities for increasing the usability
> of public domain texts and now we find ourselves mired (too strong a
> word) in all the ugliness of standards and formats.
Organic growth is good.
I love the diverse range of ideas that have already popped up.
But be prepared that at one point, decisions will have to be made.
How we make those decisions as a group is still tbd.

> LaTeX, I might note in passing, also has the advantage of actually
> typesetting a text. This is crucial if one wants a readable version on
> something like paper or PDF. I think one reason PG texts do not have
> wider readerships (and I'm relying on anecdotal evidence in this
> assessment) is that no one (again, anecdote) likes reading plaintext.
> ePub (etc) gives you the advantage of e-readers of various stripes;
> but many people still love the look of books on print, LaTeX (with a
> nice selection of open font; perhaps Linux Libertine) can give you
> nice looking, book-like text (which you can print out, read as a PDF,
> whatever).
The problem with LaTeX is, that it needs a lot of work to create a
polished final version.
Even with automatic converters from an existing master-file.
Other than that, it would be _perfect_ for PDF rendering.

I will soon put an example of this into the repo to illustrate my
point.

Best,
Felix Faber aka Julius

Travis Jensen

unread,
Feb 29, 2012, 11:33:53 AM2/29/12
to prj-ale...@googlegroups.com
On Feb 28, 2012, at 12:36 PM, cforster <chris.s...@gmail.com> wrote:

> On Feb 28, 2:18 pm, Alexandre Raymond <cerb...@gmail.com> wrote:
>> A hybrid approach might also work:
>>
>> - users submit a draft version using a simple markup language (ReST/markdown)
>> - a first proofreading pass is done on this version
>
> <snip>
>

> *A Modest Proposal:* As my earlier post suggests, my desires run in
> this direction as well; but if our chief conviction here is that DCVS
> can improve public domain texts maybe we should stop trying to decide
> on / invent a standard (TEI, ReST, Markdown, whatever) and define
> instead a handful of criteria which are our goal and a text to work
> on. For example, everyone who cares about this project, grab the PG
> text of _Frankenstein_ and produce a version which:
>
> - has a source in some modifiable format (points for keeping it easy
> to edit and metadata rich!)
> - outputs to a variety of formats: ePub / HTML / LaTeX (PDF)

DCVS isn't going to solve the "this is a lot of work" problem. I think
it's advantageous to have the discussion of what could work and what
won't work to lay out some direction before everybody runs off to the
corner to work. People will get strongly invested in work done, and we
want that work to be in the general direction of where we want to go.

*A Modest Counter Proposal:* Let's identify three to five questions we
are trying to answer with this work (the number of questions is
deliberately chosen to keep focused) and then decide how to answer
those questions (might be teams, individually attacking a book, or
individually attacking multiple books). The question's I have that I
will lay out for consideration are:

1. Is there a format that allows straightforward editing and patching
that can be turned into a presentable ebook using automated tools? In
other words, what is the effort to go from "I need to figure out how
to fix this typo" through "I have a new book on my device".
2. Is there a single source format that is appropriate for all kinds
of books, or does it make more sense to have an 80/20 rule, with a
simpler source format for 80% of the books, and a more complex on for
the more complex books (mathematical functions, etc).
3. Do we want to focus exclusively on the past or do we want to
provide an avenue for future public domain works? My personal
opinion: I would love to be thinking about the future as well, when an
author may want to start bridging the gap between "book" and "directed
multimedia experience". I also don't think we want to *solve* that
problem today, but it is worth considering if it is a value as it
might impact the approach to #1 and #2.
4. How can we manage meta-data without incurring an associated penalty
on ability to edit the source?

Once we have these questions, I think it would be worth putting them
up on a wiki for easy reference. I think next steps will become much
more clear (and the output much more valuable) once we are all trying
to answer the same questions.

Tj

cforster

unread,
Mar 6, 2012, 1:39:28 AM3/6/12
to project-alexandria-books

I fear the silence on the list after the initial excitement... I
thought I'd quickly response to some of Travis's points and share a
relevant link:

On Feb 29, 11:33 am, Travis Jensen <travis.jen...@gmail.com> wrote:
>
> 1. Is there a format that allows straightforward editing and patching
> that can be turned into a presentable ebook using automated tools? In
> other words, what is the effort to go from "I need to figure out how
> to fix this typo" through "I have a new book on my device".
> 2. Is there a single source format that is appropriate for all kinds
> of books, or does it make more sense to have an 80/20 rule, with a
> simpler source format for 80% of the books, and a more complex on for
> the more complex books (mathematical functions, etc).

These are _the_ questions. My initial proposal to individuals, to try
out something, while work intensive, was inspired by the experience of
doing some text encoding in TEI XML in the past. Text encoding, if you
are going to try to remain faithful to some texts which already exists
(rather than essentially inventing your own text) is inevitably far
more complicated in practice.

If there is any consensus from the discussions over the last couple of
weeks it seems to be around ReST or some sort of markdown format. It
seems to me that only way now to know what challenges would attend
really trying to use one of those formats would be to try. What we
might learn is that they simply won't work; we might learn they work
perfectly; or we might learn that those formats need to be extended in
X, Y, or Z ways (or in X, Y, and Z... and AA, AB, AC, etc).

> 3. Do we want to focus exclusively on the past or do we want to
> provide an avenue for future public domain works?  My personal
> opinion: I would love to be thinking about the future as well, when an
> author may want to start bridging the gap between "book" and "directed
> multimedia experience". I also don't think we want to *solve* that
> problem today, but it is worth considering if it is a value as it
> might impact the approach to #1 and #2.

While I, of course, am interested in other folks' perspective, I think
I would prefer a focus on print books, and on public domain in
particular. For me the inspiring motivation here remains the
disconnected between the enormous wealth of the public domain and
poverty of access and uses for which that material is currently
available.

> 4. How can we manage meta-data without incurring an associated penalty
> on ability to edit the source?

This is not that far from questions 1 & 2. What sort of metadata are
we interested in? Data about the text? Data about the encoding?
Including both of these in a single file has long made the TEI header
the bane of many text encoders lives ("Wait; is the titleStatement the
title of the text or the title of the encoding?"); and one advantage
of using a DVCS for texts would be that information about the version
of an encoding (_not_ to be confused with the edition/version of the
text itself) would be handled elsewhere.

I would like the richest bibliographical metadata possible; indeed, in
a perfect world (and here I'm referring back to your question 3) I
would want an electronic version of a public domain text to refer to a
specific print edition on which it is based. In an age not only of PG
but of archive.org and Google Books, basing a text on a particular
print edition is not as hard as it used to be. (For that matter; among
the metadata I'd like to see included in texts are page numbers; yup.
I'm old school.)

I know folks have tended toward some lighter mode of markup than XML;
this page shows taking a text from a raw text format into XML and then
back out to PDF and XHTML; it is quite nice: http://piez.org/wendell/projects/buechlein/

Regardless of the format, I think that demo illustrates nicely what I
imagine the end product of this project to look like (though with ePub
in addition to PDF and XHTML). Is that what other folks are imagining?
Reply all
Reply to author
Forward
0 new messages