What do we want to see? Put another way, what's the _goal_ of this project?
There's been lots of discussion over the years about re-building
Project Gutenberg, a lot of it descending into bike-shedding. I,
personally, would like to see Project Alexandria as a collection of
like-minded folks working toward:
* machine/human-readable book text files
* tools to view and reformat these text files and
* an ecosystem of applications _around_ the files and the
infrastructural toolset
The first point implies the adoption of a markup format, the second
conversion utilities between the project's adopted format and those
which are popular and device applications that give access to the
library produced.
I propose that the best way to meet this aims is to compete solutions
against one another. Bike-shedding debates will kill an otherwise
viable community; debates about functional tools is much more
effective. To that end, I've begun mirroring PG's '.txt' files and
will get them pull-requested into
https://github.com/felix-faber/project-alexandria as soon as the whole
lot is ready.
Thoughts? Hopes? Wishes or dreams?
--
Brian L. Troutwine
Whoops; typed without thinking. I'll produce a torrent for the initial
mirror, but checking the _whole_ think into git would be a nightmare.
> Thoughts? Hopes? Wishes or dreams?
> --
> Brian L. Troutwine
--
Brian L. Troutwine
Indeed. It's been five years since I last mirrored PG and their
collection has grown substantially--pure textual duplication seems
very common. Anyway, I'll announce when there's a torrent available.
I've never known people to rue having too much data.
> As briefly mentioned elsewhere in the HN thread, it would be easiest
> to essentially build up from scratch taking a few select titles from
> PG and getting them into an appropriate text format -- hone the
> process and roll from there.
>
> As far as text formats, I am mostly familiar with Markdown and have
> briefly looked at RST. I don't think Markdown could handle these
> needs. RST seems to have the immediate advantage of handling tables
> out of the box better than Markdown and it seems PG is/was tentatively
> leaning towards RST.
Texts I propose:
* Walden -- contains poetry, prose, tables and footnote references
(http://www.gutenberg.org/files/205/205-0.txt)
* Crime and Punishment -- prose, UTF8 and biggish
(http://www.gutenberg.org/files/28054/28054-0.txt)
* Deductive Logic -- rather punishing layout job for some devices
(http://www.gutenberg.org/cache/epub/6560/pg6560.txt)
* Hyperbolic Functions -- tex only (http://www.gutenberg.org/ebooks/13692)
I've gone ahead and created a pull request with these texts scattered
in the repo.
> RST also seems prepared to handle math equations
> http://docutils.sourceforge.net/docs/ref/rst/directives.html#math
> The limiting factor is how well the output format can typeset the
> mathematical output -- but for math/science books (which where my big
> interest is) this seems advantageous.
There's also asciidocs: http://www.methods.co.nz/asciidoc/ I'm all for
minimal modifications to the source texts, especially if PG might pick
up the changes/tools.
--
Brian L. Troutwine
+1
> I'm certainly familiar with some of the problems with PG from an end-
> user's perspective, but I'm not familiar with it from a contributor's/
> editor's perspective. We should probably identify the personas (yes,
> I've done a lot of product management ;) and a couple of goals for
> each. At the very least, we have three personas we will want to
> cover: author (for people adding newly created content to the public
> domain), editor (people adding other people's PD work to the project
> and making revisions to it [note: there may be two personas there]),
> and reader.
>
> Having worked at a company producing software for libraries, I can
> understand how PG could be mired in deep muck around "standards". That
> is one of the biggest advantages to a process like Python's BDFL.
> Having somebody at the top who can, when all is said and done, say
> "this is the right thing for the project" can be a very good thing to
> keep people from bike shedding.
>
> tj
--
Brian L. Troutwine
I'd love to see you elaborate more on this.
> Having worked at a company producing software for libraries, I can
> understand how PG could be mired in deep muck around "standards". That
> is one of the biggest advantages to a process like Python's BDFL.
> Having somebody at the top who can, when all is said and done, say
> "this is the right thing for the project" can be a very good thing to
> keep people from bike shedding.
>
> tj
--
Brian L. Troutwine
My wife does Old English studies, translations so I am not terribly
unfamiliar with your particular discipline's methodology. Hi!
> 1) Start small.
>
> Yes. The idea of identifying a handful of texts (or maybe even a
> single text) sounds very smart. Brian's suggestions (Walden, etc) seem
> like good ideas to me.
>
> 2) What exactly is the goal?
>
> To my mind the goal is to bring good, readable, public domain texts to
> readers. Project Gutenberg is great; but as everyone here recognizes
> it has some drawbacks. (I'll elaborate on those criteria: good and
> readable a bit below). I'd point out PG is not the only source for
> such texts; check out, for instance, the Oxford Text Archive:
> http://ota.ahds.ac.uk/ (there are others).
Indeed and a good point. I think Public Gutenberg is merely the most
public example of an anarchy of archives, one which might well be the
worst managed, from the point of view of someone with library science
sympathies. Outside of Archive.org's scanned materials, PG has the
most _popular_ works available.
> If the goal is increase access to public domain texts, I think we
> might imagine our goal as re-mediating print objects. Tools for
> creating born-digital etexts seem, to my mind, available right now.
Do elaborate on this thought, please, with special emphasis on the
consequence of it.
> In passing, I'd like to stress that reducing the friction between the
> public domain and the reading public is an unalloyed good. Many of
> these texts have lives in public schools (the novels of Twain, poems
> of Keats, Shelley, and Browning, Shakespeare, etc) and (at least
> potentially) have very large audiences. As I have said before to other
> audiences, "Imagine if no one ever had to pay for Jane Eyre again."
Absolutely agreed. PG falls down here in its lack of machine parsable
texts. That is, while I can easily make a machine lex a PG text, I
can't parse one into a syntax tree. That inhibits conversion into
modern file formats, severely limiting the appeal of PG's services.
> 3) Formats
>
> I think the best possible solution would be a single markup format
> which could be processed out to LaTeX (for print/PDF; everything that
> Felix is talking about), ePub (mobi, whatever, for e-readers), HTML,
> and plain text.
Agreed.
> Have a look at the TEI; TEI is a flavor of XML and is considered in
> many ways the standard for document encoding for many academic
> projects. It is not a format in the strict sense, but a flexible,
> customizable standard. Because it is not a single format, it is not as
> "standard" as you'd like a standard to be. One of the challenges the
> TEI faces is that it tries to do everything: that includes medieval
> manuscripts, electronic texts, incunabula, printed books, etc; because
> of that it is very complicated and somewhat fragmented. I think most
> folks here are interested in ~printed books~. This can simplify our
> markup needs significantly. Some form of simplified TEI might be a
> good bet.
>
> Check out the TEI Stylesheets for a first step towards moving from TEI
> to LaTeX, HTML, etc:
> http://www.tei-c.org/Tools/Stylesheets/
I've worked with TEI a bit and it's the Humanities answer to the CS
crowd's DocBook: XSLT stylesheets for rigorously defined XML in both
cases, ostensibly the _most_ general format possible for being naught
but XML but suffering for that exact reason. Such a broad solution
violates the Worse is Better observation, tending to drive down
adoption simply for the difficulty of new users learning the tools.
That a parser for TEI can't be hacked together in a language without
pre-existing, simple XML libraries is also a problem: adding
complexity to the parsing of the markup format will tend to cause a
monoculture, rather than the diverse ecosystem of code that will drive
PA on into nifty areas.
It'd rather see the project adopt a markup format that's not generally
applicable but can be quickly understood over the inverse, even though
that will mean, in time, we'll have to hash out extensions to ReST and
process around that.
> 4) Final Thoughts from a Literature Scholar
>
> I mentioned two criteria; readable texts is a function, I think, of
> format. The other criterion is "good" texts. By "good" I emphatically
> do not mean the quality of the literature; I mean texts whose identity
> and provenance is unambiguous.
>
> This can easily seem like a silly, pedantic question: folks who aren't
> familiar with critical editing or with the different editions, states,
> printings, and versions of a text can often exist (i.e. the very
> complicated histories books have in the course of their transmission).
> Such matters are very complicated. Establishing what "the text" is, is
> a matter of no small complication. Gutenberg texts often frighten
> scholars because it is very unclear where they come from.
I would love to see PA include rigorous meta-data on each work's
history. It would be helpful if you could choose a work for inclusion
into PA and produce what you'd like to see. Presumably something with
a complicated history, but not absolutely tortuous.
> I know the desires of scholars and those of readers (and say, this
> mailing list) are not always the same. In the world I'd love to live
> in, there would be texts based on some existing print edition. We
> would have consistently marked up electronic texts based on clearly
> identified print editions (with GOOD metadata) which would provide the
> raw materials for folks to create their own editions. The ability to
> easily add stand-off annotations would be icing on the cake--a true
> boon for students and readers of all stripes. But here, I know, I'm
> moving well beyond the goals of making PG (or perhaps simply public
> domain) texts better (i.e. more readable).
I don't necessarily think so. Feature creep is a real concern, but if
you can make happen what you find important and enough of us feel
giddy about it, well, I'm sure we'd all agree the initial goal list
was incomplete. :)
> I look forward to seeing what folks think.
I'd love to see a worked example.
--
Brian L. Troutwine
Speaking of TEI, it looks like PG started working on using a variant
of TEI called PGTEI at one point:
http://www.gutenberg.org/tei/
http://pgtei.pglaf.org/marcello/0.4/doc/20000-h.html
Perhaps this is indeed the correct approach. One drawback, however, is
that this markup language is not simple... far from it.
The way I see it, a DVCS could ease the development process, allowing
a more iterative process, perhaps starting from scanned pages, then
converted to raw text, and finally iteratively refined until it is
correctly encoded.
One nice property of TEI seems to be that it is easily exportable to a
wide variety of formats, including html/txt/latex/etc.
Maybe all that is missing is a nice set of web tools to guide this process.
Alexandre
On Mon, Feb 27, 2012 at 10:37 PM, cforster <chris.s...@gmail.com> wrote:
> --
> You have received this message from the Project Alexander mailing list.
> The IRC channel is #ProjectAlexandria on irc.freenode.net.
> The project-central repository is here: https://github.com/felix-faber/project-alexandria
A commitment to meaningful metadata would be great. But this wouldhave more fundamental impacts in how folks imagine what they're doing;
to really know where a text comes from would seriously complicate the
"clean up / improve / build on" Project Gutenberg vision of PA. Such
cleaning up / clarification / metadata itself could be added later.
Thanks!
> By the way, does PG contain books that are image-heavy?
> e.g. books for children with lots of illustrations..
Yes it does. Alice in Wonderland, for instance.
--
Brian L. Troutwine
- users submit a draft version using a simple markup language (ReST/markdown)
- a first proofreading pass is done on this version
- once there are no typos left, an automated conversion can transform
the document in a more complex file format such as TEI or DocBook
(from what I can see, Pandoc could be used for this purpose), on which
more advanced editing can be done
- advanced users can then enter additional information in the
TEI/DocBook master-file, such as author/editor/publisher, page
numbers, quotes in foreign languages, etc
- finally, standard TEI/DocBook transformation templates can be used
to export to text/html/LaTeX/pdf/epub/etc
This decoupled approach might also solve the issue that different
works can be better expressed using different markup languages. For
example, LaTeX is very good with math, ReST is very simple for end
users, etc.
Alexandre
> On Feb 28, 2:18 pm, Alexandre Raymond <cerb...@gmail.com> wrote:
>> A hybrid approach might also work:
>>
>> - users submit a draft version using a simple markup language (ReST/markdown)
>> - a first proofreading pass is done on this version
>
> <snip>
>
> *A Modest Proposal:* As my earlier post suggests, my desires run in
> this direction as well; but if our chief conviction here is that DCVS
> can improve public domain texts maybe we should stop trying to decide
> on / invent a standard (TEI, ReST, Markdown, whatever) and define
> instead a handful of criteria which are our goal and a text to work
> on. For example, everyone who cares about this project, grab the PG
> text of _Frankenstein_ and produce a version which:
>
> - has a source in some modifiable format (points for keeping it easy
> to edit and metadata rich!)
> - outputs to a variety of formats: ePub / HTML / LaTeX (PDF)
DCVS isn't going to solve the "this is a lot of work" problem. I think
it's advantageous to have the discussion of what could work and what
won't work to lay out some direction before everybody runs off to the
corner to work. People will get strongly invested in work done, and we
want that work to be in the general direction of where we want to go.
*A Modest Counter Proposal:* Let's identify three to five questions we
are trying to answer with this work (the number of questions is
deliberately chosen to keep focused) and then decide how to answer
those questions (might be teams, individually attacking a book, or
individually attacking multiple books). The question's I have that I
will lay out for consideration are:
1. Is there a format that allows straightforward editing and patching
that can be turned into a presentable ebook using automated tools? In
other words, what is the effort to go from "I need to figure out how
to fix this typo" through "I have a new book on my device".
2. Is there a single source format that is appropriate for all kinds
of books, or does it make more sense to have an 80/20 rule, with a
simpler source format for 80% of the books, and a more complex on for
the more complex books (mathematical functions, etc).
3. Do we want to focus exclusively on the past or do we want to
provide an avenue for future public domain works? My personal
opinion: I would love to be thinking about the future as well, when an
author may want to start bridging the gap between "book" and "directed
multimedia experience". I also don't think we want to *solve* that
problem today, but it is worth considering if it is a value as it
might impact the approach to #1 and #2.
4. How can we manage meta-data without incurring an associated penalty
on ability to edit the source?
Once we have these questions, I think it would be worth putting them
up on a wiki for easy reference. I think next steps will become much
more clear (and the output much more valuable) once we are all trying
to answer the same questions.
Tj