TEI Simple

58 views
Skip to first unread message

Andrew Dunning

unread,
Jul 4, 2015, 3:57:45 PM7/4/15
to gitenber...@googlegroups.com
I was pleased to come across this project; it seems to me that our current methods for putting old books online are incredibly shoddy, and that just a little work could make a great deal of difference.

I've experimented with putting a few works up on Project Gutenberg using their PG-RST format, and have appreciated its general simplicity, but I find that it simply doesn't support a number of things needed for displaying print material: for example, there is no good way of showing sidenotes or other marginal text, or for indicating where I have made corrections to the text. As far as I can tell, the only format that really covers everything is TEI, and they are developing a simplified version of their markup at <https://github.com/TEIC/TEI-Simple/>. This is used for the massive collection of texts at <https://github.com/textcreationpartnership>. It would be really interesting to apply some of the research into this to creating online books that are sustainable, portable, and accurate at the same time.

Eric Hellman

unread,
Jul 4, 2015, 4:50:22 PM7/4/15
to gitenber...@googlegroups.com
Project Gutenberg contains some books with TEI encoding. It should be possible to add support for TEI in the toolchain we are building.

Eric

On Jul 4, 2015, at 3:57 PM, Andrew Dunning <andu...@gmail.com> wrote:

I was pleased to come across this project; it seems to me that our current methods for putting old books online are incredibly shoddy, and that just a little work could make a great deal of difference.

I've experimented with putting a few works up on Project Gutenberg using their PG-RST format, and have appreciated its general simplicity, but I find that it simply doesn't support a number of things needed for displaying print material: for example, there is no good way of showing sidenotes or other marginal text, or for indicating where I have made corrections to the text. As far as I can tell, the only format that really covers everything is TEI, and they are developing a simplified version of their markup at <https://github.com/TEIC/TEI-Simple/>. This is used for the massive collection of texts at <https://github.com/textcreationpartnership>. It would be really interesting to apply some of the research into this to creating online books that are sustainable, portable, and accurate at the same time.

--
You received this message because you are subscribed to the Google Groups "GITenberg Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gitenberg-proj...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gitenberg-project/66014f4f-dc18-43fb-9f25-4573a05a31e8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jonathan Reeve

unread,
Jul 6, 2015, 1:39:37 PM7/6/15
to gitenber...@googlegroups.com
Another nice example of TEI-Simple is Martin Mueller's Shakespeare His Contemporaries corpus. As an experiment, I broke up the corpus into submodule repos, based on a methodology I sketched in this blog post. I think it'd be fun to create a script that can automatically generate TEI-Simple texts from PG HTML or other markup format. TEI-Simple would also be a good base format from which HTML and plaintext files can be generated dynamically via XSLT.

Seth Woodworth

unread,
Jul 10, 2015, 12:33:58 PM7/10/15
to gitenber...@googlegroups.com
I've been really curious to meet folks who had used PG-RST.  I spent a lot of time evaluating RST as a markup format for the project.  I agree that it has a few weak points, but was a great initiative, I wish it had worked out.

We've been using Asciidoc (for now at least).  I agree that TEI is an ideal format in a lot of ways.  Were it easier to work with, I would love to use TEI as the format for books.  But for simplicity of editing a text file, I think it misses the mark.

It may be far easier to write a custom backend for Asciidoctor (which would consist of xml templates) to generate TEI-Simple.

Overall, I think our criteria for formats involves:

0) quantity/quality of tools
1) ease of editing
2) variety of output formats

Are there really good tools for TEI-Simple yet?

On Sat, Jul 4, 2015 at 3:57 PM, Andrew Dunning <andu...@gmail.com> wrote:
I was pleased to come across this project; it seems to me that our current methods for putting old books online are incredibly shoddy, and that just a little work could make a great deal of difference.

I've experimented with putting a few works up on Project Gutenberg using their PG-RST format, and have appreciated its general simplicity, but I find that it simply doesn't support a number of things needed for displaying print material: for example, there is no good way of showing sidenotes or other marginal text, or for indicating where I have made corrections to the text. As far as I can tell, the only format that really covers everything is TEI, and they are developing a simplified version of their markup at <https://github.com/TEIC/TEI-Simple/>. This is used for the massive collection of texts at <https://github.com/textcreationpartnership>. It would be really interesting to apply some of the research into this to creating online books that are sustainable, portable, and accurate at the same time.

--

Jon Hurst

unread,
Jul 11, 2015, 6:42:33 AM7/11/15
to gitenber...@googlegroups.com, se...@sethish.com
Hi Seth,

This is an official rabbit hole warning. Firstly, you will never get two people to agree on a master format. Secondly, it doesn't really matter. Anything can be converted with relative ease to anything else with a modicum of human intervention and history has shown that the internet is full of people who are willing to do these conversions for the master format that they champion. Pick one, any one, it really truly, honestly makes not a jot of difference which. Just don't expect anyone to agree with your choice.

Your problem isn't and never will be which master format is best. Your problem is the extremely poor quality of your source material. You can create the most technically excellent markup in the world, but if what you are marking up is of poor quality, your book will be of poor quality and your time will have been thoroughly wasted. As a general rule, any book that you might have heard of in Project Gutenberg's collection will have been done early enough in the project's life cycle that it will be full of typos, will be of completely unknown origin and will not have any source page images available. They are of use as reference checks for cross-proofing but have little else going for them.

IMHO it is simply not worth working on any book where you do not have access to the source page images. Git is about making maintenance easy; without these source page images maintenance is not even really possible. It is not the absence of distributed version control that has made Project Gutenberg the unmaintained mess that it is; it is the fact that maintenance has never been baked into its processes and seemingly never will be either. I wish you luck with gitenberg, but you are definitely building on quicksand.

Eric Hellman

unread,
Jul 11, 2015, 1:11:46 PM7/11/15
to gitenber...@googlegroups.com, Seth Woodworth
It turns out that many of the more recent scan files are stowed away in places that we could re-expose. We've been in contact with the Distributed Proofreaders folks. For the older, more well-known books, multiple copies will have been scanned more recently and can be found in Hathitrust. 

There may be quicksand, but the global environment is drying it out.

--
You received this message because you are subscribed to the Google Groups "GITenberg Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gitenberg-proj...@googlegroups.com.

Seth Woodworth

unread,
Jul 11, 2015, 1:39:51 PM7/11/15
to gitenber...@googlegroups.com

This is an official rabbit hole warning.

:-D
 
Firstly, you will never get two people to agree on a master format... don't expect anyone to agree with your choice.

Agreed, I do think that longevity, extensibility, and multi-platform multi-implementation support are important.  And I think they have been addressed to a degree.

Your problem isn't and never will be which master format is best. Your problem is the extremely poor quality of your source material. You can create the most technically excellent markup in the world, but if what you are marking up is of poor quality, your book will be of poor quality and your time will have been thoroughly wasted. As a general rule, any book that you might have heard of in Project Gutenberg's collection will have been done early enough in the project's life cycle that it will be full of typos, will be of completely unknown origin and will not have any source page images available. They are of use as reference checks for cross-proofing but have little else going for them.

This is a very interesting problem.  As Eric says, many of these scans are available through DP (although this has not yet been tested).  Some books have page scans in the repos themselves (although these are rare).

As you likely know, PG will often have multiple editions of the same book.  There hasn't been a consensus reached about which PG edition of a book to reformat into asciidoc.  Your point suggests to me that we could find which books we can get scans of, and focus on those books first.  There are enough books in GITenberg that need converted to asciidoc that we can push off books lacking scans until later, if in fact we convert them at all.
 
It is not the absence of distributed version control that has made Project Gutenberg the unmaintained mess that it is; it is the fact that maintenance has never been baked into its processes and seemingly never will be either.

Watching the feed of updated books in PG, and speaking to their director, I get the impression that several books are fixed each week.  These might be minor changes, and may not address the overall inconsistency of the collection.  When I spoke with Greg Newby last year, he was very interested in adopting the GITenberg version control and workflow once issues had been worked out and had been tested.
 
I wish you luck with gitenberg, but you are definitely building on quicksand.

There is little reason why this methodology need be limited to the Project Gutenberg collection.  Were there a set of books that had better connection to original scans, a new github org can and should be setup to track them.  Most of the command line tools I am building (gitberg) are ambivalent to which github org they are working against.

Jon Hurst

unread,
Jul 14, 2015, 4:31:31 AM7/14/15
to gitenber...@googlegroups.com, se...@sethish.com
This is a very interesting problem.  As Eric says, many of these scans are available through DP (although this has not yet been tested).  Some books have page scans in the repos themselves (although these are rare).

As you likely know, PG will often have multiple editions of the same book.  There hasn't been a consensus reached about which PG edition of a book to reformat into asciidoc.  Your point suggests to me that we could find which books we can get scans of, and focus on those books first.  There are enough books in GITenberg that need converted to asciidoc that we can push off books lacking scans until later, if in fact we convert them at all.

Unfortunately you are running head long into the Gutenberg paradox. By the time that DP got its act together, any book that was worth doing had already been done, and once a book is done and in the Gutenberg collection it is essentially unassailable, no matter how bad it is. Nearly all of DP's output is therefore fairly niche and likely of less interest to you for your asciidoc conversion; the time spent converting it will exceed the cumulative time spent reading it. You are therefore left with a situation that almost none of the books that are worth doing (because people would read them ) are worth doing (because there are no scans).

The solution is, of course, to redo the mainstream books properly from scratch, using the Gutenberg collection as a cross-proofing resource. Proper version control would be an important element in such an undertaking. It would, however, be too much work for any individual or small group to contemplate. DP could do it, but they won't unless Greg Newby undertakes to replace the old crufty versions rather than just add a second version that will never get surfaced. Its an impasse that has been going on for years now.

Watching the feed of updated books in PG, and speaking to their director, I get the impression that several books are fixed each week.  These might be minor changes, and may not address the overall inconsistency of the collection.  When I spoke with Greg Newby last year, he was very interested in adopting the GITenberg version control and workflow once issues had been worked out and had been tested.

As far as I am aware there is just one "whitewasher" at PG that has any real interest in processing errata, and the errata backlog is immense. If a book has just been "published" on PG and you send an errata report containing a couple of errors to the whitewasher responsible, you may get the book updated. Anything more significant will be ignored. As an experiment, I once produced comprehensive errata for the new version of Huckleberry Finn. I had access to the original scans and a line synched text from the original producer and cross proofed it against the old crufty version. This picked out about 100 errors in the new version. I couldn't, of course, do anything about the old crufty version because I didn't have page scans for it or even, for that matter, the slightest idea which edition it was supposed to represent. I sent my list to the original producer who confirmed them as errors. This is absolutely a best case scenario. Known scans. Original producer involved. Mainstream book. The result was zero interest from PG.
 
Sorry to be so negative. I really hope that your project is part of the solution to this mess.

Tom Morris

unread,
Jul 14, 2015, 3:25:13 PM7/14/15
to gitenber...@googlegroups.com, Seth Woodworth
On Tue, Jul 14, 2015 at 4:31 AM, Jon Hurst <jhurs...@gmail.com> wrote:
This is a very interesting problem.  As Eric says, many of these scans are available through DP (although this has not yet been tested).  Some books have page scans in the repos themselves (although these are rare).

As you likely know, PG will often have multiple editions of the same book.  There hasn't been a consensus reached about which PG edition of a book to reformat into asciidoc.  Your point suggests to me that we could find which books we can get scans of, and focus on those books first.  There are enough books in GITenberg that need converted to asciidoc that we can push off books lacking scans until later, if in fact we convert them at all.

Unfortunately you are running head long into the Gutenberg paradox. By the time that DP got its act together, any book that was worth doing had already been done, and once a book is done and in the Gutenberg collection it is essentially unassailable, no matter how bad it is. Nearly all of DP's output is therefore fairly niche and likely of less interest to you for your asciidoc conversion; the time spent converting it will exceed the cumulative time spent reading it. You are therefore left with a situation that almost none of the books that are worth doing (because people would read them ) are worth doing (because there are no scans).

You seem to be ignoring, intentionally or unintentionally, the point about multiple editions of popular books being available in PG.  It is a fact that DP has redone many (most?) of the works that you are complaining about.  The main stumbling block is discovery.  Because the PG search facility prioritizes editions by the number of downloads, there's a vicious circle where the newer, better editions never get discovered and downloaded, so they always have lower download counts.

I've always assumed that Gitenberg was going to have its own distribution mechanism, so it could boost editions in search results by quality.  Of course, the very fact that Gitenberg is not PG means that it will rank lower than the natively hosted material, but that's in issue regardless.

Tom

p.s. I agree that "editionless" early PG works with no provenance or page images aren't worth trying to deal with.

Jon Hurst

unread,
Jul 15, 2015, 3:59:38 AM7/15/15
to gitenber...@googlegroups.com, se...@sethish.com
You seem to be ignoring, intentionally or unintentionally, the point about multiple editions of popular books being available in PG.  It is a fact that DP has redone many (most?) of the works that you are complaining about.  The main stumbling block is discovery.  Because the PG search facility prioritizes editions by the number of downloads, there's a vicious circle where the newer, better editions never get discovered and downloaded, so they always have lower download counts.

A quick unscientific poll involving looking up the first 20 books on the left hand side of the bookshelf where I keep my paper classics shows just 3 with a recent DP version (David Copperfield, Pickwick Papers, Pride and Prejudice), two of which I was intimately involved in the production of (when experimenting, I chose my favorite books to work with) so that skews it to the optimistic side. One other (Sense and Sensibility) has an alternate edition done by solo producers which likely has the same problem of lack of source images. Most of these 20 look like they might have an alternate edition in the form of a "World's Greatest Books" or a "Tales from Dickens", but these turn out to not be the actual books. There are also often LibriVox editions, which while laudable aren't really applicable to gitenberg. I think it would be more accurate to say DP has redone a precious few of the major works rather than many or most.

Discovery is indeed the main stumbling block, simply because it makes DP Project Managers feel that redoing major works is not worthwhile. The actual proofers do seem to like doing these reworks (Pride and Prejudice in particular absolutely screamed through the rounds), but it is the PMs that set the agenda. To get the PMs interested, Greg Newby would have to promise to replace the crufty editions outright, and he is adamantly opposed to doing so. The only plausible solution to this problem would be for DP to host a library containing only DP produced books, filling in all the holes left by the absence of non-DP PG works with shiny new digitisations complete with source page images. This would be an absolutely fantastic resource, particularly if you Gitenberg folks were to help them bake in maintainability, but they are very reluctant to take this step for a number of reasons, the foremost being a pathological fear of change. The situation is not completely hopeless, just almost so.

Seth Woodworth

unread,
Jul 21, 2015, 4:32:27 PM7/21/15
to gitenber...@googlegroups.com
Discovery is indeed the main stumbling block, simply because it makes DP Project Managers feel that redoing major works is not worthwhile. The actual proofers do seem to like doing these reworks (Pride and Prejudice in particular absolutely screamed through the rounds), but it is the PMs that set the agenda. To get the PMs interested, Greg Newby would have to promise to replace the crufty editions outright, and he is adamantly opposed to doing so. The only plausible solution to this problem would be for DP to host a library containing only DP produced books, filling in all the holes left by the absence of non-DP PG works with shiny new digitisations complete with source page images. This would be an absolutely fantastic resource, particularly if you Gitenberg folks were to help them bake in maintainability,
 

More than help DP bake in maintainability, I would love to see a modern proofreading UI/UX and catalog.  In my imagination, it would look something like this live CSS editor I am working on for HTMLBook, or the GitBook interface. But this would be a very ambitious project that would require a close working relationship with DP volunteers, and would be a significant engineering effort.

The Hathi Trust has expressed interest in using a GITenberg-like workflow to manage the OCR'd text for the 15 million books in their collection.  Their OCR'd text and scans would make a great starting point for such an initiative.

BUT, getting GITenberg documented and shared is a much higher priority.

--S

Reply all
Reply to author
Forward
0 new messages