On Jul 4, 2015, at 3:57 PM, Andrew Dunning <andu...@gmail.com> wrote:
I was pleased to come across this project; it seems to me that our current methods for putting old books online are incredibly shoddy, and that just a little work could make a great deal of difference.I've experimented with putting a few works up on Project Gutenberg using their PG-RST format, and have appreciated its general simplicity, but I find that it simply doesn't support a number of things needed for displaying print material: for example, there is no good way of showing sidenotes or other marginal text, or for indicating where I have made corrections to the text. As far as I can tell, the only format that really covers everything is TEI, and they are developing a simplified version of their markup at <https://github.com/TEIC/TEI-Simple/>. This is used for the massive collection of texts at <https://github.com/textcreationpartnership>. It would be really interesting to apply some of the research into this to creating online books that are sustainable, portable, and accurate at the same time.
--
You received this message because you are subscribed to the Google Groups "GITenberg Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gitenberg-proj...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gitenberg-project/66014f4f-dc18-43fb-9f25-4573a05a31e8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
I was pleased to come across this project; it seems to me that our current methods for putting old books online are incredibly shoddy, and that just a little work could make a great deal of difference.I've experimented with putting a few works up on Project Gutenberg using their PG-RST format, and have appreciated its general simplicity, but I find that it simply doesn't support a number of things needed for displaying print material: for example, there is no good way of showing sidenotes or other marginal text, or for indicating where I have made corrections to the text. As far as I can tell, the only format that really covers everything is TEI, and they are developing a simplified version of their markup at <https://github.com/TEIC/TEI-Simple/>. This is used for the massive collection of texts at <https://github.com/textcreationpartnership>. It would be really interesting to apply some of the research into this to creating online books that are sustainable, portable, and accurate at the same time.
--
--
You received this message because you are subscribed to the Google Groups "GITenberg Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gitenberg-proj...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gitenberg-project/85ae7460-fcd4-47fc-9ce1-1e9c6e56df0e%40googlegroups.com.
This is an official rabbit hole warning.
Firstly, you will never get two people to agree on a master format... don't expect anyone to agree with your choice.
Your problem isn't and never will be which master format is best. Your problem is the extremely poor quality of your source material. You can create the most technically excellent markup in the world, but if what you are marking up is of poor quality, your book will be of poor quality and your time will have been thoroughly wasted. As a general rule, any book that you might have heard of in Project Gutenberg's collection will have been done early enough in the project's life cycle that it will be full of typos, will be of completely unknown origin and will not have any source page images available. They are of use as reference checks for cross-proofing but have little else going for them.
It is not the absence of distributed version control that has made Project Gutenberg the unmaintained mess that it is; it is the fact that maintenance has never been baked into its processes and seemingly never will be either.
I wish you luck with gitenberg, but you are definitely building on quicksand.
This is a very interesting problem. As Eric says, many of these scans are available through DP (although this has not yet been tested). Some books have page scans in the repos themselves (although these are rare).As you likely know, PG will often have multiple editions of the same book. There hasn't been a consensus reached about which PG edition of a book to reformat into asciidoc. Your point suggests to me that we could find which books we can get scans of, and focus on those books first. There are enough books in GITenberg that need converted to asciidoc that we can push off books lacking scans until later, if in fact we convert them at all.
Watching the feed of updated books in PG, and speaking to their director, I get the impression that several books are fixed each week. These might be minor changes, and may not address the overall inconsistency of the collection. When I spoke with Greg Newby last year, he was very interested in adopting the GITenberg version control and workflow once issues had been worked out and had been tested.
This is a very interesting problem. As Eric says, many of these scans are available through DP (although this has not yet been tested). Some books have page scans in the repos themselves (although these are rare).As you likely know, PG will often have multiple editions of the same book. There hasn't been a consensus reached about which PG edition of a book to reformat into asciidoc. Your point suggests to me that we could find which books we can get scans of, and focus on those books first. There are enough books in GITenberg that need converted to asciidoc that we can push off books lacking scans until later, if in fact we convert them at all.Unfortunately you are running head long into the Gutenberg paradox. By the time that DP got its act together, any book that was worth doing had already been done, and once a book is done and in the Gutenberg collection it is essentially unassailable, no matter how bad it is. Nearly all of DP's output is therefore fairly niche and likely of less interest to you for your asciidoc conversion; the time spent converting it will exceed the cumulative time spent reading it. You are therefore left with a situation that almost none of the books that are worth doing (because people would read them ) are worth doing (because there are no scans).
You seem to be ignoring, intentionally or unintentionally, the point about multiple editions of popular books being available in PG. It is a fact that DP has redone many (most?) of the works that you are complaining about. The main stumbling block is discovery. Because the PG search facility prioritizes editions by the number of downloads, there's a vicious circle where the newer, better editions never get discovered and downloaded, so they always have lower download counts.
Discovery is indeed the main stumbling block, simply because it makes DP Project Managers feel that redoing major works is not worthwhile. The actual proofers do seem to like doing these reworks (Pride and Prejudice in particular absolutely screamed through the rounds), but it is the PMs that set the agenda. To get the PMs interested, Greg Newby would have to promise to replace the crufty editions outright, and he is adamantly opposed to doing so. The only plausible solution to this problem would be for DP to host a library containing only DP produced books, filling in all the holes left by the absence of non-DP PG works with shiny new digitisations complete with source page images. This would be an absolutely fantastic resource, particularly if you Gitenberg folks were to help them bake in maintainability,