How do DistributedProofreaders, FadedPage, InternetArchive, OpenLibrary, etc fit with this?

15 views

Skip to first unread message

Tom Morris

unread,

Mar 25, 2015, 1:43:19 PM3/25/15

to gitenber...@googlegroups.com

There's been a little bit of discussion about the relationship between PG and Gitenberg because that's the source of all the texts in Gitenberg, but there's really been none, at least publicly, about how Gitenberg fits into the larger public domain eBook ecosystem.

Using Raymond's recent example:

http://www.pgdp.net/c/project.php?id=projectID505c8a74597b3

We have links to:

PG http://www.gutenberg.org/ebooks/48573

DP proofreaders discussion forum http://www.pgdp.net/phpBB2/viewtopic.php?t=52048

Scanned images, OCR & proofread page texts, as well as diffs http://www.pgdp.net/c/tools/project_manager/page_detail.php?project=projectID505c8a74597b3&show_image_size=0

Buried down at the bottom is a link to the original Internet Archive source:

http://archive.org/details/britishforeignar00ashduoft

which, in turn, has links to:

MARC/XML from donor University of Toronto: https://archive.org/download/britishforeignar00ashduoft/britishforeignar00ashduoft_archive_marc.xml

OpenLibrary editable page: https://openlibrary.org/books/OL7096037M/British_and_foreign_arms_armour

OpenLibrary links back to the crappy IA formats which were derived from the raw uncorrected OCR, but not the beautiful DP provided HTML or text.

The next time Internet Archive scrapes Project Gutenberg a new file which show up at IA which is totally unrelated to any of the above.

I think that the focus on Project Gutenberg, while understandable, may be a mistake. They've got a brand and a repository, but all the heavy lifting is done by DistributedProofreaders, and there are many more pieces to the ecosystem.

Additionally, there are other DP sites like DP Canada which self-hosts their output at FadedPage

Making PG texts editable is a good goal, but just one tiny piece of the puzzle. Has any thought been given to the overall problem?

Tom

J West

unread,

Mar 26, 2015, 10:37:49 AM3/26/15

to gitenber...@googlegroups.com

OpenLibrary links back to the crappy IA formats which were derived from the raw uncorrected OCR, but not the beautiful DP provided HTML or text.

The next time Internet Archive scrapes Project Gutenberg a new file which show up at IA which is totally unrelated to any of the above.

We could (and would be happy to) link to something better and are happy having our editable metadata pages be available for grabbing the MARCs or whatever.

The usual workflow at IA is that uploading a PDF will spur the rendering into the other formats, so there's not a good way currently to link to high quality text files except manually. That said, manual linkage would be just terrific,maybe there's a way to automate through the API?

One of the things on the (long) to do list is to do something about all the bad scans that IA has, finding a way to get better quality (or at least reporting of poor quality in some real way) but there's always been an aspect of quantity over quality there that may not be surmountable as a corporate culture issue.

Jessamyn
Open Library person

Reply all

Reply to author

Forward

0 new messages