How do DistributedProofreaders, FadedPage, InternetArchive, OpenLibrary, etc fit with this?

14 views
Skip to first unread message

Tom Morris

unread,
Mar 25, 2015, 1:43:19 PM3/25/15
to gitenber...@googlegroups.com
There's been a little bit of discussion about the relationship between PG and Gitenberg because that's the source of all the texts in Gitenberg, but there's really been none, at least publicly, about how Gitenberg fits into the larger public domain eBook ecosystem.

Using Raymond's recent example:


We have links to:

DP proofreaders discussion forum http://www.pgdp.net/phpBB2/viewtopic.php?t=52048

Buried down at the bottom is a link to the original Internet Archive source:


which, in turn, has links to:


OpenLibrary links back to the crappy IA formats which were derived from the raw uncorrected OCR, but not the beautiful DP provided HTML or text.

The next time Internet Archive scrapes Project Gutenberg a new file which show up at IA which is totally unrelated to any of the above.

I think that the focus on Project Gutenberg, while understandable, may be a mistake.  They've got a brand and a repository, but all the heavy lifting is done by DistributedProofreaders, and there are many more pieces to the ecosystem.

Additionally, there are other DP sites like DP Canada which self-hosts their output at FadedPage

Making PG texts editable is a good goal, but just one tiny piece of the puzzle.  Has any thought been given to the overall problem?

Tom

J West

unread,
Mar 26, 2015, 10:37:49 AM3/26/15
to gitenber...@googlegroups.com

OpenLibrary links back to the crappy IA formats which were derived from the raw uncorrected OCR, but not the beautiful DP provided HTML or text.

The next time Internet Archive scrapes Project Gutenberg a new file which show up at IA which is totally unrelated to any of the above.

We could (and would be happy to) link to something better and are happy having our editable metadata pages be available for grabbing the MARCs or whatever.

The usual workflow at IA is that uploading a PDF will spur the rendering into the other formats, so there's not a good way currently to link to high quality text files except manually. That said, manual linkage would be just terrific,maybe there's a way to automate through the API?

One of the things on the (long) to do list is to do something about all the bad scans that IA has, finding a way to get better quality (or at least reporting of poor quality in some real way) but there's always been an aspect of quantity over quality there that may not be surmountable as a corporate culture issue.

Jessamyn
Open Library person
Reply all
Reply to author
Forward
0 new messages