GITenberg metadata. Part 1- boiling down the PG RDF

27 views
Skip to first unread message

Eric Hellman

unread,
Mar 23, 2015, 8:59:52 AM3/23/15
to gitenber...@googlegroups.com
As part of the process of developing metadata for GITenberg, I've been working on understanding the metadata dumps provided by Project Gutenberg. Here's my first writeup, presented as a Gist. Comments and correction are welcome, as always.

https://gist.github.com/eshellman/40d85be01acf1172a5c1

Eric

Alexandre Rafalovitch

unread,
Mar 23, 2015, 11:38:53 AM3/23/15
to gitenber...@googlegroups.com
When I was looking, I found a couple of issues:
1) Author information is duplicated in each file. So, some de-duplication has to happen, fully- or semi-automatically
2) The subject (LOC?) codes are not actually mapped to anything, available in proper electronic taxonomy form; which is apparently due to the LOC full taxonomy not being available for free on the web

Regards,
   Alex. 

Seth Woodworth

unread,
Mar 23, 2015, 11:50:02 AM3/23/15
to gitenber...@googlegroups.com
The Library of Congress Subject Headings (LCSH) are very unfortunately, not available to us :-(.  Fighting for this information to be free and available might be worthwhile, I don't understand all of the details and ramifications yet.

Library of Congress Codes (LCC) are available for us to use, but are a small set of headings equivalent to the Dewey Decimal system.  I have a json file with the LCC data on github which I had to scrape out of a PDF file as I couldn't find it elsewhere online.

--
You received this message because you are subscribed to the Google Groups "GITenberg Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gitenberg-proj...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gitenberg-project/9c7d4a8e-b213-41c2-ae11-9a41600285b7%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Seth Woodworth

unread,
Mar 23, 2015, 12:02:25 PM3/23/15
to gitenber...@googlegroups.com
On Mon, Mar 23, 2015 at 11:50 AM, Seth Woodworth <se...@sethish.com> wrote:
The Library of Congress Subject Headings (LCSH) are very unfortunately, not available to us :-(.  Fighting for this information to be free and available might be worthwhile, I don't understand all of the details and ramifications yet.

I haven't looked into LCSH availability for 2-3 years.  I would _love_ it if someone could show me otherwise! 

Tom Morris

unread,
Mar 23, 2015, 1:15:52 PM3/23/15
to gitenber...@googlegroups.com
They've been available for years in N3, Turtle, and RDF/XML formats:


I think before they were available as bulk downloads you could get them by scraping the web site.  I'm pretty sure they're also integrated with Freebase which would allow for an easy crosswalk to Wikipedia / Wikidata.

Tom

Tom Morris

unread,
Mar 24, 2015, 2:03:43 PM3/24/15
to gitenber...@googlegroups.com
On Mon, Mar 23, 2015 at 8:59 AM, Eric Hellman <er...@hellman.net> wrote:
As part of the process of developing metadata for GITenberg, I've been working on understanding the metadata dumps provided by Project Gutenberg. Here's my first writeup, presented as a Gist. Comments and correction are welcome, as always.

https://gist.github.com/eshellman/40d85be01acf1172a5c1

I'd probably start with the higher level conceptual stuff rather than the specific syntax, but, concerning the syntax, I think there's tension between choosing a format which is user-editable in a text editor without tooling and which is amenable to line-oriented diffing and choosing a format which can be interchanged with bibliographic systems, works well with schema.org, and any number of other "machine" type goals.  As tempting as it might be, it seems like inventing your own proprietary bibliographic format should be a last resort.

Other thoughts:

- Converting PG-proprietary URLs to standard URLs that can be dereferenced on id.loc.gov or viaf.org would make the data much more reusable and well connected

- I'm prettty sure that the last time I investigated, there were a number of duplicate entries for authors, despite the nominally canonical IDs

- There are probably DCMI RDF/XML format converters available that could be leveraged http://dublincore.org/tools/ rather than rolling your own

- PG has multiple editions of many works and often the later ones are of higher quality than the older "editionless" editions, yet the earlier ones get downloaded way more.  Enhancing the bibliographic information to help with this issue would be useful to readers.  For example, this early editionless Pride & Prejudice http://www.gutenberg.org/ebooks/1342#bibrec is downloaded over 30 times more often than this later high quality transcription of the 1932 R.W. Chapman edition http://www.gutenberg.org/ebooks/42671#bibrec donated by Distributed Proofreaders. #76 & #32325 are another example.  It would be good to be able to link the various editions together.

- Much better provenance, including links to DP projects, scanned source files, Internet Archive mirrors, etc would be useful metadata to add

- If the production pipeline was IA->DP->PG, then a library-created MARC file is available at IA which could be exploited

The real question, though, is what will the metadata be used for and what do we want it to look like.  Knowing the end goal, we can work back to the best way to achieve it.

Tom






Eric Hellman

unread,
Mar 24, 2015, 4:01:10 PM3/24/15
to gitenber...@googlegroups.com
Have added this to part 4

On Mar 24, 2015, at 2:03 PM, Tom Morris <tfmo...@gmail.com> wrote:

- PG has multiple editions of many works and often the later ones are of higher quality than the older "editionless" editions, yet the earlier ones get downloaded way more.  Enhancing the bibliographic information to help with this issue would be useful to readers.  For example, this early editionless Pride & Prejudice http://www.gutenberg.org/ebooks/1342#bibrec is downloaded over 30 times more often than this later high quality transcription of the 1932 R.W. Chapman edition http://www.gutenberg.org/ebooks/42671#bibrec donated by Distributed Proofreaders. #76 & #32325 are another example.  It would be good to be able to link the various editions together.

- Much better provenance, including links to DP projects, scanned source files, Internet Archive mirrors, etc would be useful metadata to add


with respect to Canadiana Online, perhaps you could add it to activerepos.csv, so we can try to get the  metadata linkages get done correctly. Maybe not a good example for the asciidoc process, though.

Tom Morris

unread,
Mar 25, 2015, 1:07:33 AM3/25/15
to gitenber...@googlegroups.com
On Tue, Mar 24, 2015 at 2:03 PM, Tom Morris <tfmo...@gmail.com> wrote:

- I'm prettty sure that the last time I investigated, there were a number of duplicate entries for authors, despite the nominally canonical IDs

I did a quick recheck of this and my memory wasn't faulty.  Here is a sample of some of the corrections after using the *awesome* :-) clustering facility of OpenRefine . Corrected (nominally) version in the first column.

American Sunday-School Union Union, American Sunday-School
Bakunin, Mikhail Aleksandrovic Bakunin, Mikhail Aleksandrovich
Barine, Arvède Barine, Arvede
Ditchfield, P. H. (Peter Hampson) Ditchfield, P. H. Peter Hampson)
Gerhard, J. W. Gerhard, J.W.
Haapanen-Tallgren, Tyyni Haapanen-Tallgren, Tyyne
Knatchbull-Hugesson, Edward Hugessen Knatchbull-Hugessen, Edward Hugessen
La Monte, Robert Rives Monte, Robert Rives la
Levett-Yeats, S. (Sidney) Levett Yeats, S. (Sidney)
Library of Congress. Copyright Office Copyright Office. Library of Congress.

On the plus side, there are only a couple dozen of these records in the 20k+ authors, so it's a pretty small problem, but it does indicate that the PG author records can't be relied upon to be unique.

Tom
 

Eric Hellman

unread,
Mar 25, 2015, 9:30:54 AM3/25/15
to gitenber...@googlegroups.com
I've copied this into the comments for part 2 of my metadata gists. https://gist.github.com/eshellman/7a6d34c88e797b439938

Tom, do you think it's feasible or wise to delegate all of the gitenberg "agent" metadata to VIAF and/or ORCID and/or Wikipedia? 
The other alternatives I can think of are
1. keep the agent metadata associated with book repos, and link to PG agents
2. create a separate repo for PG agent metadata
3. defer the issue till later

Eric

--
You received this message because you are subscribed to the Google Groups "GITenberg Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gitenberg-proj...@googlegroups.com.

Tom Morris

unread,
Mar 25, 2015, 12:54:56 PM3/25/15
to gitenber...@googlegroups.com
On Wed, Mar 25, 2015 at 9:30 AM, Eric Hellman <er...@hellman.net> wrote:
I've copied this into the comments for part 2 of my metadata gists. https://gist.github.com/eshellman/7a6d34c88e797b439938

Tom, do you think it's feasible or wise to delegate all of the gitenberg "agent" metadata to VIAF and/or ORCID and/or Wikipedia? 

I thought a lot about this last night and am coming close to suggesting that OpenLibrary is the right place for not only agent metadata, but all bibliographic metadata.  It's got its own set of issues, of course, but it's a purpose-built bibliographic data wiki, it's already linked with IA, and it avoids the pain of having to try and edit bibliographic metadata with text editing tools.
 
The other alternatives I can think of are
1. keep the agent metadata associated with book repos, and link to PG agents
2. create a separate repo for PG agent metadata
3. defer the issue till later

I think you definitely want to separate out agent data regardless of what direction you take for managing it.  The fact that it's duplicated in the individual pub RDF files is a serialization artifact, not a reflection of the underlying data model at PG (or at least that's my strong suspicion).

Tom

Tom Morris

unread,
Mar 25, 2015, 12:58:34 PM3/25/15
to gitenber...@googlegroups.com
On Mon, Mar 23, 2015 at 8:59 AM, Eric Hellman <er...@hellman.net> wrote:
As part of the process of developing metadata for GITenberg, I've been working on understanding the metadata dumps provided by Project Gutenberg. Here's my first writeup, presented as a Gist. Comments and correction are welcome, as always.

https://gist.github.com/eshellman/40d85be01acf1172a5c1

Are the various gists going to be distilled and pulled together someplace eventually?  The piecewise, stream-of-consciousness thing is fine for the construction process, but have a single document, with the ability to comment inline or nearby would make things a lot easier.

A few more random notes & comments from my explorations:

- The example used (Space Viking) came from Distributed Proofreaders.  The project page is here:
- The OpenLibrary edition is a book edition, while the PG version was serialized in a magazine, but this important piece of metadata is buried in the text.  Similarly the recent Pride and Prejudice published at PG was extracted from one volume of "The Works of Jane Austen," but standard bibliographic conventions don't have a good way to record this.
- Linking from the gists to various reference sources would make it easier to follow the exposition
- Wikidata is probably a better source of info than Wikipedia infoboxes although both are usually mostly about works, not editions, despite including such edition info as ISBN
- Not specifically metadata related, but the Wikipedia article shows a problem that Gitenberg will have in getting its editions to show up.  It links back to PG as the most authoritative source, so readers won't see the improved version.
- The first few of the supposed LCSH terms that I sampled didn't resolve, so I'd be suspicious of them.  I'm not sure whether they're out of date of just wrong.  One term, Magna Carta, appears in the LC NA file currently, not LCSH.  Another resolved with it's last precoordinated term dropped, but not with all three terms as presented.
- the marc901s look like links to cover images
- some of the things that are included in Freebase and specialty databases like ISFDB would make things richer and better connected (links to series, prequels, sequels, adaptations, etc)
- Freebase has a list of 11 editions including LCCNs, links to OpenLibrary, ISFDB, university OPACs, Google Books, etc which could be exploited to enrich things

Some averages derived from the predicate analysis by Raymond:
- 48,538 publications
- 1.007 authors on average (seems low, which makes me suspicious that multiple authors aren't well recorded)
- 2.44 subjects
- 13.5 formats (this is most likely inflated by pubs with large numbers of audio files and/or multi-part HTML) 

I'm running a more comprehensive analysis of languages, formats, etc now

Tom

Eric Hellman

unread,
Mar 26, 2015, 11:38:12 AM3/26/15
to gitenber...@googlegroups.com
Comments inline

On Mar 25, 2015, at 12:58 PM, Tom Morris <tfmo...@gmail.com> wrote:

On Mon, Mar 23, 2015 at 8:59 AM, Eric Hellman <er...@hellman.net> wrote:
As part of the process of developing metadata for GITenberg, I've been working on understanding the metadata dumps provided by Project Gutenberg. Here's my first writeup, presented as a Gist. Comments and correction are welcome, as always.

https://gist.github.com/eshellman/40d85be01acf1172a5c1

Are the various gists going to be distilled and pulled together someplace eventually?  The piecewise, stream-of-consciousness thing is fine for the construction process, but have a single document, with the ability to comment inline or nearby would make things a lot easier.

I was thinking (midstream) it should become its own repo, and then maybe an ebook. I'm busy working on some non gitenberg stuff till next week, but next section, I'll do that.

- Wikidata is probably a better source of info than Wikipedia infoboxes although both are usually mostly about works, not editions, despite including such edition info as ISBN

Maybe you can help here. Wikidata says it gets isbn and oclcnum from English Wikipedia, but those things aren't there. Wikipedia has series information, but that isn't in Wikidata. So what's the relationship?

- Not specifically metadata related, but the Wikipedia article shows a problem that Gitenberg will have in getting its editions to show up.  It links back to PG as the most authoritative source, so readers won't see the improved version.

In principle, we can fix that.

- The first few of the supposed LCSH terms that I sampled didn't resolve, so I'd be suspicious of them.  I'm not sure whether they're out of date of just wrong.  One term, Magna Carta, appears in the LC NA file currently, not LCSH.  Another resolved with it's last precoordinated term dropped, but not with all three terms as presented.

Could you explain how and what you resolved? Magna Carta???

- the marc901s look like links to cover images

Does anyone know if this is customary in library cataloging?

- some of the things that are included in Freebase and specialty databases like ISFDB would make things richer and better connected (links to series, prequels, sequels, adaptations, etc)

YES!

- Freebase has a list of 11 editions including LCCNs, links to OpenLibrary, ISFDB, university OPACs, Google Books, etc which could be exploited to enrich things

LibraryThing has 28 editions
Unglue.it has 20 editions (we use part of LT)
xISBN has 15 english editions and 1 french edition



Tom Morris

unread,
Mar 26, 2015, 5:18:26 PM3/26/15
to gitenber...@googlegroups.com
On Thu, Mar 26, 2015 at 11:37 AM, Eric Hellman <er...@hellman.net> wrote:
Comments inline

On Mar 25, 2015, at 12:58 PM, Tom Morris <tfmo...@gmail.com> wrote:

Are the various gists going to be distilled and pulled together someplace eventually?  The piecewise, stream-of-consciousness thing is fine for the construction process, but have a single document, with the ability to comment inline or nearby would make things a lot easier.
I was thinking (midstream) it should become its own repo, and then maybe an ebook. I'm busy working on some non gitenberg stuff till next week, but next section, I'll do that.

I think even just a wiki page that gets added to and revised (or perhaps a couple of wiki pages) would be fine for this.
 
- Wikidata is probably a better source of info than Wikipedia infoboxes although both are usually mostly about works, not editions, despite including such edition info as ISBN

Maybe you can help here. Wikidata says it gets isbn and oclcnum from English Wikipedia, but those things aren't there. Wikipedia has series information, but that isn't in Wikidata. So what's the relationship?

Wikipedia does have those two items if people fill them in.  See http://en.wikipedia.org/wiki/Template:Infobox_book  Your Space ... example has an empty placeholder for ISBN and an OCLC field is available in the template if you wanted to make use of it.

Infoboxes are subset of Templates, not all of which get rendering as infoboxes (or even visually, all all).  For example, Persondata is not rendered, but the information is available for use internally.  Similarly, the AuthorityControl template gets rendered as a little tiny line at the bottom of the article, but it's got all the good stuff (ISNI, VIAF, LCCN, ORCID, GND, etc) in it.
 
- The first few of the supposed LCSH terms that I sampled didn't resolve, so I'd be suspicious of them.  I'm not sure whether they're out of date of just wrong.  One term, Magna Carta, appears in the LC NA file currently, not LCSH.  Another resolved with it's last precoordinated term dropped, but not with all three terms as presented.
Could you explain how and what you resolved? Magna Carta???

I was simply using search e.g http://id.loc.gov/search/?q=Magna+Carta&q=  The only LCSH entry is for a place in Australia.  I suspect the PG term was referencing the LCNA (not LCSH entry), http://id.loc.gov/authorities/names/n50064742.html  The equivalent PG subject is http://www.gutenberg.org/ebooks/subject/2289?sort_order=title

Note, I'm not arguing that the LC has the faintest clue about what they're doing in the way they divided things up between LCNA & LCSH, I'm just pointing out that the "Subject" terms in PG that reference LCSH as their classification scheme don't appear in present day LCSH.

The other "subject" was "Constitutional history -- England -- Sources" http://www.gutenberg.org/ebooks/subject/7078  If you drop the last term and remove the spaces, you can find http://id.loc.gov/search/?q=Constitutional+history--England&q= this http://id.loc.gov/authorities/subjects/sh2009121486.html but you can't find the subject in its original form.  Of course, I find this classification scheme totally opaque, so some trained MLIS type could say "Oh, everyone KNOWS that -- Sources means frazzle pop" in which case you'd just need to apply all those "well known" transformations before doing the lookups.

 
- the marc901s look like links to cover images
Does anyone know if this is customary in library cataloging?

Since 901-907 are local data elements, I imagine people use than in a vast array of ways.  This is just PG's usage.
  
Freebase has a list of 11 editions including LCCNs, links to OpenLibrary, ISFDB, university OPACs, Google Books, etc which could be exploited to enrich things
LibraryThing has 28 editions

LibraryThing might be good for reviews, but I'm not sure I'd use it for metadata.  I see a bunch of duplication if you're talking about this list.

To my mind, the miscapitalized VIking doesn't make these two separate editions:

 Space Viking (VIntage Ace SF, F-225) / Piper, H. Beam (ISBN 0441602258) (8 copies; 1.1.1 separate)
 Space Viking (Vintage Ace SF, F-225) / Piper, H. Beam (ISBN 0441602258) (5 copies; 1.1.1 separate)

Ditto for the various others with no ISBN or other minor variations on the list.
 
I wasn't singling out Freebase so much for the completeness of editions, but rather the onward links to Google Books, OpenLibrary, OPACs at Cornell, Stanford, etc.

For work and edition finding, OpenLibrary, OCLC, and Freebase all have solutions which could be explored and compared.  For SF works, ISFDB includes not only novel editions, serializations, audio versions, PG etexts, as well as links to reviews.

Tom
Reply all
Reply to author
Forward
0 new messages