GITenberg for British Library ALTO XML texts

Jonathan Reeve

unread,

Sep 8, 2015, 3:05:21 PM9/8/15

to GITenberg Project

Hi GITenberg folks,

I've been talking with some people at the British Library about their very large (~250GB compressed) collection of scanned texts. They're in ALTO XML and plaintext (samples at https://github.com/JonathanReeve/git-lit/tree/master/data), and I think they would be great candidates for gitifying in the style of GITenberg. Towards that end, I adapted some of the scripts found on gitenberg-dev so that they can parse the British Library metadata, generate READMEs, and make GitHub repositories out of them. I'm not very practiced in Pythonic modularity, unfortunately, so I just threw everything in an IPython Notebook: https://github.com/JonathanReeve/git-lit/blob/master/main.ipynb. This script generates the GitHub repositories found at https://github.com/Git-Lit from the sample texts linked above. There are still a few bugs and things I'd like to change, but it mostly works.

I'd love to have your feedback / contributions / issues / notes on this project, especially since I hope to be moving soon from the sample texts to the larger corpus. Maybe we can work together on this somehow? I think this could be a huge thing for the public availability of literary texts.

Best,

-Jonathan

Tom Morris

unread,

Sep 8, 2015, 5:09:10 PM9/8/15

to gitenber...@googlegroups.com

That sounds interesting and I'd be willing to help out. I had a quick look and came up with a few questions/comments:

- Are all 50,000 volumes from booklist.tsv included in the corpus?

- Is ALTO XML limited to a single page per file?

- Regardless of the answer to the above, having at least the text format available as a single file per book would be much easier to work with, I suspect, for most users.

- the page ordering in the overview text file, e.g. https://github.com/Git-Lit/000624240/blob/master/ALTO/000624240.txt, seems to be arbitrary

- PANDAS seems like a heavyweight dependency considering how little it's used.

- links back to the original source of the texts and their online home at BL (if they have one) seem appropriate

- the ReadMe mentions known issues in the issue tracker, but I don't see any.

I like that the ALTO files preserve the original word confidence and character confidence scores from the ABBYY OCR since that could potentially be exploited for automated quality checks, but I think it's worth thinking through the various source, intermediary, and target file formats before going too far, given the quantity of data involved. Iterating on formats, processing pipeline, etc which a small number of volumes/repos, then scaling up is likely to be much less unwieldy than trying to iterate with thousands of volumes in play.

One of the things that I'm interested in is trying to bring some order and rationality to the multiple copies of metadata and texts floating around, hopefully automatically.

For example, how do we choose the best of (or create a better union copy?):

https://github.com/Git-Lit/000624240

https://archive.org/details/chroniclesofcasc00cole

https://archive.org/details/chroniclesofcasc01cole

http://babel.hathitrust.org/cgi/pt?id=loc.ark:/13960/t14m9b674;view=1up;seq=3

and recognize that this metadata is all about the same volume (and perhaps derived from each other in some non-independent way):

https://github.com/Git-Lit/000624240/blob/master/metadata/000624240.xml

https://openlibrary.org/books/OL13999993M/Chronicles_of_Casco_Bay_...

https://openlibrary.org/books/OL13999993M.json

https://openlibrary.org/books/ia:chroniclesofcasc01cole/Chronicles_of_Casco_Bay_..

https://openlibrary.org/books/ia:chroniclesofcasc01cole.json

http://catalog.hathitrust.org/Record/009574264

If nothing else, knowledge about availability from other sources might be used to help prioritize those texts which are not available elsewhere first.

Tom

--
You received this message because you are subscribed to the Google Groups "GITenberg Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gitenberg-proj...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gitenberg-project/23a6a10d-13b7-4ee6-a1de-289a7525cdc7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Seth Woodworth

unread,

Sep 12, 2015, 1:58:32 PM9/12/15

to gitenber...@googlegroups.com

Hi Jonathan,

I would be very interested in collaborating. A few months ago, you wrote a blogpost proposing a corpus sharing protocol, following the pattern of using git (and github) as a backend like several package management systems (yeoman, bower, arch/pacman). I'm in favor of more folks working on this problem and converging on packaging standards.

This summer, we've been working on getting an initial release of 100 ebooks formatted and published. I've been grinding on the gitberg python project, hopefully refactoring it to be easier to extend for purposes like git-lit.

There are still a few caveats for using github for this many texts. For instance: being a member of the GITenberg org on on Github will crash the github mac client, and many websites that integrate with github will also crash when trying to enumerate and fetch data for all repositories. I recommend creating a new github account just for pushing to the git-lit organization if you plan on moving all 50k repos there.

In regards to metadata, I highly recommend creating a yaml file like this one from Huck Finn, for each book repository. This is our proposal for a standard metadata format for book repositories. And I'm building tools expecting to be able to source this file. The yaml format should be considerably more editable as text on github or on local machines. Ideally, this file will address Tom's points about referring to upstream and downstream sources.

Possibly the most important thing for GITenberg to do at this point would be to produce more documentation on the problems we've come across and what solutions we've tried.

I'd be happy to set up a ghangout sometime this upcoming week to braindump and help share code.

-S

On Tue, Sep 8, 2015 at 3:05 PM, Jonathan Reeve <jon....@gmail.com> wrote:

--

Jonathan Reeve

unread,

Sep 14, 2015, 9:32:59 AM9/14/15

to GITenberg Project

Hi Tom,

That's awesome that you'd be interested in helping out. I'll add you to the repo. To answer your questions:

- Are all 50,000 volumes from booklist.tsv included in the corpus?

I don't have the full corpus yet, (and my last contact there said that their method of delivery was a hard disk through the post, so I think it'll be a little while), but I imagine it'll be at least that. Their digital texts page says they have something like 68,000 texts.

- Is ALTO XML limited to a single page per file?

That's the way it looks to me in the current samples. I don't know whether that's a hard limitation of the XML schema, though.

- Regardless of the answer to the above, having at least the text format available as a single file per book would be much easier to work with, I suspect, for most users.

True. I imagine writing a script to extract plain text from these ALTO XML files, so that they're easier to work with.

- the page ordering in the overview text file, e.g. https://github.com/Git-Lit/000624240/blob/master/ALTO/000624240.txt, seems to be arbitrary

Wow, you're right. I'm thinking I'll just overwrite or ignore these plain text files where they exist, anyway, and generate new ones from the XML instead.

- PANDAS seems like a heavyweight dependency considering how little it's used.

True. I'd just learned pandas, so I thought it'd be a fun way to store a matrix. But I'm not really using it, so it can definitely be left out.

- links back to the original source of the texts and their online home at BL (if they have one) seem appropriate

As far as I can tell, there isn't yet an original public online home for these documents. Which is part of why I think this project will be important--it'll publish these documents for the first time.

- the ReadMe mentions known issues in the issue tracker, but I don't see any.

True. I guess I wrote that before writing the issues. I just added a few.

I like that the ALTO files preserve the original word confidence and character confidence scores from the ABBYY OCR since that could potentially be exploited for automated quality checks, but I think it's worth thinking through the various source, intermediary, and target file formats before going too far, given the quantity of data involved. Iterating on formats, processing pipeline, etc which a small number of volumes/repos, then scaling up is likely to be much less unwieldy than trying to iterate with thousands of volumes in play.

Sage advice. I'd love to hear ideas about formats and things. I'm open to just about anything. I'll start an issue.

One of the things that I'm interested in is trying to bring some order and rationality to the multiple copies of metadata and texts floating around, hopefully automatically.

For example, how do we choose the best of (or create a better union copy?):

https://github.com/Git-Lit/000624240
https://archive.org/details/chroniclesofcasc00cole
https://archive.org/details/chroniclesofcasc01cole
http://babel.hathitrust.org/cgi/pt?id=loc.ark:/13960/t14m9b674;view=1up;seq=3

and recognize that this metadata is all about the same volume (and perhaps derived from each other in some non-independent way):

https://github.com/Git-Lit/000624240/blob/master/metadata/000624240.xml
https://openlibrary.org/books/OL13999993M/Chronicles_of_Casco_Bay_...
https://openlibrary.org/books/OL13999993M.json
https://openlibrary.org/books/ia:chroniclesofcasc01cole/Chronicles_of_Casco_Bay_..
https://openlibrary.org/books/ia:chroniclesofcasc01cole.json
http://catalog.hathitrust.org/Record/009574264

If nothing else, knowledge about availability from other sources might be used to help prioritize those texts which are not available elsewhere first.

Good point. Is there an automated way to search Openlibrary, Archive.org, and Hathitrust, in a way that would identify an identical volume? This would be an interesting exercise. I think if there's a way to encode these other versions of the text in the metadata YAML that Seth suggests, that might be a good idea. I'll open an issue for this.

Great ideas here. Looking forward to talking more about this.

-Jonathan

Jonathan Reeve

unread,

Sep 14, 2015, 10:00:39 AM9/14/15

to GITenberg Project, se...@sethish.com

Hi Seth,

That's awesome that you'd be interested in collaborating. I'll add you to the repo and everything. Since I wrote that corpus sharing protocol post, I've been thinking about ways to collect/anthologize texts using git submodules. I imagine that you could have a parent repository for, say, Nineteenth Century poetry that contains submodules for each of those texts. That way, you could curate overlapping collections of texts. Anyone who wants to download 19thC poetry, then, would just have to `git clone` the parent repo, and then `git submodule update --init --recursive` to pull in all the child text repos. I think it'd be fun to do this with GITenberg texts, too.

Thanks for all the warnings about github accounts. That will be very useful information when I try scaling this thing up.

The YAML metadata is a great idea. Do you have a spec for that somewhere? I'll add that as an issue. Just out of curiosity, did you automate the generation of the LOC subjects? If not, do you think there might a way to automate that?

A Google Hangout sometime sounds great. Let me know what times would work for you (and everyone else is invited, too). I'm available most evenings this week.

-Jonathan

Eric Hellman

unread,

Sep 14, 2015, 10:33:41 AM9/14/15

to gitenber...@googlegroups.com, Seth Woodworth

On Sep 14, 2015, at 10:00 AM, Jonathan Reeve <jon....@gmail.com> wrote:

The YAML metadata is a great idea. Do you have a spec for that somewhere? I'll add that as an issue. Just out of curiosity, did you automate the generation of the LOC subjects? If not, do you think there might a way to automate that?

John,

The YAML dictionary is at

https://github.com/gitenberg-dev/documentation/blob/master/metadata/pandata_attribute_dictionary.yaml

This is obviously a first pass.

Software tools surrounding it are at

https://github.com/gitenberg-dev/metadata

I also have a loader for Unglue.it which I need to open up, but haven't because of finiteness of time.

The LOC subjects were in the Gutenberg metadata. I am working on a BISAC chooser right now to facilitate crowd sourcing, should be easily adaptable to lcsh. As far as I know, there's not been a successful automated heading generatator, but suggesters should be easy to build given training data.Someone can probably point us at one.

Eric

Eric Hellman
President, Free Ebook Foundation
Founder, Unglue.it https://unglue.it/
http://go-to-hellman.blogspot.com/
twitter: @gluejar

Reply all

Reply to author

Forward