That sounds interesting and I'd be willing to help out. I had a quick look and came up with a few questions/comments:
- Are all 50,000 volumes from booklist.tsv included in the corpus?
- Is ALTO XML limited to a single page per file?
- Regardless of the answer to the above, having at least the text format available as a single file per book would be much easier to work with, I suspect, for most users.
- PANDAS seems like a heavyweight dependency considering how little it's used.
- links back to the original source of the texts and their online home at BL (if they have one) seem appropriate
- the ReadMe mentions known issues in the issue tracker, but I don't see any.
I like that the ALTO files preserve the original word confidence and character confidence scores from the ABBYY OCR since that could potentially be exploited for automated quality checks, but I think it's worth thinking through the various source, intermediary, and target file formats before going too far, given the quantity of data involved. Iterating on formats, processing pipeline, etc which a small number of volumes/repos, then scaling up is likely to be much less unwieldy than trying to iterate with thousands of volumes in play.
One of the things that I'm interested in is trying to bring some order and rationality to the multiple copies of metadata and texts floating around, hopefully automatically.
For example, how do we choose the best of (or create a better union copy?):
and recognize that this metadata is all about the same volume (and perhaps derived from each other in some non-independent way):
If nothing else, knowledge about availability from other sources might be used to help prioritize those texts which are not available elsewhere first.
Tom