Hi Folks,
I've been interested in making some changes and/or making Project Gutenberg texts more available for a few years now. I read the article and HN thread that started this community with great interest, but little free time. Last week I got frustrated with PG for: (a) not having an epub edition of a particular book and (b) blocking my access to the website because I was using a VPN service. My response might be overkill, but this is what I have done thus far, and what my next steps are:
* rsync a complete mirror of the archive to a new server
* get a python readable index of the file hierarchy
My next steps are to:
* add a metadata file to the repo using data from my index
* push all ~40k repos to github
* create a separate repo with a machine readable index of the books on github
My goal in this is to provide a repository that I as a developer could use to interact with PG programatically without re-downloading a mirror each time, or going through PG's not-really-an-api. I also wanted an index that I could use to loop over each book, and try to render my own web or epub versions of each text. And lastly, I wanted other people to be able to fork the PG books, make changes as desired, and offer their edits back to the primary repo of that book using github's pull requests.
I've read over this mailing list's archives, and I realize that this group has different goals, and was approaching the problem with a few key examples. I felt the easiest way for me to go forward was to apply what little improvements I can, as broadly as possible.
What I listed above is where my current project scope ends for now. There are some changes that can be made to these repos, and additional tools that can be created that seem like easy next steps, for example:
* using a continuous integration server to auto-build epub/pdf/etc files using the PG masters [txt, tei, rst, tex], and getting a count of the success/fail rate
* removing extraneous files from the repo like emacs temp files (files that end in ~, etc)
* splitting PG preamble into a dedicated file, not in the master file
* creating a readme for each book with a book description, project description, link to a toolchain to compile master files into epub/pdf/etc, link to the epub/pdf versions on another server
* creating a toolchain to read/write metadata files to each repo, probably using YAML (as was discussed on this list earlier)
* converting all file encodings to utf-8 (also a suggestion from this list)
* finding older revisions of PG books, and adding them as historical commits to the repo, which would enable an analysis of diffs over time.
* lastly, writing good documentation on how to submit errors to the github issue tracker for a given book repo, and/or how to fork, edit and pull request changes (advanced)
I am hesitant to set a date for when I will have the books on github, but it is costing me $ per day that I have the repository on my server, so I would like to finish a draft of this project in the next couple of weeks.
I would love to hear your feedback on this project, and suggestions for future tasks and goals.
All the best,
Seth Woodworth