updated and added gitenberg crawling resources

11 views
Skip to first unread message

Eric Hellman

unread,
Feb 28, 2016, 11:38:12 PM2/28/16
to GITenberg Project
There's a PR https://github.com/gitenberg-dev/gitberg/pull/94 that adds some resources that will be useful for people wanting to crawl the repos. This is the result of some work closing loose ends - adding missing tests, bringing the sync up to date.

We've now synced up to Gutenberg #51117.
There are a total of 50,945 repos, listed in assets/GITenberg_repo_list.tsv in the pull request.
There are 188 Gutenberg ids without GITenberg repos. These are listed in assets/missing.tsv in the gitberg dev repo. 44 of them are not in the file mirrors for unknown reasons; the others are audio, image, or datafile repos, or they are withdrawn from Gutenberg.

Many of the missing repos occurred due to evolution of the repo naming procedure. This procedure is now robust.

I'll merge the PR in  a week if there are no comments.


Reply all
Reply to author
Forward
0 new messages