One idea I was kicking around for discovery was to make use of existing web indexes (ie Google), since they are already crawling and indexing. A couple issues:
- how can we positively identify an SVN/CVS repository indexed metadata?
= what kind of false-positives can we expect, and how can we mitigate it
- what constraints do we have to consider?
= web crawlers abide by robots.txt files, so entire domains aren't searched
For example, we may be able to cheaply retrieve a fairly good base set of repositories by tuning a google search using their search operators [1], and having the results returned as JSON or XML(Atom), ideal follow-on analysis [2].
For example, a query can be made that focuses on the concern of robots.txt [3]...which actually exploits the fact that while web crawlers do abide by it and doesn't search the directories/pages defined in this file...they do actually index the robots.txt file itself so they can cache so they don't need to process it each time. Since it's indexed by google, we can use it as a source of information about a site. This helped (manually) identify a repos for
videolan.org [4], for example, which uses non-standard names, but the "developer" path was a tip off.