discovery

15 views
Skip to first unread message

Steven Siebert

unread,
Oct 22, 2014, 10:48:05 PM10/22/14
to mammoth-h...@googlegroups.com
I think one of the parallel issues the community can investigate early on is on the subject of discovery.

Initial questions we should want to answer:

How many CVS/SVN repos for OSS software are out there?
Which of these meet the "criteria" for conversion? (What is the criteria...and at what point can we say "Done!")


WRT to infrastructure for this task:
What approaches do we want to take in harvesting this information (focus on automation and divide/conquer)?
What kind of meta information is necessary about these repos that would enable follow-on analysis?
How do we want to store these discovered repos (focus on transparency and community involvement)?


Steven Siebert

unread,
Oct 22, 2014, 11:37:57 PM10/22/14
to mammoth-h...@googlegroups.com
One idea I was kicking around for discovery was to make use of existing web indexes (ie Google), since they are already crawling and indexing. A couple issues:

- how can we positively identify an SVN/CVS repository indexed metadata?
  = what kind of false-positives can we expect, and how can we mitigate it
- what constraints do we have to consider?
  = web crawlers abide by robots.txt files, so entire domains aren't searched

For example, we may be able to cheaply retrieve a fairly good base set of repositories by tuning a google search using their search operators [1], and having the results returned as JSON or XML(Atom), ideal follow-on analysis [2].

For example, a query can be made that focuses on the concern of robots.txt [3]...which actually exploits the fact that while web crawlers do abide by it and doesn't search the directories/pages defined in this file...they do actually index the robots.txt file itself so they can cache so they don't need to process it each time.  Since it's indexed by google, we can use it as a source of information about a site.  This helped (manually) identify a repos for videolan.org [4], for example, which uses non-standard names, but the "developer" path was a tip off.



[1] google search operator reference http://www.googleguide.com/advanced_operators_reference.html
[2] google web search api https://developers.google.com/custom-search/json-api/v1/overview
[3] example google search returning repos https://www.google.com/?gws_rd=ssl#q=inurl:robots.txt+intext:cvs&start=10
[4] example robots.txt file showing repo information http://www.videolan.org/robots.txt


--
You received this message because you are subscribed to the Google Groups "mammoth-hunters-all" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mammoth-hunters...@googlegroups.com.
To post to this group, send email to mammoth-h...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mammoth-hunters-all/13da35f5-1fee-4ccd-80f2-77768d7fa052%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages