Projects containing files with a given file extension

67 views
Skip to first unread message

Valerio Cosentino

unread,
Sep 14, 2015, 12:42:14 PM9/14/15
to Boa Language and Infrastructure User Forum

Hello, 

I wrote a script (below) to collect all GitHub projects that contain a specific file extension. 
I run it on the dataset "2015 August/GitHub", however if I compare the results I get with the GitHub Search feature, I see that there are some missing projects in the BOA output.
I'd like to know whether the dataset is just a portion of GitHub or BOA limits the number of projects that can be retrieved.

p: Project = input;
o: output set [string] of string;

repos := p.code_repositories;
for (i := 0; i < len(repos); i++) {
    repo := repos[i];
    for (j := 0; j < len(repo.revisions); j++) {
        revision := repo.revisions[j];
        if (hasfiletype(revision, `atl`))
            o [p.id] << p.project_url;
    }
}

cheers
Valerio

Robert E Dyer

unread,
Sep 14, 2015, 1:10:59 PM9/14/15
to boa-...@googlegroups.com
Hi Valerio,

Boa’s dataset is a subset of GitHub’s actual data.  Unlike their built-in search which is searching ‘live’ data, Boa’s data will be lagging behind as we have to use the GH API to collect the data and it limits our speed.

We also (initially) prioritized for Java projects, as we currently only support processing Java source files and wanted as much source AST data as possible in Boa’s dataset.

Hope that answers your question!

- Robert

PS - take a look at Boa’s quantifiers (http://boa.cs.iastate.edu/docs/quantifiers.php).  It can make your queries much more compact!

foreach (i: int; def(p.code_repositories[i]))
  foreach (j: int; hasfiletype(p.code_repositories[i].revisions[j], `atl`))
    o[p.id] << p.project_url;

--
More information about Boa: http://boa.cs.iastate.edu/
---
You received this message because you are subscribed to the Google Groups "Boa Language and Infrastructure User Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to boa-user+u...@googlegroups.com.
To post to this group, send email to boa-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

________________________________________________
Robert Dyer | Assistant Professor | Department of Computer Science
BGSU | rd...@bgsu.edu | 419.372.3469 | 244 Hayes | Bowling Green, OH

Want to mine ultra-large-scale software repositories with minimal initial
investment? Check out Boa! http://boa.cs.iastate.edu/

Valerio Cosentino

unread,
Sep 14, 2015, 2:30:03 PM9/14/15
to boa-...@googlegroups.com
Hi Robert,

thanks for information about the dataset. You totally answered my question.

PS I'll have a look at the documentation! :)

Andre Hora

unread,
Sep 14, 2015, 3:44:42 PM9/14/15
to boa-...@googlegroups.com
Dear Robert,

I still have some doubts on the dataset selection.
The GitHub dataset is from August 2015, so it means that it contains all the projects until August? 
Or did you filter out some projects? (for example, with zero stars or forked ones)

Best regards and thanks again for this great work!
--
Andre Hora

Robert E Dyer

unread,
Sep 14, 2015, 4:26:43 PM9/14/15
to boa-...@googlegroups.com
Hi Andre,

The dates on a Boa dataset represent when we released it, not the date the data was collect on.

For GitHub specifically, a lot of our data is much older than August 2015.  Thus there will be many missing projects and out-dated projects.  We also had to filter out some of the largest projects as our system was having difficulties processing them in the memory we allow it.

I believe we also focused just on non-forks, so there are probably no or very few forks in the dataset.

- Robert

Hoan Nguyen

unread,
Sep 14, 2015, 4:44:21 PM9/14/15
to boa-...@googlegroups.com
Hi,

I want to add that in this August 2015/GitHub dataset, 

1. we keep both non-forked and forked projects, and

2. we have full development histories (commits) only for projects with Java code.

Hoan

Andre Hora

unread,
Sep 14, 2015, 5:48:28 PM9/14/15
to boa-...@googlegroups.com
On Mon, Sep 14, 2015 at 5:44 PM, Hoan Nguyen <nguyen...@gmail.com> wrote:
Hi,

I want to add that in this August 2015/GitHub dataset, 

1. we keep both non-forked and forked projects, and

Thanks Hoan.

So, with both non-forked and forked projects, GitHub reports to have more than 1.5M Java projects [1].
With the query in [2], Boa reports to have 554,864 Java projects.

Even though Boa dataset is not up-to-date, there is a big difference with GitHub.
Note that I'm not criticising at all, I'm just wondering how the 554K Boa Java projects were selected (Ok, Robert said you filtered out large projects, but I think that it does not explain the ~1M difference with GitHub).

[2]
p: Project = input;
counts: output sum of int;
foreach (i: int; match(`^java$`, lowercase(p.programming_languages[i])))
counts << 1;



--
Andre Hora

Robert E Dyer

unread,
Sep 14, 2015, 5:58:42 PM9/14/15
to boa-...@googlegroups.com
Hi Andre,

Our data collection for GH was started back in 2013.  We focused on retrieving the project metadata at that time.  And then we updated a large portion of that data (but not adding new projects) and updated the commit histories of mostly Java projects this year.

So probably the number of projects you see would be pretty close to the data in 2013.

Again, the confusion I think is in the naming of our dataset - which is when we released our data but has no insight into when we actually collected the data.

- Robert

Andre Hora

unread,
Sep 14, 2015, 6:25:31 PM9/14/15
to boa-...@googlegroups.com
Hello Robert,

Ok, now it is clearer for me (you started back in 2013).
That answered my question.

Thanks for your attention.
Reply all
Reply to author
Forward
0 new messages