Forked projects in GitHub Dataset?

26 views
Skip to first unread message

Eduardo Campos

unread,
Sep 27, 2018, 5:08:58 PM9/27/18
to Boa Language and Infrastructure User Forum
I have a question about how the GitHub dataset was built...

Does the "2015 September/GitHub" dataset contains forked projects 
written in Java language? Or the forked projects have been excluded
during the construction of the dataset?

How the aforementioned dataset has been extracted? If this dataset 
contains forked projects, how do I filter only Java projects that are not forked?

Many thanks and best regards,
Eduardo Cunha Campos
Software Engineering Ph.D. Student at Federal University of Uberlândia, Brazil

Robert Dyer

unread,
Sep 27, 2018, 5:15:32 PM9/27/18
to boa-...@googlegroups.com
Hi Eduardo,

The projects in the dataset should all be non-forks.  We filtered all forks, Java or not.

So the only thing you need to filter for is Java projects.  Depending on your definition, you can go with the project metadata indicating a programming_language of “Java” or you can look for files that end in “.java”.

Since we only parse Java source, you could also easily just keep projects with at least one ChangedFile of kind SOURCE_*.

Hope that helps!

- Robert

--
More information about Boa: http://boa.cs.iastate.edu/
---
You received this message because you are subscribed to the Google Groups "Boa Language and Infrastructure User Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to boa-user+u...@googlegroups.com.
To post to this group, send email to boa-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages