[llvm-dev] The AnghaBench collection of compilable programs

Fernando Magno Quintao Pereira via llvm-dev

unread,

Feb 22, 2020, 9:55:33 AM2/22/20

to llvmdev, llvm...@lists.llvm.org

Dear LLVMers,

we, at UFMG, have been building a large collection of compilable
benchmarks. Today, we have one million C files, mined from open-source
repositories, that compile into LLVM bytecodes (and from there to
object files). To ensure compilation, we perform type inference on the
C programs. Type inference lets us replace missing dependencies.

The benchmarks are available at: http://cuda.dcc.ufmg.br/angha/

We have a technical report describing the construction of this
collection: http://lac.dcc.ufmg.br/pubs/TechReports/LaC_TechReport012020.pdf

Many things can be done with so many LLVM bytecodes. A few examples
follow below:

* We can autotune compilers. We have trained YaCoS, a tool used to
find good optimization sequences. The objective function is code size.
We find the best optimization sequence for each program in the
database. To compile an unknown program, we get the program in the
database that is the closest, and apply the same optimization
sequence. Results are good: we can improve on clang -Oz by almost 10%
in MiBench, for instance.

* We can perform many types of explorations on real-world code. For
instance, we have found that 95.4% of all the interference graphs of
these programs, even in machine code (no phi-functions and lots of
pre-colored registers), are chordal.

* We can check how well different tools are doing on real-world code.
For instance, we can use these benchmarks to check how many programs
can be analyzed by Ultimate Buchi Automizer
(https://ultimate.informatik.uni-freiburg.de/downloads/BuchiAutomizer/).
This is a tool that tries to prove termination or infinite execution
for some programs.

* We can check how many programs can be compiled by different
high-level synthesis tools into FPGAs. We have tried LegUp and Vivado,
for instance.

* Our webpage contains a search box, so that you can get the closest
programs to a given input program. Currently, we measure program
distance as the Euclidian distance on Namolaru feature vectors.

We do not currently provide inputs for those programs. It's possible
to execute the so called "leaf-functions", e.g., functions that do not
call other routines. We have thousands of them. However, we do not
guarantee the absence of undefined behavior during the execution.

Regards,

Fernando
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Florian Hahn via llvm-dev

unread,

Feb 22, 2020, 3:16:48 PM2/22/20

to Fernando Magno Quintao Pereira, llvm...@lists.llvm.org, llvmdev

Hi Fernando,

That sounds like a very useful resource to improve testing and also get easier access to good stress tests (e.gQuite a few very large functions have proven to surface compile time problems in some backend passes).

From a quick look on the website I couldn’t find under which license the code is published. That may be a problem for some users.

Have you thought about integrating the benchmarks as external tests into LLVM’s test-suite? That would make it very easy to play around with.

Cheers,
Florian

> On 22 Feb 2020, at 14:56, Fernando Magno Quintao Pereira via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> Dear LLVMers,

Fernando Magno Quintao Pereira via llvm-dev

unread,

Feb 22, 2020, 3:30:39 PM2/22/20

to Florian Hahn, llvm...@lists.llvm.org, llvmdev

Hi Florian,

we though about using UIUC, like in LLVM. Do you guys know if that
could be a problem, given that we are mining the functions from
github?

> Have you thought about integrating the benchmarks as external tests into LLVM’s test-suite? That would make it very easy to play around with.

We did not think about it actually. But we would be happy to do it, if
the community accepts it.

Regards,

Fernando

Florian Hahn via llvm-dev

unread,

Feb 22, 2020, 3:54:19 PM2/22/20

to Fernando Magno Quintao Pereira, llvm...@lists.llvm.org, llvmdev

> On 22 Feb 2020, at 20:30, Fernando Magno Quintao Pereira <pron...@gmail.com> wrote:
>
> Hi Florian,

>
> we though about using UIUC, like in LLVM. Do you guys know if that
> could be a problem, given that we are mining the functions from
> github?

If I understand your approach directly, I think the question will be quite tricky to answer. I am not a lawyer and cannot help there, sorry!

>
>> Have you thought about integrating the benchmarks as external tests into LLVM’s test-suite? That would make it very easy to play around with.
>
> We did not think about it actually. But we would be happy to do it, if
> the community accepts it.

IIUC the mined benchmarks would fit quite well and should not be too hard to integrate (as external). But it would probably be good to have the license question answered, otherwise that might limit its practical usefulness.

Cheers,
Florian

Chris Lattner via llvm-dev

unread,

Feb 28, 2020, 12:21:37 AM2/28/20

to Fernando Magno Quintao Pereira, llvm...@lists.llvm.org, llvmdev

Hi Fernando,

My understanding is that LLVM’s test-suite is under a weird mix of different licenses. So long as you preserve the original licenses (and only include ones with reasonable licenses), it should be possible I think.

-Chris

Fernando Magno Quintao Pereira via llvm-dev

unread,

Feb 28, 2020, 6:21:20 AM2/28/20

to Chris Lattner, llvm...@lists.llvm.org, llvmdev

Thank you for the feedback, Chris and Florian. We will start updating
the benchmarks with the licenses from the original repositories where
they came from. Once we update the individual benchmarks, we will try
to make them available as an external test in LLVM.

Regards,

Fernando

Chris Lattner via llvm-dev

unread,

Feb 28, 2020, 9:07:31 PM2/28/20

to Fernando Magno Quintao Pereira, llvm...@lists.llvm.org, llvmdev

Sounds great! If there are any that can be merged into test-suite, that would make it easier for CI systems to use it.

-Chris

Fernando Magno Quintao Pereira via llvm-dev

unread,

Mar 3, 2020, 8:36:05 AM3/3/20

to Chris Lattner, llvm...@lists.llvm.org, llvmdev

Dear LLVMers,

we have separated a subset of 128,411 files from the following repositories:

FFmpeg, DeepMind, openssl, SoftEtherVPN, libgit2, php-src, radare2,
darwin-xnu, mongoose, reactos, git, nodemcu-firmware, redis, h2o,
obs-studio.

I will leave them here:
http://www.dcc.ufmg.br/~fernando/coisas/c_files_with_licenses.tar.gz
(38.9M), before moving them into the AnghaBench page
(cuda.dcc.ufmg.br/angha). Each one of these files contain a function
taken from a given repository. All these files contain within itself
all the dependencies that ensure that they compile into LLVM
bytecodes. Files are organized by folders following the directory
hierarchy in the original repository. Each file contains a header
mentioning the license that is used in the original repository. The
header reads as follows:

* This file is licensed under the (____LICENSE_NAME____)
* It's contents are the result of reconstructing functions
* from the code extracted from the original project (____PROJECT_NAME____)
* Execution is not the main goal, therefore not guaranteed.
* Read the file called COPYING.txt provided with this code,
* at the root of the project for details on its license.
*
* All the contributors of the original files are listed in CREDITS.txt
* and/or MAINTAINERS, provided at the root of this project.

Would you know if this header, plus the accompanying COPYING.txt and
CREDITS.txt files are enough to let us start integrating the
benchmarks into the LLVM test suite, as external benchmarks?

Notice that although the process of building the compilable benchmarks
out of open-source repositories is totally automatic, adding the
licenses to the files is not, for we need to find the license used in
the repository. There is another caveat: some individual developers
add their names to the files, as comments. We do not keep comments
originally written outside reconstructed functions, so if it happened,
these names will not be found.

Finally, notice that we only preserve the body of a function. All the
types and declarations necessary for the compilation of that function
are reconstructed via type-inference. So, the code that we distribute
is different than the code that was originally extracted from an
open-source repository. To see how different, check Figure 6 of the
technical report:
http://lac.dcc.ufmg.br/pubs/TechReports/LaC_TechReport012020.pdf.

Reply all

Reply to author

Forward