benchmarking bali-phy

15 views
Skip to first unread message

jo...@ncgr.org

unread,
Dec 3, 2018, 2:48:20 PM12/3/18
to bali-phy-users
Hi, Ben; Hi, All,

I am happy with what bali-phy has done for some difficult multi-species alignments/trees in plant genomes.  I'm about to embark on running bali-phy on a large collection of gene families (some 450K protein sequences in 18K families).  Before I do this much computation, I want to do some benchmarking.  A few questions on this:

  1. Is there a standard benchmark somewhere that people have been using?  The tests/ directory of the repo doesn't seem to have anything suitable (protein sequences with ~25 genes of ~400 AA each would be ideal).  While it's not hard to supply my own, it's nice to have something comparable.  It's a further bonus is someone bothered to put it into the Phoronix test suite.  
  2. Is there a way to turn off random seeding for comparability?
  3. Any experience with different compilers or compile options to suggest?  My default has been gcc 8.2 with -O3, but I'll probably try clang 7.  I also plan on trying FDO, LTO, and maybe BOLT. 
  4. Does anybody have some profiling guidance?
  5. The dependencies that might be relevant to performance seem so be boost and eigen.  Any words of wisdom regarding these?
Cheers,
   Joel

jo...@generisbio.com

unread,
Dec 4, 2018, 3:45:13 PM12/4/18
to bali-phy-users
I'm not sure why Ben didn't post here, but here's what I got in response. 

>Hi Joel,

>I'm glad to hear you like the alignments and trees!

On 12/3/18 2:45 PM,wrote:
  1. Is there a standard benchmark somewhere that people have been using?  

>There are some sequence data sets in examples/sequences/ directory.  You might try examples/sequences/Enolase/enolase-38-trimmed.fasta .  If you want to extract a 25-sequence data-set, >you can run

>   alignment-thin --down-to=25 enolase-38-trimmed.fasta > enolase-25-trimmed.fasta

Enolase and globins are both a bit short and lacking diversity for what I want to do later, but I'd accept them if there were a table of runtimes somewhere.  If not, I'll submit a pull request.

  1. Is there a way to turn off random seeding for comparability?
> --seed=<integer>
Ah, I felt sheepish at having missed that until I realized the only place it seems to be documented is in the man page.  No mention of it in the user manual or --help message.
  1. Any experience with different compilers or compile options to suggest?  My default has been gcc 8.2 with -O3, but I'll probably try clang 7.  I also plan on trying FDO, LTO, and maybe BOLT. 
>It would be interesting to see the effect of profile-guided optimization or FDO.  I'm also curious about the effect of using -march=native -mtune=native .  gcc-8 seems pretty good.
-march=native should be standard for anybody doing computational work; -mtune is for packagers.  I'll benchmark versus the binaries distributed.  
  1. Does anybody have some profiling guidance?

> Do you mean what data sets to run while generating profiles for profile-guided optimization?  My guess is that you should run a DNA data set, a protein data set, and a codon data set, so that  you get info for different alphabet sizes.

>Or do you mean profiling the code (perf record) to what takes the most time?

I meant the latter.

  1. The dependencies that might be relevant to performance seem so be boost and eigen.  Any words of wisdom regarding these?

> Sorry, not really.  In theory the matrix multiplication could be improved by using hand-written avx instructions, or by actually using eigen to do matrix multiplications, probably.  In practice, there is a lot of overhead in other areas, so  the effect of inner loops is not so high.

>I'd be curious to hear what you find out.

>-BenRI

Nobody should be hard-coding linear algebra; that's what different implementations of BLAS and LAPACK are for.  But that's a step after profiling.


jo...@generisbio.com

unread,
Dec 13, 2018, 6:57:30 PM12/13/18
to bali-phy-users
I just submitted a pull request with the following message:

Added a bash script and some data for benchmarking.
Used benchmarking to see what worked, results in benchmarks.tsv.
Summary, relative to gcc-8 -O3:

+5%, get rid of -funroll-loops
+0%, LTO
+10%, no -funroll-loops, LTO+FDO on base code
+15%, LTO+FDO using internal boost and eigen
-25%, clang-7

AutoFDO did not work for me.
BOLT did not work for me.
valgrind does not work because of range errors in boost inv_erfc that need to be fixed first.
gperftools show the top calls are:
460 32.6% 32.6% 543 38.4% substitution::peel_internal_branch
210 14.9% 47.4% 245 17.3% DPmatrixConstrained::forward_cell
38 2.7% 50.1% 47 3.3% substitution::peel_internal_branch (inline)

The run_benchmark.sh script lets one reproduce all of these results and more.

Benjamin Redelings

unread,
Dec 14, 2018, 11:24:21 AM12/14/18
to bali-ph...@googlegroups.com

Hi Joel,

Thanks, I will look at these and try them out.

On 12/13/18 6:57 PM, jo...@generisbio.com wrote:
I just submitted a pull request with the following message:

Added a bash script and some data for benchmarking.
Used benchmarking to see what worked, results in benchmarks.tsv.
Summary, relative to gcc-8 -O3:

+5%, get rid of -funroll-loops
+0%, LTO
+10%, no -funroll-loops, LTO+FDO on base code
+15%, LTO+FDO using internal boost and eigen
-25%, clang-7

I would like to see how these numbers change based on whether we are look at DNA, amino-acids, or codons.

I tried removing funroll-loops and it seems to consistently make things slower.  In the case of proteins, about 7% slower.  So, we might have to investigate a bit.  Also, in my experience (mostly reading gcc-patches) +5% means that the time got 5% longer.

AutoFDO did not work for me.
BOLT did not work for me.
valgrind does not work because of range errors in boost inv_erfc that need to be fixed first.
I'll look into this.

gperftools show the top calls are:
460 32.6% 32.6% 543 38.4% substitution::peel_internal_branch
210 14.9% 47.4% 245 17.3% DPmatrixConstrained::forward_cell
38 2.7% 50.1% 47 3.3% substitution::peel_internal_branch (inline)
This is pretty much what I expected based on perf record.  BAli-Phy has a lot of overhead, but ideally we would spend all of our time in these functions.  Here we spend 59% so that means that if we could remove all overhead, the slowdown would be less than a factor of 2.


The run_benchmark.sh script lets one reproduce all of these results and more.

Great!  Thank you for the script.

-BenRI


--
You received this message because you are subscribed to the Google Groups "bali-phy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bali-phy-user...@googlegroups.com.
To post to this group, send email to bali-ph...@googlegroups.com.
Visit this group at https://groups.google.com/group/bali-phy-users.
For more options, visit https://groups.google.com/d/optout.

Benjamin Redelings

unread,
Dec 14, 2018, 11:49:38 AM12/14/18
to bali-ph...@googlegroups.com

Hi Joel,

1. The examples/sequences/EF-Tu directory has protein sequences of about 400 amino acids.  I added one with 25 sequences for you a few days ago.  I'll take a look at the sequences you added in your pull request as well.

2. In version 3.3, you need to run `bali-phy help advanced` in order to see the --seed option.  Clearly this wasn't discoverable enough.  In version 3.4 (just released) running `bali-phy --help` yields an explicit warning that not all options are shown, and a suggestion to run `bali-phy help advanced` to see additional options.  Do you have any more suggestions for how to improve the discoverability?

3.

>It would be interesting to see the effect of profile-guided optimization or FDO.  I'm also curious about the effect of using -march=native -mtune=native .  gcc-8 seems pretty good.
-march=native should be standard for anybody doing computational work; -mtune is for packagers.  I'll benchmark versus the binaries distributed. 

"-mtune is for packagers" seems wrong, or at the wrong level of generality.

More precisely, -march specifies what machine instructions can be generated, while -mtune specifies what their perceived costs are, and therefore which of the available instructions are chosen.

Now, it is apparently the case that -march=native implies -mtune=native (a learning opportunity for me!), so that you don't actually need to write -mtune=native.  Nevertheless the tuning is necessary to generate a binary that is optimized for the local architecture.

-BenRI

Reply all
Reply to author
Forward
0 new messages