>I'm glad to hear you like the alignments and trees!
- Is there a standard benchmark somewhere that people have been using?
>There are some sequence data sets in examples/sequences/ directory. You might try examples/sequences/Enolase/enolase-38-trimmed.fasta . If you want to extract a 25-sequence data-set, >you can run
> alignment-thin --down-to=25 enolase-38-trimmed.fasta > enolase-25-trimmed.fasta
Enolase and globins are both a bit short and lacking diversity for what I want to do later, but I'd accept them if there were a table of runtimes somewhere. If not, I'll submit a pull request.
- Is there a way to turn off random seeding for comparability?
Ah, I felt sheepish at having missed that until I realized the only place it seems to be documented is in the man page. No mention of it in the user manual or --help message.
- Any experience with different compilers or compile options to suggest? My default has been gcc 8.2 with -O3, but I'll probably try clang 7. I also plan on trying FDO, LTO, and maybe BOLT.
-march=native should be standard for anybody doing computational work; -mtune is for packagers. I'll benchmark versus the binaries distributed.
- Does anybody have some profiling guidance?
> Do you mean what data sets to run while generating profiles for profile-guided optimization? My guess is that you should run a DNA data set, a protein data set, and a codon data set, so that you get info for different alphabet sizes.
>Or do you mean profiling the code (perf record) to what takes the most time?
I meant the latter.
- The dependencies that might be relevant to performance seem so be boost and eigen. Any words of wisdom regarding these?
> Sorry, not really. In theory the matrix multiplication could be improved by using hand-written avx instructions, or by actually using eigen to do matrix multiplications, probably. In practice, there is a lot of overhead in other areas, so the effect of inner loops is not so high.
>I'd be curious to hear what you find out.
Nobody should be hard-coding linear algebra; that's what different implementations of BLAS and LAPACK are for. But that's a step after profiling.
Thanks, I will look at these and try them out.
I just submitted a pull request with the following message:
Added a bash script and some data for benchmarking.Used benchmarking to see what worked, results in benchmarks.tsv.Summary, relative to gcc-8 -O3:
+5%, get rid of -funroll-loops+0%, LTO+10%, no -funroll-loops, LTO+FDO on base code+15%, LTO+FDO using internal boost and eigen-25%, clang-7
I would like to see how these numbers change based on whether we
are look at DNA, amino-acids, or codons.
I tried removing funroll-loops and it seems to consistently make things slower. In the case of proteins, about 7% slower. So, we might have to investigate a bit. Also, in my experience (mostly reading gcc-patches) +5% means that the time got 5% longer.
AutoFDO did not work for me.BOLT did not work for me.valgrind does not work because of range errors in boost inv_erfc that need to be fixed first.
gperftools show the top calls are:460 32.6% 32.6% 543 38.4% substitution::peel_internal_branch210 14.9% 47.4% 245 17.3% DPmatrixConstrained::forward_cell38 2.7% 50.1% 47 3.3% substitution::peel_internal_branch (inline)
The run_benchmark.sh script lets one reproduce all of these results and more.
Great! Thank you for the script.
1. The examples/sequences/EF-Tu directory has protein sequences of about 400 amino acids. I added one with 25 sequences for you a few days ago. I'll take a look at the sequences you added in your pull request as well.
2. In version 3.3, you need to run `bali-phy help advanced` in order to see the --seed option. Clearly this wasn't discoverable enough. In version 3.4 (just released) running `bali-phy --help` yields an explicit warning that not all options are shown, and a suggestion to run `bali-phy help advanced` to see additional options. Do you have any more suggestions for how to improve the discoverability?
>It would be interesting to see the effect of profile-guided optimization or FDO. I'm also curious about the effect of using -march=native -mtune=native . gcc-8 seems pretty good.-march=native should be standard for anybody doing computational work; -mtune is for packagers. I'll benchmark versus the binaries distributed.
"-mtune is for packagers" seems wrong, or at the wrong level of
More precisely, -march specifies what machine instructions can be generated, while -mtune specifies what their perceived costs are, and therefore which of the available instructions are chosen.
Now, it is apparently the case that -march=native implies -mtune=native (a learning opportunity for me!), so that you don't actually need to write -mtune=native. Nevertheless the tuning is necessary to generate a binary that is optimized for the local architecture.