Optimizing EPA placement speed

Benjamin lin

unread,

Oct 3, 2017, 7:06:06 AM10/3/17

to raxml

Dear Pr Stamatakis,

I'm currently running tests related to phylogenetic placement of short metagenomic reads.

I'm struggling to run fast EPA (RAxML) placements (when compared to pplacer).
The only option that looked useful to increase EPA speed was to reduce the proportion of optimizations to the minimum (-G 0.00000000001).
Here is typically the command-line i use:

RAxMLBinary -f v -G 0.1 -m GTRCAT -n EPA_test -s ../basic_queries.fasta -t ../RAxML_bestTree.basic_queries

Nevertheless, i noticed that this paper (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4315155/) mentions a "fast" and "slow" way of using EPA, when they did their own speed tests.
Unfortunately they don't describe what is behind these words. I sent them a mail, but no answer so far.
I cannot find anything looking like these options in the RAxML documentation.

May i have missed an option that could increase the speed of EPA ?

Thank you for your advices.
Best regards,

Alexandros Stamatakis

unread,

Oct 3, 2017, 7:34:26 AM10/3/17

to ra...@googlegroups.com

Dear Benjamin,

> I'm currently running tests related to phylogenetic placement of short
> metagenomic reads.
>
> I'm struggling to run fast EPA (RAxML) placements (when compared to
> pplacer).
> The only option that looked useful to increase EPA speed was to reduce
> the proportion of optimizations to the minimum (-G 0.00000000001).
> Here is typically the command-line i use:
>
> RAxMLBinary -f v -G 0.1 -m GTRCAT -n EPA_test -s ../basic_queries.fasta
> -t ../RAxML_bestTree.basic_queries
>
> Nevertheless, i noticed that this paper
> (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4315155/) mentions a
> "fast" and "slow" way of using EPA, when they did their own speed tests.
> Unfortunately they don't describe what is behind these words. I sent
> them a mail, but no answer so far.

I assume it is with and without the heuristics you use (with and without
-G). It might also be that they used an older version of EPA where the
pre-placement is conducted under parsimony and then only the thorough
placement on some candidate branches under ML, but that I don't know.

> I cannot find anything looking like these options in the RAxML
> documentation.
>
> May i have missed an option that could increase the speed of EPA ?

No that sounds all right. Please stay tuned here as we will announce the
release of a completely re-designed version of EPA, including better
heuristics via the raxml google group this week.

Alexis

>
> Thank you for your advices.
> Best regards,
>

> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

Benjamin lin

unread,

Oct 3, 2017, 10:39:03 AM10/3/17

to raxml

Thank you for the fast reply.

I hope we can immediately exploit this update in our comparisons, I stay tuned.

benjamin

Pierre Barbera

unread,

Oct 4, 2017, 7:58:25 AM10/4/17

to raxml

Dear Benjamin,

the reimplementation of EPA is released! You can find it here: https://github.com/Pbdas/epa-ng

Please note that it does not currently support the GTRCAT model. Infact, currently, it only supports the GTRGAMMA model. Other models are definitely planned for the future, however at this moment we are unsure wether we will support the CAT model.

As for speed, I think you will be pleasantly surprised, as in my most recent tests, with default settings, epa-ng outperforms pplacer in speed by a factor of six! There is also "new" heursitic mode (enabled by default) that, instead of having a static number of branches to preselect like RAxML did with -G, selects a number of branches that adapts dynamically to the distribution of LWR values on the tree. For example, if prescoring finds that the best two branches together cover practically all of the likelihood weight (together greater than 0.99), then only those two branches will be evaluated in detail. If the initial distribution of likelihood weight is more flat, then more branches are evaluated in detail. The relevant command line option is "-g"

Additionally, this new version can be run on the cluster with many compute nodes with good parallel scalability.

Out of curiosity, are you struggling with speed due to a large reference tree/alignment?

All the best,

Pierre

Benjamin lin

unread,

Oct 5, 2017, 3:48:55 AM10/5/17

to raxml

Hi,

Thanks ! I will try that immediately !

Out of curiosity, are you struggling with speed due to a large reference tree/alignment?

I'm testing alignments ranging from 2 to 15kb and hundreds to thousands of leaves.
My main issue was not so much to run it, but i was surprised by the slow speed compared to pplacer.
As I will compare both, i didn't wanted to say that raxml is super slow, while it was just my mistakes.

Now, a last question, am I allowed to use this last version of EPA for results that will be published in the coming months ?

best,

benjamin

Alexandros Stamatakis

unread,

Oct 5, 2017, 3:52:17 AM10/5/17

to ra...@googlegroups.com

Hi Benjamin,

> Out of curiosity, are you struggling with speed due to a large
> reference tree/alignment?
>
>
> I'm testing alignments ranging from 2 to 15kb and hundreds to thousands
> of leaves.

100s-1000s of leaves in the reference tree?

> My main issue was not so much to run it, but i was surprised by the slow
> speed compared to pplacer.

Maybe because RAxML first re-optimizes all model parameters on the
reference tree each time you invoke it (unlike pplacer), although there
is a way to circumvent it. So the key question really is how large you
reference trees are.

> As I will compare both, i didn't wanted to say that raxml is super slow,
> while it was just my mistakes.
>
> Now, a last question, am I allowed to use this last version of EPA for
> results that will be published in the coming months ?

Sure :-)

Alexis

>
> best,
> benjamin

Benjamin lin

unread,

Oct 5, 2017, 5:04:37 AM10/5/17

to raxml

Hi,

100s-1000s of leaves in the reference tree?

Yes.

benjamin

Benjamin lin

unread,

Oct 5, 2017, 5:56:41 AM10/5/17

to raxml

Hi,

I have some hard time to compile EPA-nx, and i think this is related to icc compiler requirement.

See the output below.

If i'm right, is there a way to compile EPA-nx without this requirement ? an open-source equivalent implementation ?

I did some googling searching for that, without success.

On the Intel website I see free access to academics for some particular libraries (like pthread).

But the compiler itself is part of the following products of Intel, which are all locked as a 30days trial period only:

- Intel Parallel Studio XE

- Intel System Studio

- Intel Bi-Endian C++ Compiler

Thanks,

benjamin

---------------------------------------

ben@3420:/media/ben/STOCK/SOFTWARE/epa-ng/epa$ make

Running cmake

-- The CXX compiler identification is GNU 4.9.4

-- Check for working CXX compiler: /usr/bin/g++-4.9

-- Check for working CXX compiler: /usr/bin/g++-4.9 -- works

-- Detecting CXX compiler ABI info

-- Detecting CXX compiler ABI info - done

-- EPA-ng version: 0.1.0

-- Building RELEASE

-- Looking for include file pthread.h

CMake Error at /usr/share/cmake-2.8/Modules/CheckIncludeFiles.cmake:58 (try_compile):

Unknown extension ".c" for file

/media/ben/STOCK/SOFTWARE/epa-ng/epa/build/CMakeFiles/CMakeTmp/CheckIncludeFiles.c

try_compile() works only for enabled languages. Currently these are:

CXX

See project() command to enable other languages.

Call Stack (most recent call first):

/usr/share/cmake-2.8/Modules/FindThreads.cmake:39 (CHECK_INCLUDE_FILES)

CMakeLists.txt:67 (find_package)

-- Looking for include file pthread.h - not found

-- Could NOT find Threads (missing: Threads_FOUND)

-- Enabling Prefetching

-- Checking for OpenMP

-- Try OpenMP CXX flag = [-fopenmp]

-- Performing Test OpenMP_FLAG_DETECTED

-- Performing Test OpenMP_FLAG_DETECTED - Success

-- Checking for OpenMP -- found

-- Performing Test HAS_AVX

-- Performing Test HAS_AVX - Success

-- Performing Test HAS_SSE3

-- Performing Test HAS_SSE3 - Success

-- CMake version 2.8.12.2

-- The C compiler identification is GNU 4.9.4

-- Check for working C compiler: /usr/bin/cc

-- Check for working C compiler: /usr/bin/cc -- works

-- Detecting C compiler ABI info

-- Detecting C compiler ABI info - done

-- Genesis version: v0.16.0

-- Scope: library

-- Build type: RELEASE

-- Unity build: FULL

-- C++ compiler: GNU 4.9.4 at /usr/bin/g++-4.9

-- C compiler : GNU 4.9.4 at /usr/bin/cc

-- Check if the system is big endian

-- Searching 16 bit integer

-- Looking for sys/types.h

-- Looking for sys/types.h - found

-- Looking for stdint.h

-- Looking for stdint.h - found

-- Looking for stddef.h

-- Looking for stddef.h - found

-- Check size of unsigned short

-- Check size of unsigned short - done

-- Using unsigned short

-- Check if the system is big endian - little endian

-- Looking for Threads

-- Could NOT find Threads (missing: Threads_FOUND)

-- Threads not found

-- Building static lib

-- Using flags: -std=c++14 -Wall -Wextra -D__PREFETCH -fopenmp -D__OMP -mavx -D__AVX

-- Could NOT find GTest (missing: GTEST_LIBRARY GTEST_INCLUDE_DIR GTEST_MAIN_LIBRARY)

-- GTest not found

CMake Warning at test/src/CMakeLists.txt:5 (message):

Skipping building tests.

-- Configuring incomplete, errors occurred!

See also "/media/ben/STOCK/SOFTWARE/epa-ng/epa/build/CMakeFiles/CMakeOutput.log".

See also "/media/ben/STOCK/SOFTWARE/epa-ng/epa/build/CMakeFiles/CMakeError.log".

make: *** [build/CMakeCache.txt] Error 1

Alexandros Stamatakis

unread,

Oct 5, 2017, 6:18:28 AM10/5/17

to ra...@googlegroups.com

Okay, in that case RAxML should be slower than pplacer for the larger
trees as it re-optimizes the model parameters of the reference tree by
default each time you invoke it, regarding your other question, Pierre
will help you with that,

Alexis

Pierre Barbera

unread,

Oct 5, 2017, 6:23:27 AM10/5/17

to raxml

(sent again as the last message was a private message by accident)

Hi Benjamin,

the problem was not with your compiler, but with the cmake version requirement which I had set incorrectly.

The latest commit on the master branch should circumvent this; I tested it with a very similar cmake version and it should work now.

Please do a `git pull`, `make clean` and `make` and let me know if that fixes the problem!

Regards,

Pierre

Lucas Czech

unread,

Oct 5, 2017, 6:28:44 AM10/5/17

to ra...@googlegroups.com

Hi Benjamin,

also note that you don't need icc to compile EPA-ng. Any of gcc, clang or icc should work - having one of those is enough ;-)

According to your cmake output, you are using gcc, so that's fine.

Lucas

--
You received this message because you are subscribed to the Google Groups "raxml" group.

To unsubscribe from this group and stop receiving emails from it, send an email to raxml+un...@googlegroups.com.

Benjamin lin

unread,

Oct 5, 2017, 9:53:42 AM10/5/17

to raxml

Dear all,

I compiled it with success and ran some rapid tests.
I used the dataset provided in the PPlacer tutorial (http://fhcrc.github.io/microbiome-demo/):
- Reference tree: 652 leaves.

- Queries: 2507 short reads
- Redundancy: 2281 queries are identical.

Here are my results using the "time" unix command and 1 single core of 3,7Mhz:

PPlacer:

real 0m6.832s

user 0m1.260s

sys 0m0.052s

EPA_ng, dynamic heuristics:

real 0m37.929s

user 0m37.728s

sys 0m0.196s

EPA_ng, fixed heuristics:

real 5m36.602s

user 5m36.368s

sys 0m0.236s

Older EPA (RAxML 8.2.9):

real 2m41.891s

user 5m22.524s

sys 0m0.228s

I have some questions:

- is the EPA_ng "fixed heuristic" option (-G) equivalent to the old EPA algorithm ?

- EPA_ng with dynamics heuristics appears way faster, but remains slower than pplacer (default parameters) on this particular example.

I was thinking about the very high redundancy of these query reads (quite common in metabarcoding experiments), is there some checkup in the query file to identify identical sequences ?

Thanks,

benjamin

Pierre Barbera

unread,

Oct 5, 2017, 10:30:05 AM10/5/17

to ra...@googlegroups.com

Dear Benjamin,

yes, I believe the discrepancy is due to there being so many redundancies. I haven't explicitly mentioned it in the documentation yet, but EPA-ng does not (currently) re-replicate the input queries, as I see that as a pre-processing step.

That being said, I realize that the input fasta format for the queries has no clean way of tracking how many multiplicities a sequence has behind it... So keeping track of that is not straight forward. I want to find a clean/workable solution for this in the future.

The data I usually test with, and which I used to arrive at the `6 times faster` figure was a chunk of data used in a recent popular study using placement, where the fraction of redundant sequences was much lower (around 10%). It is entirely possible that they performed some dereplication however.

Just for some insight: the idea during development was to stream through the query sequences to allow for massive files (billions of queries) while keeping the memory footprint fixed. It also improves runtime, as reading and processing can be overlapped.

In the meantime, I guess a fairer comparison would be by first dereplicating the queries, possibly like this if you have vsearch:

vsearch \
    --quiet \
    --derep_fulllength input.fasta \
    --sizein \
    --sizeout \
    --fasta_width 0 \
    --relabel_sha1 \
    --output dereplicated.fasta 2>>./log.txt

However, combining this with the current version of EPA-ng, the multiplicity information will get lost. vsearch will add it to the sequence name, so it can be recovered with additional effort, but EPA-ng does not add it to the jplace multiplicity field. That last part could be implemented relatively easily.

As for your other question: yes, -G is identical in RAxML-EPA and EPA-ng.

We've added dereplication as an issue to the repository, and I will try to settle on a solution within the next release cycle (shouldn't take too long).

Thank you for your input!
Pierre

--
You received this message because you are subscribed to the Google Groups "raxml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raxml+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 
MSc Pierre Barbera

Phone: +49 6221 533 258
Fax: +49 6221 533 298
E-Mail: pierre....@h-its.org

HITS gGmbH
Schloss-Wolfsbrunnenweg 35
D-69118 Heidelberg
Amtsgericht Mannheim / HRB 337446
Managing Director: Dr. Gesa Schönberger
Scientific Director: Prof. Dr. Michael Strube

Pierre Barbera

unread,

Oct 5, 2017, 11:03:48 AM10/5/17

to raxml

Dear Benjamin,

let me also add as precautionary advice: 2500 sequences is a very small number to test speed. At least for EPA-ng I know that in some cases, the time spent per sequence may be significantly higher for small inputs as compared to large inputs due to static overheads, and optimizations that trade of a small amount of time in the beginning to greatly accelerate the runtime overall. This small time investment may be a sizeable portion of a small input run, but vanishes into total insignificance for any actual data.

Let me know if there is any more help I can provide.

All the best,

Pierre

Benjamin lin

unread,

Oct 5, 2017, 1:33:06 PM10/5/17

to raxml

Hi pierre,

The data I usually test with, and which I used to arrive at the `6 times faster` figure was a chunk of data used in a recent popular study using placement, where the fraction of redundant sequences was much lower (around 10%). It is entirely possible that they performed some dereplication however.

Ok and after another check pplacer is indeed doing dereplication, in fact i think only 250 real placement are done in this 2500 reads dataset, this is precisely why I was asking if EPA-ng had any.
You are confirming that to have a real comparison between EPA-ng and pplacer, i will need indeed to divide its execution time by the amount of metagenome redundancy (or simply discarding redundancy, as you proposed).
You relpy is very useful, now i can think about a fair comparison.

In my previous developments, I generally built a rapid 32bit checksum of input sequences and stored the corresponding checksum bits in a small hasmap (or equivalent).
From my experience, this uses only a very small fraction of execution time when identifying duplicates in millions of short reads and for a small memory footprint (these are just bits shifts).

I don't know if that's adapted to your implementation or it's of any help.

At least for EPA-ng I know that in some cases, the time spent per sequence may be significantly higher for small inputs as compared to large inputs due to static overheads, and optimizations that trade of a small amount of time in the beginning to greatly accelerate the runtime overall

Thanks, I will scale up my future tests to take that into account.

In all case, thank you for all your advices.

benjamin

Benjamin lin

unread,

Oct 13, 2017, 9:15:09 AM10/13/17

to raxml

Hi,

I came up to another issue.

While epa-ng works fine on my desktop computer, it fails to run on our cluster computers.

Both run on ubuntu systems and i installed the exact same version of GCC (4.9).
The compilation on the server succeeds:

linard@flip:~/epa$ make
Running cmake
-- The CXX compiler identification is GNU 4.9.3
-- The C compiler identification is GNU 5.4.0

-- Check for working CXX compiler: /usr/bin/g++-4.9
-- Check for working CXX compiler: /usr/bin/g++-4.9 -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done

-- Detecting CXX compile features
-- Detecting CXX compile features - done

-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done

-- Detecting C compile features
-- Detecting C compile features - done

-- EPA-ng version: 0.1.0
-- Building RELEASE

-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE

-- Enabling Prefetching
-- Checking for OpenMP

-- Checking for OpenMP -- found
-- Performing Test HAS_AVX
-- Performing Test HAS_AVX - Success
-- Performing Test HAS_SSE3
-- Performing Test HAS_SSE3 - Success

-- CMake version 3.5.1

-- Genesis version: v0.16.0
-- Scope: library
-- Build type: RELEASE
-- Unity build: FULL

-- C++ compiler: GNU 4.9.3 at /usr/bin/g++-4.9
-- C compiler  : GNU 5.4.0 at /usr/bin/cc

-- Check if the system is big endian
-- Searching 16 bit integer
-- Looking for sys/types.h
-- Looking for sys/types.h - found
-- Looking for stdint.h
-- Looking for stdint.h - found
-- Looking for stddef.h
-- Looking for stddef.h - found
-- Check size of unsigned short
-- Check size of unsigned short - done
-- Using unsigned short
-- Check if the system is big endian - little endian
-- Looking for Threads

-- Found Threads: -pthread
-- Using Threads

-- Building static lib
-- Using flags: -std=c++14 -Wall -Wextra -D__PREFETCH -fopenmp -D__OMP -mavx -D__AVX 
-- Could NOT find GTest (missing:  GTEST_LIBRARY GTEST_INCLUDE_DIR GTEST_MAIN_LIBRARY) 
-- GTest not found
CMake Warning at test/src/CMakeLists.txt:5 (message):
  Skipping building tests.

-- Configuring done
-- Generating done
-- Build files have been written to: /home/linard/tests_speed/epa/build
Running make
make -C build 
make[1]: Entering directory '/home/linard/tests_speed/epa/build'
make[2]: Entering directory '/home/linard/tests_speed/epa/build'
make[3]: Entering directory '/home/linard/tests_speed/epa/build'
Scanning dependencies of target genesis_lib_static
make[3]: Leaving directory '/home/linard/tests_speed/epa/build'
make[3]: Entering directory '/home/linard/tests_speed/epa/build'
[  4%] Building CXX object libs/genesis/lib/genesis/CMakeFiles/genesis_lib_static.dir/__/__/__/__/genesis_unity_sources/lib/all.cpp.o
[  9%] Linking CXX static library ../../../../../libs/genesis/bin/libgenesis.a
make[3]: Leaving directory '/home/linard/tests_speed/epa/build'
[  9%] Built target genesis_lib_static
make[3]: Entering directory '/home/linard/tests_speed/epa/build'
Scanning dependencies of target epa_module
make[3]: Leaving directory '/home/linard/tests_speed/epa/build'
make[3]: Entering directory '/home/linard/tests_speed/epa/build'
[ 14%] Building CXX object src/CMakeFiles/epa_module.dir/sample/Placement.cpp.o
[ 19%] Building CXX object src/CMakeFiles/epa_module.dir/tree/Tiny_Tree.cpp.o
[ 23%] Building CXX object src/CMakeFiles/epa_module.dir/tree/Tree.cpp.o
[ 28%] Building CXX object src/CMakeFiles/epa_module.dir/tree/tiny_util.cpp.o
[ 33%] Building CXX object src/CMakeFiles/epa_module.dir/core/pll/epa_pll_util.cpp.o
[ 38%] Building CXX object src/CMakeFiles/epa_module.dir/core/pll/optimize.cpp.o
[ 42%] Building CXX object src/CMakeFiles/epa_module.dir/core/pll/pll_util.cpp.o
[ 47%] Building CXX object src/CMakeFiles/epa_module.dir/core/raxml/Model.cpp.o
[ 52%] Building CXX object src/CMakeFiles/epa_module.dir/core/place.cpp.o
[ 57%] Building CXX object src/CMakeFiles/epa_module.dir/pipeline/schedule.cpp.o
[ 61%] Building CXX object src/CMakeFiles/epa_module.dir/seq/MSA.cpp.o
[ 66%] Building CXX object src/CMakeFiles/epa_module.dir/seq/MSA_Stream.cpp.o
[ 71%] Building CXX object src/CMakeFiles/epa_module.dir/io/jplace_util.cpp.o
[ 76%] Building CXX object src/CMakeFiles/epa_module.dir/io/file_io.cpp.o
[ 80%] Building CXX object src/CMakeFiles/epa_module.dir/io/Binary.cpp.o
[ 85%] Building CXX object src/CMakeFiles/epa_module.dir/util/stringify.cpp.o
[ 90%] Building CXX object src/CMakeFiles/epa_module.dir/set_manipulators.cpp.o
[ 95%] Building CXX object src/CMakeFiles/epa_module.dir/main.cpp.o
[100%] Linking CXX executable ../../bin/epa-ng
make[3]: Leaving directory '/home/linard/tests_speed/epa/build'
[100%] Built target epa_module
make[2]: Leaving directory '/home/linard/tests_speed/epa/build'
make[1]: Leaving directory '/home/linard/tests_speed/epa/build'

However, the produed binary immediately throws a core dump error:

linard@flip:~/tests_speed/epa$ ./bin/epa-ng 

Illegal instruction (core dumped)

Valgrind tells me that the program executes unrecognised instructions. So I check which might be the CPUs differences.

I noticed that sse3 and avx instructions are absent from the cluster CPU flags (see below, but sse4a is present).

Desktop computer ( epa-ng works):

vendor_id : GenuineIntel

cpu family : 6

model : 94

model name : Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz

microcode : 0x74

cpu MHz : 2776.046

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov

pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdt

scp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_ts

c aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3

sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_time

r aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vn

mi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpci

d rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln

pts hwp hwp_notify hwp_act_window hwp_epp

Cluster ( epa-ng produces immediate core dump):

vendor_id : AuthenticAMD

cpu family : 16

model : 9

model name : AMD Opteron(tm) Processor 6172

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov

pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp

lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid amd_dcm pn

i monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misali

gnsse 3dnowprefetch osvw ibs skinit wdt nodeid_msr hw_pstate vmmcall npt lbrv sv

m_lock nrip_save pausefilter

I will switch to other machines.
But could it be possible to know which are the minimum requirements in terms of CPU instructions to run EPA-ng ?

Thank you for your help,

benjamin

Lucas Czech

unread,

Oct 13, 2017, 9:20:29 AM10/13/17

to ra...@googlegroups.com

Hi Benjamin,

is it possible that you copied the epa directory from your desktop machine to the cluster, and then ran `make pll` and `make` directly? In that case, the configuration is still made for your desktop, that is, it uses all those special instructions during compilation. Try to run `make clean` first, and then `make pll` and `make` again.

If this still does not work, we will have to dig deeper.

Best
Lucas

Benjamin lin

unread,

Oct 13, 2017, 10:02:36 AM10/13/17

to raxml

is it possible that you copied the epa directory from your desktop machine to the cluster

I started from scratch, starting from a git clone.

Then did "make pll" successfully and reran the compilation "make" (also successful).

Any execution of the resulting binary immediately throws the core dump (even without parameters, where it should print the help).

I attach here the complete output of all compilation steps and the debug log from valgrind, if this may be of any help.

Best,

error_epa_compile

Pierre Barbera

unread,

Oct 13, 2017, 11:17:40 AM10/13/17

to ra...@googlegroups.com

Hello Benjamin,

this is strange indeed. EPA-ng doesn't have a minimum requirement on vector instruction sets.

I can see that cmake successfully detects AVX, even though as you write, the target processor does not.

Did you run the executable on the login node of your cluster? Or did you submit a job script? Perhaps your login nodes compilers are configured to compile for the actual compute nodes, while the login node has a different architecture.

Barring that, it might be an issue with cmake, and/or how I use it to detect the intrinsics, but it hasn't failed me yet. The only other odd thing I can see is that your C compiler is set differently than your CXX compiler:

-- The CXX compiler identification is GNU 4.9.3 -- The C compiler identification is GNU 5.4.0

You can try to resolve that and see how it affects it.

Regards,

Pierre

--
You received this message because you are subscribed to the Google Groups "raxml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raxml+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alexey Kozlov

unread,

Oct 14, 2017, 10:31:48 AM10/14/17

to chriswymant via raxml

Hi Benjamin,

I'd strongly recommend *against* running epa-ng on a any CPU without SSE3, as the performance will be *extremely* poor. (and if you have a choice, please use AVX/AVX2-enabled CPU)

Best,

Alexey

To unsubscribe from this group and stop receiving emails from it, send an email to raxml+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

-- 
MSc Pierre Barbera

Phone: +49 6221 533 258
Fax: +49 6221 533 298
E-Mail: pierre....@h-its.org

HITS gGmbH
Schloss-Wolfsbrunnenweg 35
D-69118 Heidelberg
Amtsgericht Mannheim / HRB 337446
Managing Director: Dr. Gesa Schönberger
Scientific Director: Prof. Dr. Michael Strube

--

You received this message because you are subscribed to the Google Groups "raxml" group.

To unsubscribe from this group and stop receiving emails from it, send an email to raxml+unsubscribe@googlegroups.com.

Benjamin lin

unread,

Oct 16, 2017, 7:36:37 AM10/16/17

to raxml

Hi,

I'd strongly recommend *against* running epa-ng on a any CPU without SSE3, as the performance will be *extremely* poor. (and if you have a choice, please use AVX/AVX2-enabled CPU)

I will. I was also surpised that these CPUs have no SSE3 support...

Perhaps your login nodes compilers are configured to compile for the actual compute nodes, while the login node has a different architecture.

I precisely wanted to avoid this by running EPA-ng on a server to which we have direct access, without passing by the SGE grid.
Compilation and run will all be done on the same machine.

Barring that, it might be an issue with cmake, and/or how I use it to detect the intrinsics, but it hasn't failed me yet. The only other odd thing I can see is that your C compiler is set differently than your CXX compiler:

-- The CXX compiler identification is GNU 4.9.3 -- The C compiler identification is GNU 5.4.0
You can try to resolve that and see how it affects it.

I have set CXX as suggested.

The compilation logs:

-- The CXX compiler identification is GNU 5.4.0

-- The C compiler identification is GNU 5.4.0

Compilation is successful.
Unfortunately, the core dump problem remains.

I will search a solution to run the program on machines with different CPU architectures.
If the issue occur on other architectures, i will let you know.

Best regards,

benjamin

Benjamin lin

unread,

Oct 26, 2017, 10:51:25 AM10/26/17

to raxml

Hi,

Some returns concerning my compilation tests that may be helpfull.
Situation: I could update to GCC 4.9 only on one server, our hpc is on GCC 4.8.5 (and cannot be updated immediately).

What i did:

1. I added flags "-static -static-libgcc -static-libstdc++" to "set CMAKE_CXX_FLAGS" in the top CMakeList.txt (in /epa directory)

2. I compiled on the cluster A, with GCC 5.4.0 successfully. It's the machine with AMD CPUs (no sse3 and no avx) where it throws a core dump at launch.

3. I transfered this compiled binary to a first cluster B, holding GCC 4.8.5 and CPU architecture B (with sse3 but no avx, see below). It throws a core dump at start.
4. I transfered this binary to a second cluster C, holding GCC 4.8.5 and CPU architecture C (with both sse3 and avx, see below). There it works perfectly !

Currently compatible only to avx CPUs ?

Anyway, I hope this is helpfull to you.

benjamin

CLUSTER A CPUs:

processor : 0

vendor_id : AuthenticAMD

cpu family : 16

model : 9

model name : AMD Opteron(tm) Processor 6172

stepping : 1

microcode : 0x10000c4

cpu MHz : 2100.113

cache size : 512 KB

physical id : 0

siblings : 12

core id : 0

cpu cores : 12

apicid : 0

initial apicid : 0

fpu : yes

fpu_exception : yes

cpuid level : 5

wp : yes

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxe

xt fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid amd_dcm pni monitor cx16 popcnt lahf_l

m cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt nodeid_msr hw_pstate vmmcall npt lbrv svm_

lock nrip_save pausefilter

bugs : tlb_mmatch fxsave_leak sysret_ss_attrs

bogomips : 4200.22

TLB size : 1024 4K pages

clflush size : 64

cache_alignment : 64

address sizes : 48 bits physical, 48 bits virtual

power management: ts ttp tm stc 100mhzsteps hwpstate

CLUSTER B CPUs:

processor : 0

vendor_id : GenuineIntel

cpu family : 6

model : 44

model name : Westmere E56xx/L56xx/X56xx (Nehalem-C)

stepping : 1

microcode : 0x1

cpu MHz : 2659.996

cache size : 4096 KB

physical id : 0

siblings : 1

core id : 0

cpu cores : 1

apicid : 0

initial apicid : 0

fpu : yes

fpu_exception : yes

cpuid level : 11

wp : yes

flags : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx lm constant

_tsc rep_good nopl pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 x2apic popcnt aes hypervisor lahf_lm

bogomips : 5319.99

clflush size : 64

cache_alignment : 64

address sizes : 40 bits physical, 48 bits virtual

power management:

CLUSTER C CPUs:

processor : 1

vendor_id : GenuineIntel

cpu family : 6

model : 79

model name : Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz

stepping : 1

microcode : 0xb00001a

cpu MHz : 2400.020

cache size : 25600 KB

physical id : 1

siblings : 10

core id : 0

cpu cores : 10

apicid : 32

initial apicid : 32

fpu : yes

fpu_exception : yes

cpuid level : 20

wp : yes

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep

_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx

f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt c

qm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local

bogomips : 4805.32

clflush size : 64

cache_alignment : 64

address sizes : 46 bits physical, 48 bits virtual

Benjamin lin

unread,

Oct 26, 2017, 10:59:23 AM10/26/17

to raxml

I add that i realized there are SSE3 / AVX branches that should be selected at compilation time (in the top CMakeFile.txt). But it seems there are not correctly taken into account.

Alexey Kozlov

unread,

Oct 26, 2017, 11:13:58 AM10/26/17

to ra...@googlegroups.com

@Benjamin: thanks for testing, why don't you send us this static binary (better to Pierre, but feel free to put me on
cc) - we have an AMD non-AVX system here so we can test.

@Pierre: perhaps you should consider runtime detection of CPU features, it saves a lot of hassle. there is even a
function in libpll for that, so really easy to implement, you could check RAxML-NG code for an example. Of course, this
might fail as well under some circumstances, but still better than rely on cmake AND hope that the binary will be used
on the same machine / same CPU where it was compiled...

On 26.10.2017 16:51, Benjamin lin wrote:
> Hi,
>
> Some returns concerning my compilation tests that may be helpfull.
> Situation: I could update to GCC 4.9 only on one server, our hpc is on GCC 4.8.5 (and cannot be updated immediately).
>
> What i did:
>

> 1.*I added flags "-static -static-libgcc -static-libstdc++"* to "set CMAKE_CXX_FLAGS" in the top CMakeList.txt (in /epa

> directory)
> 2. I compiled on the cluster A, with GCC 5.4.0 successfully. It's the machine with AMD CPUs (no sse3 and no avx) where
> it throws a core dump at launch.
> 3. I transfered this compiled binary to a first cluster B, holding GCC 4.8.5 and CPU architecture B (with sse3 but no

> avx, see below). *It throws a core dump at start.*

> 4. I transfered this binary to a second cluster C, holding GCC 4.8.5 and CPU architecture C (with both sse3 and avx, see

> below). *There it works perfectly !*

>
> Currently compatible only to avx CPUs ?
> Anyway, I hope this is helpfull to you.
>
> benjamin
>
>

> *CLUSTER A CPUs:*

> *CLUSTER B CPUs:*

> *
> *
> *CLUSTER C CPUs:*

> --
> You received this message because you are subscribed to the Google Groups "raxml" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.

Pierre Barbera

unread,

Oct 26, 2017, 11:18:52 AM10/26/17

to ra...@googlegroups.com

Hello Benjamin,

first of all, thank you for the in-depth investigation! And apologies about the late reply.

I think you are correct, the problem is with CMake detecting AVX compatibility where none exists. I just replicated the problem locally.

I will replace the code with something that works and let you know when it hits the master branch!

Thanks again,
Pierre

On 26.10.2017 16:51, Benjamin lin wrote:

--
You received this message because you are subscribed to the Google Groups "raxml" group.

To unsubscribe from this group and stop receiving emails from it, send an email to raxml+un...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Pierre Barbera

unread,

Oct 26, 2017, 11:59:47 AM10/26/17

to ra...@googlegroups.com

Hello Benjamin,

if you pull the latest commit on the master branch, your issue should (hopefully) be resolved! Please let me know about any further issues, I really appreciate it!

Pierre

Benjamin lin

unread,

Oct 27, 2017, 5:10:57 AM10/27/17

to raxml

Excellent, from what I rapidly tested this resolved the issue.

Thank you for your reactivity !

benjamin

Benjamin lin

unread,

Nov 17, 2017, 5:07:46 AM11/17/17

to raxml

Hi,

For information, I found another issue related to the size of the input file.

When using an input alignment of 1,000,000 reads (or less) aligned with hmmalign to a 650 leaves reference tree, EPA-ng works fine.

When using an alignment of 10,000,000 reads, I got a core dump just before the placement process starts (see output below).

The convertion to a binary file works fine.

Should I change some specific values in the source code to allow larger dataset analysis ?

Thanks !

INFO Selected: verbose (debug) output
INFO Selected: Output dir: ./
INFO Selected: Query file: ../EMP_92_studies_10000000.fas_pplacer16S.aln.fas
INFO This appears to be a non-binary fasta file. Converting!
INFO Updated Query file: ./EMP_92_studies_10000000.fas_pplacer16S.aln.fas.bin
INFO Selected: Tree file: ../RAxML_result.bv_refs_aln
INFO Selected: Reference MSA: ../bv_refs_aln_stripped_99.5.fasta
INFO Selected: Using threads: 1
INFO     ______ ____   ___           _   __ ______
        / ____// __ \ /   |         / | / // ____/
       / __/  / /_/ // /| | ______ /  |/ // / __  
      / /___ / ____// ___ |/_____// /|  // /_/ /  
     /_____//_/    /_/  |_|      /_/ |_/ \____/
DBG     Rate heterogeneity: GAMMA (4 cats, mean),  alpha: 1 (ML),  weights&rates: (0.25,0.136954) (0.25,0.476752) (0.25,1) (0.25,2.38629) 
        Base frequencies (ML): 0.25 0.25 0.25 0.25 
        Substitution rates (ML): 0.5 0.5 0.5 0.5 0.5 1
DBG  Tree length: 56.5177
DBG  Post-optimization reference tree log-likelihood: -92080.956637
INFO Number of ranks: 1
INFO Number of sequences per rank: 10000652
DBG  num_sequences: 5000
DBG  Preplacement.
DBG  Using threads: 1
DBG  Max threads: 1
Segmentation fault (core dumped)

Pierre Barbera

unread,

Nov 17, 2017, 6:10:10 AM11/17/17

to ra...@googlegroups.com

Hi Benjamin,

would it be possible for you to send me your input files in a private email?

I've tested with 100M input sequences before and never had issues that that stage of the algorithm, so I would be very interested to reproduce this.

Alternatively you can try to recompile with:

make clean && make EPA_DEBUG=1

and then re-run the program. If that yields no concrete output, you can run the program with gdb by prepending this to your command:

gdb --args

then "r" from within the gdb shell to run the actual program. Once its at a failure state, you can use the command "bt" to produce a backtrace.

Thanks,

Pierre

--
You received this message because you are subscribed to the Google Groups "raxml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raxml+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward