Any suggestions about handling huge dataset with parse-examl

44 Aufrufe
Direkt zur ersten ungelesenen Nachricht

lei chen

ungelesen,
22.06.2017, 05:14:3922.06.17
an raxml
Hi all,

  I am recently trying to parse a huge dataset for ExaML software. The dataset size has reached 24Gb. I run the parse-examl using the following commands:

parse-examl -s all.merge.fasta.phy -m DNA -n all.merge.fasta >parse.log

However, I did't get any result since three days has passed by. The parse-examl uses one single core to compute. I wonder if there is a pthread version of this software? Or is it possible to accelerate this progress? Any advice would be greatly appreciated.


Best,
Lei

Alexey Kozlov

ungelesen,
22.06.2017, 13:31:4022.06.17
an ra...@googlegroups.com
Hi Lei,

unfortunately, the parser code is not parallelized since it never was a bottleneck.

However, your dataset size is really impressive, so I could imagine parse-examl to spent some time on it. Still, 3 days
sounds like too much.

I just tested a 15GB alignment and it took only ~1.5 hours to parse it.

May I ask what dimensions does your dataset have (# taxa x # sites)?

Best,
Alexey

On 22.06.2017 11:14, lei chen wrote:
> Hi all,
>
> I am recently trying to parse a huge dataset for ExaML software. The dataset size has reached 24Gb. I run the
> parse-examl using the following commands:
>
> *parse-examl -s all.merge.fasta.phy -m DNA -n all.merge.fasta >parse.log*
>
> However, I did't get any result since three days has passed by. The parse-examl uses one single core to compute. I
> wonder if there is a pthread version of this software? Or is it possible to accelerate this progress? Any advice would
> be greatly appreciated.
>
>
> Best,
> Lei
>
> --
> You received this message because you are subscribed to the Google Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.
Die Nachricht wurde gelöscht

lei chen

ungelesen,
23.06.2017, 02:05:0323.06.17
an raxml
Hi Alexey,

  Thanks very much for getting back to me.
  My dataset has 50 taxa X 300Mb sites. I have emailed you and will repost my questions here as you suggest. Actually, my dataset is composed of 50 taxa X 1.5 Gb sites. I only choose 300 Mb sites of each taxa for a preliminary study. It is out of my exceptions that just parsing this subset data would take a long time. I am wondering: 
1. How much time would it take to parse my whole dataset?
2. How much memory would it cost to do ML inference in ExaML using whole dataset? We have computer cluster with 64Gb memory for each node. Is it possible to run it on this cluster?
Could you please give some suggestions for me? I will appreciate any advice.

Best,
Lei

在 2017年6月23日星期五 UTC+8上午1:31:40,Alexey Kozlov写道:

Alexey Kozlov

ungelesen,
23.06.2017, 10:11:4623.06.17
an ra...@googlegroups.com
Hi Lei,

> Thanks very much for getting back to me.
> My dataset has 50 taxa X 300Mb sites. I have emailed you and will repost my questions here as you suggest.
Actually, my dataset is composed of 50 taxa X 1.5 Gb sites. I only choose 300 Mb sites of each taxa for a preliminary
study. It is out of my exceptions that just parsing this subset data would take a long time. I am wondering:

> 1. How much time would it take to parse my whole dataset?

Hard to tell... I just checked the old logs, and for a dataset very similar to your reduced one (~50 taxa x 320Mb),
parser took ~10 hours to complete. The only thing I can suggest here is to double-check that nothing nasty is going on
(i.e. your job submission system doesn't start multiple parser instances in parallel), and try to use a CPU with higher
single-thread performance, if possible.

> 2. How much memory would it cost to do ML inference in ExaML using whole dataset? We have computer cluster with 64Gb
memory for each node. Is it possible to run it on this cluster?

Yes, if you allocate enough nodes for the job (probably 30-40 nodes for the 300Mb dataset).

You can estimate the overall memory requirements using this online calculator here:
http://exelixis-lab.org/web/software/raxml/index.html#memcalc

Best,
Alexey
> <javascript:>
> > <mailto:raxml+un...@googlegroups.com <javascript:>>.
> > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.

lei chen

ungelesen,
23.06.2017, 22:58:1123.06.17
an raxml
Hi Alexey,
 Grateful for your suggestions. The calculation shows that 50 taxa X 1.5 Gb sites would consume 8.7 Tb memory. Does that mean, if I allocate 200 nodes,  ~40Gb memory is required for each node? And, I am wondering if multi threads would consume more memory than expected. How much memory will be increased if I use 24 threads on each node? 

Best,
Lei

在 2017年6月23日星期五 UTC+8下午10:11:46,Alexey Kozlov写道:

Alexey Kozlov

ungelesen,
25.06.2017, 09:14:5025.06.17
an ra...@googlegroups.com
Hi Lei,

> Grateful for your suggestions. The calculation shows that 50 taxa X 1.5 Gb sites would consume 8.7 Tb memory. Does
> that mean, if I allocate 200 nodes, ~40Gb memory is required for each node? And, I am wondering if multi threads would
> consume more memory than expected. How much memory will be increased if I use 24 threads on each node?

not much, also please note that by default ExaML uses MPI for intra-node parallelization as well, i.e. you would start
24 MPI ranks per node and 200 x 24 = 4800 MPI ranks in total.

Best,
Alexey
Die Nachricht wurde gelöscht

lei chen

ungelesen,
27.06.2017, 11:28:5527.06.17
an raxml
Hi Alexey,
  Thank you very much for your advice. It really do me a great favor.

在 2017年6月25日星期日 UTC+8下午9:14:50,Alexey Kozlov写道:

Alexey Kozlov

ungelesen,
27.06.2017, 11:36:4227.06.17
an ra...@googlegroups.com
you're welcome :)

On 27.06.2017 17:28, lei chen wrote:
> Hi Alexey,
Allen antworten
Antwort an Autor
Weiterleiten
0 neue Nachrichten