Draft of parallel MDAnalysis benchmark report

Mahzad Khoshlessan

unread,

Feb 24, 2017, 9:31:35 PM2/24/17

to MDnalysis-devel

Hello all,

I am a PhD student at ASU working in Oliver Beckstein group. I have been doing parallel analysis of MD trajectories using Dask parallel library.
We have performed benchmark on a range of commonly used MD file formats (CHARMM/NAMD DCD, Gromacs XTC, Amber NetCDF) and different trajectory sizes on different high-performance computing (HPC) resources. Benchmarks are performed both on a single node and across multiple nodes.
Our results express strong dependency to file formats and hardware and several bottlenecks are identified through our study.
We would like to share our results with you and will appreciate any comments and feedback on the results.

Here are the citation and link to the draft of our report (The report is a bit long but it contains all of our data):

Khoshlessan, Mahzad; Beckstein, Oliver (2017): Parallel analysis in the MDAnalysis Library: Benchmark of Trajectory File Formats. figshare.
https://doi.org/10.6084/m9.figshare.4695742

Thanks in advance,
Mahzad

Hai Nguyen

unread,

Feb 24, 2017, 11:27:19 PM2/24/17

to mdnalys...@googlegroups.com

nice work. thanks.

Hai

--
You received this message because you are subscribed to the Google Groups "MDnalysis-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mdnalysis-devel+unsubscribe@googlegroups.com.
To post to this group, send email to mdnalysis-devel@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mdnalysis-devel/af48e050-8b6e-4104-8e84-0754021cfda0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Manuel Nuno Melo

unread,

Feb 25, 2017, 4:27:45 AM2/25/17

to mdnalys...@googlegroups.com

Hi Mahzad,

That's quite interesting work. Thank you!

One thing I noticed in my usage (with mdreader, but I guess the conclusion extends to dask as well) is that load balancing depends a lot on per-frame analysis time: for heavy analyses that take multiple seconds per frame trajectory data access happens less often and accession times gradually become staggered across CPUs (i.e., all CPUs start out attempting to access data almost simultaneously, but because they get served and start working at different times subsequent data requests have a smaller chance of overlapping.)

Cheers,

Manel

To view this discussion on the web visit https://groups.google.com/d/msgid/mdnalysis-devel/CAFNMPM9KNvpsUGxgaP_JzSObEd1mok28kzTrtVooq73LZTjaAg%40mail.gmail.com.

Oliver Beckstein

unread,

Feb 25, 2017, 4:28:17 PM2/25/17

to mdnalys...@googlegroups.com

Hi Manel,

I think the first take home message is the one that you also observed: IO (getting data into memory) is a major bottleneck. If you can increase the compute load per frame then you start taking better advantage of parallel analysis. One idea would be (and we already discussed along those lines) to stack multiple frame based analysis into a combined analysis task.

Or we find better trajectory formats.

Oliver

--
Oliver Beckstein
email: orbe...@gmail.com

Max Linke

unread,

Feb 26, 2017, 10:59:48 AM2/26/17

to mdnalys...@googlegroups.com

Hi Mahzad

The report is nice and gives a good overview. I have some additional
questions though that I couldn't find answered in the paper.

# I/O performance in MB/s read for formats

You show a lot of data for the frame/s in I/O but I would like to know
what that corresponds to in MB/s so see how far away from ideal
conditions, for each FS, the trajectories are. Here XTC will likely get
boosted numbers due to the compression. Also how does raw frame/s
performance look like for the different formats, when no computations
are done.

# Why isn't t_compute constant across formats?

As I understand your definition of t_compute in the text it doesn't
include work that does any I/O. So why do I see different performance
for that on the same machine for different file formats. See fig.5 for
example. Where XTC performs twice as good on the local-remote-HDD machine.

# How many cores / nodes have been used for which figure?

I'm not sure on how many nodes/machines/cores the test was run for each
benchmark result that is shown. For example I would expect that the
efficiency graphs start to level off once n_block >= n_cores. I couldn't
find that information in the text and it's also not clear to me when I
look at the figures.

# How would this improve if we use 10/100 frame chunks?

Could the performance of the different formats be improved if we read
trajectory data into chunks of 10-100 frames before we do any
calculations? That way the randomization of the calculation lengths
might spread the I/O load more favorable. But this might be done in
later tests.

# Implement new formats like HalMD/TNG

HalMD is a HDF5 based trajectory format. This does support true parallel
access. I'm not sure about the gromacs TNG format with respect to
parallel performance. They both might be nice to mention in the outlook
as alternative formats, since these results will be interesting for
other projects as well.

# Did you do any measurements with larger system sizes?

I would be interested if the performance behaves differently when the
number of atoms is growing. This would basically increase the
computation time for the rmsd alignment per frame.

# Is the code available on github?

I've seen you attached it at the end of the paper. Github would be more
convenient.

# Style comments
## Given precisions

The given precision isn't actually correct for xtc. The 1e-2 ansgtröm
precision only works for common box sizes. Using large boxes the
precision can be less due to floating point errors. The same is true for
the other formats. Also giving the XTC precision in int is weird since
the API accepts floats/double.

## Low quality of figures

I noticed for a number of different figures that the labels and ticks
are to small. For example in figure 40 it is almost impossible for me to
read the y-scale without magnifying the PDF.

best Max

> -- You received this message because you are subscribed to the
> Google Groups "MDnalysis-devel" group. To unsubscribe from this group
> and stop receiving emails from it, send an email to

> mdnalysis-dev...@googlegroups.com
> <mailto:mdnalysis-dev...@googlegroups.com>. To post to
> this group, send email to mdnalys...@googlegroups.com
> <mailto:mdnalys...@googlegroups.com>. To view this discussion on

> the web visit
> https://groups.google.com/d/msgid/mdnalysis-devel/af48e050-8b6e-4104-8e84-0754021cfda0%40googlegroups.com
>
>
>
>

<https://groups.google.com/d/msgid/mdnalysis-devel/af48e050-8b6e-4104-8e84-0754021cfda0%40googlegroups.com?utm_medium=email&utm_source=footer>.

Mahzad Khoshlessan

unread,

Feb 26, 2017, 12:17:49 PM2/26/17

to MDnalysis-devel

Hi Manel,

Thanks for the feedback. Yes, that is right. But this actually differs across different file formats. For example, XTC file format has much better scaling although our analysis per frame is not that heavy. It seems that the in-built compression is very useful. However, DCD file format does not scale at all due to the overlapping of the data requests for the same calculation. Our study showed that SSD can be very helpful and can improve the performance. I am not sure how useful using a heavy analysis per frame will be and can lead to better performance.
One thing that I was thinking about is developing a new policy for cache management or maybe taking advantage of an existing cache management algorithm. I am not sure if it is a good idea or how much it can be helpful. But I think employing a novel cache management algorithm that can manage data requests so that all data requests are balanced across all cores might be helpful. Using parallel IO is also another idea. But the problem is that we need to decide what is the best approach to follow.

Thanks,
Mahzad

On Saturday, February 25, 2017 at 2:27:45 AM UTC-7, Manuel Nuno Melo wrote:

Hi Mahzad,

That's quite interesting work. Thank you!
One thing I noticed in my usage (with mdreader, but I guess the conclusion extends to dask as well) is that load balancing depends a lot on per-frame analysis time: for heavy analyses that take multiple seconds per frame trajectory data access happens less often and accession times gradually become staggered across CPUs (i.e., all CPUs start out attempting to access data almost simultaneously, but because they get served and start working at different times subsequent data requests have a smaller chance of overlapping.)

Cheers,

Manel

On Feb 25, 2017 4:27 AM, "Hai Nguyen" <nha...@gmail.com> wrote:

nice work. thanks.

Hai

On Fri, Feb 24, 2017 at 9:31 PM, Mahzad Khoshlessan <mahzadkh...@gmail.com> wrote:

Hello all,

I am a PhD student at ASU working in Oliver Beckstein group. I have been doing parallel analysis of MD trajectories using Dask parallel library.
We have performed benchmark on a range of commonly used MD file formats (CHARMM/NAMD DCD, Gromacs XTC, Amber NetCDF) and different trajectory sizes on different high-performance computing (HPC) resources. Benchmarks are performed both on a single node and across multiple nodes.
Our results express strong dependency to file formats and hardware and several bottlenecks are identified through our study.
We would like to share our results with you and will appreciate any comments and feedback on the results.

Here are the citation and link to the draft of our report (The report is a bit long but it contains all of our data):

Khoshlessan, Mahzad; Beckstein, Oliver (2017): Parallel analysis in the MDAnalysis Library: Benchmark of Trajectory File Formats. figshare.
https://doi.org/10.6084/m9.figshare.4695742

Thanks in advance,
Mahzad

--
You received this message because you are subscribed to the Google Groups "MDnalysis-devel" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mdnalysis-dev...@googlegroups.com.
To post to this group, send email to mdnalys...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/mdnalysis-devel/af48e050-8b6e-4104-8e84-0754021cfda0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "MDnalysis-devel" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mdnalysis-dev...@googlegroups.com.
To post to this group, send email to mdnalys...@googlegroups.com.

Mahzad Khoshlessan

unread,

Feb 26, 2017, 1:44:29 PM2/26/17

to MDnalysis-devel

Hi Manel,

Many thanks for your feedbacks.

Hi Mahzad

The report is nice and gives a good overview. I have some additional
questions though that I couldn't find answered in the paper.

# I/O performance in MB/s read for formats

You show a lot of data for the frame/s in I/O but I would like to know
what that corresponds to in MB/s so see how far away from ideal
conditions, for each FS, the trajectories are. Here XTC will likely get
boosted numbers due to the compression. Also how does raw frame/s
performance look like for the different formats, when no computations
are done.

Yes, this is something we need to look at.

# Why isn't t_compute constant across formats?

As I understand your definition of t_compute in the text it doesn't
include work that does any I/O. So why do I see different performance
for that on the same machine for different file formats. See fig.5 for
example. Where XTC performs twice as good on the local-remote-HDD machine.

Yes t_compute does not include work from I/O. It seems that the results are strongly dependent on file formats and hardware. You are right that ideally for the same machine we should not see any difference across file formats for the same machine. Looking at figure 5 this is the case for some machines like Comet (Both lustre and SSD), Stampede, local SSD but not for others. This means that the effect of hardware is strong but why this is happening for some cases is not very well clear to me.

# How many cores / nodes have been used for which figure?

I'm not sure on how many nodes/machines/cores the test was run for each
benchmark result that is shown. For example I would expect that the
efficiency graphs start to level off once n_block >= n_cores. I couldn't
find that information in the text and it's also not clear to me when I
look at the figures.

The number of cores per node is not the same across different machines. This information is given in Table 4. For Comet we were able to extend n_blocks to 24. However, this was 16 for Stampede for example. Also we did not try n_blocks > n_cores. In our benchmark, n_blocks <= n_cores.

# How would this improve if we use 10/100 frame chunks?

Could the performance of the different formats be improved if we read
trajectory data into chunks of 10-100 frames before we do any
calculations? That way the randomization of the calculation lengths
might spread the I/O load more favorable. But this might be done in
later tests.

Yes, this is an interesting idea.

# Implement new formats like HalMD/TNG

HalMD is a HDF5 based trajectory format. This does support true parallel
access. I'm not sure about the gromacs TNG format with respect to
parallel performance. They both might be nice to mention in the outlook
as alternative formats, since these results will be interesting for
other projects as well.

I assume there is not any reader for this file format in the current version of the MDAnalysis. This means that if this file format support parallel I/O, we first need to write a new reader that can allow parallel access to file. One reason we tested Netcdf was this. We were thinking that we can take advantage of the parallel I/O in Netcdf 4. But the point is that, this will only be applicable for special file formats that support parallel I/O. But what about other types of file formats? If we can implement a parallel I/O library that allow parallel access to files for all formats then it might be very helpful.

# Did you do any measurements with larger system sizes?

I would be interested if the performance behaves differently when the
number of atoms is growing. This would basically increase the
computation time for the rmsd alignment per frame.

Yes, we are trying this now and we are looking at the results. We will update you again when we prepare all the data.

# Is the code available on github?

I've seen you attached it at the end of the paper. Github would be more
convenient.

No, not yet. We can add it there as well.

# Style comments
## Given precisions

The given precision isn't actually correct for xtc. The 1e-2 ansgtröm
precision only works for common box sizes. Using large boxes the
precision can be less due to floating point errors. The same is true for
the other formats. Also giving the XTC precision in int is weird since
the API accepts floats/double.

Thanks for bringing this into our attention.

## Low quality of figures

I noticed for a number of different figures that the labels and ticks
are to small. For example in figure 40 it is almost impossible for me to
read the y-scale without magnifying the PDF.

I will update those figures. Thanks again.

Mahzad,

Max Linke

unread,

Feb 26, 2017, 2:34:52 PM2/26/17

to mdnalys...@googlegroups.com

On 02/26/2017 07:44 PM, Mahzad Khoshlessan wrote:
> Hi Manel,

> # Why isn't t_compute constant across formats?
>
> As I understand your definition of t_compute in the text it doesn't
> include work that does any I/O. So why do I see different performance
> for that on the same machine for different file formats. See fig.5 for
> example. Where XTC performs twice as good on the local-remote-HDD
> machine.
>
>
> Yes t_compute does not include work from I/O. It seems that the results
> are strongly dependent on file formats and hardware. You are right that
> ideally for the same machine we should not see any difference across
> file formats for the same machine. Looking at figure 5 this is the case
> for some machines like Comet (Both lustre and SSD), Stampede, local SSD
> but not for others. This means that the effect of hardware is strong but
> why this is happening for some cases is not very well clear to me.

Do results differ with only 1-2 block/core? Where in the code is that
time measured?

> # How many cores / nodes have been used for which figure?
>
> I'm not sure on how many nodes/machines/cores the test was run for each
> benchmark result that is shown. For example I would expect that the
> efficiency graphs start to level off once n_block >= n_cores. I
> couldn't
> find that information in the text and it's also not clear to me when I
> look at the figures.
>
>
> The number of cores per node is not the same across different machines.
> This information is given in Table 4. For Comet we were able to extend
> n_blocks to 24. However, this was 16 for Stampede for example. Also we
> did not try n_blocks > n_cores. In our benchmark, n_blocks <= n_cores.

There have been runs with >50 cores. I assume then that they have been
on different nodes. Is it visible in the performance when you spread to
an additional node? I would also be very interested in runs with n_block
>= n_cores on a single machine, since I assume that will be the default
most people use. Being able to spread this later onto several nodes is
just a bonus, in my opinion.

> # How would this improve if we use 10/100 frame chunks?
>
> Could the performance of the different formats be improved if we read
> trajectory data into chunks of 10-100 frames before we do any
> calculations? That way the randomization of the calculation lengths
> might spread the I/O load more favorable. But this might be done in
> later tests.
>
> Yes, this is an interesting idea.

Should be fairly easy to do as well. You only need little changes in
your code since you mostly work on the raw coordinates only anyway.

> # Implement new formats like HalMD/TNG
>
> HalMD is a HDF5 based trajectory format. This does support true
> parallel
> access. I'm not sure about the gromacs TNG format with respect to
> parallel performance. They both might be nice to mention in the outlook
> as alternative formats, since these results will be interesting for
> other projects as well.
>
> I assume there is not any reader for this file format in the current
> version of the MDAnalysis. This means that if this file format support
> parallel I/O, we first need to write a new reader that can allow
> parallel access to file. One reason we tested Netcdf was this. We were
> thinking that we can take advantage of the parallel I/O in Netcdf 4. But
> the point is that, this will only be applicable for special file formats
> that support parallel I/O. But what about other types of file formats?
> If we can implement a parallel I/O library that allow parallel access to
> files for all formats then it might be very helpful.

HalMD might be easiest to test this. We just need to write a new reader.
HDF5 takes care of the parallel IO implementation.

I'm not sure about writing our own parallel I/O libraries. I have never
done that myself and it sounds like it takes great care and time to get
this bug free.

Mahzad Khoshlessan

unread,

Feb 27, 2017, 9:56:20 PM2/27/17

to MDnalysis-devel

> # Why isn't t_compute constant across formats?
>
> As I understand your definition of t_compute in the text it doesn't
> include work that does any I/O. So why do I see different performance
> for that on the same machine for different file formats. See fig.5 for
> example. Where XTC performs twice as good on the local-remote-HDD
> machine.
>
>
> Yes t_compute does not include work from I/O. It seems that the results
> are strongly dependent on file formats and hardware. You are right that
> ideally for the same machine we should not see any difference across
> file formats for the same machine. Looking at figure 5 this is the case
> for some machines like Comet (Both lustre and SSD), Stampede, local SSD
> but not for others. This means that the effect of hardware is strong but
> why this is happening for some cases is not very well clear to me.

Do results differ with only 1-2 block/core? Where in the code is that
time measured?

I moved the codes to a shared Github repository:
https://github.com/Becksteinlab/Parallel-analysis-in-the-MDAnalysis-Library
That RMSD compute time per frame is measured inside the block-RMSD function.
I do not understand your question "Do results differ with only 1-2 block/core? ".

> # How many cores / nodes have been used for which figure?
>
> I'm not sure on how many nodes/machines/cores the test was run for each
> benchmark result that is shown. For example I would expect that the
> efficiency graphs start to level off once n_block >= n_cores. I
> couldn't
> find that information in the text and it's also not clear to me when I
> look at the figures.
>
>
> The number of cores per node is not the same across different machines.
> This information is given in Table 4. For Comet we were able to extend
> n_blocks to 24. However, this was 16 for Stampede for example. Also we
> did not try n_blocks > n_cores. In our benchmark, n_blocks <= n_cores.

There have been runs with >50 cores. I assume then that they have been
on different nodes. Is it visible in the performance when you spread to
an additional node? I would also be very interested in runs with n_block
>= n_cores on a single machine, since I assume that will be the default
most people use. Being able to spread this later onto several nodes is
just a bonus, in my opinion.

Yes, our results are presented in two parts. The parts that are calculated on a single node (The max number of cores per node is 24) and the parts that are calculated on multiple nodes (The max number of cores per node is 72). For distributed case, we set up a network by defining a scheduler node and several worker nodes. We provide the address to the scheduer to each worker. Workers will be able to talk to scheduler using the provided port. For distributed case for example, if we have 8 blocks, it does not necessarily mean that all 8 workers are from the same node. It depends on the scheduler that how it assigns jobs to the workers. I hope I answered your question, if not, please ask.
Also, it may be lead to better performance when n_block>= n_cores. But, at the same time it causes an overhead due to the cost for splitting the trajectory.

> # How would this improve if we use 10/100 frame chunks?
>
> Could the performance of the different formats be improved if we read
> trajectory data into chunks of 10-100 frames before we do any
> calculations? That way the randomization of the calculation lengths
> might spread the I/O load more favorable. But this might be done in
> later tests.
>
> Yes, this is an interesting idea.

Should be fairly easy to do as well. You only need little changes in
your code since you mostly work on the raw coordinates only anyway.

Also I am not sure how we can load multiple frames at a time in MDAnalysis. Are you talking about u.transfer_to_memory?
If not, can you please explain how can we do that? I am really interested to try this.

> # Implement new formats like HalMD/TNG
>
> HalMD is a HDF5 based trajectory format. This does support true
> parallel
> access. I'm not sure about the gromacs TNG format with respect to
> parallel performance. They both might be nice to mention in the outlook
> as alternative formats, since these results will be interesting for
> other projects as well.
>
> I assume there is not any reader for this file format in the current
> version of the MDAnalysis. This means that if this file format support
> parallel I/O, we first need to write a new reader that can allow
> parallel access to file. One reason we tested Netcdf was this. We were
> thinking that we can take advantage of the parallel I/O in Netcdf 4. But
> the point is that, this will only be applicable for special file formats
> that support parallel I/O. But what about other types of file formats?
> If we can implement a parallel I/O library that allow parallel access to
> files for all formats then it might be very helpful.

HalMD might be easiest to test this. We just need to write a new reader.
HDF5 takes care of the parallel IO implementation.

I'm not sure about writing our own parallel I/O libraries. I have never
done that myself and it sounds like it takes great care and time to get
this bug free.

Yes, you are right. But, we need to decide on one approach to follow. But, before implementing the parallel I/O library, it may be a good idea to test this for HalMD or NetCDF that already support parallel I/O. This will be the cheapest and easiest way to test if parallel I/O is helpful.

Max Linke

unread,

Mar 13, 2017, 5:06:58 PM3/13/17

to mdnalys...@googlegroups.com

Sorry for my late reply. I was preoccupied with some other things lately.

> I moved the codes to a shared Github repository:
> https://github.com/Becksteinlab/Parallel-analysis-in-the-MDAnalysis-Library

> <https://github.com/Becksteinlab/Parallel-analysis-in-the-MDAnalysis-Library>

Thanks.

> That RMSD compute time per frame is measured inside the block-RMSD
> function.
> I do not understand your question "Do results differ with only 1-2
> block/core? ".

Do you see different values for t_compute when you run the benchmark in
serial. I see that the block per core was confusing.

> Yes, our results are presented in two parts. The parts that are
> calculated on a single node (The max number of cores per node is 24) and
> the parts that are calculated on multiple nodes (The max number of cores
> per node is 72). For distributed case, we set up a network by defining a
> scheduler node and several worker nodes.

I'm not sure you specify the network that was used for data exchange. In
the benchmark environment section I only see you specify the network for
the "SDSC Comet" cluster. It would be interesting to know this also for
the other clusters that you run the multi-node benchmarks on.

> We provide the address to the
> scheduer to each worker. Workers will be able to talk to scheduler using
> the provided port. For distributed case for example, if we have 8
> blocks, it does not necessarily mean that all 8 workers are from the
> same node. It depends on the scheduler that how it assigns jobs to the
> workers. I hope I answered your question, if not, please ask.

Yes that is a nice explanation.

> Also I am not sure how we can load multiple frames at a time in
> MDAnalysis. Are you talking about u.transfer_to_memory?
> If not, can you please explain how can we do that? I am really
> interested to try this.

Basically you change the rmsd function in your script to accept numpy
arrays instead of atomgroups and then use the following loops that
include a buffer to read frames in short succession.

```python
buf_len = 10
buf = np.empty((10, natoms, 3))

for i, ts in enumerate(u.trajectory):
buf[i % buf_len] = ts.positions
if i % buf_len == 0: #doesn't take care of the i=0 case here!!!
for arr in buf:
rmsd(arr, ...)

```

That's the easiest change I see in your current code that is applicable
to all formats. The other option for DCD is a timeseries. With them you
can actually load subparts of a trajectory into memory in one call.

http://www.mdanalysis.org/mdanalysis/documentation_pages/core/Timeseries.html?highlight=timeseries

> Yes, you are right. But, we need to decide on one approach to follow.
> But, before implementing the parallel I/O library, it may be a good idea
> to test this for HalMD or NetCDF that already support parallel I/O. This
> will be the cheapest and easiest way to test if parallel I/O is helpful.

Yup. There is a python library for h5md files

https://github.com/pdebuyl/pyh5md

That might be a good start. We also have a GSoC student who is writing a
proposal to add this format to MDAnalysis.

Mahzad Khoshlessan

unread,

Mar 17, 2017, 3:46:26 PM3/17/17

to MDnalysis-devel

On Monday, March 13, 2017 at 2:06:58 PM UTC-7, Max Linke wrote:

Sorry for my late reply. I was preoccupied with some other things lately.

> I moved the codes to a shared Github repository:
> https://github.com/Becksteinlab/Parallel-analysis-in-the-MDAnalysis-Library
> <https://github.com/Becksteinlab/Parallel-analysis-in-the-MDAnalysis-Library>

Thanks.

> That RMSD compute time per frame is measured inside the block-RMSD
> function.
> I do not understand your question "Do results differ with only 1-2
> block/core? ".

Do you see different values for t_compute when you run the benchmark in
serial. I see that the block per core was confusing.

Our results show that this is dependent on the hardware. We have shown the variation of t_compute by number of cores for different machines. For example, fig 6 shows that t_compute per frame is pretty level across increasing number of blocks (processes) for some machines like Comet. However, it is not for others like our local machine (remote-HDD).

> Yes, our results are presented in two parts. The parts that are
> calculated on a single node (The max number of cores per node is 24) and
> the parts that are calculated on multiple nodes (The max number of cores
> per node is 72). For distributed case, we set up a network by defining a
> scheduler node and several worker nodes.

I'm not sure you specify the network that was used for data exchange. In
the benchmark environment section I only see you specify the network for
the "SDSC Comet" cluster. It would be interesting to know this also for
the other clusters that you run the multi-node benchmarks on.

No, we are doing the same for all machines. We set up the network across three nodes for all cases tested using distributed scheduler. However, because the number of cores per node is different on different machines, the maximum number of blocks we were able to test was different (It is 72 for Comet, 48 for Stampede and Saguaro and 24 for our local machine)

Ok, sounds good. I need to test this. But do you think that the timeseries which only works for DCD file format will be more efficient than loading multiple frames at a time (which seems to work for all formats)?

> Yes, you are right. But, we need to decide on one approach to follow.
> But, before implementing the parallel I/O library, it may be a good idea
> to test this for HalMD or NetCDF that already support parallel I/O. This
> will be the cheapest and easiest way to test if parallel I/O is helpful.

Yup. There is a python library for h5md files

https://github.com/pdebuyl/pyh5md

That might be a good start. We also have a GSoC student who is writing a
proposal to add this format to MDAnalysis.

Ok, when does this project start and how long do you think it will take the student to have the reader bug free?

Thanks,
Mahzad

Max Linke

unread,

Mar 17, 2017, 6:48:06 PM3/17/17

to mdnalys...@googlegroups.com

On 03/17/2017 08:46 PM, Mahzad Khoshlessan wrote:

> Ok, sounds good. I need to test this. But do you think that the
> timeseries which only works for DCD file format will be more efficient
> than loading multiple frames at a time (which seems to work for all
> formats)?

My gut feeling yes. Since it doesn't release the file handle while it's
reading all frames. But unless it's measured we can't know for sure.

> Ok, when does this project start and how long do you think it will take
> the student to have the reader bug free?

This isn't sure yet. We haven't selected a student and it's not clear if
this project will get chosen.

bset Max

Max Linke

unread,

May 14, 2017, 9:11:40 AM5/14/17

to mdnalys...@googlegroups.com

Hi

I spoke about parallel analysis with a friend recently. His current strategy is to split trajectories into multiple files and work

on the individual files with single jobs. This way he can scale almost perfectly up to the number of trajectory segments. He currently has to do a lot

of micro management to collect tasks back to gather to see changes along the full simulation.

His strategy actually maps pretty well to our chain readers. So instead of having just one large DCD trajectory we could just copy it X times until we reach the

total length we want. It could also be that this solution allows good scaling independent of the trajectory format.

But this isn't a ideal solution for us as a general package. When we split trajectories into several files the number of files and inodes on the filesystem is getting huge.

MD simulations already generate a large amount of files. This is actually a problem for some HPC file systems. They have to make a trade-off between number of allowed files and filesize. So for optimal performance on some file systems you want to have only a few super large files. So we need still file formats that allow parallel reading from the same file.

best Max

Max Linke

unread,

May 31, 2017, 3:00:13 PM5/31/17

to mdnalys...@googlegroups.com

Hi Mahzad

We talked at our development meeting about parallel analysis. Our main worry right now is the memory consumption of the topology object. Did you happen to make any measurements of memory consumption dependent on the number of blocks you used?

best Max

Mahzad Khoshlessan

unread,

May 31, 2017, 5:23:51 PM5/31/17

to MDnalysis-devel

Hi Max,

I did not. But it is possible and I can get this information. But I know it is possible to get the memory usage per worker and also total memory usage.
At the time I was doing benchmark I did not collect this information.
But this is not a big deal I can get this info. But do you want this info for a special file format or all the formats?
Also, the trajectory is being loaded into memory frame by frame, so why does the memory consumption matter?

Thanks,
Mahzad

Max Linke

unread,

Jun 1, 2017, 11:37:56 AM6/1/17

to mdnalys...@googlegroups.com

On 05/31/2017 11:23 PM, Mahzad Khoshlessan wrote:
> Hi Max,
>
> I did not. But it is possible and I can get this information. But I know
> it is possible to get the memory usage per worker and also total memory
> usage.
> At the time I was doing benchmark I did not collect this information.
> But this is not a big deal I can get this info. But do you want this
> info for a special file format or all the formats?

For one format is enough. This is regarding only the topology which is
the same for all formats.

> Also, the trajectory is being loaded into memory frame by frame, so why
> does the memory consumption matter?

Normally you create a Topology object with atom names, residue names,
charges, etc for each universe. This object consumes memory. Of course
the memory consumption depends on the number of atoms and attributes
available in the topology.

We thought this might limit scaling due to memory problems for larger
system. We also have some fancy ideas how to limit the memory
consumptions of a topology object for dask. But having some benchmark
results before would be nice.

best Max

> --
> You received this message because you are subscribed to the Google
> Groups "MDnalysis-devel" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to mdnalysis-dev...@googlegroups.com
> <mailto:mdnalysis-dev...@googlegroups.com>.
> To post to this group, send email to mdnalys...@googlegroups.com
> <mailto:mdnalys...@googlegroups.com>.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/mdnalysis-devel/9411e0c9-c90e-4fb0-a59f-92be7c1957cf%40googlegroups.com
> <https://groups.google.com/d/msgid/mdnalysis-devel/9411e0c9-c90e-4fb0-a59f-92be7c1957cf%40googlegroups.com?utm_medium=email&utm_source=footer>.

Reply all

Reply to author

Forward