--
You received this message because you are subscribed to the Google Groups "MDnalysis-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mdnalysis-devel+unsubscribe@googlegroups.com.
To post to this group, send email to mdnalysis-devel@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mdnalysis-devel/af48e050-8b6e-4104-8e84-0754021cfda0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To view this discussion on the web visit https://groups.google.com/d/msgid/mdnalysis-devel/CAFNMPM9KNvpsUGxgaP_JzSObEd1mok28kzTrtVooq73LZTjaAg%40mail.gmail.com.
Hi Mahzad,That's quite interesting work. Thank you!One thing I noticed in my usage (with mdreader, but I guess the conclusion extends to dask as well) is that load balancing depends a lot on per-frame analysis time: for heavy analyses that take multiple seconds per frame trajectory data access happens less often and accession times gradually become staggered across CPUs (i.e., all CPUs start out attempting to access data almost simultaneously, but because they get served and start working at different times subsequent data requests have a smaller chance of overlapping.)Cheers,
Manel
On Feb 25, 2017 4:27 AM, "Hai Nguyen" <nha...@gmail.com> wrote:
nice work. thanks.Hai
On Fri, Feb 24, 2017 at 9:31 PM, Mahzad Khoshlessan <mahzadkh...@gmail.com> wrote:
Hello all,
I am a PhD student at ASU working in Oliver Beckstein group. I have been doing parallel analysis of MD trajectories using Dask parallel library.
We have performed benchmark on a range of commonly used MD file formats (CHARMM/NAMD DCD, Gromacs XTC, Amber NetCDF) and different trajectory sizes on different high-performance computing (HPC) resources. Benchmarks are performed both on a single node and across multiple nodes.
Our results express strong dependency to file formats and hardware and several bottlenecks are identified through our study.
We would like to share our results with you and will appreciate any comments and feedback on the results.
Here are the citation and link to the draft of our report (The report is a bit long but it contains all of our data):
Khoshlessan, Mahzad; Beckstein, Oliver (2017): Parallel analysis in the MDAnalysis Library: Benchmark of Trajectory File Formats. figshare.
https://doi.org/10.6084/m9.figshare.4695742
Thanks in advance,
Mahzad
--
You received this message because you are subscribed to the Google Groups "MDnalysis-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mdnalysis-dev...@googlegroups.com.
To post to this group, send email to mdnalys...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mdnalysis-devel/af48e050-8b6e-4104-8e84-0754021cfda0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "MDnalysis-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mdnalysis-dev...@googlegroups.com.
To post to this group, send email to mdnalys...@googlegroups.com.
Hi Mahzad
The report is nice and gives a good overview. I have some additional
questions though that I couldn't find answered in the paper.
# I/O performance in MB/s read for formats
You show a lot of data for the frame/s in I/O but I would like to know
what that corresponds to in MB/s so see how far away from ideal
conditions, for each FS, the trajectories are. Here XTC will likely get
boosted numbers due to the compression. Also how does raw frame/s
performance look like for the different formats, when no computations
are done.
# Why isn't t_compute constant across formats?
As I understand your definition of t_compute in the text it doesn't
include work that does any I/O. So why do I see different performance
for that on the same machine for different file formats. See fig.5 for
example. Where XTC performs twice as good on the local-remote-HDD machine.
# How many cores / nodes have been used for which figure?
I'm not sure on how many nodes/machines/cores the test was run for each
benchmark result that is shown. For example I would expect that the
efficiency graphs start to level off once n_block >= n_cores. I couldn't
find that information in the text and it's also not clear to me when I
look at the figures.
# How would this improve if we use 10/100 frame chunks?
Could the performance of the different formats be improved if we read
trajectory data into chunks of 10-100 frames before we do any
calculations? That way the randomization of the calculation lengths
might spread the I/O load more favorable. But this might be done in
later tests.
# Implement new formats like HalMD/TNG
HalMD is a HDF5 based trajectory format. This does support true parallel
access. I'm not sure about the gromacs TNG format with respect to
parallel performance. They both might be nice to mention in the outlook
as alternative formats, since these results will be interesting for
other projects as well.
# Did you do any measurements with larger system sizes?
I would be interested if the performance behaves differently when the
number of atoms is growing. This would basically increase the
computation time for the rmsd alignment per frame.
# Is the code available on github?
I've seen you attached it at the end of the paper. Github would be more
convenient.
# Style comments
## Given precisions
The given precision isn't actually correct for xtc. The 1e-2 ansgtröm
precision only works for common box sizes. Using large boxes the
precision can be less due to floating point errors. The same is true for
the other formats. Also giving the XTC precision in int is weird since
the API accepts floats/double.
## Low quality of figures
I noticed for a number of different figures that the labels and ticks
are to small. For example in figure 40 it is almost impossible for me to
read the y-scale without magnifying the PDF.
> # Why isn't t_compute constant across formats?
>
> As I understand your definition of t_compute in the text it doesn't
> include work that does any I/O. So why do I see different performance
> for that on the same machine for different file formats. See fig.5 for
> example. Where XTC performs twice as good on the local-remote-HDD
> machine.
>
>
> Yes t_compute does not include work from I/O. It seems that the results
> are strongly dependent on file formats and hardware. You are right that
> ideally for the same machine we should not see any difference across
> file formats for the same machine. Looking at figure 5 this is the case
> for some machines like Comet (Both lustre and SSD), Stampede, local SSD
> but not for others. This means that the effect of hardware is strong but
> why this is happening for some cases is not very well clear to me.
Do results differ with only 1-2 block/core? Where in the code is that
time measured?
> # How many cores / nodes have been used for which figure?
>
> I'm not sure on how many nodes/machines/cores the test was run for each
> benchmark result that is shown. For example I would expect that the
> efficiency graphs start to level off once n_block >= n_cores. I
> couldn't
> find that information in the text and it's also not clear to me when I
> look at the figures.
>
>
> The number of cores per node is not the same across different machines.
> This information is given in Table 4. For Comet we were able to extend
> n_blocks to 24. However, this was 16 for Stampede for example. Also we
> did not try n_blocks > n_cores. In our benchmark, n_blocks <= n_cores.
There have been runs with >50 cores. I assume then that they have been
on different nodes. Is it visible in the performance when you spread to
an additional node? I would also be very interested in runs with n_block
>= n_cores on a single machine, since I assume that will be the default
most people use. Being able to spread this later onto several nodes is
just a bonus, in my opinion.
> # How would this improve if we use 10/100 frame chunks?
>
> Could the performance of the different formats be improved if we read
> trajectory data into chunks of 10-100 frames before we do any
> calculations? That way the randomization of the calculation lengths
> might spread the I/O load more favorable. But this might be done in
> later tests.
>
> Yes, this is an interesting idea.
Should be fairly easy to do as well. You only need little changes in
your code since you mostly work on the raw coordinates only anyway.
> # Implement new formats like HalMD/TNG
>
> HalMD is a HDF5 based trajectory format. This does support true
> parallel
> access. I'm not sure about the gromacs TNG format with respect to
> parallel performance. They both might be nice to mention in the outlook
> as alternative formats, since these results will be interesting for
> other projects as well.
>
> I assume there is not any reader for this file format in the current
> version of the MDAnalysis. This means that if this file format support
> parallel I/O, we first need to write a new reader that can allow
> parallel access to file. One reason we tested Netcdf was this. We were
> thinking that we can take advantage of the parallel I/O in Netcdf 4. But
> the point is that, this will only be applicable for special file formats
> that support parallel I/O. But what about other types of file formats?
> If we can implement a parallel I/O library that allow parallel access to
> files for all formats then it might be very helpful.
HalMD might be easiest to test this. We just need to write a new reader.
HDF5 takes care of the parallel IO implementation.
I'm not sure about writing our own parallel I/O libraries. I have never
done that myself and it sounds like it takes great care and time to get
this bug free.
Sorry for my late reply. I was preoccupied with some other things lately.
> I moved the codes to a shared Github repository:
> https://github.com/Becksteinlab/Parallel-analysis-in-the-MDAnalysis-Library
> <https://github.com/Becksteinlab/Parallel-analysis-in-the-MDAnalysis-Library>
Thanks.
> That RMSD compute time per frame is measured inside the block-RMSD
> function.
> I do not understand your question "Do results differ with only 1-2
> block/core? ".
Do you see different values for t_compute when you run the benchmark in
serial. I see that the block per core was confusing.
> Yes, our results are presented in two parts. The parts that are
> calculated on a single node (The max number of cores per node is 24) and
> the parts that are calculated on multiple nodes (The max number of cores
> per node is 72). For distributed case, we set up a network by defining a
> scheduler node and several worker nodes.
I'm not sure you specify the network that was used for data exchange. In
the benchmark environment section I only see you specify the network for
the "SDSC Comet" cluster. It would be interesting to know this also for
the other clusters that you run the multi-node benchmarks on.
> Yes, you are right. But, we need to decide on one approach to follow.
> But, before implementing the parallel I/O library, it may be a good idea
> to test this for HalMD or NetCDF that already support parallel I/O. This
> will be the cheapest and easiest way to test if parallel I/O is helpful.
Yup. There is a python library for h5md files
https://github.com/pdebuyl/pyh5md
That might be a good start. We also have a GSoC student who is writing a
proposal to add this format to MDAnalysis.