reading dcd files

Steve Schmerler

unread,

May 20, 2015, 4:34:18 AM5/20/15

to cp2k-users

Hello

I'm trying read dcd files written by cp2k. I have a reader which works
for dcd files written by lammps, but now I'm struggling with the cp2k
dcd header. In particular, I don't seem to be able to read the number of
steps from the header. The file src/motion_utils.F in version 2.6.1 has

WRITE (UNIT=traj_unit) id_dcd,0,it,iskip,0,0,0,0,0,0,REAL(dtime,KIND=sp),&
1,0,0,0,0,0,0,0,0,24
[....]
WRITE (UNIT=traj_unit) 2,remark1,remark2
WRITE (UNIT=traj_unit) nat

I assume from the surrounding code that `it` is the number of steps.
So, with something like

integer :: it, natoms, dummyi, ntitle, iskip, i, unt
real :: timestep
character(len=4) :: id_dcd
character(len=80) :: remark1, remark2
open(unt, ...)
read(unt) id_dcd, dummyi, it, iskip, (dummyi, i=1,6), &
timestep, (dummyi, i=1,10)
read(unt) ntitle, remark1, remark2
read(unt) natoms

the number of steps `it` is always 0, while `natoms` etc are read
correctly.

Also, one of the dcd format specifications found in the documentation of
VMD's dcdplugin [1] suggests that the step variable is the first integer
written after `id_dcd`, whereas cp2k writes a zero first, and then the
steps. Is this another flavor of the dcd format? If so, is there a
specification? My Fortran is somewhat rusty so there may be something
obvious which I have overlooked.

Another related question. Lammps writes the cell information as double
precision, but the coordinates as real. How does cp2k do that? If you
have pointers to the relevant documentation, that would help me a lot.
Thank you very much.

best,
Steve

[1] http://www.ks.uiuc.edu/Research/vmd/plugins/molfile/dcdplugin.html

--
Steve Schmerler
Institut für Theoretische Physik
TU Freiberg, Germany

Matthias Krack

unread,

May 20, 2015, 5:37:43 AM5/20/15

to cp...@googlegroups.com

Hi Steve

I have just checked in a tool called dumpdcd.f90 which you can find now in the cp2k/tools folder of the current CP2K trunk version. It should allow the dump of DCD files generated by CP2K. It provides also some other useful functionality. After compiling it using e.g.

gfortran -o dumdcd.exe dumpdcd.f90

dumpdcd.exe -h

prints a help menu. The flags -i and -d provide a more detail dump. For atomic labels you have to supply an XYZ template file.

Matthias

Steve Schmerler

unread,

May 20, 2015, 7:36:22 AM5/20/15

to cp...@googlegroups.com

On May 20 02:37 -0700, Matthias Krack wrote:
> Hi Steve
>
> I have just checked in a tool called dumpdcd.f90 which you can find now in
> the cp2k/tools folder of the current CP2K trunk version. It should allow
> the dump of DCD files generated by CP2K. It provides also some other useful
> functionality. After compiling it using e.g.
>
> gfortran -o dumdcd.exe dumpdcd.f90
>
> dumpdcd.exe -h
>
> prints a help menu. The flags -i and -d provide a more detail dump. For
> atomic labels you have to supply an XYZ template file.

OK thanks, that was helpful. So you just read until the end of the file
and cp2k doesn't store the number of steps (frames) in the header. And
cp2k seems to store the angles directly [a,gamma,b,beta,alpha,c], while
lammps stores their cosines [a,cos(gamma),b,cos(beta),cos(alpha),c] and
both do this in the same somewhat unusual ordering. Well, this dcd
business is really something. Is there an official specification of the
format(s)? Anyway, thank you very much for this tool and the fast
response.

best,
Steve

Ole Schütt

unread,

May 20, 2015, 8:40:35 AM5/20/15

to cp...@googlegroups.com

Hi Steve,

if I recall correctly you are not the first to struggle with this issue. Maybe you could write a short wiki-page, once you have it all figured out?

-Ole

Matthias Krack

unread,

May 20, 2015, 11:13:05 AM5/20/15

to cp...@googlegroups.com

Hi Steve

the DCD file header is only written once at the beginning of the simulation run. Thus no way to guess how many frames will be dumped during the run eventually. The time for a frame could be retrieved from the print interval and the initial time step which are both dumped, but this works only for a constant time step interval (no variable time step). I am afraid there is no official "DCD standard", but.the CP2K implementation should be quite close to the CHARMM/NAMD definition. It is most important that common tools like VMD are able to digest the DCD files dumped by CP2K even if one could certainly devise a better binary format.

Note, that dumpdcd.f90 has also some nice extra feature like

definition of first and last frame dumped
stride (i.e. skipping frames regularily)
applying PBC (at box centre or at 0,0,0), e.g. useful for movies
tracing of atoms which leave the original cell

Matthias

Steve Schmerler

unread,

May 21, 2015, 10:42:18 AM5/21/15

to cp...@googlegroups.com

On May 20 05:40 -0700, Ole Schütt wrote:
> Hi Steve,
>
> if I recall correctly you are not the first to struggle with this issue.
> Maybe you could write a short wiki-page, once you have it all figured out?

Well what I wrote is what I found out. I'm not sure if that qualifies as
sufficient wiki material. Do you mean cp2k.org or somewhere else? If I
get around to implement something which works for lammps and cp2k
without hassle, then yes I could certainly do that.

Steve Schmerler

unread,

May 21, 2015, 10:56:54 AM5/21/15

to cp...@googlegroups.com

On May 20 08:13 -0700, Matthias Krack wrote:
> Hi Steve
>
> the DCD file header is only written once at the beginning of the simulation
> run. Thus no way to guess how many frames will be dumped during the run
> eventually.

Yes that makes sense, I don't know how the lammps people do that. I
didn't look at their code -- I assume they just write `nstep` from the
input file into the header and assume that the run always finishes as
advertised.

[...]

> It is most important that common tools like VMD are able to digest the
> DCD files dumped by CP2K even if one could certainly devise a better
> binary format.

Of course, compatibility counts.

> Note, that dumpdcd.f90 has also some nice extra feature like
>

> - definition of first and last frame dumped
> - stride (i.e. skipping frames regularily)
> - applying PBC (at box centre or at 0,0,0), e.g. useful for movies
> - tracing of atoms which leave the original cell

This is a nice tool indeed. I'm planning to read the coordinates into
numpy arrays for post-processing and storage and wanted to avoid
unfolding the data to a text-based format if possible. At the moment we
are still working with xyz output, which is OK for medium-sizes systems
and time scales but may become problematic in the future. So, thanks for
the tool. I think I know how to proceed here, based on how dumpdcd does
things.

Ole Schütt

unread,

May 22, 2015, 4:54:35 AM5/22/15

to cp...@googlegroups.com

Hi Steve,

> Do you mean cp2k.org or somewhere else?

Yes, exactly. Maybe as http://cp2k.org/tools:dumpdcd or http://cp2k.org/howto:read_dcd_files .

For what solution did you settle now? Are you using a combination of seek() and numpy.fromfile() to parse the DCD file directly, or are you calling dumpdcd in a pipe?

In any case, I think this is a pretty common post-processing step that is worth documenting for the community.

Cheers,
Ole

Steve Schmerler

unread,

Jun 1, 2015, 2:35:14 PM6/1/15

to cp...@googlegroups.com

On May 22 01:54 -0700, Ole Schütt wrote:
> Yes, exactly. Maybe as http://cp2k.org/tools:dumpdcd or
> http://cp2k.org/howto:read_dcd_files .
>
> For what solution did you settle now? Are you using a combination of seek()
> and numpy.fromfile() to parse the DCD file directly, or are you calling
> dumpdcd in a pipe?

The numpy.fromfile() idea didn't actually occur to me even though it is pretty
obvious, thanks. Based on this I wrote a simple numpy-only version. It uses a
rather simple approach, is not super-fast and can certainly be optimized. It
took me a while to read the header correctly since it has "marker" integers
(84, 164, ...) around each block of data, which are the size (in bytes) of the
block they enclose. In Fortran, one can apparently just read the data without
reading the markers explicitly. Again, VMDs dcdplugin.c saved the day.

The dcd reader is part of my "pwtools" project [1], but the dcd part is (almost)
independent from the rest - you only need the file "dcp.py" [2]. The Python
version is called "read_dcd_data_py":

$ python
>>> import dcd
>>> cc,co = dcd.read_dcd_data_py('cp2k.dcd')

where "cc" is a (nstep,6) array (float64) of unit cell parameters
(a,b,c,alpha,beta,gamma) -- make sure you use dcd_aligned_cell in cp2k. "co" is
a (nstep,natoms,3) array (float32) of cartesian coordinates in Angstrom. Have a
look at the doc string for how to read lammps-style files (keyword
convang=True).

I also adapted an earlier Fortran + f2py version to cp2k files. It is
much faster than the simple Python version, even though it walks the
file twice (once to find nstep, rather brute-force, might change later).
To use this stand-alone, copy dcd.f90 somewhere and run f2py (part of
numpy) to compile the extension module into "_dcd.so", which you import
with "import _dcd":

$ f2py -c dcd.f90 -m _dcd
$ python
>>> import _dcd
>>> nstep, natoms, timestep = _dcd.get_dcd_file_info('cp2k.dcd', nstephdr=False)
>>> cc,co = _dcd.read_dcd_data('cp2k.dcd', nstep, natoms, convang=False)

There is also a wrapper function in dcd.py called "read_dcd_data_f", which does
exactly this. Again, please have a look at the doc string for more examples.

I don't use this regularly on large files at the moment, so there may be issues
that I'm not aware of. If you find bugs, have better / faster versions etc,
please let me know.

best,
Steve

[1] https://bitbucket.org/elcorto/pwtools
Get it:
hg clone https://bitbucket.org/elcorto/pwtools
or
wget https://bitbucket.org/elcorto/pwtools/get/e591501fa367.zip

[2] https://bitbucket.org/elcorto/pwtools/src/e591501fa3671cab7f7583ad747541cb14512d55/dcd.py?at=default

Ole Schütt

unread,

Jun 2, 2015, 8:51:25 AM6/2/15

to cp...@googlegroups.com

Hi Steve,

I did not realize that you had such an elaborate library project going on, pretty cool :-)
I added your project our collections of tools: http://cp2k.org/tools:pwtools
Feel free to elaborate on the page.

> The numpy.fromfile() idea didn't actually occur to me even though it is pretty obvious, thanks.

You're welcome. I came up with that trick when I wrote a parser for binary gromacs trajectories, a couple years ago.

> I also adapted an earlier Fortran + f2py version to cp2k files. It is
> much faster than the simple Python version, even though it walks the
> file twice (once to find nstep, rather brute-force, might change later).

Hmm, looks like your are still doing quite a bit of processing in python. In principle one should be able to get pretty close to the Fortran performance by just using numpy routines. But I guess you are aware of that ;-)

-Ole

Steve Schmerler

unread,

Jun 2, 2015, 4:23:22 PM6/2/15

to cp...@googlegroups.com

On Jun 02 05:51 -0700, Ole Schütt wrote:
> I did not realize that you had such an elaborate library project going on,
> pretty cool :-)
> I added your project our collections of tools: http://cp2k.org/tools:pwtools
> Feel free to elaborate on the page.

Thanks! Yes, I'll add some more content.

> Hmm, looks like your are still doing quite a bit of processing in python.
> In principle one should be able to get pretty close to the Fortran
> performance by just using numpy routines. But I guess you are aware of that
> ;-)

Yes as I said, this is a pretty naive straight-forward implementation,
which I used primarily to understand the dcd format. For a fast version
(about 10x) on should use the Fortran variant. Optimizing this is kind
of fun and probably rather trivial (hint: use line_profiler [1]), so if
someone with a PhD in numpyology beats me to the punch, you're welcome!

best,
Steve

[1] https://github.com/rkern/line_profiler

Steve Schmerler

unread,

Jun 5, 2015, 3:41:26 AM6/5/15

to cp...@googlegroups.com

On Jun 02 22:23 +0200, Steve Schmerler wrote:
> Yes as I said, this is a pretty naive straight-forward implementation,
> which I used primarily to understand the dcd format. For a fast version
> (about 10x) on should use the Fortran variant. Optimizing this is kind
> of fun and probably rather trivial (hint: use line_profiler [1]), so if
> someone with a PhD in numpyology beats me to the punch, you're welcome!

Well, using only numpy.fromfile and file.seek really makes a difference,
so Ole was right! I have a pure numpy implementation now [1] which is
even faster then my f2py'ed Fortran version, named
pwtools.dcd.read_dcd_data(). Make sure to checkout the latest code
version.

best,
Steve

[1] http://elcorto.bitbucket.org/pwtools/generated/api/dcd.html

Reply all

Reply to author

Forward