Loading XTC file into universe

201 views
Skip to first unread message

William Brown

unread,
Apr 20, 2022, 4:14:12 PM4/20/22
to MDnalysis discussion
Hello, I currently have a 13G trajectory in .xtc file format and when I attempt to load it into a universe with the .gro file as the topology is takes multiple hours. When I load .dcd files (which happen to be bigger than 13G) into the universe using the same topology file it loads very quickly. 

Is there something that I can do differently to speed up the process of loading in an xtc file?

Thank you

Oliver Beckstein

unread,
Apr 20, 2022, 5:49:02 PM4/20/22
to mdnalysis-discussion
Hello William,

welcome to the MDAnalysis list!

On Apr 20, 2022, at 1:06 PM, William Brown <william...@gmail.com> wrote:

Hello, I currently have a 13G trajectory in .xtc file format and when I attempt to load it into a universe with the .gro file as the topology is takes multiple hours.

It is normal that the first time loading of a XTC takes some time because MDAnalysis is building a map of the file for fast random access frame seeking (called the “offsets”). The offsets are stored as a hidden file .trajname.xtc.offsets in the same directory. The offsets are needed because otherwise it would not be possible (due to the design of the XTC format) to jump to arbitrary frames in the file or implement slicing efficiently. The next time you load the XT, the offset file is read. (Some nitty-gritty detail at https://docs.mdanalysis.org/stable/documentation_pages/lib/formats/libmdaxdr.html )

Did you let the loading finish? Did you try loading a second time?

“Multiple hours” seems excessive for 13 GiB unless you’re on a slow disk or network connection. For reference: for me a 606M XTC (in an NFS-mounted directory) takes 7s the first time and <100ms subsequently. Extrapolating to 13GiB I would expect it to take about 150s to scan the trajectory for the first time (the time cost is fairly linear). Can your try in ipython

import MDAnalysis as mda
%time u = mda.Universe(“traj.xtc”)

and report the file size and wall time? Use a smaller trajectory, not the one that takes hours. Note that it can be tricky to do good benchmarks on file systems because the OS caches files so probably only the first time you try to do the test will you get reliable numbers.

When I load .dcd files (which happen to be bigger than 13G) into the universe using the same topology file it loads very quickly. 

DCDs do not require offsets (but they require more disk space).

Is there something that I can do differently to speed up the process of loading in an xtc file?

It mostly depends on how fast your I/O is so loading the first time on a machine that has the disk directly attached helps. 

Oliver


--
Oliver Beckstein (he/his/him)

GitHub: @orbeckst

MDAnalysis – a NumFOCUS fiscally sponsored project





Oliver Beckstein

unread,
Apr 21, 2022, 1:17:06 PM4/21/22
to mdnalysis-discussion
Hi William,

I did a more thorough test: for trajectory a 591G trajectory center.xtc :


In [2]: import MDAnalysis

In [3]: %time u = MDAnalysis.Universe("center.xtc")
CPU times: user 7.88 s, sys: 40.1 s, total: 48 s
Wall time: 1h 2min 28s

In [3]: %time u = MDAnalysis.Universe("center.xtc")
CPU times: user 15.6 ms, sys: 449 ms, total: 465 ms
Wall time: 464 ms

so I get about 9.5G/min when reading the first time and after that it’s < 1s.

(For some odd reason I ran this example in an old Python 2.7 environment with MDAnalysis 1.1.1 but I doubt that things will look worse with Python 3.8+ and MDAnalysis 2.1.0 — I’ll check and post if there are any noticeable differences.)

Oliver

William Brown

unread,
Apr 21, 2022, 1:31:47 PM4/21/22
to mdnalysis-...@googlegroups.com
Thank you for all of the information. I believe that there is an issue with the xtc file itself because I used gmx trjcat to concatenation 3 files, and then attempted to use gmx trjconv to recenter the clustered residues. 

When I attempted to load in one of the original trajectories (~18G) it acted as you explain, but the other “altered” trajectory files still don’t load so it must be a problem created in the concatenation or recentering. 

Thank you again for your help.

--
You received this message because you are subscribed to the Google Groups "MDnalysis discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mdnalysis-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mdnalysis-discussion/8DD576EC-C941-42E1-B215-9F19DB1D94C2%40mdanalysis.org.

Oliver Beckstein

unread,
Apr 21, 2022, 1:35:56 PM4/21/22
to mdnalysis-discussion

On Apr 21, 2022, at 10:23 AM, William Brown <william...@gmail.com> wrote:

Thank you for all of the information. I believe that there is an issue with the xtc file itself because I used gmx trjcat to concatenation 3 files, and then attempted to use gmx trjconv to recenter the clustered residues. 

Sounds sensible, that’s what I do ;-)


When I attempted to load in one of the original trajectories (~18G) it acted as you explain, but the other “altered” trajectory files still don’t load so it must be a problem created in the concatenation or recentering. 

Did you run gmx check on it? It will detect corrupted trajectories (which happens relatively frequently, especially when you write through the network). 

As long as your originals are still ok (run gmx check or load them in MDAnalysis) you can just rerun your workflow. If one of your originals is damaged and you want to rescue the data, you can use the gmx_rescue64 tool (very old piece of code).


Thank you again for your help.

You’re welcome!
Reply all
Reply to author
Forward
0 new messages