Fwd: [GROMACS forums] [Developers discussions] Support of PDBx/mmCIF format

17 views
Skip to first unread message

Oliver Beckstein

unread,
Mar 23, 2023, 1:33:47 PM3/23/23
to mdnalysis-devel
Hi devs,

Sorry for the cross-post from the GROMACS mailing list but it seems relevant for MDAnalysis as well, namely to make an effort to support natively the mmCIF format in issue https://github.com/MDAnalysis/mdanalysis/issues/2367

There are a few related issues https://github.com/MDAnalysis/mdanalysis/issues?q=is%3Aissue+is%3Aopen+mmcif that would also be worthwhile looking at.

Oliver

Begin forwarded message:

From: aalhossary via GROMACS forums <notifi...@bioexcel1.discoursemail.com>
Subject: [GROMACS forums] [Developers discussions] Support of PDBx/mmCIF format
Date: March 23, 2023 at 9:42:34 AM MST

aalhossary 
March 23

PDB (Protein Data Bank) was established in 1971. Since then, it has been growing in a growing rate. As of December 2021, there are 202,467 structures in the PDB archive.

PDB ID’s are composed of 4 alphanumeric characters in the format [1-9]([0-9A-Z]){3} e.g. 3HHB, 4HHB are different deposited files for Hemoglobin.
PDB file format is composed of fixed column width record lines.

There are 2 problems with current deposited structures in the PDB:

  1. the 4 alphanumeric PDB IDs are running out.
  2. PDB format has a lot of limitations:
  • Atom numbers are 5 digits only (can not support systems with > 99,999 atoms)
  • Chain numbers are 1 character only
  • Coordinates are in (8.3) format. (no positions beyond -999.999)
  • The 3 letters Chemical Component Dictionary (CCD) names are running out as well.

Therefore, the wwPDB adapted this plan:

  1. Designing the extensible mmCIF format that is based on the concept of dictionary, to overcome the now-legacy PDB format limitations. More details on the format can be found here and here.
  2. Releasing the structures that violate any of the PDB limitations as mmCIF files only.
  3. After exhausting all PDB IDs, new extended 8 character IDs will be used.
  4. All new structures afterwards will be released in mmCIF format only.

Please refer to this announcment for more details. This exhaustion is expected to occur very soon in 2023.

Implication on Gromacs
Gromacs does not currently support mmCIF format. Therefore, it will have:

  • Limited ability to read newly released PDB entries
  • Limited ability to write correctly representable simulation output or to convert the .XTC files into other formats that other applications support, in case the simulation output violates one of the PDB format limitations, e.g. gmx trjconv -drop will not be able to generate proper PDB files with more than 99,999 atoms. I believe this limitation has always been there but it was overlooked.

course of action(s)

  • We need to support the mmCIF format soon. It might be a wise action to createa mmcif2gmxmodule instead of modifying the current pdb2gmx module.
  • However, we need to start by analyzing whither supporting the new format with relaxed limitations will affect other locations in Gromacs.

I have the experience enabling BioJava to support extending PDB ID using the ciftools-java library from RCSB, and I believe I can extend that to Gromacs.

In a preliminary search, I found some mmCIF support libraries for C++ cpp-cif-file, cpp-cif-file-util, cpp-dict-pack on their GitHub repo.

However, a change analysis should start before any implementation, and a schedule of support is important for proper delivery.


Visit Topic or reply to this email to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, click here.



--
Oliver Beckstein, DPhil * oliver.b...@asu.edu
https://becksteinlab.physics.asu.edu/

pronouns: he/his/him

Associate Professor of Physics
Arizona State University
Center for Biological Physics and Department of Physics
Tempe, AZ 85287-1504
USA


Richard Gowers

unread,
Mar 23, 2023, 1:42:32 PM3/23/23
to mdnalys...@googlegroups.com
I'm going to shamelessly plug what I wrote to do pdbx/mmcif to rdkit Molecules: https://github.com/OpenFreeEnergy/pdbinf

This also does the "table join" of applying bonds from mmcif templates onto the structure in the input file.  We've already got the machinery to convert from rdkit, so this could work.

--
You received this message because you are subscribed to the Google Groups "MDnalysis-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mdnalysis-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mdnalysis-devel/4CD6463F-3469-4634-9F2D-01BC6AEBB7AE%40asu.edu.

ialibay

unread,
Mar 23, 2023, 1:43:01 PM3/23/23
to MDnalysis-devel
I know Richard has a few things cooking in that space. Maybe we should have an open call at some point to discuss plans and the future of PDB readers in MDAnalysis?

ialibay

unread,
Mar 23, 2023, 1:45:22 PM3/23/23
to MDnalysis-devel
Looks like Richard can email faster (by 1 minute in my defense) than I can :P 

The call for an open discussion on this still stands though.

Reply all
Reply to author
Forward
0 new messages