Multiple models in PDB?

722 views
Skip to first unread message

rando...@protonmail.com

unread,
Apr 3, 2018, 7:03:18 AM4/3/18
to BioPandas-Users
Hi Sebastian,
thanks for developing BioPandas.
It is simply an amazing tool for people like me (and you, as far as I can read from the GitHub page), which apply machine learning and data science techniques to molecular biology.

Many PDB files (e.g. 1BQ0) are provided with more than one model for the same protein. Models start with
MODEL
and end with
ENDMDL

And as far as I can tell, BioPandas in its ATOM dataframe ignores MODEL/ENDMDL tags, leading to a huge dataframe which sees several models concatenated.
Any chance for an enhancement on that?

Best Regards!

Sebastian Raschka

unread,
Apr 3, 2018, 11:32:40 PM4/3/18
to rando...@protonmail.com, BioPandas-Users
Thanks for the note!

I actually never worked with a PDB file that contained multiple structures and didn't know that this was possible in the official PDP spec.

In any case, it should definitely be handled in one way or the other. Currently, I don't have any best idea on how to handle that and would welcome and thoughts and feedback (let me cross-post that on the GitHub issue tracker -- maybe better to continue the discussion about potential ways to implement it there).

I think one of the problems with the DataFrame format is that having them all in one DataFrame would probably result in a lot of weird -- or unexpected -- results, thus it would probably best to separate the structures one way or the other ...

1) One option would be to provide a utility function (analogous to the split_multimol2 function, http://rasbt.github.io/biopandas/tutorials/Working_with_MOL2_Structures_in_DataFrames/#parsing-multi-mol2-files) that generates multiple PandasPdb objects from such a file. I.e., it would simply be a list

pdbs = [pdb_1, pdb_2, .... pdb_n]

which would preserve the current functionality of the library without any e.g., backwards-incompatible changes. This would then also help with using the multiprocessing library more easily and efficiently for the analysis of multiple PandasPdb objects in parallel.

2) Right now, the PandasPdb objects have a dictionary containing multiple DataFrames
dict_keys(['ATOM', 'HETATM', 'ANISOU', 'OTHERS'])

For multi-PDB files, the dictionary could be expanded to

dict_keys(['ATOM_1', 'HETATM_1', 'ANISOU_1', 'OTHERS_1', 'ATOM_2', 'HETATM_2', 'ANISOU_2', 'OTHERS_2', ...])

I strongly favor scenario 1) though; however, I would love to hear feedback on this and are open to other suggestions!

In any case, also an error (or at least a warning) should be raised if MODEL & ENDMDL tags are found in a PDB file if the current read_pdb method is used such that this doesn't lead to any unexpected behavior.

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "BioPandas-Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to biopandas-use...@googlegroups.com.
> To post to this group, send email to biopand...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/biopandas-users/742073b9-67de-46de-8e17-17c133141e27%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Orly Avraham

unread,
Dec 26, 2019, 9:58:29 AM12/26/19
to BioPandas-Users
Hi,

I'm relatively new to python and pandas, I am an experimental structural biologist by training.
Just wanted to point out that multiple structures in a file often (usually?) come from an NMR structure (as is the case in the given example as well).
Thank you for this great tool! I started using pandas for pdb files myself and stumbled upon this :)

Best,
Orly


On Wednesday, April 4, 2018 at 6:32:40 AM UTC+3, Sebastian Raschka wrote:
Thanks for the note!

I actually never worked with a PDB file that contained multiple structures and didn't know that this was possible in the official PDP spec.

In any case, it should definitely be handled in one way or the other. Currently, I don't have any best idea on how to handle that and would welcome and thoughts and feedback (let me cross-post that on the GitHub issue tracker -- maybe better to continue the discussion about potential ways to implement it there).

I think one of the problems with the DataFrame format is that having them all in one DataFrame would probably result in a lot of weird -- or unexpected -- results, thus it would probably best to separate the structures one way or the other ...

1) One option would be to provide a utility function (analogous to the split_multimol2 function, http://rasbt.github.io/biopandas/tutorials/Working_with_MOL2_Structures_in_DataFrames/#parsing-multi-mol2-files) that generates multiple PandasPdb objects from such a file. I.e., it would simply be a list

pdbs = [pdb_1, pdb_2, .... pdb_n]

which would preserve the current functionality of the library without any e.g., backwards-incompatible changes. This would then also help with using the multiprocessing library more easily and efficiently for the analysis of multiple PandasPdb objects in parallel.

2) Right now, the PandasPdb objects have a dictionary containing multiple DataFrames
dict_keys(['ATOM', 'HETATM', 'ANISOU', 'OTHERS'])

For multi-PDB files, the dictionary could be expanded to

dict_keys(['ATOM_1', 'HETATM_1', 'ANISOU_1', 'OTHERS_1', 'ATOM_2', 'HETATM_2', 'ANISOU_2', 'OTHERS_2', ...])

I strongly favor scenario 1) though; however, I would love to hear feedback on this and are open to other suggestions!

In any case, also an error (or at least a warning) should be raised if MODEL & ENDMDL tags are found in a PDB file if the current read_pdb method is used such that this doesn't lead to any unexpected behavior.

Best,
Sebastian


> On Apr 3, 2018, at 7:03 AM, randomcoil via BioPandas-Users <biopandas-users@googlegroups.com> wrote:
>
> Hi Sebastian,
> thanks for developing BioPandas.
> It is simply an amazing tool for people like me (and you, as far as I can read from the GitHub page), which apply machine learning and data science techniques to molecular biology.
>
> Many PDB files (e.g. 1BQ0) are provided with more than one model for the same protein. Models start with
> MODEL
> and end with
> ENDMDL
>
> And as far as I can tell, BioPandas in its ATOM dataframe ignores MODEL/ENDMDL tags, leading to a huge dataframe which sees several models concatenated.
> Any chance for an enhancement on that?
>
> Best Regards!
>
> --
> You received this message because you are subscribed to the Google Groups "BioPandas-Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to biopandas-users+unsubscribe@googlegroups.com.
> To post to this group, send email to biopandas-users@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages