Hi everyone,
Now that the GSoC application window is about to open, I’m hoping to discuss project ideas before I start working on my application. In addition to being interested in the suggested project on context-aware guessers, I have come up with my own idea for a project and would much appreciate feedback on feasibility and usefulness to the MDAnalysis community.
With about one year of experience in MD simulations, and only having used GROMACS in that time, I am still very much new to the field. Still, this was long enough to become annoyed with calling gmx energy over and over again - be it from within Python or from the terminal - to create .xvg files that NumPy can then parse. Because of this, I thought writing an .edr parser for MDAnalysis would be a cool idea for a project. While looking into this, I of course then quickly came across @JBarnoud’s panedr (https://github.com/jbarnoud/panedr) which already does a lot of what I was hoping to do. Still, having thought about it, I think there is a lot that I could still do along these lines. Possible directions for this project that I see are:
Now, I should state that I have no prior experience in open-source software development, so I am not sure if and how panedr could be used within MDAnalysis in terms of permissions, licensing, etc.
What I like about this project is that I could fix something that annoys me and (I’m assuming) many others frequently and that I would use every day, while also getting a better understanding of both MDAnalysis and other MD software like Amber, NAMD, or CHARMM.
Any feedback is much appreciated, thank you! I would love to hear if this could be useful for others, if scale and difficulty are appropriate for a GSoC project, if there are any other directions this project could be taken, or if there are problems that I have not considered.
As stated above, I’m also interested in the project to improve the guessers. Changing the way a Universe is created would mean I’d get to work at the very core of MDAnalysis. Improving support for coarse-grained systems would be helpful to many, and since I haven’t actually used the Martini force field myself yet this would be a great way to combine learning to use this CG model and contributing to MDAnalysis.
I’m looking forward to start working on my application(s) soon! In the meantime, I’ll continue working on my open PR to get it ready to be merged. Thanks to all who offered advice and suggestions there!
Best wishes
BjarneHi Oliver,
Sorry for the delayed response, and thanks very much for your input! As I mentioned, I am very new to open-source software development, so thinking about every bit of added code in terms of a cost-benefit trade-off that also takes indefinite need of maintenance into account is an important perspective to now have.
> So my first question is what your energy integration will do for users. What will it allow you to do that you couldn't do before or do it in a much more efficient manner?
As I see it, having energy readers included in MDAnalysis would make analysis and quality control of MD simulations much more convenient. As an example, a GROMACS user who wants to check if their system is properly equilibrated at a desired temperature needs to call GROMACS’ energy program to extract the temperature data from the .edr file, either through the terminal with
$ gmx energy –f ener.edr -o temp.xvg,
followed by going through the interactive prompt, or by calling os.system as such:
os.system(“gmx energy –f ener.edr -o temp.xvg << EOF \n 18”),
where 18 is the entry index for temperature in this particular energy file. The latter approach requires knowledge of index assignment in the .edr file beforehand, and this assignment can differ between different GROMACS versions and MD input parameters. Either approach creates a .xvg file which can then readily be plotted. However, in a typical simulation in the NPT ensemble, a user might be interested in temperature, pressure, density, box vectors, and/or multiple energy components, meaning that this approach of creating intermediate files can be quite cumbersome. Being able to instead read in the .edr file and read/plot data from it directly would be more convenient and less error-prone.
This is already possible through use of panedr, but I believe integration into MDAnalysis as part of the auxiliary module would still be beneficial. By having the energy data directly associated with the time steps, selecting frames of the trajectory based on more specific criteria would be simplified. I’m imagining, for example, the case of a protein undergoing a number of conformational changes, where the user could then more easily select frames where, say, the potential energy of the system is below a certain threshold.
> How will your contribution be sufficiently general or extensible so that users of other packages can also benefit?
I propose to write auxiliary readers for energy output files for other packages as well.
For example, both NAMD and Amber write energy information to ASCII files (.out or to stdout, which can be captured to a log file) with slightly different formatting. For plotting, the desired energy terms have to be extracted and written to intermediate files, for example by using a combination of grep and awk.
In both cases (and similarly to GROMACS), users would benefit from being able to read and plot energy-related terms from within Python, without the need to create intermediate files, and without having to call command line programs from the terminal or with os.system(). Implementation of these readers should be reasonably straightforward (has anyone ever not regretted saying that, by the way?) since these are just ASCII files. Once read, MDAnalysis would no longer need to care about the origin of the data.
> Finally, panedr has pandas as dependency. That's not a problem per se but it's probably not something we want to make a required dependency so you have to think about how to deal with the situation when someone doesn't have pandas installed.
If pandas should not be a required dependency in MDAnalysis, maybe part of the scope of this project could be to modify/re-implement panedr / a new EDR reader for MDA without pandas utilisation? For the uses I’m imagining, the data can likely be stored in other forms such as numpy arrays as well.
> Can you mock up what you would *like* to write as python commands *if* you had completed your project successfully?
I’m imagining the new energy readers as part of MDAnalysis.auxiliary. Here is a new .edr reader as an example:
aux = MDAnalysis.auxiliary.edr.EDRReader(‘ener.edr’)
These readers would need to work slightly differently from XVGReader, because more than one data point is contained in the energy files. On initialisation, the EDRReader object checks which terms are present in the file and populates an attribute with that information. I’m not sure which data type would be best, let’s say for now it is a list.
In: aux.terms
Out: [“Time”, “Bond”, “Angle”, “Potential” and so on]
Alternatively, this information could be part of the return value of the reader’s __str__() method.
The readers would then have methods for extracting data from the files. One desired outcome is attaching the data as auxiliary information to the timesteps of a trajectory. For that, MDAnalysis.coordinates.base.ProtoReader.add_auxiliary() would need to be modified. Currently, it takes the following parameters:
add_auxiliary(self, auxname, auxdata, format=None, **kwargs).
In addition to the name for the auxiliary data attribute (auxname) and the data source (file or aux reader, auxdata), a new parameter is needed for energy files to specify which data to include from auxdata (one of aux.terms). Adding auxiliary data to a trajectory would then work as such:
u = mda.Universe(foo, bar)
u.trajectory.add_auxiliary(auxname='epot', auxterm=‘Potential’, auxdata=aux)
Having loaded the auxiliary data, it can for example be used in selecting certain time steps. (In my test, creating a copy was necessary at this stage to keep time information. I’ll look into this further)
selected_steps = [ts.copy() for ts in u.trajectory if ts.aux.epot < some_threshold]
Alternatively, if the user is merely interested in extracting plottable data, the energy readers can be used for “unpacking” the selected data into numpy arrays.
epot = aux.unpack(‘Potential’)
bond_terms, angle_terms = aux.unpack([‘Bond’, ‘Angle’])
The readers will also have a method to provide basic statistical information to the user:
In: aux.statistics(‘Temperature’)
Out: Temperature: Average: value, Error estimate: value, RMSD: value, total drift: value
So this is what I’m imagining new energy readers would do and how they would work. Connecting back to the first point in this email, I would be sure to use this frequently, and I think it could be useful for the community. I am very curious to hear how the community/the core devs feel about the cost-benefit trade-off of these inclusions. If the overall feedback is positive, great, I’ll put together the actual proposal for GSoC. If it is negative, then I’m still glad to have invested the time to think this through a bit more based on Oliver’s input as I have already learned quite a bit more now, and I will instead start working on a proposal for the context-aware guessers.
Thanks!
Best wishes
Bjarne
--
You received this message because you are subscribed to the Google Groups "MDnalysis-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mdnalysis-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mdnalysis-devel/c4c08b56-9ba0-4aec-8a9e-0a6f2ef8cad4n%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mdnalysis-devel/10d555fc-813d-44fb-bf52-a3078eade870n%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mdnalysis-devel/bc7c6e60-9712-4310-9f46-bfb0bdc37681n%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mdnalysis-devel/331930bb-319c-4003-8090-fe759d1c59e1n%40googlegroups.com.