GSoC Project Idea

246 views
Skip to first unread message

BFedder

unread,
Apr 3, 2022, 4:40:41 PM4/3/22
to MDnalysis-devel

Hi everyone,

Now that the GSoC application window is about to open, I’m hoping to discuss project ideas before I start working on my application. In addition to being interested in the suggested project on context-aware guessers, I have come up with my own idea for a project and would much appreciate feedback on feasibility and usefulness to the MDAnalysis community.

With about one year of experience in MD simulations, and only having used GROMACS in that time, I am still very much new to the field. Still, this was long enough to become annoyed with calling gmx energy over and over again - be it from within Python or from the terminal -  to create .xvg files that NumPy can then parse. Because of this, I thought writing an .edr parser for MDAnalysis would be a cool idea for a project. While looking into this, I of course then quickly came across @JBarnoud’s panedr (https://github.com/jbarnoud/panedr) which already does a lot of what I was hoping to do. Still, having thought about it, I think there is a lot that I could still do along these lines. Possible directions for this project that I see are:

  •  Include panedr in MDAnalysis
  •  Add parsers for “energy” output files of other MD software
  •  Integrate these into @fiona-naughton’s auxiliary data structures

Now, I should state that I have no prior experience in open-source software development, so I am not sure if and how panedr could be used within MDAnalysis in terms of permissions, licensing, etc.

What I like about this project is that I could fix something that annoys me and (I’m assuming) many others frequently and that I would use every day, while also getting a better understanding of both MDAnalysis and other MD software like Amber, NAMD, or CHARMM.

Any feedback is much appreciated, thank you! I would love to hear if this could be useful for others, if scale and difficulty are appropriate for a GSoC project, if there are any other directions this project could be taken, or if there are problems that I have not considered.

As stated above, I’m also interested in the project to improve the guessers. Changing the way a Universe is created would mean I’d get to work at the very core of MDAnalysis. Improving support for coarse-grained systems would be helpful to many, and since I haven’t actually used the Martini force field myself yet this would be a great way to combine learning to use this CG model and contributing to MDAnalysis.

I’m looking forward to start working on my application(s) soon! In the meantime, I’ll continue working on my open PR to get it ready to be merged. Thanks to all who offered advice and suggestions there!

 

Best wishes

Bjarne

orbeckst

unread,
Apr 3, 2022, 9:25:56 PM4/3/22
to MDnalysis-devel
Hi Bjarne,

some comments on the energy reader idea:

@jbarnoud's panedr is LGPL so license-wise it can be used.

I think it's a good idea to think about projects that solve a problem that you have. One question is if this is a problem that is general enough so that it warrants inclusion in MDAnalysis. We always think about added code as code that need to be maintained indefinitely, i.e., there's a cost associated with new code. It needs to be worth it, especially as developers maintain code that they didn't write themselves.

So my first question is what your energy integration will do for users. What will it allow you to do that you couldn't do before or do it in a much more efficient manner? Can you mock up what you would *like* to write as python commands *if* you had completed your project successfully?

MDAnalysis is a package that aims to be MD engine agnostic and support a wide range of codes, not just GROMACS. How will your contribution be sufficiently general or extensible so that users of other packages can also benefit? 

Finally, panedr has pandas as dependency. That's not a problem per se but it's probably not something we want to make a required dependency so you have to think about how to deal with the situation when someone doesn't have pandas installed. There are examples in the code base for similar situations (such as the hdf5 library for H5MD), which are worth looking at.

The aux system was born from the idea of a general framework to associate external data with trajectories. It seems somewhat under-utilized so personally I'd be happy to see some applications.

Oliver

BFedder

unread,
Apr 6, 2022, 7:11:15 PM4/6/22
to MDnalysis-devel

Hi Oliver,  

 

Sorry for the delayed response, and thanks very much for your input! As I mentioned, I am very new to open-source software development, so thinking about every bit of added code in terms of a cost-benefit trade-off that also takes indefinite need of maintenance into account is an important perspective to now have.  

> So my first question is what your energy integration will do for users. What will it allow you to do that you couldn't do before or do it in a much more efficient manner? 

As I see it, having energy readers included in MDAnalysis would make analysis and quality control of MD simulations much more convenient. As an example, a GROMACS user who wants to check if their system is properly equilibrated at a desired temperature needs to call GROMACS’ energy program to extract the temperature data from the .edr file, either through the terminal with 

$ gmx energy –f ener.edr -o temp.xvg,  

followed by going through the interactive prompt, or by calling os.system as such: 

 os.system(“gmx energy –f ener.edr -o temp.xvg << EOF \n 18”), 

where 18 is the entry index for temperature in this particular energy file. The latter approach requires knowledge of index assignment in the .edr file beforehand, and this assignment can differ between different GROMACS versions and MD input parameters. Either approach creates a .xvg file which can then readily be plotted. However, in a typical simulation in the NPT ensemble, a user might be interested in temperature, pressure, density, box vectors, and/or multiple energy components, meaning that this approach of creating intermediate files can be quite cumbersome. Being able to instead read in the .edr file and read/plot data from it directly would be more convenient and less error-prone.  

This is already possible through use of panedr, but I believe integration into MDAnalysis as part of the auxiliary module would still be beneficial. By having the energy data directly associated with the time steps, selecting frames of the trajectory based on more specific criteria would be simplified. I’m imagining, for example, the case of a protein undergoing a number of conformational changes, where the user could then more easily select frames where, say, the potential energy of the system is below a certain threshold.  

> How will your contribution be sufficiently general or extensible so that users of other packages can also benefit? 

I propose to write auxiliary readers for energy output files for other packages as well.  

For example, both NAMD and Amber write energy information to ASCII files (.out or to stdout, which can be captured to a log file) with slightly different formatting. For plotting, the desired energy terms have to be extracted and written to intermediate files, for example by using a combination of grep and awk.  

In both cases (and similarly to GROMACS), users would benefit from being able to read and plot energy-related terms from within Python, without the need to create intermediate files, and without having to call command line programs from the terminal or with os.system(). Implementation of these readers should be reasonably straightforward (has anyone ever not regretted saying that, by the way?) since these are just ASCII files. Once read, MDAnalysis would no longer need to care about the origin of the data.  

 

> Finally, panedr has pandas as dependency. That's not a problem per se but it's probably not something we want to make a required dependency so you have to think about how to deal with the situation when someone doesn't have pandas installed. 

If pandas should not be a required dependency in MDAnalysis, maybe part of the scope of this project could be to modify/re-implement panedr / a new EDR reader for MDA without pandas utilisation? For the uses I’m imagining, the data can likely be stored in other forms such as numpy arrays as well.  

 

> Can you mock up what you would *like* to write as python commands *if* you had completed your project successfully? 

I’m imagining the new energy readers as part of MDAnalysis.auxiliary. Here is a new .edr reader as an example: 

aux = MDAnalysis.auxiliary.edr.EDRReader(‘ener.edr’) 

These readers would need to work slightly differently from XVGReader,  because more than one data point is contained in the energy files. On initialisation, the EDRReader object checks which terms are present in the file and populates an attribute with that information. I’m not sure which data type would be best, let’s say for now it is a list.  

In: aux.terms 

Out: [“Time”, “Bond”, “Angle”, “Potential” and so on] 

Alternatively, this information could be part of the return value of the reader’s __str__() method. 

The readers would then have methods for extracting data from the files. One desired outcome is attaching the data as auxiliary information to the timesteps of a trajectory. For that,  MDAnalysis.coordinates.base.ProtoReader.add_auxiliary() would need to be modified. Currently, it takes the following parameters:  

add_auxiliary(self, auxname, auxdata, format=None, **kwargs).  

In addition to the name for the auxiliary data attribute (auxname) and the data source (file or aux reader, auxdata), a new parameter is needed for energy files to specify which data to include from auxdata (one of aux.terms). Adding auxiliary data to a trajectory would then work as such: 

u = mda.Universe(foo, bar) 

u.trajectory.add_auxiliary(auxname='epot', auxterm=‘Potential’, auxdata=aux) 

Having loaded the auxiliary data, it can for example be used in selecting certain time steps. (In my test, creating a copy was necessary at this stage to keep time information. I’ll look into this further) 

selected_steps = [ts.copy() for ts in u.trajectory if ts.aux.epot < some_threshold] 

 

Alternatively, if the user is merely interested in extracting plottable data, the energy readers can be used for “unpacking” the selected data into numpy arrays.  

epot = aux.unpack(‘Potential’) 

bond_terms, angle_terms = aux.unpack([‘Bond’, ‘Angle’]) 
 

The readers will also have a method to provide basic statistical information to the user: 

In: aux.statistics(‘Temperature’) 

Out: Temperature:  Average: value, Error estimate: value, RMSD: value,  total drift: value  

 

So this is what I’m imagining new energy readers would do and how they would work. Connecting back to the first point in this email, I would be sure to use this frequently, and I think it could be useful for the community. I am very curious to hear how the community/the core devs feel about the cost-benefit trade-off of these inclusions. If the overall feedback is positive, great, I’ll put together the actual proposal for GSoC. If it is negative, then I’m still glad to have invested the time to think this through a bit more based on Oliver’s input as I have already learned quite a bit more now, and I will instead start working on a proposal for the context-aware guessers.  

Thanks! 

Best wishes 

Bjarne

Jonathan Barnoud

unread,
Apr 11, 2022, 9:48:48 AM4/11/22
to mdnalys...@googlegroups.com
Hi Bjarne,

I wrote panedr and was involved in writing adding the aux readers to MDAnalysis. So naturally, I am curious of seeing them combined.

There are two things that concern me for that project in the specific context of GSoC, though:
* How much use will the feature get? It is already possible to do what you suggest using xvg files produced by gmx energy. Such xvg files can be read with the existing aux reader to select frames based on energy information. Yet, I never seen it done. As you mentioned, it is not as convenient as reading directly the edr file (otherwise panedr would not exist) so I may be wrong. It is also possible that the issue is linked to the aux reader not being advertised much.
* How long will it take to implement the feature? I am afraid just writing an aux reader for panedr would be much too light of a project for GSoC.

I do think this is a task worth doing. I think we even discussed it with Oliver some time ago. But I am not convinced it is a good fit for GSoC. I might be convinced otherwise, though.

Cheers,
Jonathan
--
You received this message because you are subscribed to the Google Groups "MDnalysis-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mdnalysis-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mdnalysis-devel/c4c08b56-9ba0-4aec-8a9e-0a6f2ef8cad4n%40googlegroups.com.


BFedder

unread,
Apr 11, 2022, 1:54:31 PM4/11/22
to MDnalysis-devel
Hi Jonathan, 

Thanks very much for your feedback! Both of your concerns are very good points. 

> How long will it take to implement the feature? I am afraid just writing an aux reader for panedr would be much too light of a project for GSoC.
It is true that writing an aux reader for panedr should not take too much time, even though I would do some minor rework of it (with your blessing) to avoid bringing pandas as a dependency into MDAnalysis. So in addition, in this project, I would hope to implement aux readers for energy-type files of other MD engines as well, thus helping to keep MDAnalysis MD engine-agnostic. This would increase the scope of the project and give me more to do: If I am finished with one reader and there is time left, I would start work on the next reader. I have thus far looked at Amber, NAMD, and OpenMM as possible candidates. 

> How much use will the feature get? 
It is, admittedly, hard for me to say how much use this feature would get. One possible outcome I see that inclusion of more energy/aux readers - also supporting other MD engines - could have is that the aux framework as a whole might see more use. I would be sure to make frequent use of it, but yes, implementing this would be a matter of making something that is already possible more convenient, rather than breaking new ground. 

If you think this is worth doing but not a good fit for GSoC, then I think a new issue/feature request on GitHub is a good outcome of this discussion here. I would be happy to work on it when I find the time or have someone else from the community take over. Also, I would then focus on writing a proposal for a project proposed by MDAnalysis. 

Best wishes 
Bjarne

Jonathan Barnoud

unread,
Apr 11, 2022, 3:12:29 PM4/11/22
to mdnalys...@googlegroups.com
Hi Bjarne,

The increased scope makes the project more appealing. Though, while I keep thinking it is worth doing, I still doubt it is a good match for GSoC. If there are other projects that can interest you for GSoC, I would encourage you to open an issue about the aux reader and write your proposal on the other projects. Again, my concerns are only within the scope of GSoC!

If you open an issue, you can also open one on the panedr repo about returning the result as something else than panda. It should be pretty easy to have the result as a dict of arrays as it is how the panda DataFrame is built.

Cheers,
Jonathan

BFedder

unread,
Apr 12, 2022, 11:53:47 AM4/12/22
to MDnalysis-devel
Hi Jonathan, 

Thanks very much! 
I will open the issues as discussed and sink my teeth into a different proposal. 

Best wishes
Bjarne

Richard Gowers

unread,
Apr 12, 2022, 1:04:57 PM4/12/22
to Mdnalysis-Devel
Hi Bjarne

I'm going to disagree with Jon, I think this could be viable as a project.  I think it's a little "light", so we'd want to see that it is well documented with something like a comprehensive tutorial.  I think in general auxreader is a very cool and underutilised feature, and an EDR link up would be a great way to popularise the feature.  N.b. Writing a more general guide on writing auxreaders would be a good way to bulk out the project proposal if it looks a little light.

Richard

orbeckst

unread,
Apr 13, 2022, 5:06:39 PM4/13/22
to MDnalysis-devel
Hi Bjarne,

Don't feel discouraged by getting different kind of feedback. The general consensus that I see is that we all would like aux to be used more.We also know from experience that the most successful projects are the ones that applicants propose themselves. At the end of the day, you have to be your own strongest advocate. Your proposal should convince us that it's worthwhile doing. From us you're hearing various boundary conditions and opinions, then it's up to you to address them. This is at the heart of proposal writing: make the strongest case possible for something that YOU believe in. If you think it's a good idea then it probably is, you just have to make other people see it the way you do while taking into account the concerns and interests of your audience.

Finally: Under the current rules of GSOC you may submit up to three proposals. They can all go to the same Org. In the past we had people who submitted two proposals (both were good) and we then selected the one we felt was the most impactful. So if you have enough time to write two proposal, just hedge your bets.

Oliver

--
Oliver Beckstein (he/his/him)

email: orbe...@mdanalysis.org
twitter: @orbeckst
GitHub: @orbeckst

MDAnalysis – a NumFOCUS fiscally sponsored project


Message has been deleted

BFedder

unread,
Apr 18, 2022, 2:33:40 PM4/18/22
to MDnalysis-devel
Hi Jonathan, Richard, and Oliver, 

I have responded to your messages a few days ago, but now I can't see it in this thread... Apologies if it has not reached you, and thanks again for your feedback thus far.

I have now submitted proposals for the energy reader project as well as the context-aware guessers project. I am aware that this is very late, sorry... But if any of you have some time to look over them and give me feedback on what could be improved still that would be greatly appreciated! Links to Google docs are included and hopefully work on your end as well. 

I would have liked to have the proposals fleshed out a bit more, but this is unfortunately not possible now due to how close the deadline is. 

Best wishes
Bjarne

Jonathan Barnoud

unread,
Apr 18, 2022, 4:49:11 PM4/18/22
to mdnalys...@googlegroups.com
Hi Bjarne,

I left a couple comments on the Google doc. I would like to see more emphasis on documentation. A tutorial in the user guide about using the edr aux reader with an application example would be valuable. In general, I care more about getting better docs than having more readers.

Cheers, 
Jonathan 

BFedder

unread,
Apr 19, 2022, 7:38:43 AM4/19/22
to MDnalysis-devel
Hi Jonathan and Oliver, 

Thanks very much to you both for your comments on the proposal! I have now uploaded an updated version. 

Best wishes
Bjarne

Reply all
Reply to author
Forward
0 new messages