Toubleshooting / Suggestions for current standalone development version

Mario Picciani

unread,

May 25, 2023, 5:56:03 AM5/25/23

to MS Amanda

Dear MSAmanda developers,

thanks to one of your collaborators who visited our research facility recently, I had the chance to test the current standalone development version, which is also running percolator in the end on our HPC system.

As an admin, I appreciate your efforts in getting this to run on linux as this will enable running it in large scale scenarios with many users and I am very happy with it so far, especially with your annotations in the settings.xml that are quite helpful.

I encountered the following inconveniences that need more work:

1. the version of percolator that is used is 3.06, this is problematic, as this version introduced a breaking change as it expects a column "Proteins" instead of "Protein". This is tricky to detect as the error output of percolator was not changed so it reports "Couldn't find Protein header in tab-file" despite there being a Protein column.

I added a bit of additional info about this here: https://github.com/percolator/percolator/issues/329#issuecomment-1529038151

You should check for which version of percolator is used and change the column accordingly, i.e. "Protein" for percolator <=3.05, "Proteins" for percolator >=3.06.

Current solution is to replace the binary from the percolator subdirectory by a symlink to a separate install of percolator 3.05.

Suggestion: Maybe blacklist percolator 3.06 for now and use 3.05 until they release the next version, or change your input column to "Proteins".

2. MSAmanda is not parallelising on input files. This is problematic in the following scenario: I have 35 mzml files that could be processed in parallel on my machine (120 cores, 1TB memory). Only for very few seconds, MSAmanda is really utilizing parallelism as seen in the attached image. Between these peaks, a lot CPU time goes to waste as these CPUs are allocated and thus reserved when running through SLURM. This is also caused by the fact that one cannot process more than 50000 spectra at once and cannot load more than 500000 proteins at once if hardware would allow this.

Suggestion: Please do not add a limit here, I see no reason for it. Rather, have some general memory estimations for given numbers here that more inexperienced users could use for reference. Also please consider parallelising on the input files.

3. Despite generatePInFile = true, the _pin.tsv is not created. At least I wouldn't know in which directory. It isn't visibile in the directory specified by -o

Suggestion: Please clarify where or if it is stored, it would be quite important to see what inputs are actually used for percolator in order to compare fairly against other tools from a research perspective. This seems to be a bug?

4. Log output: Please add more log output and add timestamps to estimate how long each individual step takes and to see at which steps sth. fails. For example "Writing _pin.tsv to /path/to/folder", "failed writing because of <reason>". This helps immensely in troubleshooting.

5. Rerunning if it fails: Please add an option to rerun only percolator or tell me how to do that if it failed. Currently, they entire search is rerun.

6. When installing MSAmanda on a multi-user HPC system, this is typically done systemwide, so that every user has access to it, for example /opt/msamanda/v3.0.0. This means that the directory where msamanda is located is not the same as the one from which it is executed. My experience shows, that not only enzymes.xml and settings.xml need to be in the directory from which it is executed, i.e. the ones that one needs to change. Also Instruments.xml, unimod.obo and unimod.xml are required to reside in the directory from which it is executed.

Suggestion: Please make sure the executable searches in the directory of execution first and if it doesn't find the files, it falls back to the default files in the install directory. This would be quite an important step from an admin perspective. Right now, I need to manipulate systempaths inside a module to load it correctly.

That's all for now. Thanks for your awesome work in making this standalone version. It is fast and works out of the box, at least on our system (and after fixing the percolator issue) Hope these suggestions make it even more scalable and easy to use.

Bildschirmfoto vom 2023-05-18 15-28-16.png

Mario Picciani

unread,

May 25, 2023, 11:44:29 AM5/25/23

to MS Amanda

I just realised that percolator is also only run on the last mzml file in processing order, i.e. q values are also only written to the last output.csv file. How to fix this?

Viktoria Dorfer

unread,

May 26, 2023, 8:58:25 AM5/26/23

to MS Amanda

Hi Mario,

thanks a lot for the detailed feedback and the suggestions, we highly appreciate that! The percolator version is still on beta, so, thanks for that!

Some quick answers to some easy fixes:

* In the settings.xml you can specify the required Instruments.xml, unimod, enzymes and so on files and you could specify a different path here if it is run from somewhere else.

* When specifying several input files, MS Amanda assumes that these are not replicates but should be analyzed together and will therefore only generate a single output file containing all results. It might still be that there is a bug for the percolator version, such that only the last file is analyzed, but the idea would be that all those files belong together.

We will look into an option to specify that the files are independent searches, but that will definitely require some more work. In the meantime you could run several MS Amanda runs for each mzml file (that would then run also in parallel). They would utilize the same digested database (as long as the parameters are the same), when specifying the same path (settings: <data_folder>). <data_folder>DEFAULT</data_folder> <data_folder>DEFAULT</data_folder> <data_folder>DEFAULT</data_folder>