Hi Martin,
We are glad to hear from our users and we are sure that these iterations will improve both Macrel and the AMPsphere projects.
First of all, thank you for the tip about the flag directing to a log file, we should implement this. Just to let you know, we also intend to implement a better --help message soon, where it will show all command-line options.
It is quite interesting that you did not detect any peptide longer than 50 residues. It is true that the model relies on a size bias, where larger
peptides tend to be considered negative examples (80% of the cases in
the training set). However, in our tests during the Global AMP Survey, in which we screened thousands of genomes and metagenome assemblies, it was observed a rate of about 4.6% of predicted AMPs longer than 50 residues (52,977 / 1,151,506). I
think maybe it is just because in AMPsphere we have such a huge
sampling size that this unbalance became insignificant, but to you this effect is high. I would suggest testing higher amounts of data and probably the results will be a bit different or try checking the longer peptides you got with ampir. If those longer AMPs present some signal peptide or extension, you could try removing it and submitting directly to macrel [ We heard that worked already, although we did not test it ].
Be sure that you are using the latest version of Macrel, some of the very early versions were unstable during traning and also were biased towards presence of methionine. Please read more in our blogpost at:
About the comparison with ampir, we need to keep in mind 2 main points:
1. gene prediction - ampir does not use a gene prediction system, but focused in examples of eukaryotes (frog, centipede, arabidopsis, human...). It is known that their genes are much larger and their AMPs usually tend to be more complex than those from prokaryotes. The gene prediction system implemented in macrel is prodigal-based, what means it does not work in eukaryotes at all as it does not observe introns. It also means that the minimum and maximum gene length returned are respectively 33 - 303 bp, because they are filtered during macrel processing.
2. training set - ampir traning set, differently from macrel, was made using example sequences from 50 to 500 residues (default condition or "precursor" mode), which makes their model also biased to identify longer proteins. I do not know if you have tested the ampir's "mature" mode trained with peptides of 10-60 residues, in this mode your results should be relatively comparable to those obtained in macrel. The core difference is that Legana divided ampir's training set by length (mature and precursor) and accepted much larger amps than macrel did.
These two points suggest ampir as better in the prediction of eukaryote's AMPs, while macrel was designed to predict prokaryotic ones. Thus, the purpose of your quest tells to you which software it demands.
Cheers,
Celio Dias Santos Junior.