Okay I did some checking and I think I found the bug.
The issue has to do with the length of FASTA header lines. Those that fall right between 99-100 characters in length are not getting recorded by ABACUS.
By default ABACUS will read up to the first blank space in a FASTA header line. If the resulting substring is longer than 100 it truncates it to 99 characters.
There is an issue with the conditional that checks for long protein ID lines.
Here is an example of a protein you find in the XML file but Abacus thinks has a length of zero. It's exactly 99 characters long:
pleuromamma_t_h10_37182_comp204268_c0::pleuromamma_t_h10_37182_comp204268_c0_seq1::g.87736::m.87736 |
I will fix the code and push the update sometime next week. In the meantime if you are still working on this issue, lengthen the protein identifiers so that they go beyond 100 characters before a space is encountered. Alternatively shorten the protein identifiers so that they are less than 99 characters before the first space is encountered.
This is only happening with this FASTA file since I have specific parsing rules for Uniprot, RefSeq and IPI.
Hope this helps and I'm sorry for taking so long. This one was a hard one to pin down.
Damian