Hi Sam,
I have been using samsa2 for analyzing my data and I find it very nice and easy to use. Thanks for that!
For the project I am working on now, I made a customized database for annotating the reads where the first line of some proteins looks like this:
line = '>MGYG000290000_00034 1-(5-phosphoribosyl)-5-[(5-phosphoribosylamino)methylideneamino] imidazole-4-carboxamide isomerase [Bacteroidaceae_Prevotella]'
In this line, there are two pairs of "[]" in the function name. When I ran the aggregation step with DIAMOND_analysis_counter.py I noticed that I get truncated names for some functions in the tsv file. I went back to the python script (L139-L142) and added a few lines which split the name correctly when there is more than one "[]".
# name and functional description
if line.count("[") != 1:
db_entry = line.rsplit("[",1) ## Split the line at the first "[" from the end
db_entry = db_entry[0].split(" ", 1)
db_entry = db_entry[1][:-1]
else:
db_entry = line.split("[", 1) ## splits the line into two parts, one before the first occurence of "[" and the other after
db_entry = db_entry[0].split(" ", 1) ##splits the first parts into two
db_entry = db_entry[1][:-1] ##keeps the second part
It seemed to work and I wanted to share it with you in case you want to add it in the original script.
Cheers!
Menia
-- Asimenia Gavriilidou | Postdoctoral researcher
Wageningen University & Research
Laboratory of Microbiology
Campus Helix building 124 – Room 5033
Polka dot 4
6708 WE Wageningen
Netherlands