Parsing function names with more than one "[]"

22 views

Skip to first unread message

Menia Gavriilidou

unread,

Jan 25, 2024, 7:04:23 AM1/25/24

to swest...@gmail.com, samsa-bioinfo...@googlegroups.com

Hi Sam,

I have been using samsa2 for analyzing my data and I find it very nice and easy to use. Thanks for that!

For the project I am working on now, I made a customized database for annotating the reads where the first line of some proteins looks like this:

line = '>MGYG000290000_00034 1-(5-phosphoribosyl)-5-[(5-phosphoribosylamino)methylideneamino] imidazole-4-carboxamide isomerase [Bacteroidaceae_Prevotella]'

In this line, there are two pairs of "[]" in the function name. When I ran the aggregation step with DIAMOND_analysis_counter.py I noticed that I get truncated names for some functions in the tsv file. I went back to the python script (L139-L142) and added a few lines which split the name correctly when there is more than one "[]".

# name and functional description
if line.count("[") != 1:

db_entry = line.rsplit("[",1) ## Split the line at the first "[" from the end

db_entry = db_entry[0].split(" ", 1)

db_entry = db_entry[1][:-1]

else:

db_entry = line.split("[", 1) ## splits the line into two parts, one before the first occurence of "[" and the other after
db_entry = db_entry[0].split(" ", 1) ##splits the first parts into two
db_entry = db_entry[1][:-1] ##keeps the second part

It seemed to work and I wanted to share it with you in case you want to add it in the original script.

Cheers!

Menia

Asimenia Gavriilidou | Postdoctoral researcher

Wageningen University & Research

Laboratory of Microbiology

Campus Helix building 124 – Room 5033

Polka dot 4

6708 WE Wageningen

Netherlands

mobile phone: +31620607736

e-mail: asimenia.g...@wur.nl

Reply all

Reply to author

Forward

0 new messages