error with standardized_DIAMOND_analysis_counter.py

35 views
Skip to first unread message

Arian Lundberg

unread,
Dec 18, 2022, 11:38:40 PM12/18/22
to SAMSA bioinformatics group

Hello, I am having issues with standardized_DIAMOND_analysis_counter.py script
I am getting an IndexError in line 138

I've modified samsa2 master_script_for_sample_files.bash file and the error comes after STEP4 is DONE

command from the samsa2 pipeline script:

STEP 5: AGGREGATING WITH ANALYSIS_COUNTER

for file in $starting_files_location/step_4_output/*RefSeq_annotated* do    python $python_programs/standardized_DIAMOND_analysis_counter.py -I $file -D $RefSeq_db -O    python $python_programs/standardized_DIAMOND_analysis_counter.py -I $file -D $RefSeq_db -F done error:

Now reading through the m8 results infile.

Analysis of /projects/bact.fun.unmapped.RefSeq_annotated complete.
Number of total lines: 574668
Number of unique sequences: 574668
Time elapsed: 1.8101940155 seconds.

then "Starting database analysis now." message pops and goes until

198M lines processed so far in 2025.08801007 seconds.

Then I get this error:

Traceback (most recent call last):
File "/projects/tools/samsa2/python_scripts/standardized_DIAMOND_analysis_counter.py", line 138, in 
db_entry = db_entry[1][:-1]
IndexError: list index out of range

Here is an snapshot of your script from line 127 to 138

for line in db:    if line.startswith(">") == True:        db_line_counter += 1        splitline = line.split("[",1)        # ID, the hit returned in DIAMOND results        db_id = str(splitline[0].split()[0])[1:]        # name and functional description        db_entry = line.split("[", 1)        db_entry = db_entry[0].split(" ", 1)        db_entry = db_entry[1][:-1]


I generated a database containing viral, fungi and bacteria sequence.

Bacterial and Viral sequences were downloaded from NCBI but Fungi was downloaded from Zenodo.org where Samsa2 creators uploaded their data. 

https://zenodo.org/record/3737678#.Y5uzSS-B2_c

I've checked if there might be an issue with the sequence names from each database and I couldn't find any issues. 

Here are examples from 

Bacterial:

>WP_206150240.1 LysE family translocator [Burkholderia sp. Tr-20390]

MSLSALLAFALILSVGVATPGPTVLLAMSNGSRYGLRHAMVGMLGAVTADVVLVALVGCGLGMLLDASETAFVTLKLAGAAWLAYVGVRMLLSSGGSAAAQALDHATPDHRTAFLKSFFVAMSNPKYYLFMSALLPQFVDRSHAIAPQYAILAATIVAIDVIGMTGYALLGVHSVRVWKAAGEKWLNRVSGSLLLMLAGYVALYRKAAN

Viral:

>YP_009137152.1 envelope glycoprotein L [Human alphaherpesvirus 2]

MGFVCLFGLVVMGAWGAWGGSQATEYVLRSVIAKEVGDILRVPCMRTPADDVSWRYEAPSVIDYARIDGIFLRYHCPGLDTFLWDRHAQRAYLVNPFLFAAGFLEDLSHSVFPADTQETTTRRALYKEIRDALGSRKQAVSHAPVRAGCVNFDYSRTRRCVGRRDLRPANTTSTWEPPVSSDDEASSQSKPLATQPPVLALSNAPPRRVSPTRGRRRHTRLRRN

and Fungi:

>MT1.1

SSIYTITCYPRRTFLPLYVYGTLSHRSYKFILFSNLSNIKAHLVSYPALTSLYGTSLKYFSVGILFTFNPIILLIFVYSIRESFYSVFSSLTSGMLSIIISEALLFFTYFWGILHFSLSPYPLSNEGIIITSSRMLILTITFILASASCMTACLQVFIEKGMSFEISSIICIIYLLGECFASLQTTEYLHLSYHINDTVYTTLFYCVTGLHFSHVVIGLLLLIIYFIRIIEIYDTSTEWFINSFGISYIVIPHTDQITILYWHFVEIVWLFIEFLFYSE


I look forward hearing from you. Thanks in advance.

Reply all
Reply to author
Forward
0 new messages