index out of range with standardized_DIAMOND_analysis

fio...@appstate.edu

unread,

Sep 24, 2021, 10:10:57 AM9/24/21

to SAMSA bioinformatics group

Hello,

I came across one similar post to an issue I am having but it is not quite the same (that I can tell at least). I have been using samsa2 with a custom database and I am currently trying to run the standarized_DIAMOND_analysis_counter.py on the annotated files from the DIAMOND blast against that custom database. I have 8 sample files to run and oddly 5 of them ran just fine, while 3 keep giving me an index out of range error. As far as I can tell, all 8 files are the exact same format.

Here is the code I am running:

python standardized_DIAMOND_analysis_counter.py -I Pre_NR50_BLAST_AQ_m8_results -D AQprotein.faa -F

Here is the error:

....

25M lines processed so far in 38.51 seconds.

Traceback (most recent call last):

File "standardized_DIAMOND_analysis_counter.py", line 93, in <module>

RefSeq_hit_count_db[splitline[1]] += 1

IndexError: list index out of range

I'll attach a screen shot of the top portion up a file that ran just fine ("Pre_NR17...") and one that is giving the index out of range error ("Pre_NR50..."). They look identical to me, but maybe I am missing something.

Thank you so much!

Cara

Sam Westreich

unread,

Sep 24, 2021, 3:17:43 PM9/24/21

to fio...@appstate.edu, SAMSA bioinformatics group

Hi Cara,

Can you resend the screenshots? I don't see them; they may not have come through on on the Google Group message.

My guess is that there's a strange entry in the reference that doesn't have the right number of components to be parsed by the analysis script. How big is your DIAMOND output file that you're looking to feed into the Python analysis script? Are you annotating this against the RefSeq reference database that was distributed with SAMSA2, or against a custom database you've created?

Best,

Sam

--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatic...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/d6ad7647-a052-46b6-afef-f6409eb1538en%40googlegroups.com.

--

Sam Westreich, PMP, PhD

Microbiome Scientist, DNAnexus,

http://www.mosaicbiome.com

Cara Fiore

unread,

Sep 27, 2021, 3:51:18 PM9/27/21

to Sam Westreich, SAMSA bioinformatics group

Hi Sam

Thanks for the quick response. It took me a couple days because the image file was not on my laptop, I have attached it here, hopefully it comes across ok this time.

Best

Cara

Screen Shot 2021-09-24 at 9.59.30 AM.png

Sam Westreich

unread,

Sep 28, 2021, 11:51:51 AM9/28/21

to Cara Fiore, SAMSA bioinformatics group

Thanks Cara, I see the screenshot now.

Can you answer the other questions:

How big (file size) is the DIAMOND analysis file that is serving as the input to the Python script?
Are you annotating this against the standard RefSeq database that gets downloaded by the full_database_download.bash script, or a custom database you set up?
When did you pull this version of SAMSA2 from Github?

As I mentioned above, my guess is that one of the entries has some strange formatting that is breaking the parsing script. I can likely work to diagnose it, but I'd need access to the DIAMOND m8 results file, which may be tricky if it's large. Possibly, we can use Dropbox or Google Drive to share it.

Best,

Sam

Cara Fiore

unread,

Sep 29, 2021, 8:28:36 AM9/29/21

to Sam Westreich, SAMSA bioinformatics group

Hi Sam

The files are quite large, ~4 Gb. I'll try compressing it and attaching it through Drive and see if it works (this file is one that was not working "NR50").

The database I used was a custom database and I used the DIAMOND_example_script.bash to produce the database and then the "m8_results" file that was then used in the python script.

I am not sure when I downloaded the script from github, but it was probably about a year ago or a little over that.

Thanks

Cara

Pre_NR50_BLAST_AQ_m8_results.zip

Sam Westreich

unread,

Sep 29, 2021, 3:29:48 PM9/29/21

to Cara Fiore, SAMSA bioinformatics group

Hi Cara,

Okay, can you quickly try pulling the latest SAMSA2 repo and run the updated parsing script? I know I've made a couple of tweaks in the last year, and that could be an easy fix.

If you generated your own database, my guess is something in that database isn't following the same structure as the default ones, and that's where the parsing is failing. Can you share how you made this database (which initial files you used)? Was it files pulled from RefSeq, or another location?

I suspect that the problem will lie in the custom database you're using having a different entry layout/separators than the default RefSeq one that SAMSA2 downloads.

Best,

Sam

Cara Fiore

unread,

Sep 30, 2021, 1:48:50 PM9/30/21

to Sam Westreich, SAMSA bioinformatics group

Hi Sam,

I did and it seemed to have fixed it, but still only two out of the three files that did not work before worked this time - so just down to one file that is not working at least! But not sure why. I'll attach the "m8" file through Drive.

Thank you,
Cara

Here is the code and the message:

Post_NR25_BLAST_AQ_m8_results.zip

python standardized_DIAMOND_analysis_counter.py -I Post_NR25_BLAST_AQ_m8_results -D AQprotein.faa -F

Now reading through the m8 results infile.

1M lines processed so far in 1.55984115601 seconds.

2M lines processed so far in 3.12057709694 seconds.

3M lines processed so far in 4.67945504189 seconds.

4M lines processed so far in 6.19663715363 seconds.

5M lines processed so far in 7.71381211281 seconds.

6M lines processed so far in 9.30063414574 seconds.

7M lines processed so far in 10.8132519722 seconds.

Traceback (most recent call last):

File "standardized_DIAMOND_analysis_counter.py", line 93, in <module>

RefSeq_hit_count_db[splitline[1]] += 1

IndexError: list index out of range

fio...@appstate.edu

unread,

Dec 6, 2021, 11:47:15 AM12/6/21

to SAMSA bioinformatics group

Hi Sam,

I am responding to this thread since it is a continuation of this issue, I am hoping you'll see it. I have two questions for you:

1) I am stuck on the last file that I need to run with the standardized_DIAMOND_analysis_counter.py script, the results file is the generated m8 file from the previous step and the database is from EMBL...But, I looked at the .faa file and it is formatted the same as RefSeq - and it has work with all of my other m8 files. If you have time to take a look at these, the link to the files is in the above responses. Thanks so much.

2) Also, if I can pester you for one more thing. I tried to run the filter script with the flag -SO as below:

python DIAMOND_results_filter.py -I Pre_NR_17_RefSeq_annot_Flavobacterium_function.tsv -SO fibronectin -D /home/slowshare/fiorec/fiorec_share/samsa2/full_databases/RefSeq_bac.dmnd

I have tried the m8 file as the input file and the tsv file, nothing seems to work since I know that "fibronectin" is in the annotated results. I am trying to get the sequences that correspond to this annotation. The output of the above code says there are zero matches to 'fibronectin'. Am I using the script incorrectly? I also tried different formatted database because I did not see anything in the script about what format the database should be in (.dmnd, .faa, etc?).

I appreciate your time.

Cara

Sam Westreich

unread,

Dec 6, 2021, 10:12:57 PM12/6/21

to fio...@appstate.edu, SAMSA bioinformatics group

Hi Cara,

For question 1, can you share what the error that the Python script is reporting? If you could paste the last ~20 lines of the STDOUT log into the email, that would be great.

For question 2, the first item is to run it with the .faa file, not the .dmnd file. (The .dmnd file is encrypted/compressed, so it won't be read directly by the Python script.) If it's still not working, would it be possible for you to upload the reference .faa and the m8 file to a folder on Google Drive to share with me? I could probably do a bit of troubleshooting if it's still reporting 0 matches.

Best,

Sam

To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/44001345-7eb7-4142-a495-7dbfad3111b2n%40googlegroups.com.

Reply all

Reply to author

Forward

index out of range with standardized_DIAMOND_analysis_counter.py

fio...@appstate.edu

Sam Westreich

Cara Fiore

Sam Westreich

Cara Fiore

Sam Westreich

Cara Fiore

fio...@appstate.edu

Sam Westreich