Duplicate tip labels causing error in hsp.py

41 views
Skip to first unread message

Josh Buonpane

unread,
May 9, 2023, 12:53:11 PM5/9/23
to picrust-users
Hello,

I was successfully able to run the first python script (place_seqs.py), but am running into an error while running the second (hsp.py). This is not the tutorial data but my own 16S data. The error message that I am receiving states the following:

Error running this command:

Rscript /mnt/lz01/wollheim/jmn686/picrust2-2.5.2/picrust2/Rscripts/castor_nsti.R out.tre /tmp/tmps73clarm/known_tips.txt /tmp/tmps73clarm/nsti_out.txt

Standard error of the above failed command:
Error in read_tree(file = Args[1], check_label_uniqueness = TRUE) : 
ERROR: Duplicate tip labels (e.g. 'Subgroup;1') found in input tree
Execution halted

To the best of my limited knowledge, it appears that the tree (out.tre) that the first python script (place_seqs.py) constructed is creating issues in an R script that the second python script (hsp.py) is referencing. However, I am unsure even after a good bit of online searching what exactly the duplicate tips is referencing and how I can go about fixing the issue. 

Any assistance is greatly appreicated,
Josh Buonpane

Gavin Douglas

unread,
May 9, 2023, 5:50:35 PM5/9/23
to picrus...@googlegroups.com
Hi John,

This issue means that there are multiple query sequences that are resulting in tree tip labels called 'Subgroup;1'. Is this string found in any of your input sequence ids? The first thing to check would be whether there are multiple sequences with this exact name.

Cheers,

Gavin

--
You received this message because you are subscribed to the Google Groups "picrust-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to picrust-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/picrust-users/5e280fbc-2cbb-49cd-8f69-bb2e303375f8n%40googlegroups.com.

Josh Buonpane

unread,
May 10, 2023, 11:13:52 AM5/10/23
to picrust-users
Hi Gavin,

First off, thanks for the quick response. I'm wondering if my sequence naming convention is what is causing the issue. The sequences are named as such: 

eg.
ASV_13427;Bacteria;Proteobacteria;Alphaproteobacteria;NA;NA;NA;ASV_13427

It seems that the name is being split into subgroups and perhaps the fact that the ASV_##### occurs twice is creating the issue. I will try to use the sed command (if possible) to rename all of the sequences in the fasta file and the biom table and run from the top again if this seems reasonable. I think they should be renamed as such:

eg. 
Bacteria;Proteobacteria;Alphaproteobacteria;NA;NA;NA;ASV_13427

Thanks again and I will report back,

Josh 

Gavin Douglas

unread,
May 10, 2023, 11:16:23 AM5/10/23
to picrus...@googlegroups.com
Hi John,

The issue may actually be the ‘;’ characters, as those are used in newick trees to delimit elements. I would try simplifying the ids (and maybe just inputting them as ‘ASV_13427’) before re-running.


Cheers,

Gavin

Josh Buonpane

unread,
May 10, 2023, 12:17:04 PM5/10/23
to picrust-users
Hi Gavin,

Great feedback, I am currently running it both ways (eg. Bacteria;Proteobacteria;Alphaproteobacteria;NA;NA;NA;ASV_13427 and ASV_13427) and will report back.

Thanks again,
Josh 

Josh Buonpane

unread,
May 10, 2023, 2:06:04 PM5/10/23
to picrust-users
Hi Gavin,

It appears your suggestion was correct! Running it with sequences named simply ASV_##### worked, whereas including the taxonomy (separated by semicolons) in the sequence name tripped the same error as before. Thanks again for the assistance!!

Josh
Reply all
Reply to author
Forward
0 new messages