Extracting loci from multiple reference-based assemblies

Mattia De Vivo

unread,

Jan 9, 2025, 4:49:53 AM1/9/25

to Stacks

Hello everybody,
I hope my question will not be too confusing.

From what is my understanding, Stacks accept at most one potential reference-based alignment for the ref_map.pl. So it would become something like:

>ref1_gene1
(seq)

>spec1_gene1
(seq)

>spec2_gene1
(seq)

And then you may go with a second reference like:

>ref2_gene1
(seq)

>spec1_gene1
(seq)

>spec2_gene1
(seq)

Now, would it be possible to run an alignment with multiple references? Something like:

>ref1_gene1
(seq)

>ref2_gene1
(seq)

>spec1_gene1
(seq)

>spec2_gene1
(seq)

Or is it any way to do it with an external tool (e.g., bowtie or similar)?

Thanks in advance for the answer! Let me know if my question is not written clearly.

Best wishes,
Mattia

Catchen, Julian

unread,

Jan 9, 2025, 12:20:38 PM1/9/25

to stacks...@googlegroups.com

Hi Mattia,

Stacks is not aware of what the collection of sequences are that you want to align against. If you want to combine components of different references together into a FASTA file you can. Stacks will take the output of the aligner (e.g. BWA) and assemble the RAD loci based on the alignment positions the aligner specified. I’m not sure why you specify different ‘genes’ in your example, unless you are trying to only align against the protein coding genes… however, if you combine a bunch of similar sequences together, from different related species or assemblies, you are not necessarily going to get a better result, since the aligner will likely find multiple alignments for different reads and the software has no way to sort this out. Typically, if you want to do a multi-species comparison, you need to pick one reference to use as the base and adjust your alignment parameters to make sure a sufficient number of reads from the non-native species are aligning properly. Alternatively, you can do a de novo assembly first and align the assembled loci to the different references afterwards – it depends on what your analysis goals are.

Best,

Julian

Mattia De Vivo

unread,

Jan 10, 2025, 5:19:09 AM1/10/25

to Stacks

Dear Prof. Catchen,

thank you very much for your answer! I noticed I omited way too much details in my question, so I will try to explain more.

I was extracting mitochondrial genes from public RNA SRAs; such SRAs were labeled as belonging to a single animal species. I noticed, by using BLAST, that the COXIs from such SRAs actually match two different species from the same genus, with e-values of 0 and similarities ranging from 98 to 100%. I also run a phylogenetic tree with IQTree of such sequences (the GenBank COXIs of the two species and the ones I got from public SRAs; I repeat that the SRAs one should belong to something different, albeit from the same genus of the other two) and the tree clearly shows me that I have two species instead of three; however, this is only with a single gene and something like around 350 bps.

I have ddRAD data from one of the two species with GenBank data and I was thinking about assembling the ddRAD data together with the transcriptomic ones and check if the COXI results matches the ones with more genes (how many species will I get? That's basically what I want to see). I admit I do not have a good idea on how to do such thing (should I do ref_map with all the references separately and then try merge the results together? If yes, how to merge them? Or should I just map the ddRAD data to the transcriptomes and then make a big alignment to use for other softwares, e.g. STRUCTURE or adegenet?) and that's why I asked. The gene thing was just an example, albeit I will get exclusively protein-coding genes, since I have transcriptomic data.

I hope I was clear enough and I am sorry if there is still something unclear.

Best wishes,

Mattia

Catchen, Julian

unread,

Jan 10, 2025, 2:49:35 PM1/10/25

to stacks...@googlegroups.com

Hi Mattia,

I would take the GenBank sequences and place them all in a single FASTA file and make a reference index out of them, then align my ddRAD data to that with BWA. You will likely have your ddRAD reads aligning to multiple loci in the ‘reference’ but it should mark the best alignment as ‘primary’ and the other alignments as ‘secondary’. Stacks would then proceed with the primary alignments for assembling your raw ddRAD data into loci. After that, you could feed them into structure or PCA, or similar. If genetic variation is very low, the aligner may not be able to distinguish between mt genes of different, but closely related species. You might play around with samtools after the alignment is done to see for the other mt genes, based on the RNAseq data, if your ddRAD data for that gene goes to more than one species (indicating the aligner can’t distinguish) or if you always get a good species delineation from the ddRAD data.

Cheers,

Julian

--
Stacks website: http://catchenlab.life.illinois.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/stacks-users/b9aecabe-aa24-4f02-a97f-9aa67dd10eeen%40googlegroups.com.

Reply all

Reply to author

Forward