STAR usage with metagenomes

612 views
Skip to first unread message

S. Ghignone

unread,
Apr 19, 2016, 2:32:24 PM4/19/16
to rna-star
Hi all,
I'm try to 'improperly' use STAR for metatranscriptomic studies.
The case study is the metatranscriptome of the rhizophere of Tomato roots: the main aim is to isolate those sequences not belonging to the Tomato genome: bacteria, fungi, other eukaryots....

To tweak the STAR optimal settings, I've prepared an artificial RNAseq subset of a million reads, of which 750k belong to Tomato and 250k belong to a fungus, that may occur in the soil.

I've used these commands:

#indexing
STAR --runMode genomeGenerate --runThreadN 50 --genomeDir $HOME/DB/Tomato_genome/STAR/ --genomeFastaFiles $HOME/DB/Tomato_genome/S_lycopersicum_chromosomes.2.50.fa --sjdbOverhang 100 --sjdbGTFfile $HOME/DB/Tomato_genome/ITAG2.4_gene_models.gff3 --sjdbGTFtagExonParentTranscript Parent

#mapping.
STAR --runMode alignReads --runThreadN 50 --readFilesIn $HOME/Data/Aligners_test/Tomato_Gigaspora.fastq --genomeDir $HOME/DB/Tomato_genome/STAR/ --outSAMtype BAM Unsorted --outFilterMultimapNmax 1000 --alignEndsType EndToEnd --outFileNamePrefix $HOME/Data/Aligners_test/Outputs/STAR/Tomato_Gigaspora_1_ --alignIntronMin 40 --alignIntronMax 23000 --outReadsUnmapped Fastx --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0


With these settings, I've found that many fungal reads have been mapped on the Tomato genome, even if with rather high mismatch rate, in comparison with those supposed to belong to Tomato, even if the cultivar/variety I use is not the same that has been sequenced (so I have to accept somewhat minimal mismatches).

Please give a look at pictures: samtools tview of some examples, showing the fungal reads mapped against Tomato genome. They are named with NS500140.
Fig1 - alignment with sequence reads names
Fig2 - alignment with nt view. A) accepted mismatches; B) fungal reads.

The question is: which other parameters (with optimized values) I have to consider to get rid of this situation and collect the unmapped reads in the --outReadsUnmapped Fastx file?

Thanks,

Regards
S.










Fig1.png
Fig2.png

Alexander Dobin

unread,
Apr 20, 2016, 3:48:23 PM4/20/16
to rna-star
Hi Stefano,

if you knew the genomes of your extra species, the best solution would be to include them into the reference. There are a number of posts that discuss it.
Since you talking about meta-transcriptomics, I guess in your case the species and their genomes are unknown, and you would need to increase the stringency of the mapping filters.
In particular, you may want to reduce the max number of mismatches 
  outFilterMismatchNmax [10]
  outFilterMismatchNoverLmax [0.3]
  outFilterMismatchNoverReadLmax [1.0]
and the minimum mapped length allowed:
  outFilterScoreMin [0]
  outFilterScoreMinOverLread [0.66]
  outFilterMatchNmin [0]
  outFilterMatchNminOverLread [0.66]
All these filters are applied in the AND fashion, i.e. the most stringent filters wins.
By default, <=10 mismatches are allowed, and ~2/3 of the read length has to be mapped.

For instance, you can use --outFilterMismatchNmax 3 --outFilterScoreMinOverLread 0.9 --outFilterMatchNminOverLread 0
to allow <=3 mismatches and 90% of the read score( ~length) mapped.
The real question is how to choose these filtering parameters - which reads are mapped (not) good enough to the host genome?
It depends on the sequencing quality and sequence similarity between the species.
It's hard to imagine a very rigorous procedure for that. 
I guess you could explore the tails of the distribution of mismatches and mapped score/length, and set the thresholds at certain percentiles of reads mapped.
Or maybe even fit the distribution into a certain statistical model and see at which point the model deviates from the data.

Cheers
Alex

mchi...@unito.it

unread,
Apr 21, 2016, 12:41:23 PM4/21/16
to rna-star
Dear All,
I'm really interested in this topic since I'm experiencing that bowtie-based tools have problems to map a decent number of metatranscriptome reads (I'm working with 50 bp SE Illumina) to NCBI Nt-based reference databases. Do you think it would be reasonable in term of time and performance to build Nt STAR index to map reads using the above mentioned settings?
Regards,
Matteo

Alexander Dobin

unread,
Apr 22, 2016, 4:42:13 PM4/22/16
to rna-star
Hi Matteo,

mapping to the entire nt reference (>40GB last time I checked) is a serious undertaking, but it's doable - I have done it a few years ago.

First of all, it would require quite a lot of RAM: (1+9/genomeSAsparseD)*GenomeSize. I used 220GB of RAM for 40GB of sequence and --genomeSAsparseD 2 .
You can further increase --genomeSAsparseD to reduce RAM to <100GB at some trade-off with speed and accuracy.

Second, you would need to concatenate all sequence into one huge super-reference, and after mapping convert the coordinates back to separate short  references.

If you are willing to try this, I can send you the parameters I used.

Cheers
Alex

Matteo Chialva

unread,
Apr 26, 2016, 5:59:56 AM4/26/16
to rna-star
Dear Alexander,
thanks a lot for your reply. I think I'll try this task with the server I'm using, 128GB RAM 64 cores. I already have the super reference and the splitted ones since I tried to use a bowtie2-based tool. Even if it will be difficult to index the whole reference I will try on the splitted .fasta which are about 4 Gb of disk space also because I only need .sam alignments and not counts.
Which other parameters would you suggest me to set during the mapping step?
Thanks again!
Matteo

Alexander Dobin

unread,
Apr 27, 2016, 3:30:24 PM4/27/16
to rna-star
Hi Matteo,

with 128GB, you should be able to generate the index for the whole nt sequence with --genomeSAsparseD 8 --genomeChrBinNbits 22 .
You would need to concatenate all the short nt references into one super-sequence.

For mapping, I would recommend:
--outFilterMultimapNmax 100000 --winAnchorMultimapNmax 2000 --alignTranscriptsPerReadNmax 100000 --seedPerWindowNmax 1000  --seedNoneLociPerWindow 100 --alignWindowsPerReadNmax 20000 --alignTranscriptsPerWindowNmax 1000 

Cheers
Alex

Matteo Chialva

unread,
Apr 29, 2016, 9:19:34 AM4/29/16
to rna-star
Dear Alexander,

thanks again for your advices. Anyway, I can't understand why I need to concatenate my database in a single super-sequence and converting after mapping coordinates back to the short references: is this a problem of sequence index generation? Actually, I'm not interested in STAR reads counting mode since I'm doing taxonomic assignments using Taxoner program which performs nearest neighbor analysis computing a local common ancestor for each read. This program only need .sam files with gi reference number.

Furthermore, I'm not interested to search for splice junction since Nt already contain transcript sequences. For these reason I would set --alignIntronMax 1

As a second point I also found this parameter '--limitGenomeGenerateRAM': can I set a maximum RAM usage suitable for my system, keeping --genomeSAsparseD to default value.

Do you think this is a doable way?
Thanks a lot for your support!

Best,
Matteo

Alexander Dobin

unread,
Apr 29, 2016, 3:41:00 PM4/29/16
to rna-star
Hi Matteo,

the full nt database contains many millions of references. Presently, STAR algorithm cannot deal efficiently with such a large number of reference, you will, most likely, see significant decrease in the speed when the number of references is > 100,000. If you want to try it without concatenation, you would have to reduce --genomeChrBinNbits to log2(GenomeLength/NumberOfReferences) at the genome generation stage.

I just had an idea of a hack that you can use to avoid writing your own code to convert coordinates from concatenated reference to the original short references.
While you are concatenating the short references, you can also create a GTF file with "exons" corresponding to each short reference (i.e. the short refernece strart/end coordinates in the concatenated reference), and transcript_id being the "giXXX" name of the reference. You will use this GTF file at the genome generation step (with --sjdbGTFfile), and at the mapping stage use --quantMode TranscriptomeSAM --outSAMtype None .
This will output Aligned.toTranscriptome.out.bam file with the alignment coordinates transformed to the original short reference coordinates.

--alignIntronMax 1 is indeed needed to prevent splicing


--limitGenomeGenerateRAM is an advisory parameter telling STAR how much RAM it could use, it will not enforce the size of the genome index, you have to set --genomeSAsparseD .

Cheers
Alex

Ivan Molineris

unread,
Feb 17, 2024, 4:35:36 AMFeb 17
to rna-star

Dear Alexander and Matteo,

I'm resuming this old thread because I was wondering if it is feasible to use STAR to align on the full refseq (if not nr) database.

Have you used STAR on such huge databases to do metatranscriptomics? I'm primarily interested in taxonomic identification, rather than precise gene expression estimation. Do you think that is a useful approach or is it better to use other tools?

If you think that is useful I kindly ask you some advice about parameter setting, both in index generation and alignment.

E.g. how to deal with low complexity/repeated sequences: need to be masked?

Thanks for any advices

Alexander Dobin

unread,
Feb 23, 2024, 3:37:59 PMFeb 23
to rna-star
Hi Ivan,

I would recommend using STAR for huge databases; there are metagenomic tools out there that should be more efficient.

Ivan Molineris

unread,
Feb 25, 2024, 10:53:30 AMFeb 25
to Alexander Dobin, rna-star
Thanks Alex for the reply.
What intrigues me is that, often, authors of metagenomic and metatranscriptomic tools say in their papers that bowtie or other aligners are more accurate than their tools (even if much less efficient), for this reason I was reasoning about the application of STAR for such tasks in delicate situations.

Anyhow, is there any tool you will particularly suggest using for such kinds of applications?

Thank you again.


--
You received this message because you are subscribed to a topic in the Google Groups "rna-star" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rna-star/R2A352vfhlI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rna-star+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rna-star/3e13d1d9-d089-4f9f-93fd-4abd93942bfen%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages