Re: How would RNA-star deal with mixed samples? i.e. a sample that has both human and bacterial reads?

1,704 views
Skip to first unread message

Alexander Dobin

unread,
Feb 20, 2013, 1:30:28 AM2/20/13
to rna-...@googlegroups.com
Hi @bob-loblaw,

I am re-posting my SEQanswers answer for the benefit of our group. We can continue this discussion on either forum. If you can tell us what is your the number of bacterial genomes you are using and the total size, we can estimate required RAM and the the best --genomeChrBinNbits parameter for your runs.

Cheers
Alex

We used STAR in two mixed-species settings: human + viruses (~4,500 viruses from NCBI) and human + mouse, and in both cases it worked well as far as we could tell. You have to keep an eye on RAM, since STAR would need ~10*TotalGenomeSize bytes of RAM (~50GB for human+mouse), and if you have a large number of small chromosomes/scaffolds/contigs, you would need to reduce --genomeChrBinNbits as I explained in the previous post.

For mixed-species mapping I strongly agree with other users who recommended mapping to a combined genome rather than a 2-step mapping. When a mapper aligns a read to combined genome it considers various possible alignments and picks the best one. This reduces the false positives and negatives for unique calls to each species. On the other hand, in the 2-step method you force alignments to one of the species in the 1st step, so you results will be strongly biased towards the 1st species.



On Tuesday, February 19, 2013 9:13:00 AM UTC-5, bob-loblaw wrote:
I have 100bp paired end RNA-Seq data from a mixed human/bacterial sample, I was wondering how RNA-Star would deal with this? Can it identify human reads with a high degree of accuracy from a mixed sample such as mine?

I have been using tophat and bowtie2 to map to human and bacterial database respectively, but there are millions of sequences that are identified as being human if tophat is run first, and bacterial if bowtie2 is run first and given how tophat works, running tophat with one large database of human and bacterial would return the same result as doing bowtie first and then tophat. 

Shawn Driscoll

unread,
May 26, 2013, 1:08:09 AM5/26/13
to rna-...@googlegroups.com
Just curious...what sort of false negative/positive rates do you get with the human/mouse mixed samples?

Alexander Dobin

unread,
May 29, 2013, 10:21:11 AM5/29/13
to rna-...@googlegroups.com
Hi Shawn,

the mouse/human mixed sample we did to study "short" RNA transfer (~20-100b). I am not sure what you would call "false negative" - I could imagine this being the number of reads that map equally well to mouse and human genome, so we cannot make a call which species the RNA came from. Since we studied transfer of small amounts of RNA from human to mouse, I cannot give you a good estimate of that number. In any case, for such short RNA reads we expect this number to be large. As to "false positives", in my definition those would be those reads that originate in one species but are mapped to the other. To control for that we mapped pure mouse RNA to the combined mouse/human genome, and surprisingly a large % of reads mapped uniquely to human genome, ~5-10%. Most of them were coming from a few highly expressed loci, and I think they may indicate some important sequences missing from the mouse genome assembly.

Cheers
Alex

Shawn Driscoll

unread,
May 29, 2013, 5:30:57 PM5/29/13
to rna-...@googlegroups.com
Interesting...sounds similar to what I've observed as well.  I've spent a bit of time with mixed samples.  I've found it useful to define the two species as s1 and s2. Then positives are alignments to s1 and negatives are alignments to s2. This gives me true positives as s1 reads aligned to s1 and true negatives as s2 reads aligned to s2. False negatives and positives are then reads from s1 or s2 aligned to the wrong species.  In addition I define two ambiguous rates (ratio of ambiguous reads from s1 to all s1 reads aligned and the same for s2 reads). Then there's the unaligned category which in a real experiment is mostly ignorable and in a simulation really has more to do with aligner benchmarking.

It's simpler to me to align to the two genomes separately and then use a post processing comparison between the two BAMs excluding all secondary alignments. This way you don't miss those ambiguous alignments (those that score equally well to either species). It seems to me that if you align to a merged reference (I think I tried that method) the aligner will end up randomly assigning those reads to the two species when those reads should actually go into the ambiguous pile.  I'm pretty sure I even tried using eXpress or RSEM to see if their algorithm could sort the reads but it didn't work nearly as well as the simple bam vs bam comparison.

Alexander Dobin

unread,
Jun 3, 2013, 4:15:39 PM6/3/13
to rna-...@googlegroups.com
Hi Shawn,

if  you align to the merged reference, STAR will report all alignment of the similar quality - so the reads mapping to both references equally well will be reported as multi-mappers.

Cheers
Alex

Shawn Driscoll

unread,
Jun 4, 2013, 3:43:54 PM6/4/13
to rna-...@googlegroups.com
Right - but that can happen even with single species samples so wouldn't those mixed multi-mappers be hard to tell apart from the single species multi-mappers?  I'm maybe a bit of an oddball and I like to retain even the multi-mapping reads.  Maybe it's because of my background in dealing with repeat elements.

Alexander Dobin

unread,
Jun 12, 2013, 5:43:43 PM6/12/13
to rna-...@googlegroups.com
For the multi-mapping reads, you could of course go through all their alignments, and if all of them belong to one species, you retain them as single-species multi-mappers. Since STAR outputs all alignments of a read in consecutive lines, it should be very easy to do.

Shawn Driscoll

unread,
Jun 21, 2013, 12:47:08 AM6/21/13
to rna-...@googlegroups.com
Ah! good point.  I usually don't work with outputs that include all alignments so I didn't think about that.

martin....@gmail.com

unread,
Feb 3, 2015, 9:39:18 AM2/3/15
to rna-...@googlegroups.com
Hi Alex,

Any pointers on how one should parameterize STAR for mapping mixed mouse+bacteria (the bacteria is known) samples to a combined genome given the difference in splicing?

Thanks in advance!

Alexander Dobin

unread,
Feb 5, 2015, 5:57:44 PM2/5/15
to rna-...@googlegroups.com
Hi Martin,

I would use the same parameters you are using for the normal mouse mapping, since you cannot separately control the parameters for different "chromosomes". Any splice junctions you will see in the bacterial genome will be either mapping or library prep artifacts.

Cheers
Alex

Jing Yang

unread,
Apr 13, 2015, 2:19:38 PM4/13/15
to rna-...@googlegroups.com
Hi Alex,

If the reference genome is built by combing human and mouse genome. How should I create the index file with the annotation files? Should I combine two fasta files directly and two gtf files directly? Or I need two build a total new gtf file? 

Alexander Dobin

unread,
Apr 14, 2015, 6:18:27 PM4/14/15
to rna-...@googlegroups.com
Hi Jing,

you need to make sure that chromosome names are different for the two species in the FASTA files, and the transcript names are different in the GTF files. The chromosome naming should be consistent between FASTA and GTF files for each species.
After that, the FASTA files can be concatenated or simply listed one after another (space-separated) in the --genomeFastaFiles
The GTF files have to be concatenated into one GTF file.

Cheers
Alex

jing.y...@gmail.com

unread,
Apr 25, 2015, 2:54:36 PM4/25/15
to rna-...@googlegroups.com
Hi Alex,

Thank you very much! Now the work is running!

Jing 

Sent from my iPhone
--
You received this message because you are subscribed to a topic in the Google Groups "rna-star" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rna-star/TqOdXiEFYrI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rna-star+u...@googlegroups.com.
Visit this group at http://groups.google.com/group/rna-star.

Zhuofei Xu

unread,
Jul 29, 2015, 4:40:07 PM7/29/15
to rna-star, ado...@gmail.com
Hi Alex,

I have a similar question. My samples are mixed human + a virus species (the complete genome is available, ~200 kb). I need to get read counts per gene for human and virus, respectively. I think I could use normal approach for read counting of human genes. For read counting of virus genes, I'm wondering how to remove the spliced reads in the viral genome in the bam file before counting reads?

Thank you in advance!

Zhuofei

LinkedIn
KU Leuven



On Thursday, February 5, 2015 at 11:57:44 PM UTC+1, Alexander Dobin wrote:

Alexander Dobin

unread,
Jul 30, 2015, 9:01:47 AM7/30/15
to rna-star, zhuof...@gmail.com, zhuof...@gmail.com
Hi Zhuofei,

to remove the spliced alignments only from certain chromosomes, you would need to check the SAM's CIGAR for the N symbol which denotes the splice junctions.
Note, that because of this filtering the multi-mapping status of some reads may change, and you would need to account for that.

Cheers
Alex

Zhuofei Xu

unread,
Jul 30, 2015, 9:23:30 AM7/30/15
to Alexander Dobin, rna-star
Thanks a lot, Alex! I got it.

Best,
Zhuofei

Salva Casaní

unread,
Nov 23, 2015, 10:26:24 AM11/23/15
to rna-star
Hi, 

I have a soil metagenomics and metatranscriptomics sample. As I built the metagenomics reference genome and want to map the metatranscriptomics reads to it to get the annotation of the transcriptomics reads.

I thought to do it with star, but as the reference contigs are so variable in length (100pb - 153,379pb) maybe it is not the best idea (I have a total of 973,989 contigs). I have tried to generate the genome index with a range of --genomeChrBinNbits, from 4 to 15, but I always w¡get the core dumped error when the mapping starts. 

Thanks

Salva

Alexander Dobin

unread,
Nov 23, 2015, 3:10:20 PM11/23/15
to rna-star
Hi Salva,

please also try to reduce --genomeSAindexNbases.
If this does not work, please send me the Log.out file of the genome generation step and failed mapping run.

Cheers
Alex

Salva Casaní

unread,
Nov 24, 2015, 6:59:46 AM11/24/15
to rna-star
Hi Alex, 

Thanks for your attention. I fixed --genomeChrBinNbits to 6 and tried a range of --genomeSAindexNbases from 10 to 6. It gave me a Segmentation fault (core dump) error in every case. Please, find attached the corresponding log files for the genome generation and mapping with the two above mentioned options set to 6. 

Regards, 

Salva

https://mega.nz/#!NMcEzQbK!e0Kxu5OPOZgmVtv4cKED_cZbd5v6DrGffZfMUhvX5CE
https://mega.nz/#!MFNixACL!nEFJVF84VkOKb9EJwNcoKI--HZMTgVom99qebuowhfw

Alexander Dobin

unread,
Nov 24, 2015, 1:17:18 PM11/24/15
to rna-star
Hi Salva,

could you please try the latest patch from GitHub master:
both for genome generation and mapping. I have fixed a bug there which might be related to the one you are seeing.

Cheers
Alex

Salva Casaní

unread,
Nov 26, 2015, 7:37:52 AM11/26/15
to rna-star
Thanks a lot, it looks like it is running, but as the SA preffix is so small and my genome so big it may take a while.

Regards, 
Salva

Alexander Dobin

unread,
Nov 30, 2015, 11:57:17 AM11/30/15
to rna-star
Hi Salva,

the main reason it runs slow is the large number of reference contigs.
One way to speed it up is to make one (or a few, <50,000) super-contigs by concatenating the short contigs with some N-padding between them.
After the mapping to the supercontig is done, you would have to transform the super-contig loci into local conting coordinates (and discard reads that were spliced between contigs, if any).

Cheers
Alex

George Tollefson

unread,
Jun 22, 2018, 4:49:15 PM6/22/18
to rna-...@googlegroups.com
Hi Alex,

I see this is an old thread but don't see a newer one anywhere. I have a follow up question. I am following your instructions to generate a combined two species genome index file. My question is: once I have mapped my mixed sample rna reads to this combined genome, what is the best way to separate reads by species into new BAM files for future analysis. Does a tool exist? Will I have to write a script to split and save a new BAM file where the mapped reads for the second species begin?

I should mention that I intend to analyze these resulting separated BAM files in HTSeq then DeSeq2.

Thanks,
George 
Message has been deleted

Amit Indap

unread,
Jun 22, 2018, 5:02:27 PM6/22/18
to George Tollefson, rna-star
You can use samtools view to separate out by listing the chromosome/contig names you want to write out to file

See this thread

samtools view -b in.bam chr1  > in_chr1.bam

On Fri, Jun 22, 2018 at 1:52 PM, George Tollefson <gatol...@gmail.com> wrote:
I should mention that I intend to analyze these resulting separated BAM files in HTSeq then DeSeq2.


On Friday, June 22, 2018 at 4:49:15 PM UTC-4, George Tollefson wrote:
Hi Alex,

I see this is an old thread but don't see a newer one anywhere. I have a follow up question. I am following your instructions to generate a combined two species genome index file. My question is: once I have mapped my mixed sample rna reads to this combined genome, what is the best way to separate reads by species into new BAM files for future analysis. Does a tool exist? Will I have to write a script to split and save a new BAM file where the mapped reads for the second species begin?

Thanks,
George 

On Tuesday, April 14, 2015 at 6:18:27 PM UTC-4, Alexander Dobin wrote:
Hi Jing,

you need to make sure that chromosome names are different for the two species in the FASTA files, and the transcript names are different in the GTF files. The chromosome naming should be consistent between FASTA and GTF files for each species.
After that, the FASTA files can be concatenated or simply listed one after another (space-separated) in the --genomeFastaFiles
The GTF files have to be concatenated into one GTF file.

Cheers
Alex

On Monday, April 13, 2015 at 2:19:38 PM UTC-4, Jing Yang wrote:
Hi Alex,

If the reference genome is built by combing human and mouse genome. How should I create the index file with the annotation files? Should I combine two fasta files directly and two gtf files directly? Or I need two build a total new gtf file? 

On Thursday, February 5, 2015 at 11:57:44 PM UTC+1, Alexander Dobin wrote:
Hi Martin,

I would use the same parameters you are using for the normal mouse mapping, since you cannot separately control the parameters for different "chromosomes". Any splice junctions you will see in the bacterial genome will be either mapping or library prep artifacts.

Cheers
Alex


On Tuesday, February 3, 2015 at 9:39:18 AM UTC-5, martin....@gmail.com wrote:
Hi Alex,

Any pointers on how one should parameterize STAR for mapping mixed mouse+bacteria (the bacteria is known) samples to a combined genome given the difference in splicing?

Thanks in advance!

--
You received this message because you are subscribed to the Google Groups "rna-star" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rna-star+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/rna-star.

Reply all
Reply to author
Forward
0 new messages