Mapping single ends from paired end data

403 views
Skip to first unread message

Anthony Orvedahl

unread,
Jul 6, 2016, 12:10:54 PM7/6/16
to rna-star, chand...@gmail.com
Hi Alex (and anyone else),

I have a couple of questions about mapping a single reads from paired end data (for example if one read maps more poorly to different locus, is too short after QC trimming, etc). I've found three useful posts in this group here, with slightly different options that I'm trying to reconcile (copied your responses in quotes out of context below for others' reference):

"you could get single-end alignments by reducing both --outFilterScoreMinOverLread  --outFilterMatchNminOverLread to below 0.5, say to 0.4.
These parameters control the minimum read length normalized to the total read length (sum of the mates).
To output the unmapped read, you will need to use --outSAMunmapped Within , the unmapped mate will have the coordinate of the mapped one according to the standard SAM rules."

"By default, at least ~2/3 of the read length (sum of both mates) has to be mapped, which for untrimmed reads of equal length requires paired-end alignments.
However, if one of the reads is trimmed significantly before mapping, a single-end alignment may be allowed.
For instance, if you have 2x100 and the 2nd mate was trimmed to 20 bases, than an alignment longer than ~2/3*(100+20)=80 will be accepted.
To ensure that you only get paired alignments you can specify --outFilterMatchNmin <minMateLength+1> (e.g. 101 for 2x100)."

"you could get single-end alignments by reducing both --outFilterScoreMinOverLread  --outFilterMatchNminOverLread to below 0.5, say to 0.4.  This will work if your mates have the same length, if not, you need to set
--outFilterScoreMinOverLread  0 --outFilterMatchNminOverLread 0 --outFilterScoreMin <minMateLength*0.8>"

My experiment/processing steps: I have 2x101 HiSeq 2500 data with a polyA-enriched library using custom inline barcodes/oligo-dT adapters. R2 has the barcode and poly-T tract, which I force trim from the left (using bbduk, to 78bp). R2 is in general, of poorer quality than R1 (presumably due to polyT tract interference on the sequencer), so I then QC trim the right to varying lengths (with bbduk Q10). I trim R1 for read-through barcodes and poly-A tracts. This results in read pairs of different lengths, and some R2 of less than 25bp which I would like to discard (ignore). However, I keep the read files the same number of lines and ordered for paired end analysis in STAR. Is there a way to ignore/filter the a mate that might have low mapping scores and/or is too short?

In my case, would I want to use: 

--outFilterScoreMinOverLread and --outFilterMatchNminOverLread ~0.4 
--outFilterMatchNmin <minMateLength+1> (e.g. 79 for my trimmed R2) (in which case should --outFilterScoreMinOverLread and --outFilterMatchNminOverLread be 0 or 0.4?)

or 
--outFilterScoreMinOverLread  0 --outFilterMatchNminOverLread 0 --outFilterMatchNmin <minMateLength*0.8> (e.g. 78*0.8=62)?

or
--<none of the above, I don't understand what these mean>

Thanks,

Anthony

Alexander Dobin

unread,
Jul 6, 2016, 5:37:43 PM7/6/16
to rna-star, chand...@gmail.com
Hi Anthony,

if you want to discard reads with R2<25b, it's probably best to do it before mapping. I am not sure if bbduk has this option, however, it should not be difficult to write a script that will exclude short reads from one fastq file and simultaneously the corresponding read from the other file. Of course, you can also do it after mapping processing the BAM file - it's easier to do if the BAM is unsorted.
At the moment, STAR does not have options to control the read length of each mate - it only controls the total length.

In your case, I believe going with the default options --outFilterScoreMinOverLread 0.66  --outFilterMatchNminOverLread 0.66 .
It will not enforce the PE alignments and will allow some single-end alignments that are >66% long of the total read length.
For instance, if you have R1=101 and R2=29, the min mapped length will be ~0.66*(101+29)=85.8, so an alignment of R1=86 and R2=0 will be allowed,
as well as R1=56 and R2=30.

If you set these parameters to <0.5, you would be allowing even shorter alignments which are more likely to be mis-aligned.

--outFilterScoreMinOverLread 0.66  --outFilterMatchNminOverLread 0.66 --outFilterMatchNmin 102 will enforce that at least some portions of both mates are mapped AND 66% of the total read length is mapped.

--outFilterMatchNmin 0 --outFilterMatchNminOverLread 0 --outFilterMatchNmin 80 will allow short alignments even for long reads, e.g. R1=101 R2=101, and alignment with R1=80 R2=0 is allowed.

Cheers
Alex

Anthony Orvedahl

unread,
Jul 7, 2016, 11:31:42 PM7/7/16
to rna-star, chand...@gmail.com
Alex,

Thanks very much for your prompt and detailed response. I think I understand but wanted to clarify a couple of points. 

When I changed parameters from default (0.66) to  --outFilterScoreMinOverLread 0.4 –outFilterMatchNminOverLread 0.4, my %  unique reads went from ~68% to ~83%. This was without any trimming at all, with average input read length of 202. This suggests to me that there is a discrepancy in mapping between ~15% of my read pairs. Is this an accurate interpretation? After trimming, my average reads are about 160 (I might have overemphasized the concern that one read would be <25bp), so it seems like keeping the MinOverLread at 0.4 might not be problematic (even if both reads were as short as ~80bp, the allowable length would still be 64bp per read).

Can you please explain what you mean by:
"It will not enforce the PE alignments and will allow some single-end alignments that are >66% long of the total read length".
Does this apply even to read pairs whose sum is >0.66 to total length (202*0.66=133)? In this case it seems that it would enforce PE mapping. 

My main concern is that using --outFilterScoreMinOverLread 0.4 –outFilterMatchNminOverLread 0.4 increased the # of uniquely mapping reads even before trimming. It seems like enforcing paired end mapping would be throwing out a significant proportion of usable single end reads in my dataset.
If this is the case, wouldn't --outFilterMatchNmin 0 --outFilterMatchNminOverLread 0 --outFilterMatchNmin 80 be a good option?

Thanks,

Anthony

Alexander Dobin

unread,
Jul 11, 2016, 6:06:24 PM7/11/16
to rna-star, chand...@gmail.com
Hi Anthony,


When I changed parameters from default (0.66) to  --outFilterScoreMinOverLread 0.4 –outFilterMatchNminOverLread 0.4, my %  unique reads went from ~68% to ~83%. This was without any trimming at all, with average input read length of 202. This suggests to me that there is a discrepancy in mapping between ~15% of my read pairs. Is this an accurate interpretation? After trimming, my average reads are about 160 (I might have overemphasized the concern that one read would be <25bp), so it seems like keeping the MinOverLread at 0.4 might not be problematic (even if both reads were as short as ~80bp, the allowable length would still be 64bp per read).


This difference is not just SE vs PE, but, in general, is caused by shorter allowed mapped length. For instance, for 2x101 reads, r1=41 and r2=41 alignment would be allowed.
The main questions is - after trimming of the adapter, what are the unmapped portions of the read? Allowing 60% of the sequence to be unmapped will result  in increased mis-mapping rate.
 
Can you please explain what you mean by:
"It will not enforce the PE alignments and will allow some single-end alignments that are >66% long of the total read length".
Does this apply even to read pairs whose sum is >0.66 to total length (202*0.66=133)? In this case it seems that it would enforce PE mapping. 


If the reads have the same length (no trimming), then the default --outFilterScoreMinOverLread 0.66  --outFilterMatchNminOverLread 0.66 parameters will enforce PE alignments.
However, if after trimming one of the reads gets too short, the SE alignment my become possible: for instance, if you have R1=101 and R2=29, the min mapped length will be ~0.66*(101+29)=85.8, so a SE alignment of R1=86 and R2=0 will be allowed.
 
My main concern is that using --outFilterScoreMinOverLread 0.4 –outFilterMatchNminOverLread 0.4 increased the # of uniquely mapping reads even before trimming.

This is normal, since STAR effectively does the trimming for you.
 
It seems like enforcing paired end mapping would be throwing out a significant proportion of usable single end reads in my dataset.
If this is the case, wouldn't --outFilterMatchNmin 0 --outFilterMatchNminOverLread 0 --outFilterMatchNmin 80 be a good option?


Only if you trim aggessively (i.e. remove any expcted non-genomic sequences from the reads), I would recommend enforcing long enough mapped length (e.g. --outFilterScoreMinOverLread 0.66  --outFilterMatchNminOverLread 0.66).
If you do not trim, it's probably better to go with very relaxed --outFilterMatchNmin --outFilterMatchNminOverLread  --outFilterMatchNmin filters, and then filter the results according to your needs.
Reply all
Reply to author
Forward
0 new messages