Aligning mate pairs where one read is very short (10bp)

140 views
Skip to first unread message
Assigned to ado...@gmail.com by me

bj...@cornell.edu

unread,
May 18, 2017, 4:00:26 PM5/18/17
to rna-star
I am having trouble getting some paired end reads to align properly. The reads are 58b (read1) + 10bp (read2). I realize that a 10bp read would be difficult to map, especially if there is a large gap between read1 and read2 or if read2 contains a splice junction. However, since I know that read pairs should map concordantly with a gap between reads no larger than 150b and it is OK if read2 doesn't align if it is spliced, I was guessing that the 10bp read should still be align-able with the right parameters. So far, I have not been able to get the 10bp reads to align, though the 58bp reads do align (the 58bp reads are really 65 but I am using the parameter to --clip5pNbases 7 0 ). Here is the command I have been using for alignment:

STAR --genomeDir ../STAR_GenomeDirectory --readFilesIn ./65bpReads1.fastq ./10bpReads2.fastq --alignIntronMin 10 --alignIntronMax 1000 --alignMatesGapMax 200 --alignEndsType EndToEnd --clip5pNbases 7 0 --outSAMattributes All --runThreadN 8 --outSAMunmapped Within

Any suggestions?
Thank you so much!

bj...@cornell.edu

unread,
May 18, 2017, 4:31:37 PM5/18/17
to rna-star
Update:
I have found that mapping 58b + 12b PE reads map pretty well for most all reads, but it is 58b + 11b do not, with the settings I posted

Alexander Dobin

unread,
May 18, 2017, 4:42:29 PM5/18/17
to rna-star
Hi @bjf79

you can try to play with the parameter --seedMultimapNmax . It determines the maximum number of loci the seed can map to.
By default it's 10000. My guess is that most 12-mers in your genome map to <=10000 loci, while 11-mers and 10-mers map to >10000.
As you reduce the length of the seed (i.e. the short mate), you need to increase this parameter by a factor of 4.
So for 10-mers, you would need --seedMultimapNmax 160000 or more. As you keep increasing this parameter, the mapping speed will suffer.

This is an interesting experiment, please let me know how it goes.

Cheers
Alex

bj...@cornell.edu

unread,
May 19, 2017, 1:29:46 PM5/19/17
to rna-star
Thanks for the swift reply. I had no luck with --seedMultimapNmax 160000. When I tried that parameter, it was still the case that 58b + 12b PE reads mapped with nearly 100% mapping rate for both read pairs, but 58b + 11b PE reads maps only the 58b read and never (or almost never, I didn't check thoroughly) the 11b read.

I tried some increasing some other seed as well settings (though admittedly I don't know have much understanding of what exactly these settings do) and got the same result: 58b + 12b PE reads map fine but not 58b + 11b. These are the settings I've tried most recently:

STAR --genomeDir ../STAR_GenomeDirectory --readFilesIn ./65bpReads1.fastq ./11bpReads2.fastq --alignIntronMin 10 --alignIntronMax 1000 --alignMatesGapMax 150 --alignEndsType EndToEnd --clip5pNbases 7 0 --outSAMattributes All --runThreadN 8 --outSAMunmapped Within --seedMultimapNmax 160000 --seedPerReadNmax 10000 --seedPerWindowNmax 500 --seedNoneLociPerWindow 50 --alignSplicedMateMapLmin 12


-Ben


On Thursday, May 18, 2017 at 4:00:26 PM UTC-4, bj...@cornell.edu wrote:

Alexander Dobin

unread,
May 19, 2017, 3:36:45 PM5/19/17
to rna-star
Hi Ben,

could you please send a few reads which map with 12-mers but not with 11-mers, as well as links to genome FASTA and GTF.
I will try to figure out why it's not working.

Cheers
Alex

bj...@cornell.edu

unread,
May 19, 2017, 4:12:14 PM5/19/17
to rna-star
The fasta and gtf I used is the most current Ensembl build of the yeast genome which I downloaded from here:

I am attaching the reads I used here. They were actually made with a read simulator program, namely wgsim with the following parameters:

wgsim -N 1000 -d 150 -s 25 -1 65 -2 12 -e 0.001 ./mygenome.fa ./65bpReads1.fastq ./12bpReads2.fastq

On Thursday, May 18, 2017 at 4:00:26 PM UTC-4, bj...@cornell.edu wrote:

Alexander Dobin

unread,
Jun 22, 2017, 5:21:33 PM6/22/17
to rna-star
Hi Ben,

sorry for the delay, I just got around to fix this problem. There was actually a hard-coded parameter that limited the mate size to 12nt.
I have coded a new option --seedSplitMin that you can set to smaller values, e.g. --seedSplitMin 10 . This would map mates 10 and longer.
Note, that there is yet another hardcode limit at 5, if you need to go lower than that, please let me know.

Cheers
Alex

Benjamin Fair

unread,
Jun 22, 2017, 5:34:43 PM6/22/17
to Alexander Dobin, rna-star
Wow, much thanks for working that out! I actually turned out just sequencing 15nt of the short mate for my experiment, rather than 10nt like I originally planned, and now that I look at the mapping results of my actual data (rather than the simulated paired-end read data that spurred my original post) I think it was wise that I just sequenced 15 bases instead of 10.

cheers and thanks,
Ben

--
You received this message because you are subscribed to a topic in the Google Groups "rna-star" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rna-star/IFj3fHG_Ysc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rna-star+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/rna-star.

Reply all
Reply to author
Forward
0 new messages