STARlong: adjusting for variable length transcripts

zeynep

unread,

Oct 26, 2017, 10:47:52 AM10/26/17

to rna-star

Hey there!

So I am trying to align transcripts (assembled using Trinity) with variable length (200b to ~6-7kb) to human genome.

I think the following options are relevant to my case;

STARlong --genomeDir path/to/indexed/genome

--runThreadN

--readFilesIn path/to/trinity/transcripts

--outFilterMismatchNoverReadLmax 0.10

--sjdbGTFfile path/to/annotation

--seedPerReadNmax 100000

--sjdbOverhang ??

my question is how to decide on overhang options when we have such variable transcript lengths? and would be nice to hear any other suggestions :)

Thanks!

zeynep

Alexander Dobin

unread,

Oct 26, 2017, 2:09:55 PM10/26/17

to rna-star

Hi Zeynep,

for variable read length I recommend leaving --sjdbOverhang at the default 100. Note that this parameter has to be used at the genome genetation step togehter with the --sjdbGTFfile

I would also recommend the following parameters at the mapping stage:

--outFilterMismatchNmax 1000 --outFilterMismatchNoverReadLmax 0.10 : increases the number of allowed mismatches to the minimum of (1000  OR 0.1*ReadLength) - need to allow more mismatches for longer reads

--seedSearchStartLmax 30 : increases the number of seed search start position in the read - important for reads with high error rate

--seedPerReadNmax 100000   --seedPerWindowNmax 100 : increase the number of allowed seeds for each read and alignment window - need to store more seeds for longer reads

--alignTranscriptsPerReadNmax 100000 --alignTranscriptsPerWindowNmax 10000 : increase the number of allowed alignments for each read and alignment window - need to store more putative alignments for longer reads

Cheers

Alex

zeynep

unread,

Nov 1, 2017, 9:38:53 AM11/1/17

to rna-star

Hey Alex, thanks for the reply :)

I am confused about --outFilterMismatchNoverReadLmax option. in the manual it says

"float: alignment will be output only if its ratio of mismatches to *read* length

is less than or equal to this value." and that the default is 1.0

so if I decrease this to 0.1 this means that for a read length of 1000, alignment will be counted as mapped only if the number of mismatches are less than or equal to 100, and what I think is that it will decrease my mapping percentage. (and so it does, when I set this as 0.1 I get ~50% unmapped because of mismatches)

however when I set --outFilterMismatchNmax 1000 as your alternative suggestion I get 80% mapping.

so overall I guess for longer reads, it's better to leave --outFilterMismatchNoverReadLmax as default 1 ?

I'd appreciate if you can clarify this, thank you very much for dealing with all our questions :)

Best,

Zeynep

26 Ekim 2017 Perşembe 20:09:55 UTC+2 tarihinde Alexander Dobin yazdı:

Alexander Dobin

unread,

Nov 1, 2017, 1:41:26 PM11/1/17

to rna-star

Hi Zeynep,

there are 3 parameters that control the max allowed number of mismatches:

outFilterMismatchNmax 10

int: alignment will be output only if it has no more mismatches than this value.

outFilterMismatchNoverLmax 0.3

float: alignment will be output only if its ratio of mismatches to *mapped* length is less than or equal to this value.

outFilterMismatchNoverReadLmax 1.0

float: alignment will be output only if its ratio of mismatches to *read* length is less than or equal to this value.

These filters (like most of the filters in STAR) work in the AND fashion, i.e. the most stringent filter wins.

By default, --outFilterMismatchNmax 10 is likely to be the most stringent filter, while --outFilterMismatchNoverReadLmax 1.0 filter is switched off.

Note, that (be default) STAR will soft-clip the reads as much as needed to satisfy the mismatch filters.

Cheers

Alex

zeynep

unread,

Dec 12, 2017, 5:35:08 PM12/12/17

to rna-star

Hi again Alex,

so I was trying to map marmoset transcript assemblies to human reference genome using STARlong.

After all the settings and trials, the unique mapping could not go more than 75%. and the unmapped transcripts were all longer than 3500bp.

Then I tried to map those unmapped transcript assemblies to human using BLAT this time, and I get mapping with more than 80% query coverage for 40% of them.

I was thinking maybe the reason for BLAT to map those very long transcripts and STARlong cannot is that, STAR have some sort of limit for the number of "alignment blocks" ?

or is there a maximum number of splice junctions a read can span over?

Thank you!

Best,

Zeynep

1 Kasım 2017 Çarşamba 18:41:26 UTC+1 tarihinde Alexander Dobin yazdı:

Alexander Dobin

unread,

Dec 14, 2017, 12:36:03 PM12/14/17

to rna-star

Hi Zeynep,

if BLAT finds alignments with 80% identity, STAR will not find them - it is not (yet) designed to work with high-error-rate reads (identity<95%).

if you can send me a few cases where BLAT produced a good alignment, and STAR did not, I can try to see if there are any other parameter tweaks.