STAR misalignment due to deletion

Zhenhua Wu

unread,

Jul 15, 2016, 4:27:35 PM7/15/16

to rna-star

Hi, I have encountered a case shown in the following figure: this is a TCGA sample. Top one is the RNAseq data, middle one is the WXS data of the same patient tumor, the bottom one is the WXS of the matched normal. You can see there is a deletion from chr17:74732936 to chr17:74732959 according to the WXS of the patient tumor. However, there are some reads aligned by STAR to this deleted region and caused a seemingly mutation call at position: chr17:74732942.

I examined three reads:

CCCCGTACCTGCGGGGTGGCGGTCCCCGGCGGCCGTAGCGAGCCATTTGC

GTCCCTGCGGGGTGGCGGTCCCCGGCGGCCGTAGCGCGCCATTTGCACCC

CACCGCCCCCGTACCTGCGGGGTGGCGGTCCCCGGCGGCCGTAGCGCGCC

They are supposed to aligned as the following: (where '|' indicates the junction due to deletion)

CCCCGTACCTGCGGGGTGGCGGTCCCC|GGCGGCCGTAGCGAGCCATTTGC

GTCCCTGCGGGGTGGCGGTCCCC|GGCGGCCGTAGCGCGCCATTTGCACCC

CACCGCCCCCGTACCTGCGGGGTGGCGGTCCCC|GGCGGCCGTAGCGCGCC

However, they are aligned actually as: (where, 'C' is the mutation due to the misalignment, '[...]' indicates the soft-clip sequence part by STAR)

CCCCGTACCTGCGGGGTGGCGGTCCCCGGCGGCCGT[AGCGAGCCATTTGC]

GTCCCTGCGGGGTGGCGGTCCCCGGCGGCCGT[AGCGCGCCATTTGCACCC]

CACCGCCCCCGTACCTGCGGGGTGGCGGTCCCCGGCGGCCGT[AGCGCGCC]

Basically, STAR chose to align a read like this with one mutation + soft-clip, instead of split the reads to find a better alignment. In this case, STAR should find a new junction within this SRSF2 exon that actually is a deletion according to the exome-seq data. Is there a way to avoid such misalignment using STAR? The idea behavior should be that STAR should call this as a new junction that is actually due to deletion within this exon.

Alexander Dobin

unread,

Jul 21, 2016, 6:51:25 PM7/21/16

to rna-star

Hi Zhenhua,

here are my suggestions:

1. If you concern is in miscalling the mutations, you may want to filter the alignments in which mutations occur near the ends of the reads, especially if there is a soft-clipping after. Such mutations are often mis-alignments.

2. If you have the deletions detected from the WXS, and want to include them for RNA-seq mapping, you can add them as "annotated" junctions using --sjdbFileChrStartEnd <junction_file> option. This approach will be the most sensitive. In this example, after inserting this deletion as a junction, I get the correct alignment:

1 0 chr17 74732909 255 27M24N23M * 0 0 CCCCGTACCTGCGGGGTGGCGGTCCCCGGCGGCCGTAGCGAGCCATTTGC * NH:i:1 HI:i:1 AS:i:48 nM:i:1

3. If at least one read was "spliced"correctly through this deletion, you can use the 2-pass mode to map other reads to the same "junction".

Cheers

Alex

Zhenhua Wu

unread,

Jul 29, 2016, 3:03:35 PM7/29/16

to rna-star

Thanks Alex for the suggestions. As another alternative, would you recommend to set "--alignEndsType" as "EndToEnd" to force end-to-end read alignment? What other side-effects would it cause by forcing end-to-end alignment?

Zhenhua

Zhenhua Wu

unread,

Jul 29, 2016, 3:38:00 PM7/29/16

to rna-star

Also, is it possible for STAR to do some further alignment using different strategies for reads aligned with soft-clip? For example, align from the end of the soft-clip and extend it to find a better alignment might be able to improve the alignment accuracy for case like this?

Zhenhua

Alexander Dobin

unread,

Jul 29, 2016, 4:01:03 PM7/29/16

to rna-star

Hi Zhenhua,

the --alignEndsType EndToEnd will likely drop all these alignments all together, since it will try to match the soft-clipped tails to the genome, which will result in too many mismatches.

To find such a deletion "de novo", i.e. without any prior knowledge, STAR would have to find a seed in the shorter overhang portion of the alignment, which seems to be hard in this case because of the repeat sequence near the deletion.

In principle, you are right, all reads with large soft-clippings should be considered suspicious and re-aligned more carefully.

Cheers

Alex

Zhenhua Wu

unread,

Oct 13, 2016, 10:52:53 AM10/13/16

to rna-star

Hi, Alex, for the case above, I was always wondering why STAR failed to find a new junction here. Because of the small repeat, the left side of the junction is hard to recover correctly, however, at least it should find a junction right at location 74732941 and 74732960 to align the reads perfectly, even though this junction is only partially correct.

I found one read that has two alignment records:

SOLEXA11_36:4:94:5021:9837 355 chr17 74732915 3 30M20S = 74733126 261 ACCTGCGGGGTGGCGGTCCCCGGCGGCCGTAGCGCGCCATTTGCACCCGC CCCCCCCCCCBCBBCCCCCCCCCCCCCCC>CCCCCCCC3BBBCABDCCDA NH:i:2 HI:i:2 AS:i:76 nM:i:1 NM:i:1 MD:Z:27T2 jM:B:c,-1 jI:B:i,-1

SOLEXA11_36:4:94:5021:9837 99 chr17 74732960 3 21S29M = 74733126 216 ACCTGCGGGGTGGCGGTCCCCGGCGGCCGTAGCGCGCCATTTGCACCCGC CCCCCCCCCCBCBBCCCCCCCCCCCCCCC>CCCCCCCC3BBBCABDCCDA NH:i:2 HI:i:1 AS:i:77 nM:i:0 NM:i:0 MD:Z:29 jM:B:c,-1 jI:B:i,-1

Basically, this read has two alignment record, the primary alignment gives 21bp soft-clip at the 3' side, and the secondary alignment gives 20bp soft-clip at the 5' side. I think STAR failed to stitch them together into one alignment. Do you think STAR could improve its stitch step to stitch these two records into one alignment? I feel there is potential for STAR to improve its algorithm to deal with this case without much overhead. How do you think?

Alexander Dobin

unread,

Oct 18, 2016, 3:53:35 PM10/18/16

to rna-star

Hi Zhenhua,

the stitching algorithm in STAR (as well as many other algorithms) can definitely be improved. It's not optimal for finding short deletions with highly repeatable sequence overhangs. Two main problems will be the increase in computational time, and increase in false positives.

Cheers

Alex

Zhenhua Wu

unread,

Oct 18, 2016, 4:20:23 PM10/18/16

to rna-star

Will some immediate improvements of this type be coming soon in the new release? I tried HISAT2 on the same sample. It doesn't give the false SNP in the deleted region. I think the alignment accuracy definitely is an important factor. The computational time sacrificed with a more comprehensive stitch algorithm could be compensated by using more threads. I think it might be worth it. Or it can be an option for user to choose either accuracy over computational time based on their need.

Alexander Dobin

unread,

Oct 19, 2016, 4:24:40 PM10/19/16

to rna-star

Hi Zhenhua,

the accuracy improvements for this particular problem will not be coming soon.

I do not think this problem affects a significant proportion of reads, and I prefer to concentrate on cases that are more abundant - such as mismatches near annotated junctions. Overall, in my (as well as others') tests STAR shows similar or better accuracy than HISAT2. For some specific cases one or another may be more accurate, of course.