Weird splicing junction identified by STAR 2.5.1a (start and end coordinates the same)

Qingqing Wang

unread,

Mar 11, 2016, 8:19:40 AM3/11/16

to rna-star

Hi Alex,

While using STAR 2.5.1a to map paired-end 100bp RNA-seq data to the Drosophila genome (dm3), I noticed that the following junction got reported in the SJ.out.tab file:

chrX 20066072 20066072 0 0 1 360 0 7

This junction is bizarre since it has the same start and end coordinates.

I mapped three RNA-seq samples (ctrl+kd) and all three samples output this junction. Is there an explanation for this?

FYI, my 2-pass STAR command is:

STAR --runThreadN 2 --genomeDir $genomeDIR --outFileNamePrefix $output_file --readFilesIn $readfiles1 $readfiles2 --outSJfilterReads Unique --quantMode TranscriptomeSAM GeneCounts --outSAMstrandField intronMotif --outFilterMultimapNmax 1

Thank you!

Qingqing

Alexander Dobin

unread,

Mar 14, 2016, 6:37:00 AM3/14/16

to rna-star

Hi Qingqing,

this junction appears to be annotated (col6=1), so it was supplied at the genome generation step.

How did you generate the genome?

Cheers

Alex

Qingqing Wang

unread,

Apr 14, 2016, 8:53:01 PM4/14/16

to rna-...@googlegroups.com

Hi Alex (sorry my previous post contains an error... I rectified it. Basically all the junctions identified to have the same start and end coordinates are "annotated" and also there are only a few of them for each sample),

I generated the genome using the two-step procedure.

First step:

STAR --runThreadN 2 --runMode $run_mode --genomeDir $genomeDIR --genomeFastaFiles $genome_fasta --sjdbGTFfile $genome_annotation --sjdbOverhang 99

Here for $genome_annotation I used a genes.gtf file that is downloaded from the Tophat website: https://ccb.jhu.edu/software/tophat/igenomes.shtml

This annotation profile is under Drosophila UCSC dm3.

After that, I chose all the unannotated splicing junctions from the SJ files by performing (I have three samples, Index1,2,and 3):

awk ‘$6 == 0’ Index1_SJ.out.tab > Index1_SJ.out.tab_unannotated

awk ‘$6 == 0’ Index2_SJ.out.tab > Index2_SJ.out.tab_unannotated

awk ‘$6 == 0’ Index3_SJ.out.tab > Index3_SJ.out.tab_unannotated

Then I combined all the unannoated junctions together and formatted for the 2nd pass genome generation as follows:

cat Index1_SJ.out.tab_unannotated Index2_SJ.out.tab_unannotated Index3_SJ.out.tab_unannotated Index4_SJ.out.tab_unannotated > combined_SJ_out_tab_unannotated.txt

awk '{if($4==1) $4="+"; else if($4==2) $4="-"; print $1 "\t" $2 "\t" $3 "\t" $4}' combined_SJ_out_tab_unannotated.txt > combined_SJ_out_tab_unannotated_for_2nd_pass_genome_generation.txt

Now for the second step:

STAR --runThreadN 2 --runMode $run_mode --genomeDir $genomeDIR --genomeFastaFiles $genome_fasta --sjdbFileChrStartEnd $spliceDB --sjdbGTFfile $genome_annotation --sjdbOverhang 99

Here $spliceDB is using the combined_SJ_out_tab_unannotated_for_2nd_pass_genome_generation.txt file as resulted from above.

I noticed that for each sample I found around 7 junctions that have the exact same start and end coordinates, and all of them are "annotated". Below is an example from the SJ files from the first run of sample 1:

chr2L 21388502 21388502 1 3 1 321 0 5

chr2R 14309059 14309059 0 0 1 69 0 3

chr3R 24962423 24962423 0 0 1 117 0 3

chrX 8802354 8802354 0 0 1 5 0 4

chrX 8806978 8806978 1 1 1 377 0 3

chrX 13493447 13493447 0 0 1 1960 0 11

chrX 20066072 20066072 0 0 1 86 0 7

I look forward to your reply! Thanks!

Qingqing

Alexander Dobin

unread,

Apr 15, 2016, 6:15:31 PM4/15/16

to rna-star

Hi Qingqing

can you look at these junctions after the 1st step?

They should also appear annotated at the 1st step, which would mean they are annotated in the GTF file.

Note, that the junction coordinate are start/end of the intron - so the introns in these junctions are exactly 1 base.

These also happen to exist in human annotations. They are usually explained as artifacts required to maintain ORF integrity.

I guess you could think of them as 1-base deletions in genome, owing mis-assembly or polymorphism.

Cheers

Alex

Qingqing Wang

unread,

Apr 18, 2016, 8:36:26 PM4/18/16

to rna-star

Hi Alex,

Yes these junctions did appear annotated at the 1st step, so they are in the GTF file.

Thank you for your explanation! I do not really get it why these 1 base junctions are explained as artifacts to maintain ORF integrity. Do you mean that they reside right at the edge of say, two assembly chunks that cross over an ORF, and in order to make the ORF after assembly intact people manually annotated them as junctions?

Thank you!

Qingqing

Alexander Dobin

unread,

Apr 20, 2016, 3:22:51 PM4/20/16

to rna-star

Hi Qingqing,

my understanding is that assemblies might occasionally include incorrectly inserted bases.

When a protein-coding transcript is annotated over such a base, the ORF might get broken (i.e. a premature stop-codon occurs).

To avoid this problem wtihout fixing the assembly, the annotators include a 1b intron to excise the wrong base from the RNA sequence.