Weird splicing junction identified by STAR 2.5.1a (start and end coordinates the same)

120 views
Skip to first unread message

Qingqing Wang

unread,
Mar 11, 2016, 8:19:40 AM3/11/16
to rna-star
Hi Alex,

While using STAR 2.5.1a to map paired-end 100bp RNA-seq data to the Drosophila genome (dm3), I noticed that the following junction got reported in the SJ.out.tab file:

chrX     20066072    20066072    0    0    1    360    0    7


This junction is bizarre since it has the same start and end coordinates.


I mapped three RNA-seq samples (ctrl+kd) and all three samples output this junction.  Is there an explanation for this?


FYI, my 2-pass STAR command is:

STAR --runThreadN 2 --genomeDir $genomeDIR --outFileNamePrefix $output_file --readFilesIn $readfiles1 $readfiles2 --outSJfilterReads Unique --quantMode TranscriptomeSAM GeneCounts --outSAMstrandField intronMotif --outFilterMultimapNmax 1


Thank you!


Qingqing 



Alexander Dobin

unread,
Mar 14, 2016, 6:37:00 AM3/14/16
to rna-star
Hi Qingqing,

this junction appears to be annotated (col6=1), so it was supplied at the genome generation step.
How did you generate the genome?

Cheers
Alex

Qingqing Wang

unread,
Apr 14, 2016, 8:53:01 PM4/14/16
to rna-...@googlegroups.com
Hi Alex (sorry my previous post contains an error... I rectified it.  Basically all the junctions identified to have the same start and end coordinates are "annotated" and also there are only a few of them for each sample),

I generated the genome using the two-step procedure.

First step:

STAR --runThreadN 2 --runMode $run_mode --genomeDir $genomeDIR --genomeFastaFiles $genome_fasta --sjdbGTFfile $genome_annotation --sjdbOverhang 99

Here for $genome_annotation I used a genes.gtf file that is downloaded from the Tophat website: https://ccb.jhu.edu/software/tophat/igenomes.shtml

This annotation profile is under Drosophila UCSC dm3.


After that, I chose all the unannotated splicing junctions from the SJ files by performing (I have three samples, Index1,2,and 3):

awk ‘$6 == 0’ Index1_SJ.out.tab > Index1_SJ.out.tab_unannotated


awk ‘$6 == 0’ Index2_SJ.out.tab > Index2_SJ.out.tab_unannotated


awk ‘$6 == 0’ Index3_SJ.out.tab > Index3_SJ.out.tab_unannotated


Then I combined all the unannoated junctions together and formatted for the 2nd pass genome generation as follows:

cat Index1_SJ.out.tab_unannotated Index2_SJ.out.tab_unannotated Index3_SJ.out.tab_unannotated Index4_SJ.out.tab_unannotated > combined_SJ_out_tab_unannotated.txt


awk '{if($4==1) $4="+"; else if($4==2) $4="-"; print $1 "\t" $2 "\t" $3 "\t" $4}' combined_SJ_out_tab_unannotated.txt > combined_SJ_out_tab_unannotated_for_2nd_pass_genome_generation.txt


Now  for the second step:

STAR --runThreadN 2 --runMode $run_mode --genomeDir $genomeDIR --genomeFastaFiles $genome_fasta --sjdbFileChrStartEnd $spliceDB --sjdbGTFfile $genome_annotation --sjdbOverhang 99


Here $spliceDB is using the combined_SJ_out_tab_unannotated_for_2nd_pass_genome_generation.txt file as resulted from above.

I noticed that for each sample I found around 7 junctions that have the exact same start and end coordinates, and all of them are "annotated".  Below is an example from the SJ files from the first run of sample 1:


chr2L   21388502        21388502        1       3       1       321     0       5

chr2R   14309059        14309059        0       0       1       69      0       3

chr3R   24962423        24962423        0       0       1       117     0       3

chrX    8802354 8802354 0       0       1       5       0       4

chrX    8806978 8806978 1       1       1       377     0       3

chrX    13493447        13493447        0       0       1       1960    0       11

chrX    20066072        20066072        0       0       1       86      0       7


I look forward to your reply!  Thanks!


Qingqing

Alexander Dobin

unread,
Apr 15, 2016, 6:15:31 PM4/15/16
to rna-star
Hi Qingqing

can you look at these junctions after the 1st step?
They should also appear annotated at the 1st step, which would mean they are annotated in the GTF file.
Note, that the junction coordinate are start/end of the intron -  so the introns in these junctions are exactly 1 base.
These also happen to exist in human annotations. They are usually explained as artifacts required to maintain ORF integrity.
I guess you could think of them as 1-base deletions in genome, owing mis-assembly or polymorphism.

Cheers
Alex

Qingqing Wang

unread,
Apr 18, 2016, 8:36:26 PM4/18/16
to rna-star
Hi Alex,

Yes these junctions did appear annotated at the 1st step, so they are in the GTF file.  

Thank you for your explanation!  I do not really get it why these 1 base junctions are explained as artifacts to maintain ORF integrity.  Do you mean that they reside right at the edge of say, two assembly chunks that cross over an ORF, and in order to make the ORF after assembly intact people manually annotated them as junctions?

Thank you!

Qingqing 

Alexander Dobin

unread,
Apr 20, 2016, 3:22:51 PM4/20/16
to rna-star
Hi Qingqing,

my understanding is that assemblies might occasionally include incorrectly inserted bases.
When a protein-coding transcript is annotated over such a base, the ORF might get broken (i.e. a premature stop-codon occurs).
To avoid this problem wtihout fixing the assembly, the annotators include a 1b intron to excise the wrong base from the RNA sequence.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages