STAR alignement to transcriptome + Salmon quantification fails

1,000 views
Skip to first unread message

Martin Selmansberger

unread,
Sep 29, 2016, 11:22:19 AM9/29/16
to rna-star
Hello,

I intend to do isoform/transcript quantification from RNAseq data with STAR alignment + Salmon quantification in (Alignment based mode).

Therefor I have

- generated a Ref-Genome with STAR

STAR --runThreadN 15 \
     --runMode genomeGenerate \
     --genomeDir  /home/zyto/SAN/zyto/MS/STAR_aligner/GRCh38/Genome_TCGA_TCversion \
     --genomeFastaFiles /home/zyto/SAN/zyto/MS/STAR_aligner/GRCh38/sequence/GRCh38_r85.all.fa \
     --sjdbGTFfile /home/zyto/SAN/zyto/MS/STAR_aligner/GRCh38/annotation/Homo_sapiens.GRCh38.85.gtf \
     --sjdbOverhang 47 \


- and aligned the reads with STAR, with the "quantMode TranscriptomeSAM" option in order to obtain transcriptome alignments


STAR --genomeDir /san/zyto/MS/STAR_aligner/GRCh38/Genome_TCGA\
     --runThreadN 12 \
     --readFilesIn /Ferdl/00529fb2-f36d-4f50-abf7-01b4e46739b1/read1.fq  /Ferdl/00529fb2-f36d-4f50-abf7-01b4e46739b1/read2.fq \
     --outFileNamePrefix star_out_fq_new \
     --quantMode TranscriptomeSAM \


a subsequent quantification with RSEM gave individual transcriptome quantification:

 transcript_id   gene_id length  effective_length        expected_count  TPM     FPKM    IsoPct
ENST00000373020 ENSG00000000003 2206    2140.99 211.67  11.25   6.87    93.81
ENST00000494424 ENSG00000000003 820     755.01  0.00    0.00    0.00    0.00
ENST00000496771 ENSG00000000003 1025    959.99  0.00    0.00    0.00    0.00
ENST00000612152 ENSG00000000003 3796    3730.99 24.33   0.74    0.45    6.19
ENST00000614008 ENSG00000000003 900     834.99  0.00    0.00    0.00    0.00
ENST00000373031 ENSG00000000005 1339    1273.99 0.00    0.00    0.00    0.00
ENST00000485971 ENSG00000000005 542     477.21  0.00    0.00    0.00    0.00
ENST00000371582 ENSG00000000419 1161    1095.99 0.00    0.00    0.00    0.00
ENST00000371584 ENSG00000000419 1073    1007.99 56.73   6.41    3.91    20.68


PROBLEM:

When I try to perform the quantification of the STAR aligned reads with Salmon (Alignment based mode),
I get the following Warning/Error which lead to abortion of the quantification with NO quant-results.

salmon quant -p 2 -t /san/zyto/MS/STAR_aligner/GRCh38/TRANSKRIPOME/Homo_sapiens.GRCh38.cdna.all.fa -g /san/zyto/MS/STAR_aligner/GRCh38/annotation/Homo_sapiens.GRCh38.85.gtf  -l IU -a ./star_out_fqAligned.toTranscriptome.out.bam  -o STAR_Salmon_quant

This is duo to this message (for every entry/transcript):

.
.
WARNING: Transcript ENST00000401542.2 appears in the reference but did not appear in the BAM
WARNING: Transcript ENST00000447063.6 appears in the reference but did not appear in the BAM
WARNING: Transcript ENST00000525249.1 appears in the reference but did not appear in the BAM
WARNING: Transcript ENST00000524591.6 appears in the reference but did not appear in the BAM
WARNING: Transcript ENST00000528335.5 appears in the reference but did not appear in the BAM
WARNING: Transcript ENST00000400069.7 appears in the reference but did not appear in the BAM
.
.
.
.


I think the problem is, that salmon merges the transcript ID + version (e.g. ENST00000401542, version 2  = ENST00000401542.2),
while STAR and RSEM use the ID only, without the version (e.g. only ENST00000401542).

The GTF and the ref-Seq-Fasta file were downloaded from ensebl......

 head of the BAM/SAM from STAR (Aligned.toTranscriptome.out.bam) looks like:

NC12-SN629:170:D075HACXX:4:2301:1931:47690     419     ENST00000620552 5102    3       48M     =       5142    88 CCTGGGCAAGGGGGACTTCGTGTCGCTGGCACTGCGGGACCGCCGCCT @@=DBAA?C;=C811?;:911C?;DEAGB7DDBFF645@E;9>>B?B@        NH:i:2  HI:i:1
UNC12-SN629:170:D075HACXX:4:2301:1931:47690     339     ENST00000620552 5142    3       48M     =       5102    -88CGCCGCCTGGAGTTCCGCTACGACCTGGGCAAGGGGGCAGCGGTCATC B@FHHIIJJJJIIGJIIJJJJIGJJJJJIHJJJJJGHHHHFFFFFCCB        NH:i:2  HI:i:1
UNC12-SN629:170:D075HACXX:4:2301:1931:47690     163     ENST00000379370 5102    3       48M     =       5142    88 CCTGGGCAAGGGGGACTTCGTGTCGCTGGCACTGCGGGACCGCCGCCT @@=DBAA?C;=C811?;:911C?;DEAGB7DDBFF645@E;9>>B?B@        NH:i:2  HI:i:2


head of GTF:

#!genome-build GRCh38.p7
#!genome-version GRCh38
#!genome-date 2013-12
#!genome-build-accession NCBI:GCA_000001405.22
#!genebuild-last-updated 2016-06
1       havana  gene    11869   14409   .       +       .       gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2";
1       havana  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; tag "basic"; transcript_support_level "1";
1       havana  exon    11869   12227   .       +       .       gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1";


head transcriptome/cDNA sequ:

>ENST00000448914.1 cdna:known chromosome:GRCh38:14:22449113:22449125:1 gene:ENSG00000228985.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD3 description:T cell receptor delta diversity 3 [Source:HGNC Symbol;Acc:HGNC:12256]
ACTGGGGGATACG
>ENST00000631435.1 cdna:known chromosome:GRCh38:CHR_HSCHR7_2_CTG6:142847306:142847317:1 gene:ENSG00000282253.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GGGACAGGGGGC
>ENST00000632684.1 cdna:known chromosome:GRCh38:7:142786213:142786224:1 gene:ENSG00000282431.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GGGACAGGGGGC


Did I understand the problem correctly?
Any suggestion to make the STAR + Salmon pipeline for isoform quantification work is appreciated.....


Thank you very much!

Martin

Alexander Dobin

unread,
Sep 30, 2016, 3:01:22 PM9/30/16
to rna-star
Hi Martin,

this appears to be the problem reported in the previous posting.
I think you found the reason for it - it seems like GENCODE is now splitting the transcript ID into transcript_id and transcript_version.
This did not happen in the previous versions - there thanscript_id contained the version.
The quick way to fix it would be to  remove the  transcript_version "N";  field from the GTF file - that would force Salmon not to add the version to the transcript name.
You can also add the version "N" into the trancsript_id field if you wish to keep that information.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages