STAR + Salmon

Martin Selmansberger

unread,

Sep 29, 2016, 11:23:25 AM9/29/16

to Sailfish Users Group

Hello,

I intend to do isoform/transcript quantification from RNAseq data with STAR alignment + Salmon quantification in (Alignment based mode).

Therefor I have

- generated a Ref-Genome with STAR

STAR --runThreadN 15 \
     --runMode genomeGenerate \
     --genomeDir /home/zyto/SAN/zyto/MS/STAR_aligner/GRCh38/Genome_TCGA_TCversion \
     --genomeFastaFiles /home/zyto/SAN/zyto/MS/STAR_aligner/GRCh38/sequence/GRCh38_r85.all.fa \
     --sjdbGTFfile /home/zyto/SAN/zyto/MS/STAR_aligner/GRCh38/annotation/Homo_sapiens.GRCh38.85.gtf \
     --sjdbOverhang 47 \

- and aligned the reads with STAR, with the "quantMode TranscriptomeSAM" option in order to obtain transcriptome alignments

STAR --genomeDir /san/zyto/MS/STAR_aligner/GRCh38/Genome_TCGA\
     --runThreadN 12 \
     --readFilesIn /Ferdl/00529fb2-f36d-4f50-abf7-01b4e46739b1/read1.fq /Ferdl/00529fb2-f36d-4f50-abf7-01b4e46739b1/read2.fq \
     --outFileNamePrefix star_out_fq_new \
    --quantMode TranscriptomeSAM \

a subsequent quantification with RSEM gave individual transcriptome quantification:

transcript_id   gene_id length effective_length        expected_count TPM     FPKM    IsoPct
ENST00000373020 ENSG00000000003 2206    2140.99 211.67 11.25   6.87    93.81
ENST00000494424 ENSG00000000003 820     755.01 0.00    0.00    0.00    0.00
ENST00000496771 ENSG00000000003 1025    959.99 0.00    0.00    0.00    0.00
ENST00000612152 ENSG00000000003 3796    3730.99 24.33   0.74    0.45    6.19
ENST00000614008 ENSG00000000003 900     834.99 0.00    0.00    0.00    0.00
ENST00000373031 ENSG00000000005 1339    1273.99 0.00    0.00    0.00    0.00
ENST00000485971 ENSG00000000005 542     477.21 0.00    0.00    0.00    0.00
ENST00000371582 ENSG00000000419 1161    1095.99 0.00    0.00    0.00    0.00
ENST00000371584 ENSG00000000419 1073    1007.99 56.73   6.41    3.91    20.68

PROBLEM:

When I try to perform the quantification of the STAR aligned reads with Salmon (Alignment based mode),
I get the following Warning/Error which lead to abortion of the quantification with NO quant-results.

salmon quant -p 2 -t /san/zyto/MS/STAR_aligner/GRCh38/TRANSKRIPOME/Homo_sapiens.GRCh38.cdna.all.fa -g /san/zyto/MS/STAR_aligner/GRCh38/annotation/Homo_sapiens.GRCh38.85.gtf -l IU -a ./star_out_fqAligned.toTranscriptome.out.bam -o STAR_Salmon_quant

This is duo to this message (for every entry/transcript):

.
.
WARNING: Transcript ENST00000401542.2 appears in the reference but did not appear in the BAM
WARNING: Transcript ENST00000447063.6 appears in the reference but did not appear in the BAM
WARNING: Transcript ENST00000525249.1 appears in the reference but did not appear in the BAM
WARNING: Transcript ENST00000524591.6 appears in the reference but did not appear in the BAM
WARNING: Transcript ENST00000528335.5 appears in the reference but did not appear in the BAM
WARNING: Transcript ENST00000400069.7 appears in the reference but did not appear in the BAM
.
.
.
.

I think the problem is, that salmon merges the transcript ID + version (e.g. ENST00000401542, version 2 = ENST00000401542.2),
while STAR and RSEM use the ID only, without the version (e.g. only ENST00000401542).

The GTF and the ref-Seq-Fasta file were downloaded from ensebl......

head of the BAM/SAM from STAR (Aligned.toTranscriptome.out.bam) looks like:

NC12-SN629:170:D075HACXX:4:2301:1931:47690     419     ENST00000620552 5102    3       48M     =       5142    88 CCTGGGCAAGGGGGACTTCGTGTCGCTGGCACTGCGGGACCGCCGCCT @@=DBAA?C;=C811?;:911C?;DEAGB7DDBFF645@E;9>>B?B@        NH:i:2 HI:i:1
UNC12-SN629:170:D075HACXX:4:2301:1931:47690     339     ENST00000620552 5142    3       48M     =       5102    -88CGCCGCCTGGAGTTCCGCTACGACCTGGGCAAGGGGGCAGCGGTCATC B@FHHIIJJJJIIGJIIJJJJIGJJJJJIHJJJJJGHHHHFFFFFCCB        NH:i:2 HI:i:1
UNC12-SN629:170:D075HACXX:4:2301:1931:47690     163     ENST00000379370 5102    3       48M     =       5142    88 CCTGGGCAAGGGGGACTTCGTGTCGCTGGCACTGCGGGACCGCCGCCT @@=DBAA?C;=C811?;:911C?;DEAGB7DDBFF645@E;9>>B?B@        NH:i:2 HI:i:2

head of GTF:

#!genome-build GRCh38.p7
#!genome-version GRCh38
#!genome-date 2013-12
#!genome-build-accession NCBI:GCA_000001405.22
#!genebuild-last-updated 2016-06
1       havana gene    11869   14409   .       +       .       gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2";
1       havana transcript      11869   14409   .       +       .       gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; tag "basic"; transcript_support_level "1";
1       havana exon    11869   12227   .       +       .       gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1";

head transcriptome/cDNA sequ:

>ENST00000448914.1 cdna:known chromosome:GRCh38:14:22449113:22449125:1 gene:ENSG00000228985.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD3 description:T cell receptor delta diversity 3 [Source:HGNC Symbol;Acc:HGNC:12256]
ACTGGGGGATACG
>ENST00000631435.1 cdna:known chromosome:GRCh38:CHR_HSCHR7_2_CTG6:142847306:142847317:1 gene:ENSG00000282253.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GGGACAGGGGGC
>ENST00000632684.1 cdna:known chromosome:GRCh38:7:142786213:142786224:1 gene:ENSG00000282431.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GGGACAGGGGGC

Did I understand the problem correctly?
Any suggestion to make the STAR + Salmon pipeline for isoform quantification work is appreciated.....

Thank you very much!

Martin

Rob

unread,

Sep 29, 2016, 7:13:10 PM9/29/16

to Sailfish Users Group

Hi Martin,

Indeed, Salmon will assume that everything in the fasta header (apart from the '>') up to the first whitespace is part of the target's name. Here, this is disagreeing with what is present in the BAM file, and so Salmon is unable to match up the sequence in the fasta file you give it with the targets listed in the BAM file. One thing you can try is to process the reference fasta file with a simple awk script. This should rename the target sequences to get rid of the "version" information:

awk '{if($0 ~ "^>") { split($1, x, "."); print x[1]; } else { print $0; } }' input.fa > output.fa

This just splits header lines on the '.' and takes the part before the first '.' appearance. It leaves other lines unchanged. Could you let me know if this resolves the issue for you? Depending on how common this is, maybe it makes sense to provide a script to do something like this as part of Salmon. It could also be included as a flag for alignment mode, but that seems unnecessarily complex.

Best,

Rob

Martin Selmansberger

unread,

Sep 30, 2016, 3:54:22 AM9/30/16

to Sailfish Users Group

Hi Rob,

thank you very much for the fast reply!
Your awk script seems to work, but unfortunately I still can´t do the quantification.
Does the akw script modify the reference .fa the way you intended?

HEAD reference .fa (original):

>ENST00000448914.1 cdna:known chromosome:GRCh38:14:22449113:22449125:1 gene:ENSG00000228985.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD3 description:T cell receptor delta diversity 3 [Source:HGNC Symbol;Acc:HGNC:12256]
ACTGGGGGATACG
>ENST00000631435.1 cdna:known chromosome:GRCh38:CHR_HSCHR7_2_CTG6:142847306:142847317:1 gene:ENSG00000282253.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GGGACAGGGGGC
>ENST00000632684.1 cdna:known chromosome:GRCh38:7:142786213:142786224:1 gene:ENSG00000282431.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GGGACAGGGGGC

HEAD processed (your akw) reference .fa:

>ENST00000448914
ACTGGGGGATACG
>ENST00000631435
GGGACAGGGGGC
>ENST00000632684
GGGACAGGGGGC

The Warning/Error I get is:
.
.
WARNING: Transcript ENST00000572106 appears in the reference but did not appear in the BAM
WARNING: Transcript ENST00000576073 appears in the reference but did not appear in the BAM
WARNING: Transcript ENST00000573592 appears in the reference but did not appear in the BAM
..

No version numbers in the reference .fa (processed with awk), but still can´t find the IDs in the BAM.

Although the IDs do appear in the BAM, at least in the same format

Head BAM:

UNC12-SN629:170:D075HACXX:4:2301:1931:47690 419 ENST00000620552 5102 3 48M = 5142 88 CCTGGGCAAGGGGGACTTCGTGTCGCTGGCACTGCGGGACCGCCGCCT @@=DBAA?C;=C811?;:911C?;DEAGB7DDBFF645@E;9>>B?B@ NH:i:2 HI:i:1
UNC12-SN629:170:D075HACXX:4:2301:1931:47690 339 ENST00000620552 5142 3 48M = 5102 -88 CGCCGCCTGGAGTTCCGCTACGACCTGGGCAAGGGGGCAGCGGTCATC B@FHHIIJJJJIIGJIIJJJJIGJJJJJIHJJJJJGHHHHFFFFFCCB NH:i:2 HI:i:1

UNC12-SN629:170:D075HACXX:4:2301:1931:47690 163 ENST00000379370 5102 3 48M = 5142 88 CCTGGGCAAGGGGGACTTCGTGTCGCTGGCACTGCGGGACCGCCGCCT @@=DBAA?C;=C811?;:911C?;DEAGB7DDBFF645@E;9>>B?B@ NH:i:2 HI:i:2

Does the BAM look ok? I assume that the BAM is ok, because it worked with RSEM.....
Do I need the gtf for the quantification?

salmon quant -p 2 -t /san/zyto/MS/STAR_aligner/GRCh38/TRANSKRIPOME/Homo_sapiens.GRCh38.cdna.all_TCproc.fa -g /san/zyto/MS/STAR_aligner/GRCh38/annotation/Homo_sapiens.GRCh38.85.gtf -l IU -a ./star_out_fqAligned.toTranscriptome.out.bam -o STAR_Salmon_quant

Thanks again!
Best

Martin

Rob

unread,

Oct 16, 2016, 5:11:20 PM10/16/16

to Sailfish Users Group

Hi Martin,

Sorry I missed your response! It's hard for me to say without looking at some of the actual data. Would you be able to share your fasta file and e.g. a small subset of your aligned reads? Really, just the header and and the first few thousand records should be sufficient. My guess is that there's some small difference between the way the headers appear in these different files (and, presumably, RSEM works because it's processed the files during its prepare-reference step).