Salmon Testing2

Jonathan Keats

unread,

Feb 24, 2015, 5:22:00 PM2/24/15

to sailfis...@googlegroups.com

Hi Rob,

I managed to get a first complete run through of salmon 0.2.7. I've attached a powerpoint with some of the slides in case you want to look at them but I'll summarize in text my opinions

1) My previous issue with sailfish using paired-end libaries was clearly related to the fasta file we used, using a cDNA fasta from ensembl fixed the issue

2) In general my gut tells me salmon is an improvement over sailfish. Principally because there are a number of genes that were TPM>1 in sailfish that go to TPM=0 in salmon. I suspect this is expect due to the better use of paired-end fragments in salmon

3) Running salmon 0.2.7 read based or alignment based I'm seeing a weirdly near perfect TPM correlation for stranded versus unstranded using both read and alignment based quantitation. This seems fishy as I'd expect things to be similar but the added value of standed should change some expression measures more that we are seeing in my opinion.

4) The gene level summary is working differently for read and alignment based quantifications. If you do read based the number of genes is based on the fasta file used to create the index NOT the GTF supplied for the summary function, which seems correct. While the alignment based quant outputs genes based on the GTF provided to STAR during index build OR the same GTF provided to salmon, NOT the transcriptome fasta provided to Salmon during quantitation, which was my assumption since you are providing a fasta to salmon that quantitation would be limited to it. It looks like this is also the case for transcript level summary.

Last:

Is there any chance you can post a CentOS binary for v3.0 like you have for previous releases (https://github.com/kingsfordgroup/sailfish/releases). Our IT team is having little success building from source on our CentOS 6.5 cluster and I'd like to test v0.3.0 as you suggested earlier.

Jonathan

#!/bin/bash
#PBS -l nodes=1:ppn=12
#PBS -N Salmon_Dev
#PBS -l walltime=48:00:00
#PBS -m abe
#PBS -M jke...@tgen.org
#PBS -j oe
#PBS -q overflow


cd $PBS_O_WORKDIR


#Load sailfish module, can't seem to figure out how to build independent of IT
module load sailfish/0.6.3


#####################################################
##
##   VARIABLES
##
#####################################################


SALMON_BINARY=/home/jkeats/downloads/Salmon-v0.2.7_CentOS-6.3/bin/salmon
SALMON_INDEX=/scratch/jkeats/salmon/salmon_index_E74hs37d5
SAILFISH_INDEX_DIR=/scratch/jkeats/salmon/sailfish_index
STAR_BINARY=/home/jkeats/downloads/STAR-STAR_2.4.0i/bin/Linux_x86_64/STAR
STAR_INDEX=/scratch/jkeats/STAR_NEW/124bpOverHang/


#TRANSCRIPTOME_FASTA=/home/tgenref/pecan/sailfish_index/Homo_sapiens.GRCh37.74.gtf.hs37d5.EGFRvIII.fa
TRANSCRIPTOME_FASTA=/scratch/jkeats/salmon/Homo_sapiens.GRCh37.74.cdna.hs37d5.fasta


#GTF=/home/tgenref/pecan/ensembl_v74/Homo_sapiens.GRCh37.74_hs37d5_EGFRvIII_Sailfish.gtf
GTF=/home/tgenref/pecan/ensembl_v74/Homo_sapiens.GRCh37.74.gtf.hs37d5.EGFRvIII.gtf

FORWARD=/scratch/jkeats/salmon/RPMI8226_RETparental_p0_CL_Whole_T2_KBMRS_K12719.proj.R1.fastq.gz
REVERSE=/scratch/jkeats/salmon/RPMI8226_RETparental_p0_CL_Whole_T2_KBMRS_K12719.proj.R2.fastq.gz

#####################################################
##
##   PARAMETERIZED CODE
##
#####################################################


# Create the Salmon Index
${SALMON_BINARY} index -t ${TRANSCRIPTOME_FASTA} --index salmon_index_E74hs37d5 -p 12

${SALMON_BINARY} quant \
 --index ${SALMON_INDEX} \
 --libType IU \
 -1 <(zcat ${FORWARD}) \
 -2 <(zcat ${REVERSE}) \
 --threads 12 \
 --geneMap ${GTF} \
 --output salmon_unstranded

${SALMON_BINARY} quant \
 --index ${SALMON_INDEX} \
 --libType ISR \
 -1 <(zcat ${FORWARD}) \
 -2 <(zcat ${REVERSE}) \
 --threads 12 \
 --geneMap ${GTF} \
 --output salmon_stranded

${STAR_BINARY} --genomeDir ${STAR_INDEX} \
 --readFilesIn ${FORWARD} ${REVERSE} \
 --outSAMattributes NH HI \
 --outFilterMultimapNmax 20 \
 --outFilterMismatchNmax 999 \
 --outFilterMismatchNoverLmax 0.04 \
 --alignIntronMin 20 \
 --alignIntronMax 1000000 \
 --alignMatesGapMax 1000000 \
 --alignSJoverhangMin 8 \
 --alignSJDBoverhangMin 1 \
 --quantMode TranscriptomeSAM \
 --runThreadN 12 \
 --readFilesCommand zcat

${SALMON_BINARY} quant \
        --targets ${TRANSCRIPTOME_FASTA} \
        --libType IU \
        --alignments Aligned.toTranscriptome.out.bam \
        --threads 12 \
        --geneMap ${GTF} \
        --output salmon_alignment_unstranded

${SALMON_BINARY} quant \
 --targets ${TRANSCRIPTOME_FASTA} \
 --libType ISR \
 --alignments Aligned.toTranscriptome.out.bam \
 --threads 12 \
 --geneMap ${GTF} \
 --output salmon_alignment_stranded

sailfish index -t ${TRANSCRIPTOME_FASTA} -o sailfish_index -k 20 -p 16

#The libtype will need to be dynamic for Single End runs, or stranded runs.  Currently assumes Illumina PE, unstranded library
sailfish quant -i ${SAILFISH_INDEX_DIR} --libtype "T=PE:O=><:S=U" --mates1 <(zcat ${FORWARD}) --mates2 <(zcat ${REVERSE}) --out sailfish_unstranded --threads 12 --polya

cd sailfish_unstranded

#Rename and process output to generate gene level expression estimates
/home/tgenref/pecan/bin/Sailfish_0.6.3/TranscriptsToGenes.sh --exp-file quant_bias_corrected.sf --gtf-file ${GTF} --res-file sailfish_genes.expr

cd ..

#The libtype will need to be dynamic for Single End runs, or stranded runs.  Currently assumes Illumina PE, unstranded library
sailfish quant -i ${SAILFISH_INDEX_DIR} --libtype "T=PE:O=><:S=AS" --mates1 <(zcat ${FORWARD}) --mates2 <(zcat ${REVERSE}) --out sailfish_stranded --threads 12 --polya

cd sailfish_stranded

#Rename and process output to generate gene level expression estimates
/home/tgenref/pecan/bin/Sailfish_0.6.3/TranscriptsToGenes.sh --exp-file quant_bias_corrected.sf --gtf-file ${GTF} --res-file sailfish_genes.expr

cd ..

Fishtools_Testing.pptx

Rob

unread,

Feb 24, 2015, 5:51:19 PM2/24/15

to

Hi Jonathan,

First, thank you for the detailed usage information, the analysis and report. This is very useful to us. Let me do my best to address your questions / thoughts below:

Hi Rob,

I managed to get a first complete run through of salmon 0.2.7. I've attached a powerpoint with some of the slides in case you want to look at them but I'll summarize in text my opinions

1) My previous issue with sailfish using paired-end libaries was clearly related to the fasta file we used, using a cDNA fasta from ensembl fixed the issue

Ah yes --- it's very important that the reference be as expected. This is why we've added the feature to salmon to detect unexpected strand bias (more than anything, it helps in debugging a reference / annotation).

2) In general my gut tells me salmon is an improvement over sailfish. Principally because there are a number of genes that were TPM>1 in sailfish that go to TPM=0 in salmon. I suspect this is expect due to the better use of paired-end fragments in salmon

We certainly see salmon providing improved estimates in most scenarios. A significant part of this is due to better usage of the paired-end information. Further, it's also the case that salmon's probabilistic model accounts for more potential experimental confounding factors than does the model of sailfish. These two improvements, along with a new inference algorithm that we think is better, help salmon to produce more accurate results in the scenarios we've been testing so far.

3) Running salmon 0.2.7 read based or alignment based I'm seeing a weirdly near perfect TPM correlation for stranded versus unstranded using both read and alignment based quantitation. This seems fishy as I'd expect things to be similar but the added value of standed should change some expression measures more that we are seeing in my opinion.

This, I suspect, is a result of salmon's default settings. In particular, there are two command-line options that are relevant when dealing with a strand-specific library (where you want to enforce the strand specificity). These are:

-e [ --useReadCompat ] [Currently Experimental] : Use the
orientation in which fragments were
"mapped" to assign them a probability.
For example, fragments with an
incorrect relative orientation with
respect to the provided library format
string will be assigned the incompatPrior probability.

--incompatPrior arg (=1.0000000000000001e-05)
This option can only be used in
conjunction with --useReadCompat. It
sets the prior probability that an
alignment that disagrees with the
specified library type (--libType)
results from the true fragment origin.
Setting this to 0 says that alignments
that disagree with the library type
should be "impossible", while setting
it to 1 says that alignments that
disagree with the library type are no
less likely than those that do (in this
case, though, there is no reason to
even use --useReadCompat)

Even though --useReadCompat is marked experimental, it's been tested fairly thoroughly (and so this designation will likely be removed in the next release). Essentially, salmon always checks whether or not the mapping agrees with the provided library format (to let you know if what it found is what you expected). However, it only enforces this format if it has been passed --useReadCompat. The second parameter works in conjunction with --useReadCompat. Because one of the common use cases of salmon (e.g. in the Transrate tool) is quantification in de novo transcriptomes, it is typically the case (and I imagine this may not be much different even with an annotate transcriptome) that reads have a very small but not precisely 0 probability of mapping in a way that would violate the protocol (e.g. mis-assembled / chimeric transcripts). The --incompatPrior parameter lets one specify exactly how much a read that maps in a way other than is specified by the library type should be penalized. The smaller this value, the closer the scheme becomes to strictly disallowing reads from mapping in an unexpected manner.

4) The gene level summary is working differently for read and alignment based quantifications. If you do read based the number of genes is based on the fasta file used to create the index NOT the GTF supplied for the summary function, which seems correct. While the alignment based quant outputs genes based on the GTF provided to STAR during index build OR the same GTF provided to salmon, NOT the transcriptome fasta provided to Salmon during quantitation, which was my assumption since you are providing a fasta to salmon that quantitation would be limited to it. It looks like this is also the case for transcript level summary.

I see --- this is because in alignment-based mode, salmon uses the header present in the BAM file to determine the "universe" of reference sequences to which a read might map. The purpose of passing the separate fasta to alignment-based salmon is because it needs the actual transcript sequence to evaluate probabilities (e.g. alignment error probabilities) that are related to the actual transcript sequence, since these are not encoded in the BAM file. However, I absolutely see how what you are suggesting here makes sense. If I provide a BAM file and a FASTA file to salmon, and the FASTA file contains only a subset of the transcripts that appear in the header of the BAM file, then it make sense to expect my output to have entries only for the sequences present in the FASTA file. These things have been identical in our scenarios so far, but you bring up a good point that this need not be the case. I'll add an issue to GitHub for user discussion.

Last:
Is there any chance you can post a CentOS binary for v3.0 like you have for previous releases (https://github.com/kingsfordgroup/sailfish/releases). Our IT team is having little success building from source on our CentOS 6.5 cluster and I'd like to test v0.3.0 as you suggested earlier.

Absolutely. I'd be happy to build another CentOS binary. First, though, I'd encourage you to see if the Debian Squeeze binary works on your system. This binary was built on the oldest system where it was reasonably possible, and the goal is for it to be the most portable Linux binary we can provide. If the v0.2.7 CentOS binary worked for you, I'd expect this one will as well. However, please let me know if it doesn't and I can provide a CentOS one tomorrow.

Best,

Rob

...

Jonathan Keats

unread,

Feb 25, 2015, 2:40:50 AM2/25/15

to sailfis...@googlegroups.com

Thanks Rob,

The addition of the --useReadCompat to the stranded analysis definately makes the comparison of standed versus unstranded more believable. I wonder if this should be a required parameter if the user indicates it is a stranded library?

Rob

unread,

Feb 25, 2015, 10:02:51 AM2/25/15

to sailfis...@googlegroups.com

Hi Jonathan,

No problem --- I'm glad it seems to address the issue. I think that in the next release, it makes sense to have the behavior of --useReadCompat become the default, with an option to opt out (--ignoreReadCompat?) if the user doesn't want the read compatibility enforced.

--Rob

Reply all

Reply to author

Forward