Handling of insertions/deletions/point mutations

90 views
Skip to first unread message

Chakri

unread,
Oct 14, 2016, 5:59:46 AM10/14/16
to Sailfish Users Group
Hi Rob,

We are currently using Sailfish in a series of our studies (assuming that both sailfish and salmon have comparable performance now). We would like to know how Sailfish handles insertions/deletions/point mutations and how can we use a relaxed mismatch criteria? In your blog post titled 'not quite alignments' (http://robpatro.com/blog/?p=248), you mentioned about a liberal coverage threshold and small MEM size. Is this Salmon-specific?

We wanted to specially handle some reads that are known to contain true variation because of insertions, deletions, and point mutations. For example in one case, we know that there can be 20-30 point mutations in a 350 bp long region on average. In some other cases, we know that there can be upto 15 bp InDels. Does Sailfish/Salmon by default account for these many number of mismatches? If not, what would be your suggestion to handle such  special regions? We plan to re-run Sailfish/Salmon for such genes with a relaxed mismatch criteria. Thanks a lot.

Regards,
Chakri 

Rob

unread,
Oct 16, 2016, 5:30:22 PM10/16/16
to Sailfish Users Group
Hi Chakri,

  Hi Rob,

We are currently using Sailfish in a series of our studies (assuming that both sailfish and salmon have comparable performance now). We would like to know how Sailfish handles insertions/deletions/point mutations and how can we use a relaxed mismatch criteria? In your blog post titled 'not quite alignments' (http://robpatro.com/blog/?p=248), you mentioned about a liberal coverage threshold and small MEM size. Is this Salmon-specific?

Yes, a liberal coverage threshold and small MEM sizes that I refer to there are indeed Salmon-specific.  This is because the MEM coverage notion of lightweight alignment is only present in Salmon (and, in fact, is only present with Salmon's non-default fmd-based index).  For Sailfish and Salmon (using the default index), the only parameter that might have a substantial effect on the alignment rate / sensitivity is the size of the k-mers considered during index construction (set with the -k parameter).  A smaller value of k can lead to more sensitivity in mapping, but may run the risk of introducing spurious mappings if it is set too low.  I'll note that, though you can expect high accuracy from both tools, Salmon incorporates a number of features and improvements atop the most recent updates to Sailfish (like Salmon's advanced bias modeling etc.).


We wanted to specially handle some reads that are known to contain true variation because of insertions, deletions, and point mutations. For example in one case, we know that there can be 20-30 point mutations in a 350 bp long region on average. In some other cases, we know that there can be upto 15 bp InDels. Does Sailfish/Salmon by default account for these many number of mismatches? If not, what would be your suggestion to handle such  special regions? We plan to re-run Sailfish/Salmon for such genes with a relaxed mismatch criteria. Thanks a lot.

When run in quasi-mapping mode, with default parameters, sailfish and salmon place no hard restrictions on indel size.  Thus, a 15bp indel should be OK.  Regarding the point mutations, it really is a matter of how adversarially they are distributed.  If you have 30 point mutations evenly spaced in a 350bp segment, then you have, on average, a mutation every 11-12bp.  This will make finding a reasonably specific (i.e. long) anchoring match very difficult.  If, however, you expect the mutations to be unevenly distributed (more concentrated in some subrange), then you may still be able to find a reasonable match, and the mapping should be OK.  If you're using Salmon, one thing you might consider is to dump the mapping information to file and see what, if anything, maps to these high variability regions.  By decreasing the size `-k` used during index construction, you can adjust the parameter until you're able to map reads there (again, with the caveat that too small a k might allow some spurious mappings).  If you can collect some statistics, I'd be more than happy to advise further once you have some details about how the mappings look.

Best,
Rob
 

Regards,
Chakri 

Chakri

unread,
Oct 29, 2016, 4:53:19 PM10/29/16
to Sailfish Users Group
Hi Rob,

Thanks a lot for your replies and willingness to advise further. We switched from Sailfish to Salmon after your advice. I used the default quasi-mapping mode with VBOpt on a 75bp paired-end unstranded data. 

To leverage known variation, we are using GRCh38 that contains alternate sequences for some regions. We are trying to understand how salmon handles the multi-mapping involved with alternate sequences (annotated as patches in GRCh38). As an example, please see the attached plot. The plot contains seven horizontal grids; each grid represents an alternate sequence of a single gene that is famously known to be highly polymorphic in human genome. On X-axis are different samples and on Y-axis the log2(estimated readcounts by salmon). Along with default k=31, I generated the index multiple times using k=(29, 27, 25, 23, 21, 19 and 17). For each sample (on x-axis), the readcounts estimated by varying 'k' are shown in different colors in the plot (color legend on the right). 

1.  effect of 'k' during index generation:  referring to my previous post in this thread, we wanted to specially handle some of the sequences that are highly polymorphic. As suggested by you, to observe the effect of matching length 'k', I used different values of 'k' during index generation as described above. As I understand from the plot, the lower values of  'k' (k=17, 19 and in some cases k=21) resulted in more variable readcounts, while I do not see relatively huge variability with increased matching length. However, I would be very eager to hear your insights and suggestions on this.  

2. Multi-mapping: If I understand correctly, in the case of multi-mapping reads, salmon probabilistically assigns the read to only one of the highly similar sequences during inference (although it maps to all similar locations initially). In other words, the reads are counted only once (?) and the estimated readcounts do not contain multi-mapping reads.  In the example run shown in the attached plot, in each sample, the reads of a single gene are distributed across seven highly similar sequences when supplied with alternative sequences. Does this mean that the readcounts have to be summed up across all the alternate sequences of a single gene to get its expression level? Is this valid?

Thanks a lot.

Regards,
Chakri


k_effect.pdf
Message has been deleted

Rob

unread,
Nov 13, 2016, 3:53:04 PM11/13/16
to Sailfish Users Group
Hi Chakri,

  I will try and reply with my thoughts below.  Please let me know if what I say raises any further questions, or if you have any follow up questions.


On Saturday, October 29, 2016 at 4:53:19 PM UTC-4, Chakri wrote:
Hi Rob,

Thanks a lot for your replies and willingness to advise further. We switched from Sailfish to Salmon after your advice. I used the default quasi-mapping mode with VBOpt on a 75bp paired-end unstranded data. 

To leverage known variation, we are using GRCh38 that contains alternate sequences for some regions. We are trying to understand how salmon handles the multi-mapping involved with alternate sequences (annotated as patches in GRCh38). As an example, please see the attached plot. The plot contains seven horizontal grids; each grid represents an alternate sequence of a single gene that is famously known to be highly polymorphic in human genome. On X-axis are different samples and on Y-axis the log2(estimated readcounts by salmon). Along with default k=31, I generated the index multiple times using k=(29, 27, 25, 23, 21, 19 and 17). For each sample (on x-axis), the readcounts estimated by varying 'k' are shown in different colors in the plot (color legend on the right). 

1.  effect of 'k' during index generation:  referring to my previous post in this thread, we wanted to specially handle some of the sequences that are highly polymorphic. As suggested by you, to observe the effect of matching length 'k', I used different values of 'k' during index generation as described above. As I understand from the plot, the lower values of  'k' (k=17, 19 and in some cases k=21) resulted in more variable readcounts, while I do not see relatively huge variability with increased matching length. However, I would be very eager to hear your insights and suggestions on this.  

I think that your intention concerning mapping to "highly-variable" genes / transcripts is a bit different than I originally interpreted.  Specifically, I though it was the case that you'd be mapping to a target gene for which you expected many, alternative, and unknown highly-variable variants to be present in the sample.  If you know which variants you expect to occur, and you include them in the salmon index, then salmon should quantify them well, regardless of the k-mer size (and in this case, a large k-mer size should not be problematic).  The quasi-mapping algorithm used by salmon will search for the best mapping for a read (allowing for many equally best mappings if they exist).  In this regard, a multimapping read that originates from a non-variable region of such genes will be considered when quantifying all of them, but any read that overlaps known variation will map only to those specific targets having these variants.  As an example / analogy, we get good results when using salmon to look at allele-specific expression, as the quasi-mapping algorithm is able to properly map reads specifically that overlap the known alleles, and this informs the allocation of the other multi-mapping reads that derive from the non-unique regions of these transcripts.  

Regarding the increased variability in read counts with shorter k-mers, this is interesting.  The quantification results should be robust to (though not completely independent of) the chosen k-mer size.  It could be the case that there are reads that map, either correctly or spuriously, to these transcripts with the shorter k-mer size that go completely unmapped with the larger k-mers.  The quasi-mapping procedure will look for the best mapping for a read (i.e. trying to match as much of the read as possible, while skipping errors / indels).  However, the minimum "threshold" for assigning a read is the k-mer length.  Thus, a read with no matching k-mers will be considered as un-mapped whereas a read with at least one matching k-mer can be mapped (though, depending on how the rest of the read maps, these may be considered as unmapped as well).  Judging from the plots you uploaded, it looks like the quantifications between variants are generally consistent in terms of their trend (i.e. more or fewer reads mapped to different variants), but larger k-mers result in fewer overall mapped reads.  Though it's hard to say without look at specifically how the reads are mapping, but the increased variability with smaller values of k may be coming from reads that go completely unmapped when k is larger.  What do your overall mapping rates look like (these stats are recorded in the meta_info.json file under the aux_info directory)?
 

2. Multi-mapping: If I understand correctly, in the case of multi-mapping reads, salmon probabilistically assigns the read to only one of the highly similar sequences during inference (although it maps to all similar locations initially). In other words, the reads are counted only once (?) and the estimated readcounts do not contain multi-mapping reads.  In the example run shown in the attached plot, in each sample, the reads of a single gene are distributed across seven highly similar sequences when supplied with alternative sequences. Does this mean that the readcounts have to be summed up across all the alternate sequences of a single gene to get its expression level? Is this valid?

That's not quite right.  Salmon's statistical inference procedure does a soft assignment of reads to transcripts.  That is, it computes the probability that a specific read comes from each of the transcripts to which it maps (and this probability takes into account the mapping, the transcript features, and all other reads in the experiment --- by means of estimated transcript abundances).  Thus, reads can be "partially" allocated to more than one transcript.  However, for each mapped read, the sum of probabilities for all of the transcripts where it is assigned a non-zero allocation probability will sum to 1.  Thus, your answer to your last question is right.  That is, to determine the total number of reads allocated to a gene, one would sum the 'NumReads' for all transcripts belonging to that gene.  If you're interested in doing a gene-level analysis with salmon's quantification results, I suggest you take a look at the excellent tximport (https://bioconductor.org/packages/release/bioc/html/tximport.html) package.  It supports salmon out of the box, and lets you easily aggregate salmon's transcript level expression estimates to the gene level to be used downstream with e.g. DESeq2, EdgeR, etc.

Sorry for the delayed reply.  We've been busy trying to finish up some lingering projects this past week so that has been taking up our time.  Let me know if you have any further questions.

Best,
Rob

Chakri

unread,
Nov 18, 2016, 6:54:56 AM11/18/16
to Sailfish Users Group
Hi Rob,

Thanks a lot once again for the detailed explanations. I have a few follow-up questions below.


On Sunday, November 13, 2016 at 9:53:04 PM UTC+1, Rob wrote:

Hi Chakri,

  I will try and reply with my thoughts below.  Please let me know if what I say raises any further questions, or if you have any follow up questions.

On Saturday, October 29, 2016 at 4:53:19 PM UTC-4, Chakri wrote:
Hi Rob,

Thanks a lot for your replies and willingness to advise further. We switched from Sailfish to Salmon after your advice. I used the default quasi-mapping mode with VBOpt on a 75bp paired-end unstranded data. 

To leverage known variation, we are using GRCh38 that contains alternate sequences for some regions. We are trying to understand how salmon handles the multi-mapping involved with alternate sequences (annotated as patches in GRCh38). As an example, please see the attached plot. The plot contains seven horizontal grids; each grid represents an alternate sequence of a single gene that is famously known to be highly polymorphic in human genome. On X-axis are different samples and on Y-axis the log2(estimated readcounts by salmon). Along with default k=31, I generated the index multiple times using k=(29, 27, 25, 23, 21, 19 and 17). For each sample (on x-axis), the readcounts estimated by varying 'k' are shown in different colors in the plot (color legend on the right). 

1.  effect of 'k' during index generation:  referring to my previous post in this thread, we wanted to specially handle some of the sequences that are highly polymorphic. As suggested by you, to observe the effect of matching length 'k', I used different values of 'k' during index generation as described above. As I understand from the plot, the lower values of  'k' (k=17, 19 and in some cases k=21) resulted in more variable readcounts, while I do not see relatively huge variability with increased matching length. However, I would be very eager to hear your insights and suggestions on this.  

I think that your intention concerning mapping to "highly-variable" genes / transcripts is a bit different than I originally interpreted.  Specifically, I though it was the case that you'd be mapping to a target gene for which you expected many, alternative, and unknown highly-variable variants to be present in the sample.  If you know which variants you expect to occur, and you include them in the salmon index, then salmon should quantify them well, regardless of the k-mer size (and in this case, a large k-mer size should not be problematic).  The quasi-mapping algorithm used by salmon will search for the best mapping for a read (allowing for many equally best mappings if they exist).  In this regard, a multimapping read that originates from a non-variable region of such genes will be considered when quantifying all of them, but any read that overlaps known variation will map only to those specific targets having these variants.  As an example / analogy, we get good results when using salmon to look at allele-specific expression, as the quasi-mapping algorithm is able to properly map reads specifically that overlap the known alleles, and this informs the allocation of the other multi-mapping reads that derive from the non-unique regions of these transcripts.  

As per your explanation (if I understood it correctly), - Let us assume a gene has 'n' possible alleles in the population, but only one of the 'n' possible alleles is expressed in each sample. If we include all those 'n' possible sequences in Salmon index, salmon assigns readcounts to all those sequences. But the sequence with higher readcounts is the allele truly expressed in the sample. If the sample has an allele that we did not supply in salmon index, the next best match would give the estimated expression (although this would be on the lower side if there exists huge variability between the possible alleles). Is this correct?
 

Regarding the increased variability in read counts with shorter k-mers, this is interesting.  The quantification results should be robust to (though not completely independent of) the chosen k-mer size.  It could be the case that there are reads that map, either correctly or spuriously, to these transcripts with the shorter k-mer size that go completely unmapped with the larger k-mers.  The quasi-mapping procedure will look for the best mapping for a read (i.e. trying to match as much of the read as possible, while skipping errors / indels).  However, the minimum "threshold" for assigning a read is the k-mer length.  Thus, a read with no matching k-mers will be considered as un-mapped whereas a read with at least one matching k-mer can be mapped (though, depending on how the rest of the read maps, these may be considered as unmapped as well).  Judging from the plots you uploaded, it looks like the quantifications between variants are generally consistent in terms of their trend (i.e. more or fewer reads mapped to different variants), but larger k-mers result in fewer overall mapped reads.  Though it's hard to say without look at specifically how the reads are mapping, but the increased variability with smaller values of k may be coming from reads that go completely unmapped when k is larger.  What do your overall mapping rates look like (these stats are recorded in the meta_info.json file under the aux_info directory)?

The overall mapping rates dipped with shorter k-mers (attached plot; colors correspond to different samples). I vaguely remember from the parameters that by default, a read mapping to more than 100 locations would be unmapped. Does this option affect the mapping rate with shorter k-mers? I would be interested to hear your suggestions about k-mer length.        
 
 

2. Multi-mapping: If I understand correctly, in the case of multi-mapping reads, salmon probabilistically assigns the read to only one of the highly similar sequences during inference (although it maps to all similar locations initially). In other words, the reads are counted only once (?) and the estimated readcounts do not contain multi-mapping reads.  In the example run shown in the attached plot, in each sample, the reads of a single gene are distributed across seven highly similar sequences when supplied with alternative sequences. Does this mean that the readcounts have to be summed up across all the alternate sequences of a single gene to get its expression level? Is this valid?

That's not quite right.  Salmon's statistical inference procedure does a soft assignment of reads to transcripts.  That is, it computes the probability that a specific read comes from each of the transcripts to which it maps (and this probability takes into account the mapping, the transcript features, and all other reads in the experiment --- by means of estimated transcript abundances).  Thus, reads can be "partially" allocated to more than one transcript.  However, for each mapped read, the sum of probabilities for all of the transcripts where it is assigned a non-zero allocation probability will sum to 1.  Thus, your answer to your last question is right.  That is, to determine the total number of reads allocated to a gene, one would sum the 'NumReads' for all transcripts belonging to that gene.  If you're interested in doing a gene-level analysis with salmon's quantification results, I suggest you take a look at the excellent tximport (https://bioconductor.org/packages/release/bioc/html/tximport.html) package.  It supports salmon out of the box, and lets you easily aggregate salmon's transcript level expression estimates to the gene level to be used downstream with e.g. DESeq2, EdgeR, etc.

Thank you for this detailed explanation, it was helpful. Is there a graphical abstract or layman summary of salmon's methodology (read assignment and quantification) available somewhere? Thanks. 

Regards,
Chakri
mapping_percent_varyingK.pdf

Rob

unread,
Dec 2, 2016, 6:10:08 PM12/2/16
to Sailfish Users Group
Hi Chakri,

  No problem (and sorry I'm slower than normal in getting back to you).  Here are my thoughts on the below:


On Friday, November 18, 2016 at 6:54:56 AM UTC-5, Chakri wrote:
Hi Rob,

Thanks a lot once again for the detailed explanations. I have a few follow-up questions below.


On Sunday, November 13, 2016 at 9:53:04 PM UTC+1, Rob wrote:

Hi Chakri,

  I will try and reply with my thoughts below.  Please let me know if what I say raises any further questions, or if you have any follow up questions.

On Saturday, October 29, 2016 at 4:53:19 PM UTC-4, Chakri wrote:
Hi Rob,

Thanks a lot for your replies and willingness to advise further. We switched from Sailfish to Salmon after your advice. I used the default quasi-mapping mode with VBOpt on a 75bp paired-end unstranded data. 

To leverage known variation, we are using GRCh38 that contains alternate sequences for some regions. We are trying to understand how salmon handles the multi-mapping involved with alternate sequences (annotated as patches in GRCh38). As an example, please see the attached plot. The plot contains seven horizontal grids; each grid represents an alternate sequence of a single gene that is famously known to be highly polymorphic in human genome. On X-axis are different samples and on Y-axis the log2(estimated readcounts by salmon). Along with default k=31, I generated the index multiple times using k=(29, 27, 25, 23, 21, 19 and 17). For each sample (on x-axis), the readcounts estimated by varying 'k' are shown in different colors in the plot (color legend on the right). 

1.  effect of 'k' during index generation:  referring to my previous post in this thread, we wanted to specially handle some of the sequences that are highly polymorphic. As suggested by you, to observe the effect of matching length 'k', I used different values of 'k' during index generation as described above. As I understand from the plot, the lower values of  'k' (k=17, 19 and in some cases k=21) resulted in more variable readcounts, while I do not see relatively huge variability with increased matching length. However, I would be very eager to hear your insights and suggestions on this.  

I think that your intention concerning mapping to "highly-variable" genes / transcripts is a bit different than I originally interpreted.  Specifically, I though it was the case that you'd be mapping to a target gene for which you expected many, alternative, and unknown highly-variable variants to be present in the sample.  If you know which variants you expect to occur, and you include them in the salmon index, then salmon should quantify them well, regardless of the k-mer size (and in this case, a large k-mer size should not be problematic).  The quasi-mapping algorithm used by salmon will search for the best mapping for a read (allowing for many equally best mappings if they exist).  In this regard, a multimapping read that originates from a non-variable region of such genes will be considered when quantifying all of them, but any read that overlaps known variation will map only to those specific targets having these variants.  As an example / analogy, we get good results when using salmon to look at allele-specific expression, as the quasi-mapping algorithm is able to properly map reads specifically that overlap the known alleles, and this informs the allocation of the other multi-mapping reads that derive from the non-unique regions of these transcripts.  

As per your explanation (if I understood it correctly), - Let us assume a gene has 'n' possible alleles in the population, but only one of the 'n' possible alleles is expressed in each sample. If we include all those 'n' possible sequences in Salmon index, salmon assigns readcounts to all those sequences. But the sequence with higher readcounts is the allele truly expressed in the sample. If the sample has an allele that we did not supply in salmon index, the next best match would give the estimated expression (although this would be on the lower side if there exists huge variability between the possible alleles). Is this correct?

 Roughly, yes, this is the case.  However, the exact behavior will depend on (1) how similar / dis-similar the variants are, (2) how many reads you have that are uniquely assignable to a particular variant, and (3) what specific inference method you use (EM --- the default, versus VBEM --- the `--useVBOpt` flag).  Basically, salmon will assign abundances to all variants in the way that tries to maximize, globally, the likelihood of observing the provided read mappings.  This means that the EM algorithm will try to allocate reads to transcripts in a "parsimonious manner" --- e.g. if I have many reads uniquely mapping to one variant, but no reads uniquely mapping to any other variant, then all (or almost all) of the reads may be assigned to the single variant with the uniquely mapping reads.  The VBEM algorithm will do something similar, but has a more "regularized" behavior --- i.e., it will tend to spread the multi-mappers around a bit more --- see here (https://github.com/COMBINE-lab/salmon/issues/107) for an interesting example of one of the (mostly un-common) cases where the inference methods behave differently.  If a truly present variant is missing from the index, other variants will get those reads in a manner proportional to how well they match with that unknown variant.
 
 

Regarding the increased variability in read counts with shorter k-mers, this is interesting.  The quantification results should be robust to (though not completely independent of) the chosen k-mer size.  It could be the case that there are reads that map, either correctly or spuriously, to these transcripts with the shorter k-mer size that go completely unmapped with the larger k-mers.  The quasi-mapping procedure will look for the best mapping for a read (i.e. trying to match as much of the read as possible, while skipping errors / indels).  However, the minimum "threshold" for assigning a read is the k-mer length.  Thus, a read with no matching k-mers will be considered as un-mapped whereas a read with at least one matching k-mer can be mapped (though, depending on how the rest of the read maps, these may be considered as unmapped as well).  Judging from the plots you uploaded, it looks like the quantifications between variants are generally consistent in terms of their trend (i.e. more or fewer reads mapped to different variants), but larger k-mers result in fewer overall mapped reads.  Though it's hard to say without look at specifically how the reads are mapping, but the increased variability with smaller values of k may be coming from reads that go completely unmapped when k is larger.  What do your overall mapping rates look like (these stats are recorded in the meta_info.json file under the aux_info directory)?

The overall mapping rates dipped with shorter k-mers (attached plot; colors correspond to different samples). I vaguely remember from the parameters that by default, a read mapping to more than 100 locations would be unmapped. Does this option affect the mapping rate with shorter k-mers? I would be interested to hear your suggestions about k-mer length.

  It's difficult to say exactly why the mapping rate might dip here with a decrease in the k-mer size.  There are a couple of reasons.  It is possible that a shorter k-mer size is leading to an increased mapping sensitivity and some reads are exceeding the default multi-mapping bound, and so are not being considered.  Another possibility is that with a longer minimum match size (and the distributions of errors, un-mappable sub-strings in the reads) some regions of the read are completely uncovered by matching k=30 mers, but are covered by matching k=18 mers (or some shorter length).  However, if the set of shorter matches doesn't consistently support at least one transcript, then the entire read can go unmapped.  Put another way, if there are short matches to more transcripts, but it is subsequently the case that no transcript accounts for all of the shorter matches, it is possible that this previously mappable read could now fail to map.  You might be able to test one or the other of these hypotheses by changing the default maximum multimapping rate for a read.
 
 

2. Multi-mapping: If I understand correctly, in the case of multi-mapping reads, salmon probabilistically assigns the read to only one of the highly similar sequences during inference (although it maps to all similar locations initially). In other words, the reads are counted only once (?) and the estimated readcounts do not contain multi-mapping reads.  In the example run shown in the attached plot, in each sample, the reads of a single gene are distributed across seven highly similar sequences when supplied with alternative sequences. Does this mean that the readcounts have to be summed up across all the alternate sequences of a single gene to get its expression level? Is this valid?

That's not quite right.  Salmon's statistical inference procedure does a soft assignment of reads to transcripts.  That is, it computes the probability that a specific read comes from each of the transcripts to which it maps (and this probability takes into account the mapping, the transcript features, and all other reads in the experiment --- by means of estimated transcript abundances).  Thus, reads can be "partially" allocated to more than one transcript.  However, for each mapped read, the sum of probabilities for all of the transcripts where it is assigned a non-zero allocation probability will sum to 1.  Thus, your answer to your last question is right.  That is, to determine the total number of reads allocated to a gene, one would sum the 'NumReads' for all transcripts belonging to that gene.  If you're interested in doing a gene-level analysis with salmon's quantification results, I suggest you take a look at the excellent tximport (https://bioconductor.org/packages/release/bioc/html/tximport.html) package.  It supports salmon out of the box, and lets you easily aggregate salmon's transcript level expression estimates to the gene level to be used downstream with e.g. DESeq2, EdgeR, etc.

Thank you for this detailed explanation, it was helpful. Is there a graphical abstract or layman summary of salmon's methodology (read assignment and quantification) available somewhere? Thanks. 

There's not, but you've piqued my interest in making one.  I'll let you know when I have something.

Best,
Rob
Reply all
Reply to author
Forward
0 new messages