general questions/notes about emirge output

soniat

unread,

Apr 6, 2012, 4:32:07 PM4/6/12

to emirge...@googlegroups.com

1. It initially wasn't clear to me what the EMIRGE output was, but I realized that the iter.xx.cons.fasta file had sequences that were a few percent different from the sequences with the same names in the reference database. So I assumed those are my results.

2. On the EMIRGE github home page, it says "In the process, [EMIRGE] also provides estimates of the sequences' abundances."
Does this mean indirectly, by parsing the bam file of the final iteration or by remapping one's reads to the output consensus sequences ? Or am i not seeing something?

Chris Miller

unread,

Apr 7, 2012, 4:10:21 PM4/7/12

to emirge...@googlegroups.com

Here's the relevant section from the README:

Once an EMIRGE run is completed, run emirge_rename_fasta.py on the
final iterations directory, for example:
emirge_rename_fasta.py iter.40 > renamed.fasta
Also see:

emirge_rename_fasta.py --help

emirge_rename_fasta.py orders the fasta file by decreasing abundance, and places those abundances in the headers.

Prior is the abundance estimate used in the Genome Biology paper, and is exactly that: the prior in the EM algorithm, the best guess of the relative abundance of that sequence.

NormPrior is a length-normalized version of the prior, redistributing the Prior by weighting each sequence based on its length. This is an attempt to make sure that longer sequences (with more reads mapping) do not get abundances which are over-estimated. In most cases, especially if the community is all bacterial, Prior is usually very close to NormPrior.

The silva candidate SSU sequence that EMIRGE started with is also listed in the header. Usually, EMIRGE will have adjusted this to reflect what the reads indicate was actually in your sample, which is why the EMIRGE sequence can be a few to several % identity away from 100.

Hope that's more clear.

Chris

manue...@gmail.com

unread,

Apr 4, 2017, 11:48:45 PM4/4/17

to EMIRGE users

Hi Chris,

Could you please confirm that nowhere do you get information in the EMIRGE output as to how many times a given sequence was found in terms of "integers" (60 times, 20 times, 120 times)? EMIRGE only outputs the percentage of representation, i.e in this case 30%, 10% and 60%. Is this correct?