HI Jaci,
Until now, my main focus has been in the guided placement. This is
meant for cases where one studies a "known" species (let say 'camel').
We assume that we know (roughly) the phylogenetic position of the
target species; the species itself has little sequence data available
whereas many related species have lots. By efficiently using this
information, we can make accurate alignments of fragmented sequences
(DNA, protein, NGS reads).
I aim to extend PAGAN towards unguided placement and have implemented
functions that seem to work in my small-scale simulated examples (up
to 100 sequences in the reference alignment, not extremely diverged).
I haven't tested them on larger problems.
> For an unguided search, how exactly are the sequences aligned? Is PAGAN
> trying to pairwise align every read to every node on the tree to find
> the best score?
In sequence placement, PAGAN reconstructs the sequence history and
infers the ancestral sequences at the internal nodes of the reference
phylogeny. When searching for the place for a query sequence, PAGAN
aligns the query against the selected set of reference sequences.
In "guided" placement the selected set is indicated in the reference
tree using special NHX tags. In "unguided" placement, the selected set
is either all nodes (= internal+leaf nodes, using option --test-every-
node) or just internal nodes (--test-every-internal-node).
To make the search faster, PAGAN uses Exonerate to help finding the
potential targets. Using the option --fast-placement, PAGAN calls
first the Exonerate *local alignment* and finds the best N hits (by
default N=5) between the query and the selected set of reference
alignment nodes; it then calls the Exonerate *exact alignment*,
keeping the best M hits (by default M=1).
If M is more than one (there's an option to change that), PAGAN makes
its own exhaustive alignments against all the M target nodes and
selects the best one. If M is zero, that query is (by default)
excluded. This can be changed with option --exhaustive-placement which
forces PAGAN to make exhaustive (and rather slow) alignments against
all target nodes and choose the best one.
If --fast-placement (or --use-exonerate-reads-local and --use-
exonerate-reads-gapped; those allow running the two modes separately)
is not used, PAGAN doesn't use Exonerate and makes exhaustive search
against all target nodes. That can be very slow.
One should also note that PAGAN can place one query sequence to
multiple locations if they score equally well. This makes sense in
some analyses but may cause problems in others. This behaviour can be
disabled with option --one-placement-only.
> Does PAGAN create 'profiles' (like an HMM) for all the internal nodes
> to which the reads are aligned to?
PAGAN is based on partial-order graphs and represents every sequence
(extant sequences at leaf nodes and inferred sequences at internal
nodes) using the same graph representation, each sequence being
represented by a graph of its own. PAGAN doesn't loose information by
compressing the data into a consensus profile.
> And are reads pairwise aligned to the extant terminal nodes?
Yes, if option --test-every-node is used.
> Also, how are mismatched residues between best node match and read
> handled: is a gap added or is the read's base deleted?
PAGAN first finds the locations where to place the query sequences and
then starts adding them progressively, to one target node at time. If
many query sequences map to one target node, they are aligned there
one after another, each alignment creating a new node to represent the
alignment of the reference node and the query sequence. PAGAN takes
into account and correctly aligns sites that are shared by the query
sequences but not present in the reference alignments and shows those
sites as insertions. PAGAN *never* deletes sites in the query
sequences unless option --trim-read-ends is selected (to clean up the
low quality ends of NGS reads; even then insertions are shown as
insertions).
> Also, can you use the output of a PAGAN alignment directly with a
> placement program like pplacer or EPA?
Currently PAGAN only accepts FASTA as input and writes FASTA as
output. I have code (in PRANK) to support some other formats and can
easily add those if needed. FASTA can also be trivially converted to
other formats with external tools.
> My ideal phylogeny aware alignment program for taxonomic placements
> would be able to handle long query sequence reads that overhang on
> either end of my reference alignment without penalty.
PAGAN does have low penalty for terminal gaps but it should work fine.
> It would also give me the option of converting my nucleotide sequences
> to amino acids, conducting the alignment in amino acid space, and then
> convert back to nucleotides for the final output of the alignment
> which I could then use in a placement program.
PAGAN can place DNA query sequences to DNA reference alignments and
protein query sequences to protein reference alignments. It does not
currently support translation and backtranslation (although that is
doable and my other program PRANK has that feature).
Before I would implement that feature in PAGAN, I should be convinced
that it's really worth it. First, one rarely has perfect DNA query
sequences that translate to protein in the correct frame. Most data
come from NGS and contain errors; often they also are short and may
contain multiple open reading frames. Second, when the sequence
aligner reconstructs the ancestarl sequences, it has much more
information available to make a good DNA alignment; the amount of
information can further be increased by a denser sampling of
(relevant) reference sequences. Unlike profile-HMM methods, PAGAN
improves its performance with a larger and denser reference alignment.
Third, DNA sequences contain more information than amino-acid
sequences and in ideal cases are aligned more accuractely.
> I'm concerned about artificial frame shift mutations, and that I am
> trying to force alignments of non-conserved regions in nucleotide
> space.
> Aligning in the AA space gives me the security that I am aligning
> truly conserved regions, then being able to back translate to the NA
> sequence for the placement program would give me a finer scale
> resolution for placement of a query sequence read among sister taxa
> and closely related groups.
I believe that non-conserved regions have partially evolved through
insertions and deletions (multiple events getting the sequence back to
the reading frame). If that's true, sequences can't correctly aligned
(or analysed by other methods) in amino-acid space.
> I know the program TranslatorX enables this conversion to AA space,
> then back conversion to NA sequence. The TranslatorX webserver even
> has the option of running your alignment in Prank. I think it would
> be absolutely wonderful to do the same with PAGAN.
I will think about that. I believe that the correct way would be to
allow (and correct) for reading frame errors but that is very
challenging. Supporting "just" the translation/back-translation would
be much easier.
Regards, Ari