MUSCLE Alignment with UNITE

787 views
Skip to first unread message

Andrew Belus

unread,
Mar 5, 2012, 3:25:29 PM3/5/12
to Qiime Forum
Hello,

I am started a alignment last Thursday (3/1) afternoon with the
following command:

screen -S muscle_alignment align_seqs.py -i rep_set_uclust.fna -m
muscle -o muscle_alignment -t unite_public_22.07.11.fasta

I am working with about 70k sequences and running it on the m1.large
instance on the EC2.

The problem is that the process appears to be still running. Should I
stop the instance and then check for errors in the log file (assuming
one is produced) or let it keep going?

Thanks for the help!

Andrew Belus

Daniel McDonald

unread,
Mar 5, 2012, 3:32:11 PM3/5/12
to qiime...@googlegroups.com
Hey Andrew,

I believe MUSCLE relies on a distance matrix, which means you'd need
70k x 70k x 8bytes of memory to represent that matrix (~40GB). I'm not
sure of what the runtime looks like but likely not friendly. I'm not
familiar with the UNITE database, is there a template that you can
align against?
-Daniel

Andrew Belus

unread,
Mar 5, 2012, 3:49:48 PM3/5/12
to Qiime Forum
Daniel,

Thanks for the quick reply! I'm still very new to this, so forgive me
if I'm wrong, but I thought the database was the template to which
sequences are compared. Is there another file that I could include in
the command that would make the process more efficient?

A little more background... A grad student at my university has been
running similar alignments (~60K sequences, ITS data, MUSCLE, UNITE)
in a fraction of the time, usually a max of 4 hours. Is it likely that
there is an error in my instance that is causing it to take so much
longer, or is this something that I just have to be patient with? My
only concern is that I might be wasting the little amount of grant
money that I have budgeted for AWS.

Andrew

Daniel McDonald

unread,
Mar 5, 2012, 4:01:14 PM3/5/12
to qiime...@googlegroups.com
Hey Andrew,

I'm not immediately familiar with the runtime characteristics of
MUSCLE and what would make one set of sequences run faster than
another set. I suspect it is likely tied to the level of divergence
within the sequence set as that can lead to inflation. MUSCLE should
bail if it runs into a problem, so it is doubtful that killing the
process and checking output will yield any information other than that
it is still trying to align. I'm not sure if I can advise you as to
whether continuing the MUSCLE run is worthwhile as that is going to be
dependent on what you want out of the alignment. Why did you choose to
use MUSCLE?

A template is a set of highly trusted and aligned sequences. PyNAST
can use a template and align sequences against the template if one is
available. I'm not sure how well PyNAST would work with ITS sequences.
-Daniel

priesgo

unread,
Mar 26, 2012, 7:18:35 PM3/26/12
to qiime...@googlegroups.com
Hi there!

I am also using UNITE and Qiime to analyse fungal ITS regions.
I didn't manage to make it work using the default alignment method, pynast, apparently because of format problems.

Just launched an alignment with the muscle method as you proposed above and initially it worked, we'll see if it finishes...

Pablo.

Jeff Werner

unread,
Mar 27, 2012, 6:29:18 AM3/27/12
to qiime...@googlegroups.com
Hi Pablo,

The muscle alignment option in QIIME does not use the template option - that is exclusively for the -m pynast (the default).  So, your command as written will ignore the -t option and build a de-novo alignment based on your input sequences alone (which will take a long time).  If you want to align your ITS reads to a standard alignment frame in this UNITE reference alignment, you could use PYNAST.  Otherwise, if you really want to use muscle AND an existing alignment, you could run muscle from the command-line without QIIME with the adding-sequences-to-an-existing-alignment option: http://www.drive5.com/muscle/muscle_userguide3.8.html#_Toc260497021

However, this will NOT necessarily preserve the alignment frame from your reference UNITE alignment.  The output alignment from muscle, in that case, would be a combined alignment of your experimental sequences and the UNITE sequences, and it *may* have extra inserts that weren't present in the original UNITE alignment.  That means that, if you want to combine these seqs with other aligned seqs in the future, you'd have to use muscle again to merge the two alignments. 

And, muscle will be much slower than pynast.  Pynast is specifically made to run quickly and preserve the alignment frame of the reference, and it can do so by (1.) find the reference seq most similar to your query seq, (2.) align the query seq to that one reference seq, (3.) force the query to be in-frame with the reference alignment, by first utilizing the gaps that are available in the reference, and then deleting bases from the query if absolutely necessary.  So, pynast will work best if your query seqs are close matches to reference alignment seqs and the reference alignment includes some extra gaps at strategic locations to allow for the odd/extra insertions.

Cheers,
Jeff


On Monday, March 26, 2012 7:18:35 PM UTC-4, priesgo wrote:
Hi there!

I am also using UNITE and Qiime to analyse fungal ITS regions.
I didn't manage to make it work using the default alignment method, pynast, apparently because of format problems.

Just launched an alignment with the muscle method as you proposed above and initially it worked, we'll see if it finishes...

Pablo.

El lunes 5 de marzo de 2012 22:01:14 UTC+1, Daniel McDonald escribió:
Hey Andrew,

I'm not immediately familiar with the runtime characteristics of
MUSCLE and what would make one set of sequences run faster than
another set. I suspect it is likely tied to the level of divergence
within the sequence set as that can lead to inflation. MUSCLE should
bail if it runs into a problem, so it is doubtful that killing the
process and checking output will yield any information other than that
it is still trying to align. I'm not sure if I can advise you as to
whether continuing the MUSCLE run is worthwhile as that is going to be
dependent on what you want out of the alignment. Why did you choose to
use MUSCLE?

A template is a set of highly trusted and aligned sequences. PyNAST
can use a template and align sequences against the template if one is
available. I'm not sure how well PyNAST would work with ITS sequences.
-Daniel

Jeff Werner

unread,
Mar 27, 2012, 6:32:22 AM3/27/12
to qiime...@googlegroups.com
Sorry, I addressed that to Pablo but I guess I was more specifically replying to Andrew... :)

Jeff Werner

unread,
Mar 27, 2012, 6:40:28 AM3/27/12
to qiime...@googlegroups.com
Is your UNITE file a reference alignment?  I think UNITE is only a reference sequence database, and not an alignment.

kpeay

unread,
Mar 28, 2012, 9:02:12 AM3/28/12
to Qiime Forum
Just a quick note of caution - fungal ITS sequences are not really
meaningful to align across large groups (ie UNITE or an environmental
dataset) because they are highly variable and chock full of indels.

Serena Thomson

unread,
May 1, 2012, 9:31:24 AM5/1/12
to Qiime Forum
I am having a similar issue. I need to align my ITS fungal sequences
using PyNast, against a reference, (aligned) dataset. Therefore
presumably UNITE is not suitable for this? How else can you align
fungal ITS data in order to remove chimeras and feed back into the
pipeline?

Serena


On Mar 28, 2:02 pm, kpeay <kabirp...@gmail.com> wrote:
> Just a quick note of caution - fungal ITS sequences are not really
> meaningful toalignacross large groups (ieUNITEor an environmental
> dataset) because they are highly variable and chock full of indels.
>
> On Mar 27, 5:40 am, Jeff Werner <werner....@gmail.com> wrote:
>
>
>
>
>
>
>
> > Is yourUNITEfile a reference alignment?  I thinkUNITEis only a

Greg Caporaso

unread,
May 1, 2012, 11:07:39 AM5/1/12
to qiime...@googlegroups.com
There is not currently a good reference alignment for this, so instead
of using PyNAST you can use muscle for sequence alignment, and then
the entropy-based alignment filter. If you're using one of the
workflow scripts, the relevant values in your parameters file will be:

align_seqs:pynast
filter_alignment:suppress_lane_mask_filter True
filter_alignment:entropy_threshold 0.10

The value of 0.10 for the entropy threshold has not been extensively
evaluated, so it might be worth experimenting with to see if/how it
affects your UniFrac PCoA plots.

Greg

Serena Thomson

unread,
May 1, 2012, 11:23:53 AM5/1/12
to qiime...@googlegroups.com
Hi Greg

Thanks for getting back to me. So if I use Muscle to align my sequences against the UNITE dataset, I can then feed this output into 'identify_chimera.py' as I would normally with PyNAST? Is PyNAST better than MUSCLE? Is one technique more suitable than the other? The reason I ask is that I have had issues with getting the alignments and reference datasets into the correct PyNAST format in the past (on other datasets) and Muscle as an option wasn't really explored. Once the chimera check has been run, I am just creating an OTU table so the rest of the downstream analysis is less important.

Many thanks for your help.

Serena

Serena Thomson

unread,
May 2, 2012, 8:43:06 AM5/2/12
to Qiime Forum
So would the script below be ok? Could you just confirm the
following:

If wanting to align ITS fungal data, the best option is to use
Muscle?

This will take much longer than PyNAST given the de novo approach but
result in compatible file than can then be tested for chimeras (is
using rep_set.aligned.fasta)?

If using PyNAST, a suitable PyNAST aligned reference file is needed,
which Unite is not (it is only a collection of reference sequences).
Therefore you cannot incorporate

UNITE in the align_seqs step and need to be prepared for a long
processing time?

Would UNITE be utilised in the assign step only then?

Greg, I am not sure where the options are for altering the entropy
threshold or suppressing the lane mask filter. I am just entering the
script as follows (I don't think it is part of the workflow you
mentioned):

nohup align_seqs.py -i rep_set2.fna -m muscle -o muscle_alignment &

Did you get my questions about whether Muscle is better than PyNAST?

Many thanks

Serena

On May 1, 4:23 pm, Serena Thomson <serenathoms...@googlemail.com>
wrote:
> Hi Greg
>
> Thanks for getting back to me. So if I use Muscle to align my sequences
> against the UNITE dataset, I can then feed this output into
> 'identify_chimera.py' as I would normally with PyNAST? Is PyNAST better
> than MUSCLE? Is one technique more suitable than the other? The reason I
> ask is that I have had issues with getting the alignments and reference
> datasets into the correct PyNAST format in the past (on other datasets) and
> Muscle as an option wasn't really explored. Once the chimera check has been
> run, I am just creating an OTU table so the rest of the downstream analysis
> is less important.
>
> Many thanks for your help.
>
> Serena
>

Greg Caporaso

unread,
May 2, 2012, 9:33:10 AM5/2/12
to qiime...@googlegroups.com
Hi Serena,
ChimeraSlayer won't work with the muscle alignment, so I would
recommend skipping that step unfortunately. At this point we don't
have a good reference alignment for ITS, and that is required for
ChimeraSlayer.

As far as which alignment method is better, you'd have to define
"better". The goal here is to get a rough idea of the relationship
between OTUs for the purpose of computing UniFrac distances or PD of
the samples: in that case both generally do well, in that we haven't
noticed differences in the conclusions we'd draw from UniFrac/PD
results associated with choice of alignment method. PyNAST can easily
be run in parallel, so it is very convenient in these studies. Given
the difficultly of aligning ITS, it might also be worth computing
bray-curtis distances here, in case the tree turns out to be very bad.

The relevant options would be:

align_seqs.py --alignment_method muscle ...
filter_alignment.py --suppress_lane_mask_filter --entropy_threshold 0.10 ...

Yes, you would only use UNITE in the taxonomy assignment step.

I hope this helps!

Greg

Serena Thomson

unread,
May 9, 2012, 12:39:30 PM5/9/12
to qiime...@googlegroups.com
Hi Greg,

I'm just trying to think of other possible options and wondered whether UCHIME (standalone version) would be an option? I understand that this program is not currently available in QIIME but might it be a work around? I wouldn't want to run the OTUpipe given that denoise_wrapper is a better denoiser it seems. 

Thanks

Serena 

Jose Carlos Clemente

unread,
May 9, 2012, 2:22:02 PM5/9/12
to qiime...@googlegroups.com
Serena,

you can use uchime as described in this post:

http://groups.google.com/group/qiime-forum/browse_thread/thread/ad58d2862f4b54fd/fe0413cd72c6249e?lnk=gst&q=otupipe+uchime#fe0413cd72c6249e

check Tony's entry from March 19th, and the otupipe tutorial as he suggested.

Jose

On Wed, May 9, 2012 at 10:39 AM, Serena Thomson

Serena Thomson

unread,
May 9, 2012, 6:23:42 PM5/9/12
to qiime...@googlegroups.com
Hi Jose

From this feed it states that Otupipe is a pseudo denoising step and therefore less effective. Having spoken to you about UChime before I thought the only other option was UChime as a stand alone application. Am I getting confused??

Serena 

Jose Carlos Clemente

unread,
May 9, 2012, 6:26:44 PM5/9/12
to qiime...@googlegroups.com
Serena,

you can use denoise_wrapper first. OTUpipe includes uchime as one of
its steps, so you can just de-activate all other steps in OTUpipe and
use uchime with sequences denoised through Denoiser.

Jose

On Wed, May 9, 2012 at 4:23 PM, Serena Thomson

Serena Thomson

unread,
May 9, 2012, 6:36:20 PM5/9/12
to qiime...@googlegroups.com
Ok thanks Jose. I'll give this a try on my fungal ITS data and then feed it back into the assign taxonomy and make OTU table scripts. Presumably this is what you would recommend for chimera checking this kind of data? I've just not come across anyone else trying this out in this situation.

Jose Carlos Clemente

unread,
May 9, 2012, 8:10:40 PM5/9/12
to qiime...@googlegroups.com
Since I haven't worked with the type of data you have, I am reluctant
to recommend one method above others, but maybe you can give it a try
and let us know how it worked. Again, if others in the forum
(including users) have suggestions, I'm sure Selena would appreciate
them.

Jose

On Wed, May 9, 2012 at 4:36 PM, Serena Thomson

Serena Thomson

unread,
May 10, 2012, 10:39:28 AM5/10/12
to qiime...@googlegroups.com
Hi Jose,

Ok, so I have attempted to run UCHIME on my ITS fungal data.  Could you please check this script and let me know if you notice any errors with it. If I disabled the correct steps, should my abundance_sorted.fasta file not be empty (I don't want to 'denoise' twice)? Mine is 1.4Mb. The denoised_seqs_inflated_otus.log file shows the following:


UsearchOtuPicker parameters:

Application:usearch

abundance_skew:2

chimeras_retention:union

cluster_size_filtering:False

count_start:0

db_filepath:/Users/serena/Documents/Academia/B_Soi_2nd_run/FUNGI/Usearch_inputs/unite_ref_seqs_21nov2011.fasta

de_novo_chimera_detection:False

global_alignment:True

label_prefix:

label_suffix:

maxrejects:500

minlen:64

minsize:4

output_dir:usearch_qf_results/

perc_id_blast:0.97

percent_id:0.97

percent_id_err:0.97

reference_chimera_detection:True

remove_usearch_logs:False

retain_label_as_comment:False

rev:True

save_intermediate_files:True

sizein:True

sizeout:True

slots:16769023

verbose:False

w:64

Num OTUs:3570

Num failures:7941


Is this indicating that de novo would be better, given the number of failures? What does this failure value mean? Should I be experimenting with the abundance skew value? 

Also I understand that de novo chimera checking is computationally more demanding but is it more accurate to run de novo? Presumably this depends on how good the reference dataset is? I ran de novo using this command (which only took 20 minutes on ~350k sequences):

nohup macqiime pick_otus.py -i denoised_seqs_inflated.fna -m usearch -o usearch_qf_results/ --word_length 64 --cluster_size_filtering -x


And although the denoised_seqs_inflated_otus.txt files are very similar (7 OTUs difference) between de novo and reference-based, the number of failures are almost 4x as many. The OTU value how is very similar to the value I achieved with the UClust pick OTU option. I just expected de novo and ref-based techniques to differ more so. 

Passing the -F option will involve running de novo and ref-based chimera removal at the same time, is this right?  Do you have any recommendations as to whether to use Union or Intersection? Again, I have run both options and the results are similar in terms of OTUs returned but obviously much higher failures under the intersection option. 

Ideally I would like to see how many chimeras are being flagged for removal using this step. Which file tells me this? Is UCHIME more stringent that ChimeraSlayer.

Many thanks

Serena

priesgo

unread,
Jul 11, 2012, 9:45:50 AM7/11/12
to qiime...@googlegroups.com
Hi,

I would like to make a pretty straightforward question.
Has anybody ever aligned fungal ITS sequences against UNITE using muscle or any other aligner??

I did a first try myself with 12k seqs which was running for more than a week until it got almost 8GB of memory and I killed it.
Talking of which the parameter --muscle_max_memory doesn't seem to work well or at least I didn't how to configure it well.


As a second approach I did a harder filtering and I run the denoising step which I didn't do before and I have 4.5K seqs. Muscle has already been running for 24h. We'll see!

And furthermore, is there anyway to run it in parallel?


Regards,
Pablo.

Jeff Werner

unread,
Jul 11, 2012, 10:00:11 AM7/11/12
to qiime...@googlegroups.com
Hi Pablo,

UNITE is a reference sequence database, but is not a reference alignment (the sequences in UNITE are not aligned). As far as I know, it is not very useful to try to align fungal ITS reads unless they are all very similar/closely related.

Cheers,
Jeff

Daniel McDonald

unread,
Jul 11, 2012, 10:03:20 AM7/11/12
to qiime...@googlegroups.com
FWIW, I took the UNITE sequences plus a bunch of other ITS sequences
down to clusters at 55% similarity and still had a few hundred OTUs.
I'd have to agree with Jeff.
-Daniel

Jeff Werner

unread,
Jul 11, 2012, 10:03:30 AM7/11/12
to qiime...@googlegroups.com
And, if you're using muscle through qiime, I think it's a de-novo sequence alignment, and does not use the reference alignment, even if you input one. So, that's related to your high memory usage.

Pablo Riesgo

unread,
Jul 12, 2012, 4:24:32 AM7/12/12
to qiime...@googlegroups.com
Thanks!

Yes you are right I was mixing things. Muscle does not use a reference sequence. I obtained my OTUs using UNITE and I am trying to multialign this OTUs de novo.

I really don't know if the ratio of similarity is greater for bacterial 16S sequences than for fungal ITS sequences, but my point is the same as it is being done with bacteria, obtaining the phylogenetic tree of the OTUs to study the variability of the population.

Anybody knows of a way to launch a parallel muscle de novo multialignment?


Regards,
Pablo.

Jeff Werner

unread,
Jul 12, 2012, 7:12:07 AM7/12/12
to qiime...@googlegroups.com
Hi Pablo,

You might try using muscle directly through the command-line, rather than wrapping it in qiime. That will give you more flexibility for the command-line options. The regular command-line instructions for muscle are here:
http://www.drive5.com/muscle/downloads.htm

There are some specific approaches for large alignments:
http://www.drive5.com/muscle/muscle_userguide3.8.html#_Toc260497014

If your sequence set is too big for muscle, one approach is to clump together groups of similar sequences, align them within those smaller clumps, and then take the resultant multiple multiple sequence alignments and merge them together (muscle can do that).

Also, make sure you have the 64-bit version of muscle. 

Good luck! Cheers -- Jeff

Pablo Riesgo

unread,
Jul 12, 2012, 9:09:33 AM7/12/12
to qiime...@googlegroups.com
Thanks Jeff!

I'll try this, but in a way I have already done that when selecting the OTUs, but I'll see if this can work.
I'll keep you updated.


Pablo.

Jeff Werner

unread,
Jul 12, 2012, 9:26:45 AM7/12/12
to qiime...@googlegroups.com
Hi Pablo,

Clumping the reads in order to divide up the alignment problem is a little different from OTU picking.  The USEARCH manual
http://drive5.com/usearch/usearch_docs.html
has a section on the "UHIRE" option, which divides up very large sequence sets into several smaller fasta files that are easier to align one at a time with muscle, and then you can combine all the resultant multiple sequence alignments into one big alignment.  It takes a bit of personal care and time, and can give mixed results depending on the options you choose (I've used it in the past with large sets of related protein sequences), but it's one of the few ways out there to force an alignment out of a set of sequences that's too large to handle all at once.

Cheers,
Jeff

Pablo Riesgo

unread,
Jul 25, 2012, 5:15:29 AM7/25/12
to qiime...@googlegroups.com
Hi all,

I finally got the alignment! it took more or less 3 days to execute.
Thanks for all your suggestions but it was enough with a harder filtering and the denoising. Previously I was filtering 21.4% of the sequences while now I got filtered 33.9%, in absolut numbers the multialignments worked with 131444 sequences. Also important I guess the denoising...


Thanks all!
Pablo.

manpreet

unread,
Aug 23, 2012, 12:37:57 AM8/23/12
to qiime...@googlegroups.com
Hi guys,

Just popping in with previous experience with this that might be of use to anyone else facing this problem.

I too have an ITS dataset and reference database is UNITE. I tried using muscle but for days to no avail...it would just keep trying and I would end up killing the process or the machine because it would hang.

Eventually I used MAFFT to align. It also makes denovo alignment and seems to take less than an hour to run.  Phylogenetic tree looks alright and after removing chimeras, most sequences clustered more or less meaningfully.  So I guess for some people this might be an alternative.

It does require to be installed but super-easy instructions here: http://mafft.cbrc.jp/alignment/software/linux.html

Hopefully of use!

Manpreet

Lluvia Vargas

unread,
May 27, 2013, 12:32:52 PM5/27/13
to qiime...@googlegroups.com
Hi Pablo!

I'm working with ITS sequences too, and I am having problems with align sequences with MUSCLE, can you help me with this, I am not realy sure if I understan why you did it.


Thanks, and I hope you can help with this.

Pablo Riesgo

unread,
May 28, 2013, 4:21:26 AM5/28/13
to qiime...@googlegroups.com
Hi Lluvia,

I'll try to explain briefly why I did the multiple alignment with muscle.
If you take a look at http://qiime.org/tutorials/tutorial.html there are a series of steps:
  1. Cluster your sequences to define your OTUs
  2. Define a representative sequence for each OTU
  3. Aligning the representative sequence set (multiple alignment)
  4. ...more steps...
The wanted output of step 3 is the distance between every two representative sequences, you will use this to create later your phylogenetic tree.
So for the alignment in step 3 you have three options: PyNAST, INFERNAL and MUSCLE. PyNAST uses a a pre-aligned database of sequences. At the time I used Qiime, this database existed for 16S regions, but it didn't for ITS regions (you should check this is still the same). INFERNAL is specific for RNA. So... MUSCLE is all you got, brute-force.

Keep in mind that you will only need the multiple alignment to obtain the phylogenetic tree, but not for the taxonomic assignment.


Good luck!
Pablo.

PS: nice name Lluvia! I guess you know it is rain in Spanish

Lluvia Vargas

unread,
May 29, 2013, 2:47:19 PM5/29/13
to qiime...@googlegroups.com
Thank's Pablo!!! It's working!! :)
Reply all
Reply to author
Forward
0 new messages