I believe MUSCLE relies on a distance matrix, which means you'd need
70k x 70k x 8bytes of memory to represent that matrix (~40GB). I'm not
sure of what the runtime looks like but likely not friendly. I'm not
familiar with the UNITE database, is there a template that you can
align against?
-Daniel
I'm not immediately familiar with the runtime characteristics of
MUSCLE and what would make one set of sequences run faster than
another set. I suspect it is likely tied to the level of divergence
within the sequence set as that can lead to inflation. MUSCLE should
bail if it runs into a problem, so it is doubtful that killing the
process and checking output will yield any information other than that
it is still trying to align. I'm not sure if I can advise you as to
whether continuing the MUSCLE run is worthwhile as that is going to be
dependent on what you want out of the alignment. Why did you choose to
use MUSCLE?
A template is a set of highly trusted and aligned sequences. PyNAST
can use a template and align sequences against the template if one is
available. I'm not sure how well PyNAST would work with ITS sequences.
-Daniel
Hi there!
I am also using UNITE and Qiime to analyse fungal ITS regions.
I didn't manage to make it work using the default alignment method, pynast, apparently because of format problems.
Just launched an alignment with the muscle method as you proposed above and initially it worked, we'll see if it finishes...
Pablo.
El lunes 5 de marzo de 2012 22:01:14 UTC+1, Daniel McDonald escribió:
Hey Andrew,I'm not immediately familiar with the runtime characteristics of
MUSCLE and what would make one set of sequences run faster than
another set. I suspect it is likely tied to the level of divergence
within the sequence set as that can lead to inflation. MUSCLE should
bail if it runs into a problem, so it is doubtful that killing the
process and checking output will yield any information other than that
it is still trying to align. I'm not sure if I can advise you as to
whether continuing the MUSCLE run is worthwhile as that is going to be
dependent on what you want out of the alignment. Why did you choose to
use MUSCLE?A template is a set of highly trusted and aligned sequences. PyNAST
can use a template and align sequences against the template if one is
available. I'm not sure how well PyNAST would work with ITS sequences.
-Daniel
Ok, so I have attempted to run UCHIME on my ITS fungal data. Could you please check this script and let me know if you notice any errors with it. If I disabled the correct steps, should my abundance_sorted.fasta file not be empty (I don't want to 'denoise' twice)? Mine is 1.4Mb. The denoised_seqs_inflated_otus.log file shows the following:
UsearchOtuPicker parameters:
Application:usearch
abundance_skew:2
chimeras_retention:union
cluster_size_filtering:False
count_start:0
db_filepath:/Users/serena/Documents/Academia/B_Soi_2nd_run/FUNGI/Usearch_inputs/unite_ref_seqs_21nov2011.fasta
de_novo_chimera_detection:False
global_alignment:True
label_prefix:
label_suffix:
maxrejects:500
minlen:64
minsize:4
output_dir:usearch_qf_results/
perc_id_blast:0.97
percent_id:0.97
percent_id_err:0.97
reference_chimera_detection:True
remove_usearch_logs:False
retain_label_as_comment:False
rev:True
save_intermediate_files:True
sizein:True
sizeout:True
slots:16769023
verbose:False
w:64
Num OTUs:3570
Num failures:7941
Is this indicating that de novo would be better, given the number of failures? What does this failure value mean? Should I be experimenting with the abundance skew value?
Also I understand that de novo chimera checking is computationally more demanding but is it more accurate to run de novo? Presumably this depends on how good the reference dataset is? I ran de novo using this command (which only took 20 minutes on ~350k sequences):
nohup macqiime pick_otus.py -i denoised_seqs_inflated.fna -m usearch -o usearch_qf_results/ --word_length 64 --cluster_size_filtering -x
And although the denoised_seqs_inflated_otus.txt files are very similar (7 OTUs difference) between de novo and reference-based, the number of failures are almost 4x as many. The OTU value how is very similar to the value I achieved with the UClust pick OTU option. I just expected de novo and ref-based techniques to differ more so.
Passing the -F option will involve running de novo and ref-based chimera removal at the same time, is this right? Do you have any recommendations as to whether to use Union or Intersection? Again, I have run both options and the results are similar in terms of OTUs returned but obviously much higher failures under the intersection option.
Ideally I would like to see how many chimeras are being flagged for removal using this step. Which file tells me this? Is UCHIME more stringent that ChimeraSlayer.
Many thanks
Serena