vsearch --derep_full lib_clean.fasta --output derep.fasta --log=log --sizeout --minuniquesize 2
2) Get chimeras
vsearch --uchime_denovo derep.fasta --chimeras chimera.out.fasta --nonchimeras non.chimera.out.fasta
Chimera's job is still running from ~10+ hours now.
3) I'll filter them out using filter_fasta.py filter_fasta.py script.
4)
How do I proceed with pick_otus.py using vsearch (.uc, centroid steps)?
--
Please help, and guide.
Best,
Sanjeev
3) I'll filter them out using filter_fasta.py filter_fasta.py script.
How do I proceed with pick_otus.py using vsearch (.uc, centroid steps)?
- dereplicate using VSEARCH
- cluster them at 97% using VSEARCH
- remove chimera using VSEARCH
- pick representative sequences using QIIME
vsearch --derep_full small_test.fasta --output small_derep.fasta --log=log --sizeout --minuniquesize 2
vsearch -cluster_fast small_derep.fasta -id 0.97 -uc results.uc -minsize 2 -consout otus.fa
vsearch -usearch_global small_derep.fasta -db otus.fa -strand plus -id 0.97 -uc no.uc
I'm being honest here to say, I don't know what I'm doing. I've following doubts:
A) I'm getting rid of singletons in step1 and step2, is it redundant or is it advised?
B) In step 2, I hope -consout otus.fa is correct, and required file. Please correct me if I'm wrong.
C) In step 2, I've -uc results.uc and in step 3 again I've -uc no.uc
Which one is correct, and what is what? uc - uclust tool
D) What is the step3 for?
E) Is Step3 necessary?
F) How do I use .biom file further, and how am I to generate representative sequences to get back to classification?
My apologies for throwing so many naive queries. But I've been failing to get my head around with multiple files, and steps.
Appreciate your time, and replies. :)
Best,
Sanjeev
Hi Colin,Thanks for your reply, again.Its embarrassing I've not worked with .biom before. I'd need your patience here.I've always worked with process in QIIME: get chimera removed, cluster, representative seqs, and classify them using our in-house training set.What I've done on small data is:1- Dereplicatevsearch --derep_full small_test.fasta --output small_derep.fasta --log=log --sizeout --minuniquesize 2
getting rid of single tons.
2- Cluster themvsearch -cluster_fast small_derep.fasta -id 0.97 -uc results.uc -minsize 2 -consout otus.fa
getting rid of single tons.
3- Some step:
vsearch -usearch_global small_derep.fasta -db otus.fa -strand plus -id 0.97 -uc no.uc
I'm being honest here to say, I don't know what I'm doing. I've following doubts:
A) I'm getting rid of singletons in step1 and step2, is it redundant or is it advised?
B) In step 2, I hope -consout otus.fa is correct, and required file. Please correct me if I'm wrong.
C) In step 2, I've -uc results.uc and in step 3 again I've -uc no.uc
Which one is correct, and what is what? uc - uclust tool
D) What is the step3 for?
E) Is Step3 necessary?
F) How do I use .biom file further, and how am I to generate representative sequences to get back to classification?
My apologies for throwing so many naive queries. But I've been failing to get my head around with multiple files, and steps.
Appreciate your time, and replies. :)
Best,
Sanjeev
1)
De-replicate
vsearch --derep_full small_test.fasta --output small_derep.fasta --log=log --sizeout --minuniquesize 2
2)
cluster them at 97%, relabel, centroid file is the one to be used further in chimera:
vsearch -cluster_fast small_derep.fasta -id 0.97 -uc results.uc --sizein --sizeout --relabel OTU_ --centroids otus.fna
3)
Perform chimera check (de novo in my case):
vsearch --uchime_denovo otus.fna --nonchimeras otus_checked.fna --sizein --xsize --chimeras chimeras.fasta
4)
Get .uc file which is sort of OTU table
vsearch -usearch_global small_test.fasta -db otus.fna -strand plus -id 0.97 -uc otu_table_mapping.uc
5)
I used scripts from this link. Fixed function as mentioned on thread
python drive5/mod_uc2otutab.py otu_table_mapping.uc > tabfile.tsv
6)
Things went good so far. I didn't like ;size=N; concatenated with my OTUs, so another step:
sed -i -E 's/;size=[0-9]+;//g' tabfile.tsv
Questions:
A- I did clustering after sorting by lengths using --cluster_fast flag.
How does one decide which one to go: clustering after sorting by length, or sorting by abundance?
Or is it to each on his/her own?
B- I don't use results.uc generated in step 2.
- What is its use?
If no use, then I'd get rid of that flag, and argument.
C- Just to get down with the entire process with small data I took 2000 initial sequences from my big demultiplexed file.
-- 2000 input sequences
-- After de-replicating I'm left with 148: 7.4% of the original
-- otus.fna - after clustering, that is, my representative sequences: 75
-- After running chimera check, my non-chimeric sequences were 75. That means no chimera. It is understandable, it might be due to so less sequences grabbed.
-- I'm using the non-chimeric ones for RDP classification and further analyses.
In other words, I've 3.75% (rep) reads of the original 2000 to work on RDP.
Do these numbers look legit to you, or there's something horrible going on? (again, I took a too small)
D- In step:
vsearch -usearch_global small_test.fasta -db otus.fna -strand plus -id 0.97 -uc otu_table_mapping.uc
Shouldn't I be using otus_checked.fna in stead of otus.fna?
OTU table should be made from the reads which passed all chimera, and other filters.
--
Thanks again for your extensive support, and encouragement throughout this process.
Best,
Sanjeev
Hi Colin,Thank you for detailed replies, and walking down the steps. :)Just to be certain I'm understanding you completely, I'd like to tally steps.1)
De-replicate
vsearch --derep_full small_test.fasta --output small_derep.fasta --log=log --sizeout --minuniquesize 2
2)
cluster them at 97%, relabel, centroid file is the one to be used further in chimera:
vsearch -cluster_fast small_derep.fasta -id 0.97 -uc results.uc --sizein --sizeout --relabel OTU_ --centroids otus.fna
3)
Perform chimera check (de novo in my case):
vsearch --uchime_denovo otus.fna --nonchimeras otus_checked.fna --sizein --xsize --chimeras chimeras.fasta
4)
Get .uc file which is sort of OTU table
vsearch -usearch_global small_test.fasta -db otus.fna -strand plus -id 0.97 -uc otu_table_mapping.uc
6)
Things went good so far. I didn't like ;size=N; concatenated with my OTUs, so another step:
sed -i -E 's/;size=[0-9]+;//g' tabfile.tsv
Questions:
A- I did clustering after sorting by lengths using --cluster_fast flag.
How does one decide which one to go: clustering after sorting by length, or sorting by abundance?
Or is it to each on his/her own?
B- I don't use results.uc generated in step 2.
- What is its use?
If no use, then I'd get rid of that flag, and argument.
C- Just to get down with the entire process with small data I took 2000 initial sequences from my big demultiplexed file.
-- 2000 input sequences
-- After de-replicating I'm left with 148: 7.4% of the original
-- otus.fna - after clustering, that is, my representative sequences: 75
-- After running chimera check, my non-chimeric sequences were 75. That means no chimera. It is understandable, it might be due to so less sequences grabbed.
-- I'm using the non-chimeric ones for RDP classification and further analyses.
In other words, I've 3.75% (rep) reads of the original 2000 to work on RDP.
Do these numbers look legit to you, or there's something horrible going on? (again, I took a too small)
D- In step:
vsearch -usearch_global small_test.fasta -db otus.fna -strand plus -id 0.97 -uc otu_table_mapping.uc
Shouldn't I be using otus_checked.fna in stead of otus.fna?
OTU table should be made from the reads which passed all chimera, and other filters.
--
Thanks again for your extensive support, and encouragement throughout this process.
Best,
Sanjeev
I have filtered the .biom file resulting from the pipeline, but now, I am not able to pick the representative sequences of the filtered OTU table by using the pick_rep_set.py script.
Maybe the problem is the OTU map. I am using the .uc file resulting from the pipeline and translated to .txt. Is that correct?
Also, how does vsearch behave if I try to map reads with lower identity to the 97% rep set, e.g., in case that a read is 95% similar to more than one OTU?
When I tried to explain the problem again for this answer
vsearch --derep_full full_tagclean.fasta --output full_derep.fasta --log=vsearch_log --sizeout --minuniquesize 2 2>derep_log.txt
2- Cluster
vsearch -cluster_fast full_derep.fasta -id 0.97 --sizein --sizeout --relabel OTU_ --centroids otus.fna 2> cluster_log.txt
3- Chimera check
vsearch --uchime_denovo otus.fna --nonchimeras otus_checked.fna --sizein --chimeras chimeras.fasta --borderline border_line.fasta --xsize 2> chimera_log.txt
4- What does this aim for?
vsearch -usearch_global full_tagclean.fasta -db otus_checked.fna -strand plus -id 0.97 -uc otu_table_mapping.uc --xsize 2> usearch_global_log.txt
5- Get OTU table:
python drive5/mod_uc2otutab.py otu_table_mapping.uc > tabfile.tsv