File size too big for 32-bit version (17.4Gb)
Input file size 17G
Reads: 32084664
I need 2 things to get done: identify chimera, and OTU picking.
I'd love to hear your suggestions.
Any help shall be greatly appreciated.
vsearch v1.10.2_linux_x86_64, 63.0GB RAM, 32 cores
What I'm doing is:
1- dereplicate and remove single ton
vsearch --derep_full lib_tagclean.fasta --output derep.fasta --log=log --sizeout --minuniquesize 2
2- run chimera check
vsearch --uchime_denovo derep.fasta --chimeras chimera.out.fasta --nonchimeras non.chimera.out.fasta
---------
3-
This is something I don't know if I should be doing post dereplication??
vsearch --sortbysize derep.fa -output derep_sorted.fa
4-
- Then run chimera check on above dereplicated and sorted out seqs??
vsearch --uchime_denovo derep.fasta --chimeras sorted.chimera.out.fasta --nonchimeras sorted.non.chimera.out.fasta
I do not know if I should be doing Step 3 after step 1, and then run chimera check.
---
Thanks much for all your support and guidance.!
Not to flood inbox:
What I'm doing is:
3-
This is something I don't know if I should be doing post dereplication??
vsearch --sortbysize derep.fa -output derep_sorted.fa
4-
- Then run chimera check on above dereplicated and sorted out seqs??
vsearch --uchime_denovo derep.fasta --chimeras sorted.chimera.out.fasta --nonchimeras sorted.non.chimera.out.fasta
I do not know if I should be doing Step 3 after step 1, and then run chimera check.
1)
vsearch --derep_full small_test.fasta --output small_derep.fasta --log=log --sizeout --minuniquesize 2
2)
vsearch -cluster_fast small_derep.fasta -id 0.97 --sizein --sizeout --relabel OTU_ --centroids otus.fna
3)
vsearch --uchime_denovo otus.fna --nonchimeras otus_checked.fna --sizein --xsize --chimeras chimeras.fasta
4)
vsearch -usearch_global small_test.fasta -db otus_checked.fna -strand plus -id 0.97 -uc otu_table_mapping.uc
5)
python drive5/mod_uc2otutab.py otu_table_mapping.uc > tabfile.tsv
I download this (mod_uc2otutab.py) script from Robert Edgar's website.
6)
sed -i -E 's/;size=[0-9]+;//g' tabfile.tsv
I discussed this at length with QIIME developer. Respective thread's link.
Hope this would be helpful.
Cheers!
I might be wrong but I don't see why we need to map the no-chimera OTU list to the original file if with this pipeline I get all I need (iI think) from step 3
Sequence labels must have sample identifiers (input set) and OTU identifiers (database) as explained later in this page. This means that you cannot use the input file to cluster_otus for this step because several samples often have the same unique sequence, so the dereplicated (unique) sequence labels either do not have a sample identifier, or have a misleading sample identifier because the same sequence may be found in other samples. The way to deal with this is usually to go back to the "raw" reads after merging or truncating to a fixed length.