mothur output file is too large

33 views
Skip to first unread message

Duan Copeland

unread,
Nov 2, 2022, 4:48:22 PM11/2/22
to VSEARCH Forum
Hello, I am running a meta analysis using a closed reference

I used mothur to combine my fastq into a fasta file using make.contigs command, then I use this file as the input for vsearch

Here are the commands I'm using:

./vsearch --fastx_uniques mothur.fasta --sizein --sizeout --fasta_width 0 --uc all.derep.uc --fastaout new.mothur.fasta

vsearch v2.22.1_linux_x86_64, 376.5GB RAM, 96 cores
https://github.com/torognes/vsearch

Dereplicating file newmeta.trim.contigs.good.renamed.fasta 100%
89035602982 nt in 219831151 seqs, min 35, max 625, avg 405
Sorting 100%
86773373 unique sequences, avg cluster 2.5, median 1, max 527191
Writing FASTA output file 100%
Writing uc file, first part 100%
Writing uc file, second part 100%

./vsearch --usearch_global new.mothur.fasta --db sintax___BEEx_FL-TS.fa --id 0.97 --strand both --sizein --sizeout --uc 97new-hits.uc --notmatched 97new-miss.fasta --dbmatched 97new.otus.fasta --biomout 97new.biom --mothur_shared_out newmeta.original.shared

The problem I am having is that my shared file from the mothur output is 533GB.  I am wondering if there's a way to merge the matched hits at the species level before outputting to the shared file?   I believe this will greatly reduce the size of my file

Colin Brislawn

unread,
Nov 2, 2022, 10:50:01 PM11/2/22
to VSEARCH Forum
Hello Duan,

>The problem I am having is that my shared file from the mothur output is 533GB.

Wow, that's a big data set! What a good problem to have!

Is the 533GB file in question the input (mothur.fasta) or the output (new.mothur.fasta) to the `vsearch --fastx_uniques` command? There are a couple of different ways you could break this process into chunks to get it done in less RAM, and I want to better understand your input and output data.

Torbjørn Rognes

unread,
Nov 3, 2022, 4:20:15 AM11/3/22
to VSEARCH Forum
It probably wont solve your problem, but with such large datasets the new "--derep_smallmem" command in VSEARCH version 2.22 may be helpful to dereplicate datasets using much less memory.

Duan Copeland

unread,
Nov 7, 2022, 12:32:52 PM11/7/22
to VSEARCH Forum
Hello, the  --mothur_shared_out newmeta.original.shared is the 533 GB file from my  --usearch_global command


I am dereplicating the " mothur.fasta"  with  --fastx_uniques    and outputing to new.mothur.fasta

Then I use new.mothur.fasta with --usearch_global to match bacteria to my database at 97% sequence similarity

I tried the --derep_smallmem, but unfortunately my file sizes were the same.

 

Colin Brislawn

unread,
Nov 9, 2022, 9:34:31 AM11/9/22
to VSEARCH Forum
>--mothur_shared_out newmeta.original.shared is the 533 GB file from my  --usearch_global command
Got it. While you could split up the inputs to --usearch_global, I'm not sure how best to merge the outputs of --mothur_shared_out later on.

>I tried the --derep_smallmem, but unfortunately my file sizes were the same.
This makes sense, as the derep results should be the same regardless of the algorithm used at this step.

So the new pipeline might look like this:
1. vsearch --fastx_uniques
2. vsearch --<make that fasta file smaller somehow, maybe by merging at species>
3. vsearch --usearch_global

There is no good way to cluster reads in a fasta file by species because the reads don't represent the species concept. But you can still cluster / denoise these reads to reduce file size before this next step!

For step 2. I would recommend one of these options:

Cluster reads into 99% OTUs:
2. VSEARCH --cluster_size new.mothur.fasta --id 0.99 --sizein --sizeout --centroids otu99.mothur.fasta
Denoise reads into ASVs:
vsearch --cluster_unoise new.mothur.fasta --sizein --sizeout --centroids asv.mothur.fasta

In both examples, similar reads are merged so your output file is smaller. You can add other options like --threads --fasta_width 0 like you have done before.

Reply all
Reply to author
Forward
0 new messages