Hello,
I am trying to run dDocent-2.9.4 on a cluster computer with the RPE method, but after repeated attempts, I keep getting insanely huge output files for the second cutoff step in the process to reduce the dataset for cdhit. In my most recent attempt, you can see in the file list below that the uniq.k.2.c.2.seqs file got to 5.5TB(!!) before I canceled the job because clearly something is wrong.
The uniq.seqs file is quite large as well (93GB), and when I scrolled through it, I saw that although the counts column seemed accurate, seqs with more than 1 copy were still listed multiple times in uniq.seqs, when my understanding is that the seq should be listed only once, whatever the count. For example, one seq has a count of three, and in the uniq.seqs file, it is listed 3 times:
3 AGCACTCGATTTACATTTTCGCCGGCGGTGCACATCATTAAAAAGGCGGCGTGGCGACGCCTCATTAATCCCCTTCTGCGGCGACAATGGCGCCGCAAGAAGCGATCGACTCCATTCAAAAGC
GCGTGTCACGCTTTTGCCNNNNNNNNNNGTCGCGACCGTTCGCGGGTGCACAACTGTTCTTTTATTGCGGTCGCGACCGGAACCTCTCCTCTCTCGCCTCTCAACTTGGTGTTCCTCGCATGCATCGAAGC
AAAAGCGTGACACGCGCTTTTGAATGGAGTCGATCGCTTCTTGCGGCG
3 AGCACTCGATTTACATTTTCGCCGGCGGTGCACATCATTAAAAAGGCGGCGTGGCGACGCCTCATTAATCCCCTTCTGCGGCGACAATGGCGCCGCAAGAAGCGATCGACTCCATTCAAAAGC
GCGTGTCACGCTTTTGCCNNNNNNNNNNGTCGCGACCGTTCGCGGGTGCACAACTGTTCTTTTATTGCGGTCGCGACCGGAACCTCTCCTCTCTCGCCTCTCAACTTGGTGTTCCTCGCATGCATCGAAGC
AAAAGCGTGACACGCGCTTTTGAATGGAGTCGATCGCTTCTTGCGGCG
3 AGCACTCGATTTACATTTTCGCCGGCGGTGCACATCATTAAAAAGGCGGCGTGGCGACGCCTCATTAATCCCCTTCTGCGGCGACAATGGCGCCGCAAGAAGCGATCGACTCCATTCAAAAGC
GCGTGTCACGCTTTTGCCNNNNNNNNNNGTCGCGACCGTTCGCGGGTGCACAACTGTTCTTTTATTGCGGTCGCGACCGGAACCTCTCCTCTCTCGCCTCTCAACTTGGTGTTCCTCGCATGCATCGAAGC
AAAAGCGTGACACGCGCTTTTGAATGGAGTCGATCGCTTCTTGCGGCG
Shouldn't it only be listed one time? This applies to all multiples in that file, so it's not surprising the file is still large after the first data cutoff. Similarly, the uniq.k.2.c.2.seqs has lines repeated that I think should only be there once, for example, the following 4 lines are identical:
2 AACAAGAAAAAGCCAGCTGCGGCCGCAGCCGCCGCCGCCCCCGCGCCACCCTCTGACAGCGACAGCGACAAAGAGAGCGGCACCGAGAAAGACTCTGAGGACTCTGGCAGCTCGAACAAAGATCGGAAGAGCACACGTCTGNNNNNNNNNNTTGTTCGAGCTGCCAGAGTCCTCAGAGTCTTTCTCGGTGCCGCTCTCTTTGTCGCTGTCGCTGTCAGAGGGTGGCGCGGGGGCGGCGGCGGCTGCGGCCGCAGCTGGCTTTTTCTTGTTCTTGTCGTCCAGATCGGAAGAGCGTCGTGTAG
2 AACAAGAAAAAGCCAGCTGCGGCCGCAGCCGCCGCCGCCCCCGCGCCACCCTCTGACAGCGACAGCGACAAAGAGAGCGGCACCGAGAAAGACTCTGAGGACTCTGGCAGCTCGAACAAAGATCGGAAGAGCACACGTCTGNNNNNNNNNNTTGTTCGAGCTGCCAGAGTCCTCAGAGTCTTTCTCGGTGCCGCTCTCTTTGTCGCTGTCGCTGTCAGAGGGTGGCGCGGGGGCGGCGGCGGCTGCGGCCGCAGCTGGCTTTTTCTTGTTCTTGTCGTCCAGATCGGAAGAGCGTCGTGTAG
2 AACAAGAAAAAGCCAGCTGCGGCCGCAGCCGCCGCCGCCCCCGCGCCACCCTCTGACAGCGACAGCGACAAAGAGAGCGGCACCGAGAAAGACTCTGAGGACTCTGGCAGCTCGAACAAAGATCGGAAGAGCACACGTCTGNNNNNNNNNNTTGTTCGAGCTGCCAGAGTCCTCAGAGTCTTTCTCGGTGCCGCTCTCTTTGTCGCTGTCGCTGTCAGAGGGTGGCGCGGGGGCGGCGGCGGCTGCGGCCGCAGCTGGCTTTTTCTTGTTCTTGTCGTCCAGATCGGAAGAGCGTCGTGTAG
2 AACAAGAAAAAGCCAGCTGCGGCCGCAGCCGCCGCCGCCCCCGCGCCACCCTCTGACAGCGACAGCGACAAAGAGAGCGGCACCGAGAAAGACTCTGAGGACTCTGGCAGCTCGAACAAAGATCGGAAGAGCACACGTCTGNNNNNNNNNNTTGTTCGAGCTGCCAGAGTCCTCAGAGTCTTTCTCGGTGCCGCTCTCTTTGTCGCTGTCGCTGTCAGAGGGTGGCGCGGGGGCGGCGGCGGCTGCGGCCGCAGCTGGCTTTTTCTTGTTCTTGTCGTCCAGATCGGAAGAGCGTCGTGTAG
When I looked in the dDocent executable, I see that the RPE method has this special_uniq function that makes it hard for me to troubleshoot what might be the problem here.
Any ideas for how to fix this? Requested info below.
Thanks!
Melanie
dDocent.runs
Variables used in dDocent (version 2.9.6) run at Fri Sep 22 11:04:22 PDT 2023
Number of Processors
8
Trimming
yes
Assembly?
yes
Type_of_Assembly
RPE
Clustering_Similarity%
0.9
Minimum within individaul coverage level to include a read for assembly (K1)
2
Minimum number of individuals a read must be present in to include for assembly (K2)
2
Mapping_Reads?
yes
Mapping_Match_Value
1
Mapping_MisMatch_Value
4
Mapping_GapOpen_Penalty
6
Calling_SNPs?
yes
Email
ls -l [NOTE: I was unable to paste the entire list (webpage became unresponsive) - there are many more samples with the same files and roughly the same file sizes as the two shown here]
-rw-rw-r-- 1 mlacava mlacava 57M Sep 22 11:19 70704_15.F.fq.gz
-rw-rw-r-- 1 mlacava mlacava 62M Sep 22 13:20 70704_15.R.fq.gz
-rw-rw-r-- 1 mlacava mlacava 61M Sep 22 16:52 70704_15.R1.fq.gz
-rw-rw-r-- 1 mlacava mlacava 67M Sep 22 16:52 70704_15.R2.fq.gz
-rw-rw-r-- 1 mlacava mlacava 1.4K Sep 22 16:52 70704_15.trim.log
-rw-rw-r-- 1 mlacava mlacava 218M Sep 22 12:09 70704_15.uniq.seqs
-rw-rw-r-- 1 mlacava mlacava 31M Sep 22 11:19 70704_16.F.fq.gz
-rw-rw-r-- 1 mlacava mlacava 34M Sep 22 13:20 70704_16.R.fq.gz
-rw-rw-r-- 1 mlacava mlacava 34M Sep 22 16:52 70704_16.R1.fq.gz
-rw-rw-r-- 1 mlacava mlacava 36M Sep 22 16:52 70704_16.R2.fq.gz
-rw-rw-r-- 1 mlacava mlacava 1.4K Sep 22 16:52 70704_16.trim.log
-rw-rw-r-- 1 mlacava mlacava 119M Sep 22 12:09 70704_16.uniq.seqs
-rw-rw-r-- 1 mlacava mlacava 417 Sep 15 16:14 config.file
-rw-rw-r-- 1 mlacava mlacava 478 Sep 22 11:04 dDocent.runs
-rw-rw-r-- 1 mlacava mlacava 54K Sep 22 15:11 lengths.txt
-rw-rw-r-- 1 mlacava mlacava 476 Sep 19 09:06 my_ddocent.sh
-rw-rw-r-- 1 mlacava mlacava 3.4K Sep 22 11:04 namelist
-rw-rw-r-- 1 mlacava mlacava 1.9M Sep 22 16:53 slurm-7525263.out
-rw-rw-r-- 1 mlacava mlacava 1.8M Sep 22 16:53 temp.LOG
-rw-rw-r-- 1 mlacava mlacava 27G Sep 22 16:07 total.f.uniq
-rw-rw-r-- 1 mlacava mlacava 72G Sep 22 15:18
total.fr-rw-rw-r-- 1 mlacava mlacava 35G Sep 22 13:43 total.u.F
-rw-rw-r-- 1 mlacava mlacava 37G Sep 22 13:52 total.u.R
-rw-rw-r-- 1 mlacava mlacava 72G Sep 22 13:32 total.uniqs
-rw-rw-r-- 1 mlacava mlacava 0 Sep 22 11:04 trim.log
drwxrwsr-x 2 mlacava mlacava 770 Sep 22 16:52 trim_reports/
-rw-rw-r-- 1 mlacava mlacava
5.5T Sep 25 09:24 uniq.k.2.c.2.seqs
-rw-rw-r-- 1 mlacava mlacava 93G Sep 22 12:21 uniq.seqs
-rw-rw-r-- 1 mlacava mlacava 27G Sep 22 13:02 uniqCperindv
head temp.LOG [I have to cancel the job because the uniq.k.2.c.2.seqs file was getting out of control, so I just have the temp.LOG. This LOG has thousands of lines about perl locale error, so I cannot paste the whole log here. Here is the top up to the first perl error]
dDocent version 2.9.6 started Fri Sep 22 11:04:22 PDT 2023
At this point, all configuration information has been entered and dDocent may take several hours to run.
It is recommended that you move this script to a background operation and disable terminal input and output.
All data and logfiles will still be recorded.
To do this:
Press control and Z simultaneously
Type bg and press enter
Type disown -h and press enter
Now sit back, relax, and wait for your analysis to finish
Trimming reads and simultaneously assembling reference sequences
perl: warning: Setting locale failed.