Hello,
I am very excited about redundans and I hope to use it consistently for all my projects ... if I can figure this out!
My unexpected results occur during the reduction phase of the pipeline. It seems that I am losing a bunch of content and losing all my smaller contigs.
Original Assembly QUAST results (contigs.fa)
# contigs (>= 0 bp) 46087
# contigs (>= 1000 bp) 23529
# contigs (>= 5000 bp) 11560
# contigs (>= 10000 bp) 6759
# contigs (>= 25000 bp) 1685
# contigs (>= 50000 bp) 221
After reduction (contigs.reduced.fa)
# contigs (>= 0 bp) 3907
# contigs (>= 1000 bp) 3907
# contigs (>= 5000 bp) 3907
# contigs (>= 10000 bp) 3907
# contigs (>= 25000 bp) 1685
# contigs (>= 50000 bp) 221
So it is basically removing all my contigs that have a length less than 10,000 regardless if they are duplicated. Only one contig (~15,000 bp) shows up in the contigs.reduced.fa.hetero.tsv file.
Here are some BUSCO results showing the loss of content.
BUSCO original:
554 Complete BUSCOs (C)
504 Complete and single-copy BUSCOs (S)
50 Complete and duplicated BUSCOs (D)
118 Fragmented BUSCOs (F)
306 Missing BUSCOs (M)
978 Total BUSCO groups searched
BUSCO contigs.reduced.fa
389 Complete BUSCOs (C)
369 Complete and single-copy BUSCOs (S)
20 Complete and duplicated BUSCOs (D)
92 Fragmented BUSCOs (F)
497 Missing BUSCOs (M)
978 Total BUSCO groups searched
Switching around the options do not change this observations. Here is my typical run.
redundans.py --identity 0.6 -i $forward $reverse -f $contigs -o redundans_out_60 -t 24
If it matters, I am currently working with a nematode that has ~150 MB genome and using illumina 250 bp reads. I am working with redundans-0.13c.
I appreciate any feedback.
Happy Holidays,
Joseph