Too intensive reduction

159 views
Skip to first unread message

l.p.p...@gmail.com

unread,
Feb 7, 2017, 7:44:52 AM2/7/17
to Fuyou Fu, Redundans
Hi Fuyou, 

Yes, there shouldn't be any spaces in contig names (but spaces are allowed in fasta headers ie `contig0001 some description` is fine, but `some description` will be trimmed. 
In the past I was renaming the contigs, but some users were asking to leave original contig names. It makes sense if you want to see which contigs are redundant and analyse them. 

Regarding reduction, in general it's better to start from contigs. But you can try starting from scaffold as well. 
If Redundans removed too much, have a look at identity histogram (OUTDIR/contigs.reduced.fa.hist.png). It'll tell you how similar are the regions that were reduced. In the following runs you can increase --identity threshold if there are old paralogous regions that were removed. 
Are you working with plants? 

Bests, 
L.

2017-02-07 13:38 GMT+01:00 Fuyou Fu <fuf...@gmail.com>:
Hi Leszek,
Meanwhile, I find the genome size is much lower than we predict. I have a question, we should use raw contigs or raw scaffolds?
Thanks,
Fuyou

On Tue, Feb 7, 2017 at 6:31 AM, Fuyou Fu <fuf...@gmail.com> wrote:
Hi Leszek,
It is running now. I think the program can not recognize the space in contigs name. But I find Redundans think many reads are poor quality.

[WARNING] Poor quality: Major orientation (FR) represent 71.79% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz: [29, 7179, 2761, 31]
/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz	94	83	114.76	93.55	29	7179	2761	31
[WARNING] Highly variable insert size (115 +- 93.55) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz!
[WARNING] Highly variable insert size (130 +- 173.08) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_2.fastq.gz!
[WARNING] Highly variable insert size (411 +- 287.91) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_2.fastq.gz!

##################################################
[Sun Feb  5 20:00:57 2017] Scaffolding...
 iteration 1.1 of 3.2 ...
   67528945 pairs. 16540841 passed filtering [24.49%]. 15185628 in different contigs [22.49%].
   67528945 pairs. 16372026 passed filtering [24.24%]. 14774474 in different contigs [21.88%].
 iteration 1.2 of 3.2 ...
   67528945 pairs. 16800580 passed filtering [24.88%]. 15208902 in different contigs [22.52%].
   67528945 pairs. 16577555 passed filtering [24.55%]. 14760283 in different contigs [21.86%].
[WARNING] Poor quality: Major orientation (RF) represent 52.72% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_2.fastq.gz: [5, 4713, 5272, 10]
[WARNING] Poor quality: Major orientation (FR) represent 68.2% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_2.fastq.gz: [6, 6820, 3166, 8]
[WARNING] Poor quality: Major orientation (FR) represent 65.99% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz: [19, 6599, 3363, 19]
[WARNING] Highly variable insert size (182 +- 172.35) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz!
[WARNING] Highly variable insert size (203 +- 240.25) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_2.fastq.gz!
[WARNING] Highly variable insert size (411 +- 287.91) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_2.fastq.gz!
 iteration 2.1 of 3.2 ...
   67528945 pairs. 18790470 passed filtering [27.83%]. 10650650 in different contigs [15.77%].
 iteration 2.2 of 3.2 ...
   67528945 pairs. 19517198 passed filtering [28.90%]. 8149748 in different contigs [12.07%].
[WARNING] Poor quality: Major orientation (RF) represent 58.04% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_2.fastq.gz: [9, 4177, 5804, 10]
[WARNING] Poor quality: Major orientation (FR) represent 58.66% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_2.fastq.gz: [14, 5866, 4105, 15]
[WARNING] Poor quality: Major orientation (FR) represent 67.36% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz: [23, 6736, 3214, 27]
[WARNING] Highly variable insert size (196 +- 206.38) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz!
[WARNING] Highly variable insert size (240 +- 270.10) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_2.fastq.gz!
[WARNING] Highly variable insert size (411 +- 287.91) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_2.fastq.gz!
 iteration 3.1 of 3.2 ...
   67528945 pairs. 17146191 passed filtering [25.39%]. 11899340 in different contigs [17.62%].


On Mon, Feb 6, 2017 at 10:00 AM, l.p.p...@gmail.com <l.p.p...@gmail.com> wrote:
Dear Fuyou Fu, 

Does it fail or run hang? 
Can you list output directory? 

ls -lah strv1.0

Bests,

L.

2017-02-05 5:29 GMT+01:00 Fuyou Fu <fuf...@gmail.com>:
Hi Leszek,
I used the new version. But my job can not keep running. Could you help me check the error files?
Thanks,
Fuyou
Options: Namespace(fasta='/depot/tmengist/data/StrigaGenomeData/StrigaV1.0.fa', fastq=['/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_1.fastq.gz', '/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_2.fastq.gz', '/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_1.fastq.gz', '/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_2.fastq.gz', '/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_1.fastq.gz', '/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_2.fastq.gz', '/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz', '/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz'], identity=0.51, iters=2, joins=5, limit=0.2, linkratio=0.7, log=<open file '<stderr>', mode 'w' at 0x2af84bf5a1e0>, longreads=['/scratch/snyder/f/fu115/Genome_assembly/fastq/seq/filtered_subreads.fastq'], mapq=10, minLength=200, nocleaning=True, nogapclosing=True, norearrangements=False, noreduction=True, noscaffolding=True, outdir='strv1.0', overlap=0.66, reference='', resume=False, threads=20, verbose=True)

##################################################
[Sat Feb  4 20:24:12 2017] Preparing contigs...
 1   
 100001   
 200001   
 300001   
 324162 sequences stored.

##################################################
[Sat Feb  4 20:27:27 2017] Reduction...
#file name	genome size	contigs	heterozygous size	[%]	heterozygous contigs	[%]	identity [%]	possible joins	homozygous size	[%]	homozygous contigs	[%]

On Fri, Feb 3, 2017 at 5:32 PM, Ching-Ho Chang <hily...@gmail.com> wrote:
Hi Leszek,

The new version still lost some unique sequences.
I guess that it may put Ns to some regions where should have sequence information before, say mapped regions.

Thanks,
Ching-Ho

lpryszcz於 2017年1月31日星期二 UTC-5下午3時04分19秒寫道:
Dear Ching-Ho,

Thanks for the information. The problem is because pyScaf didn't report contigs that were not scaffolded. This should be now solved, just pull the latest version (note, you'll need to update submodules as well, thus please execute both below commands).

git pull
git submodule update --recursive

pyScaf is under heavy development - I'd appreciate your feedback!

btw: if you don't need reduction & gap closing, you can use pyScaf directly.

Bests,
Leszek

L.

2017-01-27 5:32 GMT+01:00 Ching-Ho Chang <hily...@gmail.com>:
Hi redundans developer,

I tried to use Pacbio sequences to scaffold the genome.
Here is the result.
#fname  contigs bases   GC [%]  contigs >1kb    bases in contigs >1kb   N50     N90     Ns      longest
runmecate/contigs.fa    9274    503102573       36.370  9274    503102573       105793  21932   5492    1053140
runmecate/contigs.reduced.fa    9142    501703970       36.367  9142    501703970       106102  22145   5492    1053140
runmecate/scaffolds.longreads.1.fa      4583    443058017       36.370  4583    443058017       404424  28820   2398945 7166112
runmecate/scaffolds.longreads.fa        4583    443058017       36.370  4583    443058017       404424  28820   2398945 7166112

It greatly increased the N50 and longest contig size while lost many sequences.
I did find some sequences it removed are not redundant sequences.
It lost ~8% BUSCO (Benchmarking Universal Single-Copy Orthologs) genes
I was wondering if there is any parameter I can tune for the long read scaffolding.

Thank you,
Ching-Ho Chang


--
You received this message because you are subscribed to the Google Groups "Redundans" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redundans+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/redundans/27d8d050-5619-44b4-8cec-4415b51ae236%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Redundans" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redundans+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/redundans/061b00ef-eefb-4380-89e1-bef0d0236afa%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Fuyou Fu, Ph.D.
Department of Botany and Plant Pathology
Purdue University
USA

--
You received this message because you are subscribed to the Google Groups "Redundans" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redundans+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/redundans/CAFUo82g_-oOpwttXXsnJiX_ZnyhgCHmw64rN_G1UouJFVEVw8g%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.




--
Fuyou Fu, Ph.D.
Department of Botany and Plant Pathology
Purdue University
USA




--
Fuyou Fu, Ph.D.
Department of Botany and Plant Pathology
Purdue University
USA


Fuyou Fu

unread,
Feb 7, 2017, 7:56:47 AM2/7/17
to l.p.p...@gmail.com, Redundans
Hi Leszek,
It looks the hist figure is not good.
Thanks,
Fuyou
contigs.reduced.fa.hist.png

Fuyou Fu

unread,
Feb 7, 2017, 7:58:11 AM2/7/17
to l.p.p...@gmail.com, Redundans
Hi Leszek,
My genome is plant genome, the genome size is about 1.6Gb.
Thanks,
Fuyou

Fuyou Fu

unread,
Feb 7, 2017, 8:01:43 AM2/7/17
to l.p.p...@gmail.com, Redundans
Hi Leszek,
You think how much --identity threshold is better based on my results? 
Thanks,
Fuyou

l.p.p...@gmail.com

unread,
Feb 7, 2017, 8:54:12 AM2/7/17
to Fuyou Fu, Redundans
Yes, the regions identified by redundans are likely quite ancient whole genome duplication events / paralogs. Try increasing --identity to 0.9 or even 0.95. 

One important point, your mate-pair libraries are heavily contaminated by paired-end reads ie. 71.79% reads in 10000_?.fastq.gz is FR instead of RF, thus this library is recognised as paired-end...

Fuyou Fu

unread,
Feb 7, 2017, 9:12:31 AM2/7/17
to l.p.p...@gmail.com, Redundans
Hi Leszek,
My command is correct?
/group/bioinfo/apps/apps/redundans-0.13a-try3/redundans.py -v -t 20 \
-i /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_1.fastq.gz /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_2.fastq.gz \
/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_1.fastq.gz /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_2.fastq.gz 2000_2.fq.gz.is.txt \
/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_1.fastq.gz /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_2.fastq.gz 5000_2.fq.gz.is.txt \
/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz 10000_2.fq.gz.is.txt \
-l /scratch/snyder/f/fu115/Genome_assembly/fastq/seq/filtered_subreads.fastq \
-f /scratch/snyder/f/fu115/Genome_assembly/Redundans/striga_initial.contig \
--identity 0.95  \
-o str_soapd
Thanks,
Fuyou

mht

unread,
Mar 13, 2018, 5:09:02 AM3/13/18
to Redundans
Hi Leszek,

Hoping to get insight into the same topic. My assembly went from 1.8Gb to 1.2Gb after redundans, expected genome size at ~1.6Gb. I'm guessing some of the repeats were collapsed as well? What is the best practice to decide what --identity threshold to set for a specific set of scaffolds? From the previous example you recommended 0.9 or 0.95, how did you get to that number?

Thanks,
mht





L.

--
Fuyou Fu, Ph.D.
Department of Botany and Plant Pathology
Purdue University
USA

--
You received this message because you are subscribed to the Google Groups "Redundans" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redundans+...@googlegroups.com.

mht

unread,
Mar 13, 2018, 7:44:52 PM3/13/18
to Redundans
Attached my identity histogram to use as example.

Thanks.
contigs.reduced.fa.hist.png

l.p.p...@gmail.com

unread,
Mar 14, 2018, 11:41:16 AM3/14/18
to mht, Redundans
Hi, from your histogram I'd go for --identity cut-off of ~0.95. 
You can also check contigs.reduced.fa.hetero.tsv to see what is avg. identity and overlap between heterozygous contigs . 

Hope it helps,
L. 

L.

To unsubscribe from this group and stop receiving emails from it, send an email to redundans+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/redundans/dc81eb10-cd13-483b-acc9-fe581e16782e%40googlegroups.com.

mht

unread,
Mar 15, 2018, 12:19:16 AM3/15/18
to Redundans
Thanks Leszek for the tip. May I know what the rationale is behind setting 0.95 in this case (for future analyses)?
From the graph, majority of the contigs look like they still share 0.95 identity, it's even the peak of the graph.


L.

l.p.p...@gmail.com

unread,
Mar 15, 2018, 5:27:08 AM3/15/18
to mht, Redundans
Dear Mun, the identity of repetitive regions seems to peak ~.95, so if you set cut-off around that, the program will remove only the most identical repeats. 
Note, in my experience, slight over reduction isn't a big problem, as I was recovering correct genome size during scaffolding and gap closing (see NAR paper for details). 
Bests, 
L.

L.

To unsubscribe from this group and stop receiving emails from it, send an email to redundans+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/redundans/714b1072-91ef-40ba-ab4c-87cb262345b5%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages