Too intensive reduction

l.p.p...@gmail.com

unread,

Feb 7, 2017, 7:44:52 AM2/7/17

to Fuyou Fu, Redundans

Hi Fuyou,

Yes, there shouldn't be any spaces in contig names (but spaces are allowed in fasta headers ie `contig0001 some description` is fine, but `some description` will be trimmed.

In the past I was renaming the contigs, but some users were asking to leave original contig names. It makes sense if you want to see which contigs are redundant and analyse them.

Regarding reduction, in general it's better to start from contigs. But you can try starting from scaffold as well.

If Redundans removed too much, have a look at identity histogram (OUTDIR/contigs.reduced.fa.hist.png). It'll tell you how similar are the regions that were reduced. In the following runs you can increase --identity threshold if there are old paralogous regions that were removed.

Are you working with plants?

Bests,

L.

2017-02-07 13:38 GMT+01:00 Fuyou Fu <fuf...@gmail.com>:

Hi Leszek,
Meanwhile, I find the genome size is much lower than we predict. I have a question, we should use raw contigs or raw scaffolds?
Thanks,
Fuyou
On Tue, Feb 7, 2017 at 6:31 AM, Fuyou Fu <fuf...@gmail.com> wrote:
Hi Leszek,
It is running now. I think the program can not recognize the space in contigs name. But I find Redundans think many reads are poor quality.
[WARNING] Poor quality: Major orientation (FR) represent 71.79% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz: [29, 7179, 2761, 31]
/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz	94	83	114.76	93.55	29	7179	2761	31
[WARNING] Highly variable insert size (115 +- 93.55) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz!
[WARNING] Highly variable insert size (130 +- 173.08) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_2.fastq.gz!
[WARNING] Highly variable insert size (411 +- 287.91) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_2.fastq.gz!

##################################################
[Sun Feb  5 20:00:57 2017] Scaffolding...
 iteration 1.1 of 3.2 ...
   67528945 pairs. 16540841 passed filtering [24.49%]. 15185628 in different contigs [22.49%].
   67528945 pairs. 16372026 passed filtering [24.24%]. 14774474 in different contigs [21.88%].
 iteration 1.2 of 3.2 ...
   67528945 pairs. 16800580 passed filtering [24.88%]. 15208902 in different contigs [22.52%].
   67528945 pairs. 16577555 passed filtering [24.55%]. 14760283 in different contigs [21.86%].
[WARNING] Poor quality: Major orientation (RF) represent 52.72% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_2.fastq.gz: [5, 4713, 5272, 10]
[WARNING] Poor quality: Major orientation (FR) represent 68.2% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_2.fastq.gz: [6, 6820, 3166, 8]
[WARNING] Poor quality: Major orientation (FR) represent 65.99% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz: [19, 6599, 3363, 19]
[WARNING] Highly variable insert size (182 +- 172.35) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz!
[WARNING] Highly variable insert size (203 +- 240.25) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_2.fastq.gz!
[WARNING] Highly variable insert size (411 +- 287.91) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_2.fastq.gz!
 iteration 2.1 of 3.2 ...
   67528945 pairs. 18790470 passed filtering [27.83%]. 10650650 in different contigs [15.77%].
 iteration 2.2 of 3.2 ...
   67528945 pairs. 19517198 passed filtering [28.90%]. 8149748 in different contigs [12.07%].
[WARNING] Poor quality: Major orientation (RF) represent 58.04% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_2.fastq.gz: [9, 4177, 5804, 10]
[WARNING] Poor quality: Major orientation (FR) represent 58.66% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_2.fastq.gz: [14, 5866, 4105, 15]
[WARNING] Poor quality: Major orientation (FR) represent 67.36% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz: [23, 6736, 3214, 27]
[WARNING] Highly variable insert size (196 +- 206.38) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz!
[WARNING] Highly variable insert size (240 +- 270.10) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_2.fastq.gz!
[WARNING] Highly variable insert size (411 +- 287.91) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_2.fastq.gz!
 iteration 3.1 of 3.2 ...
   67528945 pairs. 17146191 passed filtering [25.39%]. 11899340 in different contigs [17.62%].
On Mon, Feb 6, 2017 at 10:00 AM, l.p.p...@gmail.com <l.p.p...@gmail.com> wrote:
Dear Fuyou Fu,

Does it fail or run hang?
Can you list output directory?

ls -lah strv1.0

Bests,
L.
2017-02-05 5:29 GMT+01:00 Fuyou Fu <fuf...@gmail.com>:
Hi Leszek,
I used the new version. But my job can not keep running. Could you help me check the error files?
Thanks,
Fuyou
Options: Namespace(fasta='/depot/tmengist/data/StrigaGenomeData/StrigaV1.0.fa', fastq=['/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_1.fastq.gz', '/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_2.fastq.gz', '/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_1.fastq.gz', '/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_2.fastq.gz', '/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_1.fastq.gz', '/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_2.fastq.gz', '/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz', '/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz'], identity=0.51, iters=2, joins=5, limit=0.2, linkratio=0.7, log=<open file '<stderr>', mode 'w' at 0x2af84bf5a1e0>, longreads=['/scratch/snyder/f/fu115/Genome_assembly/fastq/seq/filtered_subreads.fastq'], mapq=10, minLength=200, nocleaning=True, nogapclosing=True, norearrangements=False, noreduction=True, noscaffolding=True, outdir='strv1.0', overlap=0.66, reference='', resume=False, threads=20, verbose=True)

##################################################
[Sat Feb  4 20:24:12 2017] Preparing contigs...
 1   
 100001   
 200001   
 300001   
 324162 sequences stored.

##################################################
[Sat Feb  4 20:27:27 2017] Reduction...
#file name	genome size	contigs	heterozygous size	[%]	heterozygous contigs	[%]	identity [%]	possible joins	homozygous size	[%]	homozygous contigs	[%]
On Fri, Feb 3, 2017 at 5:32 PM, Ching-Ho Chang <hily...@gmail.com> wrote:
Hi Leszek,

The new version still lost some unique sequences.
I guess that it may put Ns to some regions where should have sequence information before, say mapped regions.

Thanks,
Ching-Ho

lpryszcz於 2017年1月31日星期二 UTC-5下午3時04分19秒寫道：
Dear Ching-Ho,

Thanks for the information. The problem is because pyScaf didn't report contigs that were not scaffolded. This should be now solved, just pull the latest version (note, you'll need to update submodules as well, thus please execute both below commands).

git pull
git submodule update --recursive

pyScaf is under heavy development - I'd appreciate your feedback!

btw: if you don't need reduction & gap closing, you can use pyScaf directly.

Bests,
Leszek

L.

2017-01-27 5:32 GMT+01:00 Ching-Ho Chang <hily...@gmail.com>:
Hi redundans developer,

I tried to use Pacbio sequences to scaffold the genome.
Here is the result.
#fname contigs bases GC [%] contigs >1kb bases in contigs >1kb N50 N90 Ns longest
runmecate/contigs.fa 9274 503102573 36.370 9274 503102573 105793 21932 5492 1053140
runmecate/contigs.reduced.fa 9142 501703970 36.367 9142 501703970 106102 22145 5492 1053140
runmecate/scaffolds.longreads.1.fa 4583 443058017 36.370 4583 443058017 404424 28820 2398945 7166112
runmecate/scaffolds.longreads.fa 4583 443058017 36.370 4583 443058017 404424 28820 2398945 7166112

It greatly increased the N50 and longest contig size while lost many sequences.
I did find some sequences it removed are not redundant sequences.
It lost ~8% BUSCO (Benchmarking Universal Single-Copy Orthologs) genes
I was wondering if there is any parameter I can tune for the long read scaffolding.

Thank you,
Ching-Ho Chang

--
You received this message because you are subscribed to the Google Groups "Redundans" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redundans+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/redundans/27d8d050-5619-44b4-8cec-4415b51ae236%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Redundans" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redundans+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/redundans/061b00ef-eefb-4380-89e1-bef0d0236afa%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
Fuyou Fu, Ph.D.
Department of Botany and Plant Pathology
Purdue University
USA
--
You received this message because you are subscribed to the Google Groups "Redundans" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redundans+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/redundans/CAFUo82g_-oOpwttXXsnJiX_ZnyhgCHmw64rN_G1UouJFVEVw8g%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.
--
Fuyou Fu, Ph.D.
Department of Botany and Plant Pathology
Purdue University
USA
--
Fuyou Fu, Ph.D.
Department of Botany and Plant Pathology
Purdue University
USA

Fuyou Fu

unread,

Feb 7, 2017, 7:56:47 AM2/7/17

to l.p.p...@gmail.com, Redundans

Hi Leszek,

It looks the hist figure is not good.

Thanks,

Fuyou

contigs.reduced.fa.hist.png

Fuyou Fu

unread,

Feb 7, 2017, 7:58:11 AM2/7/17

to l.p.p...@gmail.com, Redundans

Hi Leszek,

My genome is plant genome, the genome size is about 1.6Gb.

Thanks,

Fuyou

Fuyou Fu

unread,

Feb 7, 2017, 8:01:43 AM2/7/17

to l.p.p...@gmail.com, Redundans

Hi Leszek,

You think how much --identity threshold is better based on my results?

Thanks,

Fuyou

l.p.p...@gmail.com

unread,

Feb 7, 2017, 8:54:12 AM2/7/17

to Fuyou Fu, Redundans

Yes, the regions identified by redundans are likely quite ancient whole genome duplication events / paralogs. Try increasing --identity to 0.9 or even 0.95.

One important point, your mate-pair libraries are heavily contaminated by paired-end reads ie. 71.79% reads in 10000_?.fastq.gz is FR instead of RF, thus this library is recognised as paired-end...

Try to set insert size manually for all libraries: https://github.com/lpryszcz/redundans/tree/master/docs#estimation-of-my-library-statistics-is-incorrect-can-i-specify-these-values-manually

Fuyou Fu

unread,

Feb 7, 2017, 9:12:31 AM2/7/17

to l.p.p...@gmail.com, Redundans

Hi Leszek,

My command is correct?

/group/bioinfo/apps/apps/redundans-0.13a-try3/redundans.py -v -t 20 \

-i /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_1.fastq.gz /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_2.fastq.gz \

/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_1.fastq.gz /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_2.fastq.gz 2000_2.fq.gz.is.txt \

/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_1.fastq.gz /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_2.fastq.gz 5000_2.fq.gz.is.txt \

/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz 10000_2.fq.gz.is.txt \

-l /scratch/snyder/f/fu115/Genome_assembly/fastq/seq/filtered_subreads.fastq \

-f /scratch/snyder/f/fu115/Genome_assembly/Redundans/striga_initial.contig \

--identity 0.95 \

-o str_soapd

Thanks,

Fuyou

mht

unread,

Mar 13, 2018, 5:09:02 AM3/13/18

to Redundans

Hi Leszek,

Hoping to get insight into the same topic. My assembly went from 1.8Gb to 1.2Gb after redundans, expected genome size at ~1.6Gb. I'm guessing some of the repeats were collapsed as well? What is the best practice to decide what --identity threshold to set for a specific set of scaffolds? From the previous example you recommended 0.9 or 0.95, how did you get to that number?

Thanks,

mht

L.

To view this discussion on the web visit https://groups.google.com/d/msgid/redundans/061b00ef-eefb-4380-89e1-bef0d0236afa%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Fuyou Fu, Ph.D.
Department of Botany and Plant Pathology
Purdue University
USA

--
You received this message because you are subscribed to the Google Groups "Redundans" group.

To unsubscribe from this group and stop receiving emails from it, send an email to redundans+...@googlegroups.com.

mht

unread,

Mar 13, 2018, 7:44:52 PM3/13/18

to Redundans

Attached my identity histogram to use as example.

Thanks.

contigs.reduced.fa.hist.png

l.p.p...@gmail.com

unread,

Mar 14, 2018, 11:41:16 AM3/14/18

to mht, Redundans

Hi, from your histogram I'd go for --identity cut-off of ~0.95.

You can also check contigs.reduced.fa.hetero.tsv to see what is avg. identity and overlap between heterozygous contigs .

Hope it helps,

L.

To unsubscribe from this group and stop receiving emails from it, send an email to redundans+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/redundans/dc81eb10-cd13-483b-acc9-fe581e16782e%40googlegroups.com.

mht

unread,

Mar 15, 2018, 12:19:16 AM3/15/18

to Redundans

Thanks Leszek for the tip. May I know what the rationale is behind setting 0.95 in this case (for future analyses)?

From the graph, majority of the contigs look like they still share 0.95 identity, it's even the peak of the graph.

L.

l.p.p...@gmail.com

unread,

Mar 15, 2018, 5:27:08 AM3/15/18

to mht, Redundans

Dear Mun, the identity of repetitive regions seems to peak ~.95, so if you set cut-off around that, the program will remove only the most identical repeats.

Note, in my experience, slight over reduction isn't a big problem, as I was recovering correct genome size during scaffolding and gap closing (see NAR paper for details).

Bests,

L.

To unsubscribe from this group and stop receiving emails from it, send an email to redundans+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/redundans/714b1072-91ef-40ba-ab4c-87cb262345b5%40googlegroups.com.

Reply all

Reply to author

Forward