Lost sequences after scaffolding by Pacbio

Ching-Ho Chang

unread,

Jan 26, 2017, 11:32:13 PM1/26/17

to Redundans

Hi redundans developer,

I tried to use Pacbio sequences to scaffold the genome.

Here is the result.

#fname contigs bases GC [%] contigs >1kb bases in contigs >1kb N50 N90 Ns longest

runmecate/contigs.fa 9274 503102573 36.370 9274 503102573 105793 21932 5492 1053140

runmecate/contigs.reduced.fa 9142 501703970 36.367 9142 501703970 106102 22145 5492 1053140

runmecate/scaffolds.longreads.1.fa 4583 443058017 36.370 4583 443058017 404424 28820 2398945 7166112

runmecate/scaffolds.longreads.fa 4583 443058017 36.370 4583 443058017 404424 28820 2398945 7166112

It greatly increased the N50 and longest contig size while lost many sequences.

I did find some sequences it removed are not redundant sequences.

It lost ~8% BUSCO (Benchmarking Universal Single-Copy Orthologs) genes

I was wondering if there is any parameter I can tune for the long read scaffolding.

Thank you,

Ching-Ho Chang

l.p.p...@gmail.com

unread,

Jan 31, 2017, 3:04:19 PM1/31/17

to Ching-Ho Chang, Redundans

Dear Ching-Ho,

Thanks for the information. The problem is because pyScaf didn't report contigs that were not scaffolded. This should be now solved, just pull the latest version (note, you'll need to update submodules as well, thus please execute both below commands).

git pull

git submodule update --recursive

pyScaf is under heavy development - I'd appreciate your feedback!

btw: if you don't need reduction & gap closing, you can use pyScaf directly.

Bests,

Leszek

L.

--
You received this message because you are subscribed to the Google Groups "Redundans" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redundans+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/redundans/27d8d050-5619-44b4-8cec-4415b51ae236%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ching-Ho Chang

unread,

Feb 3, 2017, 6:32:53 PM2/3/17

to Redundans, hily...@gmail.com

Hi Leszek,

The new version still lost some unique sequences.

I guess that it may put Ns to some regions where should have sequence information before, say mapped regions.

Thanks,

Ching-Ho

lpryszcz於 2017年1月31日星期二 UTC-5下午3時04分19秒寫道：

L.

To unsubscribe from this group and stop receiving emails from it, send an email to redundans+...@googlegroups.com.

Fuyou Fu

unread,

Feb 4, 2017, 11:29:42 PM2/4/17

to Ching-Ho Chang, Redundans

Hi Leszek,

I used the new version. But my job can not keep running. Could you help me check the error files?

Thanks,

Fuyou

Options: Namespace(fasta='/depot/tmengist/data/StrigaGenomeData/StrigaV1.0.fa', fastq=['/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_1.fastq.gz', '/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_2.fastq.gz', '/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_1.fastq.gz', '/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_2.fastq.gz', '/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_1.fastq.gz', '/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_2.fastq.gz', '/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz', '/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz'], identity=0.51, iters=2, joins=5, limit=0.2, linkratio=0.7, log=<open file '<stderr>', mode 'w' at 0x2af84bf5a1e0>, longreads=['/scratch/snyder/f/fu115/Genome_assembly/fastq/seq/filtered_subreads.fastq'], mapq=10, minLength=200, nocleaning=True, nogapclosing=True, norearrangements=False, noreduction=True, noscaffolding=True, outdir='strv1.0', overlap=0.66, reference='', resume=False, threads=20, verbose=True)

##################################################
[Sat Feb  4 20:24:12 2017] Preparing contigs...
 1   
 100001   
 200001   
 300001   
 324162 sequences stored.

##################################################
[Sat Feb  4 20:27:27 2017] Reduction...
#file name	genome size	contigs	heterozygous size	[%]	heterozygous contigs	[%]	identity [%]	possible joins	homozygous size	[%]	homozygous contigs	[%]

To unsubscribe from this group and stop receiving emails from it, send an email to redundans+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/redundans/061b00ef-eefb-4380-89e1-bef0d0236afa%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Fuyou Fu, Ph.D.
Department of Botany and Plant Pathology
Purdue University
USA

l.p.p...@gmail.com

unread,

Feb 6, 2017, 10:58:26 AM2/6/17

to Ching-Ho Chang, Redundans

Dear Ching-Ho, Can you give me some number, how much regions is missing? Also I'd appreciate some if you can share some data, so I can reproduce it.

Bests,

L.

To unsubscribe from this group and stop receiving emails from it, send an email to redundans+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/redundans/061b00ef-eefb-4380-89e1-bef0d0236afa%40googlegroups.com.

l.p.p...@gmail.com

unread,

Feb 6, 2017, 11:00:47 AM2/6/17

to Fuyou Fu, Ching-Ho Chang, Redundans

Dear Fuyou Fu,

Does it fail or run hang?

Can you list output directory?

ls -lah strv1.0

Bests,

L.

To view this discussion on the web visit https://groups.google.com/d/msgid/redundans/CAFUo82g_-oOpwttXXsnJiX_ZnyhgCHmw64rN_G1UouJFVEVw8g%40mail.gmail.com.

Ching-Ho Chang

unread,

Feb 6, 2017, 3:49:08 PM2/6/17

to Redundans, hily...@gmail.com

Hi Leszek,

There is the information.

#fname contigs bases GC [%] contigs >1kb bases in contigs >1kb N50 N90 Ns longest

Masurca4.fasta 10273 502084516 36.418 8129 500731438 142864 28039 572485 1043053

Masurca4_2/contigs.fa 10272 502084326 36.418 8129 500731438 142864 28039 572485 1043053

Masurca4_2/scaffolds.longreads.1.fa 6760 432357666 36.425 4687 431050726 462377 40997 2803906 5036293

Masurca4_2/scaffolds.longreads.fa 6760 432357666 36.425 4687 431050726 462377 40997 2803906 5036293

Before scaffolding,

INFO C:96.9%[S:82.9%,D:14.0%],F:2.0%,M:1.1%,n:2442

INFO 2366 Complete BUSCOs (C)

INFO 2025 Complete and single-copy BUSCOs (S)

INFO 341 Complete and duplicated BUSCOs (D)

INFO 49 Fragmented BUSCOs (F)

INFO 27 Missing BUSCOs (M)

INFO 2442 Total BUSCO groups searched

After scaffoldiing

INFO C:85.2%[S:74.6%,D:10.6%],F:2.3%,M:12.5%,n:2442

INFO 2081 Complete BUSCOs (C)

INFO 1821 Complete and single-copy BUSCOs (S)

INFO 260 Complete and duplicated BUSCOs (D)

INFO 57 Fragmented BUSCOs (F)

INFO 304 Missing BUSCOs (M)

INFO 2442 Total BUSCO groups searched

I don't have permission to share the sequences, but I will ask.

Could you tell me your algorithm in the long read scaffolding?

It perform much better than other programs, ex. pbjelly and sspace longread, even though I don't have the experimental data to validate the junction.

Thank you,

Ching-Ho

lpryszcz於 2017年2月6日星期一 UTC-5上午10時58分26秒寫道：

L.

l.p.p...@gmail.com

unread,

Feb 7, 2017, 6:20:36 AM2/7/17

to Ching-Ho Chang, Redundans

Hi,

It's pyScaf (https://github.com/lpryszcz/pyScaf), I've developed it myself last year. This is still quite alpha version (this is why I'd be happy about the feedback), but it works well on test data sets.

From the stats it's difficult to say what's wrong... It'd be good if you can share contigs and reads with me. I'll keep them private.

In meantime, can you execute below commands (in redundans install dir) and send me output?

cd ~/src/redundans

git log | head -n5

cd bin/pyScaf

git log | head -n5

Greetings from Warsaw!

L.

To unsubscribe from this group and stop receiving emails from it, send an email to redundans+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/redundans/40b0b362-7f9b-4a94-a01c-b8d1413f1716%40googlegroups.com.

Fuyou Fu

unread,

Feb 7, 2017, 7:31:29 AM2/7/17

to l.p.p...@gmail.com, Ching-Ho Chang, Redundans

Hi Leszek,

It is running now. I think the program can not recognize the space in contigs name. But I find Redundans think many reads are poor quality.

[WARNING] Poor quality: Major orientation (FR) represent 71.79% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz: [29, 7179, 2761, 31]
/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz	94	83	114.76	93.55	29	7179	2761	31
[WARNING] Highly variable insert size (115 +- 93.55) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz!
[WARNING] Highly variable insert size (130 +- 173.08) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_2.fastq.gz!
[WARNING] Highly variable insert size (411 +- 287.91) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_2.fastq.gz!

##################################################
[Sun Feb  5 20:00:57 2017] Scaffolding...
 iteration 1.1 of 3.2 ...
   67528945 pairs. 16540841 passed filtering [24.49%]. 15185628 in different contigs [22.49%].
   67528945 pairs. 16372026 passed filtering [24.24%]. 14774474 in different contigs [21.88%].
 iteration 1.2 of 3.2 ...
   67528945 pairs. 16800580 passed filtering [24.88%]. 15208902 in different contigs [22.52%].
   67528945 pairs. 16577555 passed filtering [24.55%]. 14760283 in different contigs [21.86%].
[WARNING] Poor quality: Major orientation (RF) represent 52.72% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_2.fastq.gz: [5, 4713, 5272, 10]
[WARNING] Poor quality: Major orientation (FR) represent 68.2% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_2.fastq.gz: [6, 6820, 3166, 8]
[WARNING] Poor quality: Major orientation (FR) represent 65.99% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz: [19, 6599, 3363, 19]
[WARNING] Highly variable insert size (182 +- 172.35) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz!
[WARNING] Highly variable insert size (203 +- 240.25) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_2.fastq.gz!
[WARNING] Highly variable insert size (411 +- 287.91) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_2.fastq.gz!
 iteration 2.1 of 3.2 ...
   67528945 pairs. 18790470 passed filtering [27.83%]. 10650650 in different contigs [15.77%].
 iteration 2.2 of 3.2 ...
   67528945 pairs. 19517198 passed filtering [28.90%]. 8149748 in different contigs [12.07%].
[WARNING] Poor quality: Major orientation (RF) represent 58.04% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_2.fastq.gz: [9, 4177, 5804, 10]
[WARNING] Poor quality: Major orientation (FR) represent 58.66% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_2.fastq.gz: [14, 5866, 4105, 15]
[WARNING] Poor quality: Major orientation (FR) represent 67.36% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz: [23, 6736, 3214, 27]
[WARNING] Highly variable insert size (196 +- 206.38) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz!
[WARNING] Highly variable insert size (240 +- 270.10) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_2.fastq.gz!
[WARNING] Highly variable insert size (411 +- 287.91) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_2.fastq.gz!
 iteration 3.1 of 3.2 ...
   67528945 pairs. 17146191 passed filtering [25.39%]. 11899340 in different contigs [17.62%].

Fuyou Fu

unread,

Feb 7, 2017, 7:38:44 AM2/7/17

to l.p.p...@gmail.com, Ching-Ho Chang, Redundans

Hi Leszek,

Meanwhile, I find the genome size is much lower than we predict. I have a question, we should use raw contigs or raw scaffolds?

Thanks,

Fuyou

stephan...@ugent.be

unread,

Feb 7, 2017, 7:57:53 AM2/7/17

to Redundans

Hi,
Did I understand that correctly, that you can scaffold with short (eg illumina) as well as long (eg pacbio) sequences?
How come Redundans automatically identifies all reads as "pairs"?

Cheers,

Stephanie

l.p.p...@gmail.com

unread,

Feb 7, 2017, 8:06:56 AM2/7/17

to stephan...@ugent.be, Redundans

Dear Stephanie,

As you mentioned, scaffolding with short reads uses paired-end information.

With long reads, you can find individual reads that span gaps, this is connect 2 or more contigs and join them together.

Bests,

L.

--
You received this message because you are subscribed to the Google Groups "Redundans" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redundans+unsubscribe@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/redundans/e3c70eef-e3c9-4152-865c-405f694c6c0c%40googlegroups.com.

stephan...@ugent.be

unread,

Feb 7, 2017, 8:37:50 AM2/7/17

to Redundans, stephan...@ugent.be

:) I think I was not clear, can Redundans use long-read information as well? If so, corrected or non-corrected (eg lordec)?
If so, How come Redundans automatically identifies all reads as "pairs"? See information from Fuyou Fu (assuming that e.g.

5000_2.fastq.gz are PacBio reads of 5000bp length on average):

[WARNING] Poor quality: Major orientation (RF) represent 58.04% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_2.fastq.gz: [9, 4177, 5804, 10]

[WARNING] Poor quality: Major orientation (FR) represent 58.66% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_2.fastq.gz: [14, 5866, 4105, 15]
[WARNING] Poor quality: Major orientation (FR) represent 67.36% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz: [23, 6736, 3214, 27]
[WARNING] Highly variable insert size (196 +- 206.38) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz!
[WARNING] Highly variable insert size (240 +- 270.10) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_2.fastq.gz!
[WARNING] Highly variable insert size (411 +- 287.91) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_2.fastq.gz!

L.

To unsubscribe from this group and stop receiving emails from it, send an email to redundans+...@googlegroups.com.

l.p.p...@gmail.com

unread,

Feb 7, 2017, 8:44:03 AM2/7/17

to stephan...@ugent.be, Redundans

Hi Staphanie,

:) I think I was not clear, can Redundans use long-read information as well?

Yes

If so, corrected or non-corrected (eg lordec)?

Both

If so, How come Redundans automatically identifies all reads as "pairs"?

See information from Fuyou Fu (assuming that e.g.

You misunderstood, the dataset below is Illumina paired-end (ie. 550_1.fastq.gz and 550_2.fastq.gz) and mate-pairs (ie. 5000_2.fastq.gz and 5000_2.fastq.gz), not long reads...

To unsubscribe from this group and stop receiving emails from it, send an email to redundans+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/redundans/6b6bce30-e061-4e01-bdea-d3ffeb2163f8%40googlegroups.com.

stephan...@ugent.be

unread,

Feb 7, 2017, 9:12:30 AM2/7/17

to Redundans, stephan...@ugent.be

That's great! I should have read the manual more thoroughly, my apologies (trying too many softwares at once).
Redundans takes into account every scenario (short, hybrid or long). Would it be lenient enough to use a closely related genome for scaffolding?

Hi Staphanie,

Reply all

Reply to author

Forward