Lost sequences after scaffolding by Pacbio

109 views
Skip to first unread message

Ching-Ho Chang

unread,
Jan 26, 2017, 11:32:13 PM1/26/17
to Redundans
Hi redundans developer,

I tried to use Pacbio sequences to scaffold the genome.
Here is the result.
#fname  contigs bases   GC [%]  contigs >1kb    bases in contigs >1kb   N50     N90     Ns      longest
runmecate/contigs.fa    9274    503102573       36.370  9274    503102573       105793  21932   5492    1053140
runmecate/contigs.reduced.fa    9142    501703970       36.367  9142    501703970       106102  22145   5492    1053140
runmecate/scaffolds.longreads.1.fa      4583    443058017       36.370  4583    443058017       404424  28820   2398945 7166112
runmecate/scaffolds.longreads.fa        4583    443058017       36.370  4583    443058017       404424  28820   2398945 7166112

It greatly increased the N50 and longest contig size while lost many sequences.
I did find some sequences it removed are not redundant sequences.
It lost ~8% BUSCO (Benchmarking Universal Single-Copy Orthologs) genes
I was wondering if there is any parameter I can tune for the long read scaffolding.

Thank you,
Ching-Ho Chang


l.p.p...@gmail.com

unread,
Jan 31, 2017, 3:04:19 PM1/31/17
to Ching-Ho Chang, Redundans
Dear Ching-Ho,

Thanks for the information. The problem is because pyScaf didn't report contigs that were not scaffolded. This should be now solved, just pull the latest version (note, you'll need to update submodules as well, thus please execute both below commands).

git pull
git submodule update --recursive

pyScaf is under heavy development - I'd appreciate your feedback!

btw: if you don't need reduction & gap closing, you can use pyScaf directly.

Bests,
Leszek

L.

--
You received this message because you are subscribed to the Google Groups "Redundans" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redundans+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/redundans/27d8d050-5619-44b4-8cec-4415b51ae236%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ching-Ho Chang

unread,
Feb 3, 2017, 6:32:53 PM2/3/17
to Redundans, hily...@gmail.com
Hi Leszek,

The new version still lost some unique sequences.
I guess that it may put Ns to some regions where should have sequence information before, say mapped regions.

Thanks,
Ching-Ho

lpryszcz於 2017年1月31日星期二 UTC-5下午3時04分19秒寫道:

L.

To unsubscribe from this group and stop receiving emails from it, send an email to redundans+...@googlegroups.com.

Fuyou Fu

unread,
Feb 4, 2017, 11:29:42 PM2/4/17
to Ching-Ho Chang, Redundans
Hi Leszek,
I used the new version. But my job can not keep running. Could you help me check the error files?
Thanks,
Fuyou
Options: Namespace(fasta='/depot/tmengist/data/StrigaGenomeData/StrigaV1.0.fa', fastq=['/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_1.fastq.gz', '/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_2.fastq.gz', '/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_1.fastq.gz', '/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_2.fastq.gz', '/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_1.fastq.gz', '/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_2.fastq.gz', '/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz', '/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz'], identity=0.51, iters=2, joins=5, limit=0.2, linkratio=0.7, log=<open file '<stderr>', mode 'w' at 0x2af84bf5a1e0>, longreads=['/scratch/snyder/f/fu115/Genome_assembly/fastq/seq/filtered_subreads.fastq'], mapq=10, minLength=200, nocleaning=True, nogapclosing=True, norearrangements=False, noreduction=True, noscaffolding=True, outdir='strv1.0', overlap=0.66, reference='', resume=False, threads=20, verbose=True)

##################################################
[Sat Feb  4 20:24:12 2017] Preparing contigs...
 1   
 100001   
 200001   
 300001   
 324162 sequences stored.

##################################################
[Sat Feb  4 20:27:27 2017] Reduction...
#file name	genome size	contigs	heterozygous size	[%]	heterozygous contigs	[%]	identity [%]	possible joins	homozygous size	[%]	homozygous contigs	[%]

To unsubscribe from this group and stop receiving emails from it, send an email to redundans+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/redundans/061b00ef-eefb-4380-89e1-bef0d0236afa%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Fuyou Fu, Ph.D.
Department of Botany and Plant Pathology
Purdue University
USA

l.p.p...@gmail.com

unread,
Feb 6, 2017, 10:58:26 AM2/6/17
to Ching-Ho Chang, Redundans
Dear Ching-Ho, Can you give me some number, how much regions is missing? Also I'd appreciate some if you can share some data, so I can reproduce it. 
Bests, 

L.

To unsubscribe from this group and stop receiving emails from it, send an email to redundans+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/redundans/061b00ef-eefb-4380-89e1-bef0d0236afa%40googlegroups.com.

l.p.p...@gmail.com

unread,
Feb 6, 2017, 11:00:47 AM2/6/17
to Fuyou Fu, Ching-Ho Chang, Redundans
Dear Fuyou Fu, 

Does it fail or run hang? 
Can you list output directory? 

ls -lah strv1.0

Bests,

L.

Ching-Ho Chang

unread,
Feb 6, 2017, 3:49:08 PM2/6/17
to Redundans, hily...@gmail.com
Hi Leszek,
There is the information.

#fname  contigs bases   GC [%]  contigs >1kb    bases in contigs >1kb   N50     N90     Ns      longest
Masurca4.fasta  10273   502084516       36.418  8129    500731438       142864  28039   572485  1043053
Masurca4_2/contigs.fa   10272   502084326       36.418  8129    500731438       142864  28039   572485  1043053
Masurca4_2/scaffolds.longreads.1.fa     6760    432357666       36.425  4687    431050726       462377  40997   2803906 5036293
Masurca4_2/scaffolds.longreads.fa       6760    432357666       36.425  4687    431050726       462377  40997   2803906 5036293

Before scaffolding,
INFO    C:96.9%[S:82.9%,D:14.0%],F:2.0%,M:1.1%,n:2442
INFO    2366 Complete BUSCOs (C)
INFO    2025 Complete and single-copy BUSCOs (S)
INFO    341 Complete and duplicated BUSCOs (D)
INFO    49 Fragmented BUSCOs (F)
INFO    27 Missing BUSCOs (M)
INFO    2442 Total BUSCO groups searched

After scaffoldiing
INFO    C:85.2%[S:74.6%,D:10.6%],F:2.3%,M:12.5%,n:2442
INFO    2081 Complete BUSCOs (C)
INFO    1821 Complete and single-copy BUSCOs (S)
INFO    260 Complete and duplicated BUSCOs (D)
INFO    57 Fragmented BUSCOs (F)
INFO    304 Missing BUSCOs (M)
INFO    2442 Total BUSCO groups searched

I don't have permission to share the sequences, but I will ask.
Could you tell me your algorithm in the long read scaffolding?
It perform much better than other programs, ex. pbjelly and sspace longread, even though I don't have the experimental data  to validate the junction.

Thank you,
Ching-Ho

lpryszcz於 2017年2月6日星期一 UTC-5上午10時58分26秒寫道:

L.

l.p.p...@gmail.com

unread,
Feb 7, 2017, 6:20:36 AM2/7/17
to Ching-Ho Chang, Redundans
Hi, 

It's pyScaf (https://github.com/lpryszcz/pyScaf), I've developed it myself last year. This is still quite alpha version (this is why I'd be happy about the feedback), but it works well on test data sets. 
From the stats it's difficult to say what's wrong... It'd be good if you can share contigs and reads with me. I'll keep them private. 

In meantime, can you execute below commands (in redundans install dir) and send me output? 

cd ~/src/redundans
git log | head -n5

cd bin/pyScaf
git log | head -n5

Greetings from Warsaw!  
L.

To unsubscribe from this group and stop receiving emails from it, send an email to redundans+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/redundans/40b0b362-7f9b-4a94-a01c-b8d1413f1716%40googlegroups.com.

Fuyou Fu

unread,
Feb 7, 2017, 7:31:29 AM2/7/17
to l.p.p...@gmail.com, Ching-Ho Chang, Redundans
Hi Leszek,
It is running now. I think the program can not recognize the space in contigs name. But I find Redundans think many reads are poor quality.

[WARNING] Poor quality: Major orientation (FR) represent 71.79% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz: [29, 7179, 2761, 31]
/scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz	94	83	114.76	93.55	29	7179	2761	31
[WARNING] Highly variable insert size (115 +- 93.55) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz!
[WARNING] Highly variable insert size (130 +- 173.08) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_2.fastq.gz!
[WARNING] Highly variable insert size (411 +- 287.91) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_2.fastq.gz!

##################################################
[Sun Feb  5 20:00:57 2017] Scaffolding...
 iteration 1.1 of 3.2 ...
   67528945 pairs. 16540841 passed filtering [24.49%]. 15185628 in different contigs [22.49%].
   67528945 pairs. 16372026 passed filtering [24.24%]. 14774474 in different contigs [21.88%].
 iteration 1.2 of 3.2 ...
   67528945 pairs. 16800580 passed filtering [24.88%]. 15208902 in different contigs [22.52%].
   67528945 pairs. 16577555 passed filtering [24.55%]. 14760283 in different contigs [21.86%].
[WARNING] Poor quality: Major orientation (RF) represent 52.72% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_2.fastq.gz: [5, 4713, 5272, 10]
[WARNING] Poor quality: Major orientation (FR) represent 68.2% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_2.fastq.gz: [6, 6820, 3166, 8]
[WARNING] Poor quality: Major orientation (FR) represent 65.99% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz: [19, 6599, 3363, 19]
[WARNING] Highly variable insert size (182 +- 172.35) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz!
[WARNING] Highly variable insert size (203 +- 240.25) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_2.fastq.gz!
[WARNING] Highly variable insert size (411 +- 287.91) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_2.fastq.gz!
 iteration 2.1 of 3.2 ...
   67528945 pairs. 18790470 passed filtering [27.83%]. 10650650 in different contigs [15.77%].
 iteration 2.2 of 3.2 ...
   67528945 pairs. 19517198 passed filtering [28.90%]. 8149748 in different contigs [12.07%].
[WARNING] Poor quality: Major orientation (RF) represent 58.04% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_2.fastq.gz: [9, 4177, 5804, 10]
[WARNING] Poor quality: Major orientation (FR) represent 58.66% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_2.fastq.gz: [14, 5866, 4105, 15]
[WARNING] Poor quality: Major orientation (FR) represent 67.36% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz: [23, 6736, 3214, 27]
[WARNING] Highly variable insert size (196 +- 206.38) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz!
[WARNING] Highly variable insert size (240 +- 270.10) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_2.fastq.gz!
[WARNING] Highly variable insert size (411 +- 287.91) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_2.fastq.gz!
 iteration 3.1 of 3.2 ...
   67528945 pairs. 17146191 passed filtering [25.39%]. 11899340 in different contigs [17.62%].

Fuyou Fu

unread,
Feb 7, 2017, 7:38:44 AM2/7/17
to l.p.p...@gmail.com, Ching-Ho Chang, Redundans
Hi Leszek,
Meanwhile, I find the genome size is much lower than we predict. I have a question, we should use raw contigs or raw scaffolds?
Thanks,
Fuyou

stephan...@ugent.be

unread,
Feb 7, 2017, 7:57:53 AM2/7/17
to Redundans
Hi,
Did I understand that correctly, that you can scaffold with short (eg illumina) as well as long (eg pacbio) sequences?
How come Redundans automatically identifies all reads as "pairs"?

Cheers,

Stephanie

l.p.p...@gmail.com

unread,
Feb 7, 2017, 8:06:56 AM2/7/17
to stephan...@ugent.be, Redundans
Dear Stephanie, 

As you mentioned, scaffolding with short reads uses paired-end information. 
With long reads, you can find individual reads that span gaps, this is connect 2 or more contigs and join them together.  

Bests, 
L. 

L.

--
You received this message because you are subscribed to the Google Groups "Redundans" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redundans+unsubscribe@googlegroups.com.

stephan...@ugent.be

unread,
Feb 7, 2017, 8:37:50 AM2/7/17
to Redundans, stephan...@ugent.be
:) I think I was not clear, can Redundans use long-read information as well? If so, corrected or non-corrected (eg lordec)?
If so, How come Redundans automatically identifies all reads as "pairs"? See information from Fuyou Fu (assuming that e.g.
5000_2.fastq.gz are PacBio reads of 5000bp length on average):
[WARNING] Poor quality: Major orientation (RF) represent 58.04% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/2000_2.fastq.gz: [9, 4177, 5804, 10]
[WARNING] Poor quality: Major orientation (FR) represent 58.66% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_2.fastq.gz: [14, 5866, 4105, 15]
[WARNING] Poor quality: Major orientation (FR) represent 67.36% of pairs in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz: [23, 6736, 3214, 27]
[WARNING] Highly variable insert size (196 +- 206.38) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/10000_2.fastq.gz!
[WARNING] Highly variable insert size (240 +- 270.10) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/5000_2.fastq.gz!
[WARNING] Highly variable insert size (411 +- 287.91) in /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_1.fastq.gz - /scratch/snyder/f/fu115/Genome_assembly/DBG/Redundans/seq/550_2.fastq.gz!



L.

To unsubscribe from this group and stop receiving emails from it, send an email to redundans+...@googlegroups.com.

l.p.p...@gmail.com

unread,
Feb 7, 2017, 8:44:03 AM2/7/17
to stephan...@ugent.be, Redundans
Hi Staphanie, 

:) I think I was not clear, can Redundans use long-read information as well?
Yes 
If so, corrected or non-corrected (eg lordec)?
Both 
If so, How come Redundans automatically identifies all reads as "pairs"? 
See information from Fuyou Fu (assuming that e.g.
You misunderstood, the dataset below is Illumina paired-end (ie. 550_1.fastq.gz and 550_2.fastq.gz) and mate-pairs (ie. 5000_2.fastq.gz and 5000_2.fastq.gz), not long reads... 

 
To unsubscribe from this group and stop receiving emails from it, send an email to redundans+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/redundans/6b6bce30-e061-4e01-bdea-d3ffeb2163f8%40googlegroups.com.

stephan...@ugent.be

unread,
Feb 7, 2017, 9:12:30 AM2/7/17
to Redundans, stephan...@ugent.be
That's great! I should have read the manual more thoroughly, my apologies (trying too many softwares at once).
Redundans takes into account every scenario (short, hybrid or long). Would it be lenient enough to use a closely related genome for scaffolding?
Hi Staphanie, 

Reply all
Reply to author
Forward
0 new messages