Using the ABySS pipeline to scaffold a DISCOVAR de novo assembly

115 views
Skip to first unread message

Dan Browne

unread,
Dec 8, 2015, 10:10:51 PM12/8/15
to ABySS
Hi guys,

As the title says it, I have a DISCOVAR de novo assembly that I generated with a 2x250 bp paired end library with a fragment size of ~800. It's not a bad assembly, but it's not very contiguous, which is to be expected given that there is no mate pair data included. Since I have a couple 2x150 bp mate pair libraries (inserts 1.5 kb and 4.0 kb) I want to try using the ABySS pipeline (everything after the ABYSS-P step) to scaffold.

Everything works fine up until the first map step (ABySS was compiled with --enable-maxk=256, BTW):

AdjList -v   -k200 -m50 --dot SXPX_DDN_BROKEN-1.fa >SXPX_DDN_BROKEN-1.dot
abyss-filtergraph -v --dot   -k200 -g SXPX_DDN_BROKEN-2.dot1 SXPX_DDN_BROKEN-1.dot SXPX_DDN_BROKEN-1.fa >SXPX_DDN_BROKEN-1.path
MergeContigs -v  -k200 -g SXPX_DDN_BROKEN-2.dot -o SXPX_DDN_BROKEN-2.fa SXPX_DDN_BROKEN-1.fa SXPX_DDN_BROKEN-2.dot1 SXPX_DDN_BROKEN-1.path
PopBubbles -v --dot -j20 -k200  -p0.9  -g SXPX_DDN_BROKEN-3.dot SXPX_DDN_BROKEN-2.fa SXPX_DDN_BROKEN-2.dot >SXPX_DDN_BROKEN-2.path
MergeContigs -v  -k200 -o SXPX_DDN_BROKEN-3.fa SXPX_DDN_BROKEN-2.fa SXPX_DDN_BROKEN-2.dot SXPX_DDN_BROKEN-2.path
awk '!/^>/ {x[">" $1]=1; next} {getline s} $1 in x {print $0 "\n" s}' \
                SXPX_DDN_BROKEN-2.path SXPX_DDN_BROKEN-1.fa >SXPX_DDN_BROKEN-indel.fa
ln -sf SXPX_DDN_BROKEN-3.fa SXPX_DDN_BROKEN-unitigs.fa
abyss-map -v  -j20 -l200    ../02_Separated_Pairs/Library.SXPX.L.fq.gz ../02_Separated_Pairs/Library.SXPX.R.fq.gz SXPX_DDN_BROKEN-3.fa \
                |abyss-fixmate -v  -l200  -h pe1-3.hist \
                |sort -snk3 -k4 \
                |DistanceEst -v  -j20 -k200 -l200 -s500 -n10   -o pe1-3.dist pe1-3.hist


Here is the corresponding output:


Reading `SXPX_DDN_BROKEN-1.fa'...
Finding overlaps of exactly k-1 bp...
V=353130 E=130474 E/V=0.369
Degree: █▂
        01234
0: 70% 1: 24% 2-4: 6.2% 5+: 0% max: 4
Finding overlaps of fewer than k-1 bp...
V=353130 E=168079956 E/V=476
Degree: ▅█▁
        01234
0: 20% 1: 30% 2-4: 14% 5+: 37% max: 8203
Loading graph from file: SXPX_DDN_BROKEN-1.dot
Graph stats before:
V=353130 E=168079956 E/V=476
Degree: ▅█▁
        01234
0: 20% 1: 30% 2-4: 14% 5+: 37% max: 8203
Removing shim contigs from the graph...
Pass 1: Checking 9670 contigs.
Pass 2: Checking 17 contigs.
Shim removal stats:
Removed: 4498 Too Complex: 55937 Tails: 55781 Too Long: 59860 Self Adjacent: 488 Parallel Edges: 72
Reading `SXPX_DDN_BROKEN-1.fa'...
Edge removal stats:
Removed: 0
Graph stats after:
V=344132 E=168070667 E/V=488
Degree: ▆█▂_
        01234
0: 20% 1: 28% 2-4: 14% 5+: 38% max: 8203
Reading `SXPX_DDN_BROKEN-2.dot1'...
Read 344132 vertices. Using 1.86 GB of memory.
Reading `SXPX_DDN_BROKEN-1.fa'...
Read 172066 sequences. Using 2.18 GB of memory.
Reading `SXPX_DDN_BROKEN-1.path'...
Read 0 paths. Using 2.18 GB of memory.
Writing `SXPX_DDN_BROKEN-2.dot'...
V=344132 E=168070667 E/V=488
Degree: ▆█▂_
        01234
0: 20% 1: 28% 2-4: 14% 5+: 38% max: 8203
Reading `SXPX_DDN_BROKEN-2.dot'...
V=344132 E=168070667 E/V=488
Degree: ▆█▂_
        01234
0: 20% 1: 28% 2-4: 14% 5+: 38% max: 8203
Reading `SXPX_DDN_BROKEN-2.fa'...
Bubbles: 534 Popped: 388 Scaffolds: 0 Complex: 85 Too long: 0 Too many: 45 Dissimilar: 16
V=294910 E=168020663 E/V=570
Degree: █▆▃__
        01234
0: 23% 1: 17% 2-4: 16% 5+: 44% max: 8203
Reading `SXPX_DDN_BROKEN-2.dot'...
Read 344132 vertices. Using 1.86 GB of memory.
Reading `SXPX_DDN_BROKEN-2.fa'...
Read 172066 sequences. Using 2.08 GB of memory.
Reading `SXPX_DDN_BROKEN-2.path'...
Read 20613 paths. Using 2.09 GB of memory.
The minimum coverage of single-end contigs is inf.
The minimum coverage of merged contigs is inf.
n       n:200   L50     min     N80     N50     N20     E-size  max     sum     name
147455  147455  12083   300     751     3925    10323   34989   1317114 207.5e6 SXPX_DDN_BROKEN-3.fa
Reading from standard input...
Reading `SXPX_DDN_BROKEN-3.fa'...
Using 18.6 MB of memory and 126 B/sequence.
Reading `SXPX_DDN_BROKEN-3.fa'...
Building the suffix array...
Building the Burrows-Wheeler transform...
Building the character occurrence table...
Read 212 MB in 147455 contigs.
Using 1.88 GB of memory and 8.87 B/bp.
Read 15 alignments. Hash load: 5 / 8 = 0.625 using 975 kB.
Read 131 alignments. Hash load: 9 / 16 = 0.5625 using 975 kB.
Read 1000000 alignments. Hash load: 4 / 16 = 0.25 using 975 kB.
Read 2000000 alignments. Hash load: 4 / 16 = 0.25 using 975 kB.
Read 3000000 alignments. Hash load: 0 / 16 = 0 using 975 kB.
Read 4000000 alignments. Hash load: 2 / 16 = 0.125 using 975 kB.
Read 5000000 alignments. Hash load: 4 / 16 = 0.25 using 975 kB.
[............]
Read 499000000 alignments. Hash load: 4 / 32 = 0.125 using 1.22 MB.
Mapped 302150492 of 499073402 reads (60.5%)
Mapped 299438060 of 499073402 reads uniquely (60%)
Read 499073402 alignments
Mateless           0
Unaligned   56417991  22.6%
Singleton   84086928  33.7%
FR          92547076  37.1%
RF              7349  0.00295%
FF             23043  0.00923%
Different   16454314  6.59%
Total      249536701
FR Stats mean: 740.3 median: 796 sd: 149.4 n: 92539745 min: 227 max: 1096 ignored: 14680
                        _____________▁▁▁▁▁▁▁▂▂▂▂▂▃▃▄▅▇███▇▆▅▃▂▁_
Mate orientation FR: 92547076 (100%) RF: 7349 (0.00794%)
The library pe1-3.hist is oriented forward-reverse (FR).
Stats mean: 740.3 median: 796 sd: 149.4 n: 92539745 min: 227 max: 1096
                        _____________▁▁▁▁▁▁▁▂▂▂▂▂▃▃▄▅▇███▇▆▅▃▂▁_
Minimum and maximum distance are set to -199 and 1096 bp.
error: input must be sorted: saw `flattened_line_137693_1' before `flattened_line_86968_1'

 Everything beyond this point essentially fails. The problem is that the file "pe1-3.dist" is empty, probably because the "sort" command in the mapping pipe is not working properly with the names of the sequences in the SXPX_DDN_BROKEN-1.fa file, which look like:

>flattened_line_0_1

I realize this is pretty different from what ABYSS-P puts out something like:

>0 100 315

My question is, should I modify the sort command? If so, how? I don't really understand how it is sorting the input (though I have tried to examine the sort manual to figure out what the flags are doing). Or should I perhaps try to rename the fasta sequences in the SXPX_DDN_BROKEN-1.fa? It would be fairly easy enough to determine names for the first two parameters in the ABYSS-P format, but it would be harder to determine the 3rd one.

Any guidance on this is greatly appreciated. I'm really excited to see how ABySS scaffolds this assembly, in contrast with SSPACE.

Best,
Dan
 

Ben Vandervalk

unread,
Dec 9, 2015, 1:00:46 PM12/9/15
to Dan Browne, ABySS
Hi Dan,

Sorry for the slow response.

If you just want to ABySS for scaffolding with MPET then what you will have to do "trick" ABySS pipeline into starting at the scaffolding stage with your assembly FASTA file.  With your current setup, you are actually running a couple stages of the assembly pipeline prior to scaffolding (the unitig and contig stages), in addition to doing the scaffolding stage itself.  

So how do you "trick" ABySS to starting at the scaffolding stage?  It is not as user-friendly as we would like.  Since ABySS is a Makefile, what you need to do is setup the prerequisite files for the scaffold stage with exactly the right filenames and then run your abyss-pe as normal.

Scaffolding starts after stage 6 (contig stage), so I think the prerequisite files would be:

SXPX_DDN_BROKEN-6.fa
SXPX_DDN_BROKEN-6.dot

To set those up, rename the FASTA IDs in your assembly FASTA file to numeric IDS, change the filename to SXPX_DDN_BROKEN-6.fa, and then build the corresponding dot file with the `AdjList` program. (The 'dot' file is a graph describing end-to-end overlaps between the sequences).  You are correct that ABySS expects integers for the FASTA IDs.  Only the first number in the FASTA headers (the ID) should be important.  Don't worry about replicating the other numbers, as they shouldn't have any effect. 

In addition, I think you will also need to tell `make` not to build any of the prerequisite files for SXPX_DDN_BROKEN-6.fa / SXPX_DDN_BROKEN-6.dot
by adding the following options to the end of your `abyss-pe` command:

-o SXPX_DDN_BROKEN-6.fa -o SXPX_DDN_BROKEN-6.dot

As you may be aware, you can add the `-n` (--dry-run) to see what commands would be run by abyss-pe before actually running the ABySS pipeline for real.  I highly recommend that. If everything goes as planned, the first commands should be alignments of the MPET files against the contigs (-6 files) with `abyss-map`.   Then following that, the commands to generate the -7 and -8 files.  The -8.fa is the scaffolds file.

Phew.  HTH,

- Ben

--
You received this message because you are subscribed to the Google Groups "ABySS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to abyss-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dan Browne

unread,
Dec 9, 2015, 2:02:50 PM12/9/15
to Ben Vandervalk, ABySS
Thanks Ben, that's exactly what I needed to know! A little bit of sed script and I should be good to go:
sed -e 's/flattened_line_//g' -e 's/_[0-9]*$//g' < SXPX_DDN_BROKEN-1.fa > SXPX_DDN_BROKEN_RN-1.fa
Best,
Dan

Dan Browne

unread,
Dec 14, 2015, 12:16:46 PM12/14/15
to Ben Vandervalk, ABySS
FYI for anyone who reads this in the future, the sed command I specified actually caused some problems. It worked as intended, but I didn't realize that there would be duplicate fasta headers as a result, i.e.:

>flattened_line_100_1
>flattened_line_100_2

Both become
 
>100
>100

Which causes errors if you try to run the ABySS pipeline. Here's a command that works much better (source: https://www.biostars.org/p/53212/#53219):

awk '/^>/{print ">" ++i;next}{print}' < SXPX_DDN_BROKEN-1.fa > SXPX_DDN_BROKEN_RN-1.fa 
Reply all
Reply to author
Forward
0 new messages