Hi guys,
As the title says it, I have a DISCOVAR de novo assembly that I generated with a 2x250 bp paired end library with a fragment size of ~800. It's not a bad assembly, but it's not very contiguous, which is to be expected given that there is no mate pair data included. Since I have a couple 2x150 bp mate pair libraries (inserts 1.5 kb and 4.0 kb) I want to try using the ABySS pipeline (everything after the ABYSS-P step) to scaffold.
Everything works fine up until the first map step (ABySS was compiled with --enable-maxk=256, BTW):
AdjList -v -k200 -m50 --dot SXPX_DDN_BROKEN-1.fa >SXPX_DDN_BROKEN-1.dot
abyss-filtergraph -v --dot -k200 -g SXPX_DDN_BROKEN-2.dot1 SXPX_DDN_BROKEN-1.dot SXPX_DDN_BROKEN-1.fa >SXPX_DDN_BROKEN-1.path
MergeContigs -v -k200 -g SXPX_DDN_BROKEN-2.dot -o SXPX_DDN_BROKEN-2.fa SXPX_DDN_BROKEN-1.fa SXPX_DDN_BROKEN-2.dot1 SXPX_DDN_BROKEN-1.path
PopBubbles -v --dot -j20 -k200 -p0.9 -g SXPX_DDN_BROKEN-3.dot SXPX_DDN_BROKEN-2.fa SXPX_DDN_BROKEN-2.dot >SXPX_DDN_BROKEN-2.path
MergeContigs -v -k200 -o SXPX_DDN_BROKEN-3.fa SXPX_DDN_BROKEN-2.fa SXPX_DDN_BROKEN-2.dot SXPX_DDN_BROKEN-2.path
awk '!/^>/ {x[">" $1]=1; next} {getline s} $1 in x {print $0 "\n" s}' \
SXPX_DDN_BROKEN-2.path SXPX_DDN_BROKEN-1.fa >SXPX_DDN_BROKEN-indel.fa
ln -sf SXPX_DDN_BROKEN-3.fa SXPX_DDN_BROKEN-unitigs.fa
abyss-map -v -j20 -l200 ../02_Separated_Pairs/Library.SXPX.L.fq.gz ../02_Separated_Pairs/Library.SXPX.R.fq.gz SXPX_DDN_BROKEN-3.fa \
|abyss-fixmate -v -l200 -h pe1-3.hist \
|sort -snk3 -k4 \
|DistanceEst -v -j20 -k200 -l200 -s500 -n10 -o pe1-3.dist pe1-3.hist
Here is the corresponding output:
Reading `SXPX_DDN_BROKEN-1.fa'...
Finding overlaps of exactly k-1 bp...
V=353130 E=130474 E/V=0.369
Degree: █▂
01234
0: 70% 1: 24% 2-4: 6.2% 5+: 0% max: 4
Finding overlaps of fewer than k-1 bp...
V=353130 E=168079956 E/V=476
Degree: ▅█▁
01234
0: 20% 1: 30% 2-4: 14% 5+: 37% max: 8203
Loading graph from file: SXPX_DDN_BROKEN-1.dot
Graph stats before:
V=353130 E=168079956 E/V=476
Degree: ▅█▁
01234
0: 20% 1: 30% 2-4: 14% 5+: 37% max: 8203
Removing shim contigs from the graph...
Pass 1: Checking 9670 contigs.
Pass 2: Checking 17 contigs.
Shim removal stats:
Removed: 4498 Too Complex: 55937 Tails: 55781 Too Long: 59860 Self Adjacent: 488 Parallel Edges: 72
Reading `SXPX_DDN_BROKEN-1.fa'...
Edge removal stats:
Removed: 0
Graph stats after:
V=344132 E=168070667 E/V=488
Degree: ▆█▂_
01234
0: 20% 1: 28% 2-4: 14% 5+: 38% max: 8203
Reading `SXPX_DDN_BROKEN-2.dot1'...
Read 344132 vertices. Using 1.86 GB of memory.
Reading `SXPX_DDN_BROKEN-1.fa'...
Read 172066 sequences. Using 2.18 GB of memory.
Reading `SXPX_DDN_BROKEN-1.path'...
Read 0 paths. Using 2.18 GB of memory.
Writing `SXPX_DDN_BROKEN-2.dot'...
V=344132 E=168070667 E/V=488
Degree: ▆█▂_
01234
0: 20% 1: 28% 2-4: 14% 5+: 38% max: 8203
Reading `SXPX_DDN_BROKEN-2.dot'...
V=344132 E=168070667 E/V=488
Degree: ▆█▂_
01234
0: 20% 1: 28% 2-4: 14% 5+: 38% max: 8203
Reading `SXPX_DDN_BROKEN-2.fa'...
Bubbles: 534 Popped: 388 Scaffolds: 0 Complex: 85 Too long: 0 Too many: 45 Dissimilar: 16
V=294910 E=168020663 E/V=570
Degree: █▆▃__
01234
0: 23% 1: 17% 2-4: 16% 5+: 44% max: 8203
Reading `SXPX_DDN_BROKEN-2.dot'...
Read 344132 vertices. Using 1.86 GB of memory.
Reading `SXPX_DDN_BROKEN-2.fa'...
Read 172066 sequences. Using 2.08 GB of memory.
Reading `SXPX_DDN_BROKEN-2.path'...
Read 20613 paths. Using 2.09 GB of memory.
The minimum coverage of single-end contigs is inf.
The minimum coverage of merged contigs is inf.
n n:200 L50 min N80 N50 N20 E-size max sum name
147455 147455 12083 300 751 3925 10323 34989 1317114 207.5e6 SXPX_DDN_BROKEN-3.fa
Reading from standard input...
Reading `SXPX_DDN_BROKEN-3.fa'...
Using 18.6 MB of memory and 126 B/sequence.
Reading `SXPX_DDN_BROKEN-3.fa'...
Building the suffix array...
Building the Burrows-Wheeler transform...
Building the character occurrence table...
Read 212 MB in 147455 contigs.
Using 1.88 GB of memory and 8.87 B/bp.
Read 15 alignments. Hash load: 5 / 8 = 0.625 using 975 kB.
Read 131 alignments. Hash load: 9 / 16 = 0.5625 using 975 kB.
Read 1000000 alignments. Hash load: 4 / 16 = 0.25 using 975 kB.
Read 2000000 alignments. Hash load: 4 / 16 = 0.25 using 975 kB.
Read 3000000 alignments. Hash load: 0 / 16 = 0 using 975 kB.
Read 4000000 alignments. Hash load: 2 / 16 = 0.125 using 975 kB.
Read 5000000 alignments. Hash load: 4 / 16 = 0.25 using 975 kB.
[............]
Read 499000000 alignments. Hash load: 4 / 32 = 0.125 using 1.22 MB.
Mapped 302150492 of 499073402 reads (60.5%)
Mapped 299438060 of 499073402 reads uniquely (60%)
Read 499073402 alignments
Mateless 0
Unaligned 56417991 22.6%
Singleton 84086928 33.7%
FR 92547076 37.1%
RF 7349 0.00295%
FF 23043 0.00923%
Different 16454314 6.59%
Total 249536701
FR Stats mean: 740.3 median: 796 sd: 149.4 n: 92539745 min: 227 max: 1096 ignored: 14680
_____________▁▁▁▁▁▁▁▂▂▂▂▂▃▃▄▅▇███▇▆▅▃▂▁_
Mate orientation FR: 92547076 (100%) RF: 7349 (0.00794%)
The library pe1-3.hist is oriented forward-reverse (FR).
Stats mean: 740.3 median: 796 sd: 149.4 n: 92539745 min: 227 max: 1096
_____________▁▁▁▁▁▁▁▂▂▂▂▂▃▃▄▅▇███▇▆▅▃▂▁_
Minimum and maximum distance are set to -199 and 1096 bp.
error: input must be sorted: saw `flattened_line_137693_1' before `flattened_line_86968_1'
Everything beyond this point essentially fails. The problem is that the file "pe1-3.dist" is empty, probably because the "sort" command in the mapping pipe is not working properly with the names of the sequences in the SXPX_DDN_BROKEN-1.fa file, which look like:
>flattened_line_0_1
I realize this is pretty different from what ABYSS-P puts out something like:
>0 100 315
My question is, should I modify the sort command? If so, how? I don't really understand how it is sorting the input (though I have tried to examine the sort manual to figure out what the flags are doing). Or should I perhaps try to rename the fasta sequences in the SXPX_DDN_BROKEN-1.fa? It would be fairly easy enough to determine names for the first two parameters in the ABYSS-P format, but it would be harder to determine the 3rd one.
Any guidance on this is greatly appreciated. I'm really excited to see how ABySS scaffolds this assembly, in contrast with SSPACE.
Best,
Dan