little help with abyss assembler

99 views
Skip to first unread message

Pavle Eric

unread,
Aug 9, 2023, 11:22:56 AM8/9/23
to ABySS
Hello everyone,


Recently I got some sequencing data, pair-end 150bp Illumina reads of a bird species that has not been assembled to this date. So I tried your assembler with a few k values (K=69, 79, 89, 99, 109, 119). Out of these values, K=99 gave the best assembly parameters, and a busco score of 94.4%. Also, the genome size is very close to the genome size of a species from the same genus, that has chromosome level assembly on NCBI. On the other hand, I am aware that when using only short reads, the genomes are going to have a lot of gaps, but having over 500k scaffolds seems a lot.
Wanting to try something else, to make a better assembly, I read in one of the discussions here, that you recommend pair merging before the assembly so I wanted to try and get an even better assembly.
 First, I tried your abyss-mergepairs for overlapping pairs, but since it is single threaded it was taking too long, so I tried pear.
 My understanding from the discussion I read, is that I should first merge overlapping read pairs with (abyss-mergepairs or pear), and then feed the unassembled read pairs
 to konnector so it can merge non-overlaping read pairs. (after Konnector only a few hundred megabytes of data remained (out of 150GB per F and R reads).
 My next understanding is that I should feed the abyss-pe with both the raw reads as pair-end reads as well as the two sets of assembled reads and remaining unassembled
 reads from konnector as single-end reads. After doing so my assemblies were significantly worse than assemblies without pair merging as they had more than double the number of scaffolds (I did this with k values 99 and 95).
 I have assembly statistics included here as well as the commands I used in the process.
 Am I doing something wrong in the merging process or is it just possible that the basic assembly is giving better results? 

n       n:500   L50     min     N75     N50     N25     E-size  max     sum     name
1172362 43763   5419    500     34143   64748   111937  81861   503470  1.19e9  ACH01_abyss_k95-scaffolds.fa   *with merged pairs
1114044 43155   5280    500     34787   66368   114968  84482   700626  1.189e9 ACH01_abyss_k99-2-scaffolds.fa  *with merged pairs
541770  92136   13320   500     13550   25915   45024   33741   242779  1.185e9 ACH01_abyss_k119_3-contigs.fa
510987  29948   3211    500     55726   107529  191111  138094  751714  1.188e9 ACH01_abyss_109-scaffolds.fa
610357  29148   3167    500     56070   109390  192459  140203  981356  1.188e9 ACH01_abyss_99-scaffolds.fa
733825  29529   3281    500     53768   104471  185035  135435  958488  1.19e9  ACH01_abyss_89-scaffolds.fa
916489  30935   3600    500     48840   95807   168852  123747  806307  1.192e9 ACH01_abyss_79-scaffolds.fa
1169123 35922   4568    500     39426   76236   133473  98583   780728  1.196e9 ACH01_abyss_69-scaffolds.fa


pear -f ACH01_1.fq.gz -r ACH01_2.fq.gz -o ACH01  -j 63
konnector -j 63 -k 95 -b30G -E -D8G -v --fastq -o k95 ACH01.unassembled.forward.fastq ACH01.unassembled.reverse.fastq
abyss-pe B=320G k=109 j=63 l=40 s=1000 v=-vv name=ACH01_abyss_k109> ACH01_abyss_k109.log 2>&1  lib='pe1' pe1='../ACH01_1.fq.gz ../ACH01_2.fq.gz' se='../ACH01.assembled.fastq.gz ../k95_pseudoreads.fq.gz ../k95_reads_1.fq.gz ../k95_reads_2.fq.gz' 

Best Regards,
Pavle Eric

Lauren Coombe

unread,
Aug 9, 2023, 11:34:31 AM8/9/23
to ABySS
Hi Pavle Eric,

A couple follow-up questions so that I can best help you.
- What is the difference between your read files "../ACH01_1.fq.gz ../ACH01_2.fq.gz" and "../k95_reads_1.fq.gz ../k95_reads_2.fq.gz"? Just trying to understand why one set is specified as paired-end and the other is not.
- For a given k-value, could you provide the stats for unitigs, contigs and scaffolds for both the original and merged pairs versions? That will help me to see the improvement at each step (since the PE information is used post-unitig stage).

In general, 94.4% BUSCOs is a really great result! Are there any plans to do any further sequencing with a different data type (ex. linked reads, long reads, etc)? I just ask because further scaffolding with longer-range evidence could be beneficial depending on how repetitive this genome is.

Thanks for your interest in ABySS!
Lauren

Pavle Eric

unread,
Aug 9, 2023, 12:59:46 PM8/9/23
to ABySS

Hello Lauren,

Firstly thank you for a very quick reply. So ACH01_1.fq.gz and ACH01_2.fq.gz are F and R raw reads and  k95_reads_1.fq.gz,  k95_reads_2.fq.gz are leftover unmerged reads that konnector could not merge (they are a few hundred megabytes large, before the gunzipping, (although they are not the same size which was a little odd)).  I used k=95 for konnector pair merging. I am sending you assembly statistics for untigs, contigs, scaffolds stage of each assembly.
As for further sequencing with longer reads, I would very much love to have that data, but I think that it won't happen in the near future. As for how repetitive the genome is, how informative can a closely related species genome be helpful with that? The data I am working on is Aquila heliaca and there is a chromosome level assembly of Aquila chrysaetos. Could its genome be helpful with scaffolding or resolving some uncertain regions?

abyss-fac   ACH01_abyss_69-unitigs.fa ACH01_abyss_69-contigs.fa ACH01_abyss_69-scaffolds.fa |tee ACH01_abyss_69-stats.tab

n       n:500   L50     min     N75     N50     N25     E-size  max     sum     name
1455448 152296  25844   500     7402    13626   22870   16940   105156  1.168e9 ACH01_abyss_69-unitigs.fa
1239815 67084   10008   500     19065   35372   60288   44825   295068  1.192e9 ACH01_abyss_69-contigs.fa

1169123 35922   4568    500     39426   76236   133473  98583   780728  1.196e9 ACH01_abyss_69-scaffolds.fa

abyss-fac   ACH01_abyss_79-unitigs.fa ACH01_abyss_79-contigs.fa ACH01_abyss_79-scaffolds.fa |tee ACH01_abyss_79-stats.tab

n       n:500   L50     min     N75     N50     N25     E-size  max     sum     name
1121854 124446  20119   500     9430    17454   29456   21917   151514  1.174e9 ACH01_abyss_79-unitigs.fa
985312  61772   9007    500     21251   39468   66613   49524   385127  1.189e9 ACH01_abyss_79-contigs.fa

916489  30935   3600    500     48840   95807   168852  123747  806307  1.192e9 ACH01_abyss_79-scaffolds.fa

abyss-fac   ACH01_abyss_89-unitigs.fa ACH01_abyss_89-contigs.fa ACH01_abyss_89-scaffolds.fa |tee ACH01_abyss_89-stats.tab

n       n:500   L50     min     N75     N50     N25     E-size  max     sum     name
904927  116149  18142   500     10300   19320   32686   24318   169967  1.177e9 ACH01_abyss_89-unitigs.fa
808154  63455   9124    500     20796   38689   66045   48632   320585  1.187e9 ACH01_abyss_89-contigs.fa

733825  29529   3281    500     53768   104471  185035  135435  958488  1.19e9  ACH01_abyss_89-scaffolds.fa

abyss-fac   ACH01_abyss_99-unitigs.fa ACH01_abyss_99-contigs.fa ACH01_abyss_99-scaffolds.fa |tee ACH01_abyss_99-stats.tab

n       n:500   L50     min     N75     N50     N25     E-size  max     sum     name
766979  116570  18015   500     10336   19406   33068   24573   214013  1.179e9 ACH01_abyss_99-unitigs.fa
691604  66516   9591    500     19713   36880   62595   46197   288764  1.186e9 ACH01_abyss_99-contigs.fa

610357  29148   3167    500     56070   109390  192459  140203  981356  1.188e9 ACH01_abyss_99-scaffolds.fa

abyss-fac   ACH01_abyss_109-unitigs.fa ACH01_abyss_109-contigs.fa ACH01_abyss_109-scaffolds.fa |tee ACH01_abyss_109-stats.tab

n       n:500   L50     min     N75     N50     N25     E-size  max     sum     name
668763  126913  19591   500     9370    17717   30544   22652   167456  1.179e9 ACH01_abyss_109-unitigs.fa
601667  72845   10527   500     17840   33447   57032   42146   269613  1.186e9 ACH01_abyss_109-contigs.fa

510987  29948   3211    500     55726   107529  191111  138094  751714  1.188e9 ACH01_abyss_109-scaffolds.fa

n       n:200   L50     min     N75     N50     N25     E-size  max     sum     name
541770  264685  14218   200     12246   24865   44072   32499   242779  1.231e9 ACH01_abyss_k119_3-6.fa       *k=119 run didn't finish all the way through it stoped on contigs stage

abyss-fac   ACH01_abyss_k99-unitigs.fa ACH01_abyss_k99-contigs.fa ACH01_abyss_k99-scaffolds.fa |tee ACH01_abyss_k99-stats.tab         *merged pairs

n       n:500   L50     min     N75     N50     N25     E-size  max     sum     name
1522528 220381  40930   500     4842    8721    14302   10597   64413   1.162e9 ACH01_abyss_k99-unitigs.fa
1265907 112637  17918   500     10729   19904   33336   24695   195268  1.184e9 ACH01_abyss_k99-contigs.fa
1114044 43155   5280    500     34787   66368   114968  84482   700626  1.189e9 ACH01_abyss_k99-scaffolds.fa

abyss-fac   ACH01_abyss_k95-unitigs.fa ACH01_abyss_k95-contigs.fa ACH01_abyss_k95-scaffolds.fa |tee ACH01_abyss_k95-stats.tab         *merged pairs

n       n:500   L50     min     N75     N50     N25     E-size  max     sum     name
1593678 225390  42004   500     4713    8472    13939   10324   77844   1.161e9 ACH01_abyss_k95-unitigs.fa
1323310 112747  18004   500     10721   19779   33160   24697   174015  1.185e9 ACH01_abyss_k95-contigs.fa

1172362 43763   5419    500     34143   64748   111937  81861   503470  1.19e9  ACH01_abyss_k95-scaffolds.fa


Best Regards,
Pavle

Lauren Coombe

unread,
Aug 9, 2023, 1:25:31 PM8/9/23
to ABySS
Hi Pavle,

Thanks for the additional information!

One more request - could you please send me the full log from one of your merged pairs runs? I'd just like to double check how the various files are being supplied to the run.

Thanks!
Lauren

Lauren Coombe

unread,
Aug 9, 2023, 5:05:20 PM8/9/23
to ABySS
Hi Pavle,

Thanks for sending your log. It confirmed my suspicion - I believe based on your command, your description of your files, and the logs that you are accidentally inputting the same read information multiple times into the initial unitig assembly step of ABySS.

This was the initial step of ABySS, where the reads are k-merized to start the Bloom filter-based de Bruijn graph assembly:
abyss-stack-size 65536 abyss-bloom-dbg -k99 -q3 -vv -b320G -j63  ../ACH01_1.fq.gz ../ACH01_2.fq.gz ../ACH01.assembled.fastq.gz ../k95_pseudoreads.fq.gz ../k95_reads_1.fq.gz ../k95_reads_2.fq.gz > ACH01_abyss_k99-1.fa

../ACH01_1.fq.gz ../ACH01_2.fq.gz are the raw reads, so including them as well as the other merged/unmerged reads in this step will erroneously increase apparent coverage by supplying redundant sequencing data, and could interfere with the assembly.

This should be fixed by replacing `lib` in your command by `pe` - this should let ABySS know that you want to use this data for paired-end information, but not k-mers. You can do a dry-run (add -n) of the command to ensure that the expected read files are being input at the respective steps.

Another thing to note is that merging reads allows for the opportunity to go higher with the k-mer size, so it could be worth exploring higher k-mer sizes.

Let me know if you have any questions!
Lauren

Pavle Eric

unread,
Aug 10, 2023, 11:09:36 AM8/10/23
to ABySS
Dear Lauren,

Thank you for your help with my assembly process, I did a dry run and now the raw reads are not used in the bloom filter building. Also thank you for your suggestion to try and explore a higher k-mer size landscape to find the assembly with the best parameters and busco score. The only question I have for now is, what tools do you recommend for postprocessing? I read in this group that you recommend using abyss-sealer for gap filling, but is there any other tool that can help improve the assembly even more?

Once again thank you for your quick assistance and helpful tips!

Best Regards, 
Pavle

Lauren Coombe

unread,
Aug 10, 2023, 11:30:10 AM8/10/23
to ABySS
Hi Pavle,

You're very welcome!

In terms of post-ABySS processing (using the same short reads):
- Yes, using abyss-sealer for gap-filling is a good idea, and something that we always do in our assembly projects. You won't get scaffold contiguity gains here, but filling in gaps is always a good idea to help get more ACTG bases in your assembly, which can impact various downstream steps such as annotation.
- Another approach that we have tried in the past is using the generated assemblies from other k-mer sizes to improve your chosen assembly using ntJoin (https://github.com/bcgsc/ntJoin). ntJoin is our assembly/reference-guided assembler. So your target assembly in this case would be your chosen assembly (presumably the one with the highest N50/BUSCOs), and you could use some other ABySS assemblies that you did not choose as 'references' to improve your chosen assembly. 
- If there is a closely related reference assembly, and you trust that the general genomic structure is the same or similar as your assembly, you could also use ntJoin with that reference assembly. In that case, it would organize your ABySS scaffolds into longer sequences or chromosomes based on that reference. Whether that makes sense in your case would just require some thought on your part! If you do go that route, I'd suggest using the no_cut=True option. 

Good luck!
Lauren

Pavle Eric

unread,
Sep 7, 2023, 11:23:04 AM9/7/23
to ABySS

Hi Lauren,

Thank you for your help with the assembly. Still, merging pairs before the abyss assembler did not produce better busco scores on the assemblies, but as you said a busco score of 94,5 using only PE reads is a great score so I moved on. As you suggested I tried improving my chosen assembly using assemblies generated with other k-mer sizes with your ntJoin tool, but I am experiencing an error. I tried a few assemblies as a target and as a reference, but the error persists. The fasta files are single-lined as they should be, I tried both with soft-links and the "real" files in the directory but the error is still there. Also your test_installation.sh script works. These are some of the commands that I tried:

ntJoin assemble target=ACH01_abyss_k105-8.fa target_weight=1 references='ACH01_abyss_k159-8.fa' reference_weights='1' t=63 k=32 w=500

ntJoin assemble target=ACH01_abyss_k105-8_s.fa target_weight=1 reference_config=config_multiple1.csv prefix=f-1_k101 t=63 k=101 w=1000 

And this is the error that I get:

2023-09-07 10:28:45.515713 : Printing output scaffolds
Traceback (most recent call last):
  File "/home/peric/anaconda3/envs/mamba/envs/ntjoin_new/bin/share/ntjoin-1.1.1-2/bin/ntjoin_assemble.py", line 1115, in <module>
    main()
  File "/home/peric/anaconda3/envs/mamba/envs/ntjoin_new/bin/share/ntjoin-1.1.1-2/bin/ntjoin_assemble.py", line 1112, in main
    Ntjoin().main()
  File "/home/peric/anaconda3/envs/mamba/envs/ntjoin_new/bin/share/ntjoin-1.1.1-2/bin/ntjoin_assemble.py", line 1102, in main
    self.print_scaffolds(paths, intersecting_regions)
  File "/home/peric/anaconda3/envs/mamba/envs/ntjoin_new/bin/share/ntjoin-1.1.1-2/bin/ntjoin_assemble.py", line 873, in print_scaffolds
    outfile = self.print_unassigned(assembly, assembly_fa, incorporated_segments, outfile, params)
  File "/home/peric/anaconda3/envs/mamba/envs/ntjoin_new/bin/share/ntjoin-1.1.1-2/bin/ntjoin_assemble.py", line 890, in print_unassigned
    missing_bed = genome_bed.complement(i=incorporated_segments_bed, g=genome_dict)
  File "/home/peric/anaconda3/envs/mamba/envs/ntjoin_new/lib/python3.10/site-packages/pybedtools/bedtool.py", line 909, in decorated
    result = method(self, *args, **kwargs)
  File "/home/peric/anaconda3/envs/mamba/envs/ntjoin_new/lib/python3.10/site-packages/pybedtools/bedtool.py", line 388, in wrapped
    stream = call_bedtools(
  File "/home/peric/anaconda3/envs/mamba/envs/ntjoin_new/lib/python3.10/site-packages/pybedtools/helpers.py", line 460, in call_bedtools
    raise BEDToolsError(subprocess.list2cmdline(cmds), stderr)
pybedtools.helpers.BEDToolsError:
Command was:

        bedtools complement -i /tmp/pybedtools._df0h1ef.tmp -g /tmp/pybedtools.8thg8ecw.tmp

Error message was:
Error: Sorted input specified, but the file /tmp/pybedtools._df0h1ef.tmp has the following record with a different sort order than the genomeFile /tmp/pybedtools.8thg8ecw.tmp
10408   0       22111

make: *** [/home/peric/anaconda3/envs/mamba/envs/ntjoin_new/bin/share/ntjoin-1.1.1-2/ntJoin:227: ACH01_abyss_k105-8_s.fa.k101.w1000.n1.assigned.scaffolds.fa] Error 1
make: *** Deleting file 'ACH01_abyss_k105-8_s.fa.k101.w1000.n1.assigned.scaffolds.fa'


Best Regards,

Pavle

Lauren Coombe

unread,
Sep 7, 2023, 11:31:01 AM9/7/23
to ABySS
Hi Pavle,

Hmm that's an interesting one - I haven't seen that error before. Good to know that the test installation script worked.

What version of bedtools and pybedtools are you using? I have seen errors before with bedtools steps that end up being related to the version, so it would be worth ensuring that is totally up to date first.
Are the output scaffold files directly from ABySS? Or did you do some other steps on either of them?
Also, since our discussion has evolved past ABySS to ntJoin, I invite you to open an issue at our ntJoin GitHub page, so that other ntJoin users can benefit from our discussion (https://github.com/bcgsc/ntJoin/issues).

Cheers,
Lauren

Reply all
Reply to author
Forward
0 new messages