abyss-explorer

62 views
Skip to first unread message

Paige M.

unread,
Feb 17, 2022, 12:48:20 PM2/17/22
to ABySS
Hello,
I am new to abyss-pe but the service provider we have uses it for whole genome sequence alignment.  I am trying to use it with different k-mers suggested through kmergenie.  Actually I have used it, but I am struggling to get much out of the output files.  I tried to use abyss-explorer but it won't work for me and it hasn't been updated in so long (2013?) I think it just fails due to conflicts.  (I am not capable of writing or editing software so that is out).  I tried to use rgraphviz but the files (even the very last dist.dot file) is too big to open.  I found suggestions online to try using "dot" or "neato" but those fail too. 

The organism I'm working with isn't a model so I don't even know how many chromosomes it has and the genome size has only a rough estimate from the 70's.  Any suggestions on how to subset the data or access it in anyway that would help me determine quality for use a draft reference genome?

Thank you,

johnath...@gmail.com

unread,
Feb 17, 2022, 1:06:36 PM2/17/22
to ABySS
Hello Paige,

Can I confirm that you have somePrefix-contigs.fa or somePrefix-scaffolds.fa in your directory, Typically when we assess the quality of the genome we look into the fa files instead of the graphs. For example, something we can do is to compare the genome size estimate from the 70s with the genome reconstructed by ABySS. Can you post the results of of the command "abyss-fac  somePrefix-scaffolds.fa" ? It will also contain the N50 metric, a useful assembly assessment metric when you do not have a good genome size estimate.

Paige M.

unread,
Feb 17, 2022, 3:01:13 PM2/17/22
to ABySS
Hello Johnathon, Yes I do that that information:
n        n:500        L50        min        N75        N50        N25        E-size        max        sum        name
6095211        359038        42570        500        1532        4145        9307        6644        71791        731.2e6        XXX-contigs.fa

n        n:500        L50        min        N75        N50        N25        E-size        max        sum        name
6001008        332968        36003        500        1816        5078        11029        7812        73778        740.2e6        XXX-scaffolds.fa


The estimated diploid genome size was 1.7 Gb based on the 2C value from the 70's.  These abyss output does not seem to match that, or perhaps I am not using the information from abyss correctly.

Thank you!

johnath...@gmail.com

unread,
Feb 17, 2022, 3:33:04 PM2/17/22
to ABySS
Hello Paige,

To confirm, when you say "The estimated diploid genome size was 1.7 Gb ", you meant the diploid genome size was 1.7Gb and the haploid genome size was 850Mb based on the estimation. This is somewhat in line with the genome size reconstruction from ABySS, which is the sum column. It is showing 740Mb. You can also try running genomescope on your data to get another source of genome size estimation. The contiguity does seem a little low at 5kb. Can you post the command you used to run ABySS, and the fold coverage of the short read dataset you used using 850Mb as the genome size. Do you also happen to know what characteristics your genome of interest has, like whether it has a lot of repeat elements?

Thanks

Paige M.

unread,
Feb 18, 2022, 12:36:40 AM2/18/22
to ABySS
Hi Johnathon,
Here is the command I used:

abyss-pe -C /Users/paige/Desktop/Silvetia/BGI_data/1015_S/k70 name=rockweed k=70 kc=2 B=25G v=-v in="/Users/paige/Desktop/Silvetia/BGI_data/1015_S/merged_1.fastq /Users/paige/Desktop/Silvetia/BGI_data/1015_S/merged_2.fastq"  > /Users/paige/Desktop/Silvetia/BGI_data/1015_S/k70/rockweed.PE.70.out

The coverage is 40X.  I don't know if it has a "high number" of repeat elements.  The closest organism sequenced is only matched at Class in taxonomy and and it does have a lot of repeats.   I went ahead and ran kmergenie and found out that I should try running k69.  Just using jellyfish I looked at k69 and used genomescope to visualize the histogram.  It shows what I think is a decent plot but the size estimate is 3,378,118bp but only 25% is unique and it only estimates kcov:28.7.  I'm kind of floundering trying to understand what all the kmergenie output means, but it clearly states k69 is best so I am running abyss-pe using k69 now.  It will take a couple of days to run (at least) based on last time!

Should I be using different commands to improve the alignment?  I will  happily restart it if I need to do that.

Thank you.

johnath...@gmail.com

unread,
Feb 18, 2022, 1:20:43 PM2/18/22
to ABySS
Hello Paige,

While 25GB should be sufficient for a genome size of your interest, you could try using a higher Bloom filter size. Where are you running this command? If you are running on a server, you can try running with more threads using the j parameter, which should drastically reduce the time taken. The genome size estimate from genomescope does seem very small but I am not familiar with jellyfish genome size. The reason I asked about if there are a lot of repeats in the genome is because short read assemblies have problem traversing repeats, which can contribute to lower contiguity,

johnath...@gmail.com

unread,
Feb 18, 2022, 3:45:34 PM2/18/22
to ABySS
Also what k did you use to run genomescope?

Paige M.

unread,
Feb 19, 2022, 2:29:00 PM2/19/22
to ABySS
Hi Johnathon,
Thanks for the suggestions on the commands.  I'm rerunning abyss-pe using more threads and B=50G and also c=2 (kmergenie suggested that value). 

I used genomescope on the website they have set up and that is where it said to run jellyfish kmer program to get the histogram files to upload. I ran kmergenie again using a wider range of kmers so now I've tested (9-121)  and it looks like k=21 might be a better choice for me, I was told that with a genome this size I should start a
round k=31 but perhaps that wasn't a good idea.

I am testing abyss-pe on k=69 with the B=50G and also will test k=21.  Using k=21 on genomescope I got:

GenomeScope version 1.0 k = 21 property min max Heterozygosity 1.1393% 1.17654% Genome Haploid Length 709,280,042 bp 713,505,533 bp Genome Repeat Length 266,318,737 bp 267,905,314 bp Genome Unique Length 442,961,305 bp 445,600,219 bp Model Fit 94.906% 99.3008% Read Error Rate 0.502263% 0.502263%

This is definitely a better genome size match (yes?).

johnath...@gmail.com

unread,
Feb 23, 2022, 3:20:13 PM2/23/22
to ABySS
Hello Paige,

That does seem to in line with the ABySS reconstruction. And just to confirm, you are running with a higher j parameter? A human assembly with j=48 typically takes less than a day for ABySS to finish running so I highly recommend using more threads.

Paige M.

unread,
Feb 23, 2022, 7:51:40 PM2/23/22
to ABySS
Hi ,
Yes, I ran the assembly with j=12 and it finished in 1 day.  I actually tried both k=69 and k=21, expecting k=21 to be better.  Oddly, the output from abyss-pe was much better with k=69.  Now I am wondering if something is off with the jellyfish/genomescope information.  The commands are very simple, so I don't think I could have messed it up (their is an example on genomescope that you can directly copy and just fill in your file name).  I have rerun my data on kmergenie now and it does not indicate that k=21 is a good fit either.  I'm not sure how the models work to find k-mer values, but there must be some variability.  I guess next time I will check at least 2 programs to find a good starting kmer.

The output for the data for k=69 seems okay I think. The N50 isn't better and  the largest scaffold is longer.

n        |n:500   |L50    |min  |N75   |N50   |N25    |E-size  |max    |sum      |name
---      |---     |---    |---  |---   |---   |---    |---     |---    |---      |---
6808396  |436814  |68489  |500  |1037  |2419  |5642   |4191    |63665  |702.7e6  |rockweed69-B50-unitigs.fa
6296348  |358703  |42890  |500  |1510  |4078  |9172   |6554    |96973  |724.8e6  |rockweed69-B50-contigs.fa
6200472  |332955  |36374  |500  |1788  |4975  |10905  |7696    |96973  |733.9e6  |rockweed69-B50-scaffolds.fa

For k=21 it is definitely a poorer result.
n        |n:500  |L50    |min  |N75  |N50  |N25  |E-size  |max   |sum      |name
---      |---    |---    |---  |---  |---  |---  |---     |---   |---      |---
57.27e6  |89391  |35228  |500  |568  |668  |847  |754     |3175  |61.73e6  |rockweed21-unitigs.fa
57.26e6  |88361  |33485  |500  |570  |675  |873  |826     |4638  |62.85e6  |rockweed21-contigs.fa
57.25e6  |87429  |31773  |500  |572  |683  |898  |914     |6678  |63.87e6  |rockweed21-scaffolds.fa

At this point I'm not sure if I can do any better than the k69.

Thanks for your help.

johnath...@gmail.com

unread,
Feb 28, 2022, 7:15:31 PM2/28/22
to ABySS
Hello Paige,

Assembly a genome and calculating genome size are two very different task. The paradigms used to accomplish one is different from the other, so the best k for one is not necessarily the best for the other.  For assembly, typically the higher the k, the better the assembly provided you have enough coverage to use such a high k. You can consider using BUSCO to evaluate your assembly. BUSCO basically find how many genes are fully constructed based on a given set of genes. There is probably a gene set closely related to the genome of your interest to use.

Paige M.

unread,
Mar 3, 2022, 2:55:45 PM3/3/22
to ABySS
Hi Jonathon,
Thanks for the clarification.  That makes sense.  As I am very new to assembly and also calculating genome size, I really hadn't gotten that distinction clear  - though I've been reading about k-mers and this helps me understand why the perspectives are so different from different sources!  I suppose authors' think that should be obvious to the reader without considering that some of us are too overwhelmed to get things that aren't explicitly stated. ;-)

I used ntedit to polish the genome and it improved.  Then I took your advice and used BUSCO (though there are only 2 single celled protists in the regular euk database and they are not closely related at all!).  It did give me some insight though. I attached the busco figure. I think it is not bad for assembly based on one short-read library. I am actually considering trying to train my own database with Augustus (another daunting task, but we will be studying related spp. of multicellular protists for years to come).

I really appreciate all your help!  It has made this experience a much better learning process - instead of just frustration after frustration.

Thanks,
busco_figure.png

johnath...@gmail.com

unread,
Mar 8, 2022, 7:00:28 PM3/8/22
to ABySS
Hi Paige,

I'm always happy to help. Best of luck win your endeavor. Using bioinformatics software can be quite a steep learning curve but it should be easier as you get the hang of it.

Johnathan

Reply all
Reply to author
Forward
0 new messages