Benchmark for Abyss

tianhe yu

unread,

Feb 17, 2017, 12:15:30 AM2/17/17

to Trans-ABySS

Hi guys,

I am a CS student trying to accelerate the Abyss algorithm from pure computer science perspective.

Currently we just started profiling the job, which is evaluating the time consumption for different part of the assembly process.

We only care which part takes how much time and will try to accelerate the most time-consuming part first.

I only found the "Abyss for Intel Xeon Processor" document which uses ERR194147, a 2*50GB human genome data, which is too large for our job.

An input data set less than 1G is large enough for us to profile it. Any suggestions on which specific data set can we use for this?

Thanks a lot in advance.

Cheers,

Theodore

Ka Ming Nip

unread,

Feb 17, 2017, 12:41:15 PM2/17/17

to trans...@googlegroups.com, be...@gmail.com

Hi Theodore,

I recommend starting with another species with (much) smaller genomes, such as E. coli:
http://www.ebi.ac.uk/ena/data/view/ERA000206&display=html

If that is too easy for you, then you may consider a C. Elegans dataset, eg.
https://www.ncbi.nlm.nih.gov/sra/?term=SRR065390

I recommend using ABySS v2.0 because it uses much less memory than its previous versions.

Thanks,
Ka Ming

--
Ka Ming Nip
Graduate Student | Dr. Inanc Birol Lab (BTL)
Canada's Michael Smith Genome Sciences Centre
________________________________________
From: trans...@googlegroups.com [trans...@googlegroups.com] On Behalf Of tianhe yu [theod...@gmail.com]
Sent: February 16, 2017 9:15 PM
To: Trans-ABySS
Subject: Benchmark for Abyss

--
You received this message because you are subscribed to the Google Groups "Trans-ABySS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trans-abyss...@googlegroups.com<mailto:trans-abyss...@googlegroups.com>.
For more options, visit https://groups.google.com/d/optout.

tianhe yu

unread,

Feb 17, 2017, 6:15:58 PM2/17/17

to Trans-ABySS

Thanks a lot!

I will try both.

Theodore

在 2017年2月17日星期五 UTC-8上午9:41:15，Ka Ming Nip写道：

To unsubscribe from this group and stop receiving emails from it, send an email to trans-abyss...@googlegroups.com<mailto:trans-abyss+unsubscribe@googlegroups.com>.

Tianhe Yu

unread,

Feb 22, 2017, 9:16:50 PM2/22/17

to trans...@googlegroups.com

Hi Ka Ming,

Thanks a lot for your last reply, I am actually using the E.coli data as input and it works now. However, I am a bit concerned about the scaffold part.

I read from abyss website that the abyss-pe can only assemble the fragments to contigs whereas we have to have mate-pair data to assemble them to scaffold. So my questions are:

1. From biology point of view, is it meaningful if we assemble it only to contigs but not scaffolds?

Just like the E.coli data you mentioned above, presumably it will never be able to assemble it to scaffolds if it doesn't have mate-pair library.

2. Will you suggest another data set that has mate-pair library and be less than several GB?

Thanks a lot!!

Regards,

Theodore

To unsubscribe from this group and stop receiving emails from it, send an email to trans-abyss+unsubscribe@googlegroups.com<mailto:trans-abyss+unsubscribe@googlegroups.com>.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Trans-ABySS" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/trans-abyss/mJ187smt41c/unsubscribe.
To unsubscribe from this group and all its topics, send an email to trans-abyss+unsubscribe@googlegroups.com.

Ka Ming Nip

unread,

Feb 22, 2017, 10:42:30 PM2/22/17

to trans...@googlegroups.com, Ben Vandervalk

Hi Theodore,

I have CC'd Ben Vandervalk, who is the lead developer of ABySS.

I think it really depends on what kind of biology of the genome you are interested in. For example, a gene represented by multiple contigs could be scaffolded together.

And, you can still scaffold the contigs with just the paired end reads. However, you would get longer scaffolds from a mate-pair library.

Regards,

Ka Ming

--
Ka Ming Nip
Graduate Student | Dr. Inanc Birol Lab (BTL)
Canada's Michael Smith Genome Sciences Centre
________________________________________

From: trans...@googlegroups.com [trans...@googlegroups.com] On Behalf Of Tianhe Yu [theod...@gmail.com]
Sent: February 22, 2017 6:16 PM
To: trans...@googlegroups.com
Subject: Re: Benchmark for Abyss

Hi Ka Ming,

Thanks a lot for your last reply, I am actually using the E.coli data as input and it works now. However, I am a bit concerned about the scaffold part.
I read from abyss website that the abyss-pe can only assemble the fragments to contigs whereas we have to have mate-pair data to assemble them to scaffold. So my questions are:
1. From biology point of view, is it meaningful if we assemble it only to contigs but not scaffolds?
Just like the E.coli data you mentioned above, presumably it will never be able to assemble it to scaffolds if it doesn't have mate-pair library.
2. Will you suggest another data set that has mate-pair library and be less than several GB?
Thanks a lot!!

Regards,
Theodore

On Fri, Feb 17, 2017 at 9:41 AM, Ka Ming Nip <km...@bcgsc.ca<mailto:km...@bcgsc.ca>> wrote:
Hi Theodore,

I recommend starting with another species with (much) smaller genomes, such as E. coli:
http://www.ebi.ac.uk/ena/data/view/ERA000206&display=html

If that is too easy for you, then you may consider a C. Elegans dataset, eg.
https://www.ncbi.nlm.nih.gov/sra/?term=SRR065390

I recommend using ABySS v2.0 because it uses much less memory than its previous versions.

Thanks,
Ka Ming

--
Ka Ming Nip
Graduate Student | Dr. Inanc Birol Lab (BTL)
Canada's Michael Smith Genome Sciences Centre
________________________________________

From: trans...@googlegroups.com<mailto:trans...@googlegroups.com> [trans...@googlegroups.com<mailto:trans...@googlegroups.com>] On Behalf Of tianhe yu [theod...@gmail.com<mailto:theod...@gmail.com>]

Sent: February 16, 2017 9:15 PM
To: Trans-ABySS
Subject: Benchmark for Abyss

Hi guys,

I am a CS student trying to accelerate the Abyss algorithm from pure computer science perspective.
Currently we just started profiling the job, which is evaluating the time consumption for different part of the assembly process.
We only care which part takes how much time and will try to accelerate the most time-consuming part first.
I only found the "Abyss for Intel Xeon Processor" document which uses ERR194147, a 2*50GB human genome data, which is too large for our job.
An input data set less than 1G is large enough for us to profile it. Any suggestions on which specific data set can we use for this?

Thanks a lot in advance.

Cheers,
Theodore

--
You received this message because you are subscribed to the Google Groups "Trans-ABySS" group.

To unsubscribe from this group and stop receiving emails from it, send an email to trans-abyss...@googlegroups.com<mailto:trans-abyss%2Bunsu...@googlegroups.com><mailto:trans-abyss...@googlegroups.com<mailto:trans-abyss%2Bunsu...@googlegroups.com>>.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Trans-ABySS" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/trans-abyss/mJ187smt41c/unsubscribe.

To unsubscribe from this group and all its topics, send an email to trans-abyss...@googlegroups.com<mailto:trans-abyss%2Bunsu...@googlegroups.com>.

For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "Trans-ABySS" group.

To unsubscribe from this group and stop receiving emails from it, send an email to trans-abyss...@googlegroups.com<mailto:trans-abyss...@googlegroups.com>.

Ben Vandervalk

unread,

Feb 23, 2017, 3:12:35 PM2/23/17

to Trans-ABySS, be...@bcgsc.ca

Hi Theodore,

I would agree with Ka Ming that it would probably be better to profile/optimize ABySS 2.0 (i.e. the Bloom filter assembly mode) as this is what we plan to use going forward. You can find instructions on running ABySS in Bloom filter mode here (https://github.com/bcgsc/abyss#assembling-using-a-bloom-filter-de-bruijn-graph) and there is an early print of the ABySS 2.0 paper here: http://genome.cshlp.org/content/early/2017/02/23/gr.214346.116.full.pdf+html. Another benefit of optimizing the Bloom filter mode is that the code is simpler and probably much easier to understand (primarily because it does not use MPI).

I think your strategy of profiling and optimizing on a small data set is wise. MPET data for small genomes is relatively rare and unfortunately I don't have any hyperlinks handy for MPET data on a small genome, although I am pretty sure such data do exist (for example, see the Illumina whitepaper: "De Novo Assembly of Small Genome Nextera Mate Pair Libraries with a Single MiSeq System Run"). As Ka Ming said, assemblies without MPET data can still be valuable for biological studies. If I were you, I would keep things simple and leave MPET data out of the equation. However, if you are keen on including MPET data anyway, Illumina BaseSpace is a good place to look (under "Public Data"). You will have to create an account on their website to access the data.

For first profiling steps, I would recommend timing the individual ABySS commands that are run by the `abyss-pe` Makefile, which is the main driver script for running ABySS assemblies. You can then zero in on individual programs based on your initial timing results.

Best of luck,

- Ben

From: trans...@googlegroups.com<mailto:trans...@googlegroups.com> [trans...@googlegroups.com<mailto:trans...@googlegroups.com>] On Behalf Of tianhe yu [theod...@gmail.com<mailto:theo...@gmail.com>]

Sent: February 16, 2017 9:15 PM
To: Trans-ABySS
Subject: Benchmark for Abyss

Hi guys,

I am a CS student trying to accelerate the Abyss algorithm from pure computer science perspective.
Currently we just started profiling the job, which is evaluating the time consumption for different part of the assembly process.
We only care which part takes how much time and will try to accelerate the most time-consuming part first.
I only found the "Abyss for Intel Xeon Processor" document which uses ERR194147, a 2*50GB human genome data, which is too large for our job.
An input data set less than 1G is large enough for us to profile it. Any suggestions on which specific data set can we use for this?

Thanks a lot in advance.

Cheers,
Theodore

--
You received this message because you are subscribed to the Google Groups "Trans-ABySS" group.

To unsubscribe from this group and stop receiving emails from it, send an email to trans-abyss...@googlegroups.com<mailto:trans-abyss%2Bunsubscribe@googlegroups.com><mailto:trans-abyss+unsubscribe@googlegroups.com<mailto:trans-abyss%2Bunsubscribe@googlegroups.com>>.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Trans-ABySS" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/trans-abyss/mJ187smt41c/unsubscribe.

To unsubscribe from this group and all its topics, send an email to trans-abyss...@googlegroups.com<mailto:trans-abyss%2Bunsubscribe@googlegroups.com>.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Trans-ABySS" group.

To unsubscribe from this group and stop receiving emails from it, send an email to trans-abyss...@googlegroups.com<mailto:trans-abyss+unsubscribe@googlegroups.com>.

Tianhe Yu

unread,

Feb 23, 2017, 3:33:42 PM2/23/17

to trans...@googlegroups.com

Dear Ben,

Thanks for your reply! That is what I am doing currently to profile abyss-pe. I will focus a bit on the bloom filter in the future.

Regards,

Theodore

To unsubscribe from this group and stop receiving emails from it, send an email to trans-abyss...@googlegroups.com<mailto:trans-abyss%2Bunsu...@googlegroups.com><mailto:trans-abyss+unsub...@googlegroups.com<mailto:trans-abyss%2Bunsu...@googlegroups.com>>.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Trans-ABySS" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/trans-abyss/mJ187smt41c/unsubscribe.

To unsubscribe from this group and all its topics, send an email to trans-abyss...@googlegroups.com<mailto:trans-abyss%2Bunsu...@googlegroups.com>.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Trans-ABySS" group.

To unsubscribe from this group and stop receiving emails from it, send an email to trans-abyss...@googlegroups.com<mailto:trans-abyss+unsub...@googlegroups.com>.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Trans-ABySS" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/trans-abyss/mJ187smt41c/unsubscribe.

To unsubscribe from this group and all its topics, send an email to trans-abyss+unsubscribe@googlegroups.com.

Reply all

Reply to author

Forward