Re: exome and genome data storage

84 views
Skip to first unread message

Conner Berthold

unread,
Nov 24, 2012, 10:23:39 AM11/24/12
to diy...@googlegroups.com
Sounds Cool! I know the only other  way would be LTO, cheap for storage but the drives cost a fortune. The newest generation LTO6 can hold 2.5Tb native and 5Tb compressed.

-Conner
On Friday, November 23, 2012 4:37:34 PM UTC-5, Giovanni wrote:
By one estimates, I read that the costs for storing full genome sequence data would be pricey (50 terabytes), although I read this article about a new optical disc that may be released in 2015 which will store 1-15 terabytes of capacity. I'm not sure what media formats genomic sequencing services use, but blu-ray discs costs about about $1/BD-R and make exome data storage relatively affordable. It's not unlikely that the price of TB optical discs will cost $4-5 dollars in 2018. A full genome sequence would likely benefit from something like the above link, because 4TB hard-drives aren't inexpensive or lightweight.

Jonathan Street

unread,
Nov 24, 2012, 11:16:14 AM11/24/12
to diy...@googlegroups.com
A quick look at the datasets released by 1000genomes would suggest the storage requirements are not quite so onerous. Taking a look at a few examples I'm getting 50-100 GB per genome.

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/


--
-- You received this message because you are subscribed to the Google Groups DIYbio group. To post to this group, send email to diy...@googlegroups.com. To unsubscribe from this group, send email to diybio+un...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/diybio?hl=en
Learn more at www.diybio.org
---
You received this message because you are subscribed to the Google Groups "DIYbio" group.
To post to this group, send email to diy...@googlegroups.com.
To unsubscribe from this group, send email to diybio+un...@googlegroups.com.
Visit this group at http://groups.google.com/group/diybio?hl=en.
To view this discussion on the web visit https://groups.google.com/d/msg/diybio/-/PGBQYTCTQmAJ.

For more options, visit https://groups.google.com/groups/opt_out.
 
 

Nathan McCorkle

unread,
Nov 24, 2012, 4:21:07 PM11/24/12
to diybio
Are you sure you're not thinking before assembly vs after assembly? If humans are around 6.5 gigabases, and 2 bits per base, that's 1.625 gigabytes. Assuming we just use one byte per base, that gives us 6 extra bits for storing methylation status, etc, and is 13 gigabytes (also what it costs to store that as ASCII). 

Am I missing something?


On Fri, Nov 23, 2012 at 1:37 PM, Giovanni <giovanni...@gmail.com> wrote:
By one estimates, I read that the costs for storing full genome sequence data would be pricey (50 terabytes), although I read this article about a new optical disc that may be released in 2015 which will store 1-15 terabytes of capacity. I'm not sure what media formats genomic sequencing services use, but blu-ray discs costs about about $1/BD-R and make exome data storage relatively affordable. It's not unlikely that the price of TB optical discs will cost $4-5 dollars in 2018. A full genome sequence would likely benefit from something like the above link, because 4TB hard-drives aren't inexpensive or lightweight.

--
-- You received this message because you are subscribed to the Google Groups DIYbio group. To post to this group, send email to diy...@googlegroups.com. To unsubscribe from this group, send email to diybio+un...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/diybio?hl=en
Learn more at www.diybio.org
---
You received this message because you are subscribed to the Google Groups "DIYbio" group.
To post to this group, send email to diy...@googlegroups.com.
To unsubscribe from this group, send email to diybio+un...@googlegroups.com.
Visit this group at http://groups.google.com/group/diybio?hl=en.
To view this discussion on the web visit https://groups.google.com/d/msg/diybio/-/qAh8zVnfOZ0J.

For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
-Nathan
Message has been deleted

Cathal Garvey

unread,
Nov 26, 2012, 7:20:47 AM11/26/12
to diy...@googlegroups.com
That raises another prospect of course, a "Genome diff": if a reference genome is known for a species that is "good enough", or if there are several known reference genomes for subgroups that narrow the gaps usefully, then your "genome" can become a string of differences between your genome and the chosen reference. Far smaller and easier to send/store/share.

On 25 November 2012 21:06, Eric Kelsic <kel...@gmail.com> wrote:
Since people have very similar genomes, storage requirements for multiple genomes drop considerably when the data is compressed:

Human genomes as email attachments
Scott Christley, Yiming Lu, Chen Li and Xiaohui Xie
(open access) http://bioinformatics.oxfordjournals.org/content/25/2/274.full 

In this case they compress the SNPs and indels of a human genome compared to a reference in a 4mb file.  There are other types of genomic variation that this method doesn't handle, like structural rearrangements, but getting that info is more a problem with sequencing technology than with file compression.

Keeping the data for individual reads from a next generation sequencer requires a lot of storage.  That's the easiest way to end up with terabytes of data.  My main point is just that the differences you would actually care about for personal genomics are a relatively small part of the information contained in a human genome.

-e
To view this discussion on the web visit https://groups.google.com/d/msg/diybio/-/TBFEoktGXagJ.

For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
www.indiebiotech.com
twitter.com/onetruecathal
joindiaspora.com/u/cathalgarvey
PGP Public Key: http://bit.ly/CathalGKey


Eugen Leitl

unread,
Nov 26, 2012, 2:40:25 AM11/26/12
to diy...@googlegroups.com
On Sat, Nov 24, 2012 at 04:38:17PM -0800, Giovanni wrote:
> Yours seems correct. The only thing I would use would be error correction
> code (in RAM and/or in disk writing)/redundant sectors/RAID, but even that

You're looking for ECC RAM and zfs, plus nearline or enterprise
disks (SAS).

> wouldn't be more than 40GB. I'm wrong or missing something too. This link
> has a curious number:
> http://www.kinexus.ca/pdf/graphs_charts/HumanGenomeSequence.pdf
>
> On Saturday, November 24, 2012 3:21:31 PM UTC-6, Nathan McCorkle wrote:
> >
> > Are you sure you're not thinking before assembly vs after assembly? If
> > humans are around 6.5 gigabases, and 2 bits per base, that's 1.625
> > gigabytes. Assuming we just use one byte per base, that gives us 6 extra
> > bits for storing methylation status, etc, and is 13 gigabytes (also what it
> > costs to store that as ASCII).
> >
> > Am I missing something?
> >
> >
> > On Fri, Nov 23, 2012 at 1:37 PM, Giovanni <giovanni...@gmail.com<javascript:>
> > > wrote:
> >
> >> By one estimates, I read that the costs for storing full genome sequence
> >> data would be pricey (50 terabytes), although I read this article about a new
> >> optical disc<http://www.tweaktown.com/news/26908/1tb_optical_discs_are_coming_but_you_ll_have_to_wait_until_2015/index.html>that may be released in 2015 which will store 1-15 terabytes of capacity.
> >> I'm not sure what media formats genomic sequencing services use, but
> >> blu-ray discs costs about about $1/BD-R and make exome data storage
> >> relatively affordable. It's not unlikely that the price of TB optical discs
> >> will cost $4-5 dollars in 2018. A full genome sequence would likely benefit
> >> from something like the above link, because 4TB hard-drives aren't
> >> inexpensive or lightweight.
> >>
> >> --
> >> -- You received this message because you are subscribed to the Google
> >> Groups DIYbio group. To post to this group, send email to
> >> diy...@googlegroups.com <javascript:>. To unsubscribe from this group,
> >> send email to diybio+un...@googlegroups.com <javascript:>. For more
> >> options, visit this group at
> >> https://groups.google.com/d/forum/diybio?hl=en
> >> Learn more at www.diybio.org
> >> ---
> >> You received this message because you are subscribed to the Google Groups
> >> "DIYbio" group.
> >> To post to this group, send email to diy...@googlegroups.com<javascript:>
> >> .
> >> To unsubscribe from this group, send email to
> >> diybio+un...@googlegroups.com <javascript:>.
> >> Visit this group at http://groups.google.com/group/diybio?hl=en.
> >> To view this discussion on the web visit
> >> https://groups.google.com/d/msg/diybio/-/qAh8zVnfOZ0J.
> >> For more options, visit https://groups.google.com/groups/opt_out.
> >>
> >>
> >>
> >
> >
> >
> > --
> > -Nathan
> >
>
> --
> -- You received this message because you are subscribed to the Google Groups DIYbio group. To post to this group, send email to diy...@googlegroups.com. To unsubscribe from this group, send email to diybio+un...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/diybio?hl=en
> Learn more at www.diybio.org
> ---
> You received this message because you are subscribed to the Google Groups "DIYbio" group.
> To post to this group, send email to diy...@googlegroups.com.
> To unsubscribe from this group, send email to diybio+un...@googlegroups.com.
> Visit this group at http://groups.google.com/group/diybio?hl=en.
> To view this discussion on the web visit https://groups.google.com/d/msg/diybio/-/gJfmhicNmVYJ.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
--
Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org
______________________________________________________________
ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE

Bastian Greshake

unread,
Nov 26, 2012, 7:42:46 AM11/26/12
to diy...@googlegroups.com
Hi,
that's exactly what's already done with the human genome. There's the Variant Call Format (VCF) around which is mainly used for SNP-data: http://www.1000genomes.org/node/101
And the Personal Genome Project delivers their genome data in GFF format which gives information as "chromosome 1, position 1-1243214": REF if the data matches the reference genome. So you only produce larger overhead for the variations which aren't included in the reference genome.

Those formats are still miles from perfect in terms of usability and memory efficiency, but they're a first step in the direction you mentioned. :)

cheers,
Bastian

Eugen Leitl

unread,
Nov 26, 2012, 10:44:38 AM11/26/12
to diy...@googlegroups.com
On Mon, Nov 26, 2012 at 08:40:25AM +0100, Eugen Leitl wrote:
> On Sat, Nov 24, 2012 at 04:38:17PM -0800, Giovanni wrote:
> > Yours seems correct. The only thing I would use would be error correction
> > code (in RAM and/or in disk writing)/redundant sectors/RAID, but even that
>
> You're looking for ECC RAM and zfs, plus nearline or enterprise
> disks (SAS).

Actually, if you have enough resources to switch on dedup
on zfs you'll catch most of the longer repeats, and
also redundancy across multiple genomes, without
having to do delta patches at application level.
Alternatively, zfs compression. Switching
on both options at the same time might or might not
be a bad idea.

Somebody should try this, and report to the group.
Reply all
Reply to author
Forward
0 new messages