Re: [DIYbio] full genome sequencing and exome data storage

256 views
Skip to first unread message
Message has been deleted

Cathal Garvey

unread,
Nov 26, 2012, 7:14:34 AM11/26/12
to diy...@googlegroups.com
Thinking about whole genome storage, the human genome is 3.2 Gbp in size. That's a base-four data-type, if you discount methylation and other forms of chemical DNA modification.

However, how do they encode that to binary? Do they store as ASCII, which (if memory serves) uses a full byte to store each character? Because encoding a base-four datatype like a/c/t/g to binary 1/0 could be minimised to only two bits per base, meaning that storing 8 bits per base (ASCII or similar) would be inflating the data 4x, making your 3.2 Gbp genome occupy 3.2 Gb instead of a minimum of 0.8 Gb before compression?

Or am I wildly off the mark? :)
That accounts only for the actual base sequence of course; your annotations etc. will still outstrip the DNA vastly in size, although they will probably have a better compression ratio.

Has anyone looked into a direct basepair->bit codec for minimisation of genome storage? Even a compression from a full byte per base down to four bits would halve storage size while leaving plenty of overhead for modified bases etc.?

On 23 November 2012 23:42, Giovanni <giovanni...@gmail.com> wrote:
A bit of a speculative post into the future. Reading about new ventures into exome sequencing and data amounting to >6GB just for the base pairs, I was curious about the media formats that are used, such as DVDs. A couple articles got me thinking about full genome sequencing (3x10^9 base pairs and about 50TB of storage), would be possible on several 4Terabyte HDDs, but terabyte (1-15TB) optical discs like ones by Fujifilm may make portable genomic data a lot simpler to handle. A few links I found on it:
http://fudzilla.com/home/item/29581-1tb-optical-discs-coming-in-2015
http://www.tweaktown.com/news/26908/1tb_optical_discs_are_coming_but_you_ll_have_to_wait_until_2015/index.html
http://news.yahoo.com/1-000-genome-almost-ready-111300774.html
http://www.genomeweb.com/clinical-genomics/23andme-opens-research-portal-outside-investigators-effort-advance-genomics-know
http://www.wired.com/wiredscience/2012/11/social-codes/
http://www.kinexus.ca/pdf/graphs_charts/HumanGenomeSequence.pdf

I guess if full genome sequencing is available, the most practical storage medium would be one that doesn't comprise a major part of the cost of sequencing. I think the cost of TB discs, like Blu- ray, and DVDs before them, might be as little as $0.20- few dollars each, but their price might not be as low if TB discs have popular adoption, which would come with UltraHD cinema discs for 4K resolution televisions and Playstation 4 discs (if they exceed 25-50GB). The idea of an entire genome fitting on just a few optical discs instead of 50 is actually a little encouraging.

--
-- You received this message because you are subscribed to the Google Groups DIYbio group. To post to this group, send email to diy...@googlegroups.com. To unsubscribe from this group, send email to diybio+un...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/diybio?hl=en
Learn more at www.diybio.org
---
You received this message because you are subscribed to the Google Groups "DIYbio" group.
To post to this group, send email to diy...@googlegroups.com.
To unsubscribe from this group, send email to diybio+un...@googlegroups.com.
Visit this group at http://groups.google.com/group/diybio?hl=en.
To view this discussion on the web visit https://groups.google.com/d/msg/diybio/-/WAatPz52dX4J.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
www.indiebiotech.com
twitter.com/onetruecathal
joindiaspora.com/u/cathalgarvey
PGP Public Key: http://bit.ly/CathalGKey


loïc lauréote

unread,
Nov 26, 2012, 7:36:29 AM11/26/12
to diy...@googlegroups.com
hi,

you can store it in dna. :D
I wrote something on my blog, the principle of conversion can be usefull.
http://hackolite.wordpress.com/2012/11/25/explore-a-new-biological-filetype-the-dna/



2012/11/26 Cathal Garvey <cathal...@gmail.com>

loïc lauréote

unread,
Nov 26, 2012, 7:40:00 AM11/26/12
to diy...@googlegroups.com
you can use the fano method :

A=1,
T=01,
C=000,
G=001,

so 00001 can only be  CT.

2012/11/26 loïc lauréote <loic.l...@gmail.com>

Mega

unread,
Nov 26, 2012, 1:19:14 PM11/26/12
to diy...@googlegroups.com
You can save it electrically, two states. (current/ no current)
With quantum computers - atom - no atom - charged atom. 3 states.
or with DNA -> four states


are DNA-computers feasible?

Eugen Leitl

unread,
Nov 26, 2012, 2:48:37 PM11/26/12
to diy...@googlegroups.com
They're feasible, but they're largely useless.
Their intrinsic advantages will become less and
less relevant over time.

Nathan McCorkle

unread,
Nov 26, 2012, 5:33:41 PM11/26/12
to diybio
On Mon, Nov 26, 2012 at 10:19 AM, Mega <masters...@gmail.com> wrote:
You can save it electrically, two states. (current/ no current)
With quantum computers - atom - no atom - charged atom. 3 states.
or with DNA -> four states

Storing DNA as DNA, huh... 

do non-americans (or those that don't watch MTV) even get this?:

 

Patrik D'haeseleer

unread,
Nov 27, 2012, 3:08:32 AM11/27/12
to diy...@googlegroups.com
Sounds like my plan to make a 3D model of DNA our of DNA origami...

Patrik D'haeseleer

unread,
Nov 27, 2012, 3:09:30 AM11/27/12
to diy...@googlegroups.com
OUT of DNA origami. Sory - cant spel.

loïc lauréote

unread,
Nov 27, 2012, 12:36:06 PM11/27/12
to diy...@googlegroups.com
haha, funny meme. origami can be usefull as indexation tool to find a
dna file, the dna motif can give some information about a file..

Jonathan Cline

unread,
Jan 13, 2013, 4:05:06 AM1/13/13
to diy...@googlegroups.com, jcline


On Monday, November 26, 2012 4:14:37 AM UTC-8, Cathal wrote:
Thinking about whole genome storage

Store only the differences between your genome and some standardized reference genome.   Then compress the result using an encoding algorithm.  The more genomes stored, the smaller size the compressed data can become, since more delta comparisons (e.g. dictionary searches) yield smaller token data.  Basically.

http://en.wikipedia.org/wiki/General_feature_format
http://www.sanger.ac.uk/resources/software/gff/spec.html




## Jonathan Cline
## jcl...@ieee.org
## Mobile: +1-805-617-0223
########################

Lisa Thalheim

unread,
Jan 14, 2013, 3:21:47 PM1/14/13
to diy...@googlegroups.com
Hey Giovanni,

depends what you want to store, exactly. Do you want to store all
reads from a run of an NGS sequencer? That's going to be a lot more
data than just storing 3 billion base pairs, also depending on the
coverage at which the genome was sequenced.

Do you want to store an alignment of a read dataset to a reference?
Again, more data than 3 billion base pairs. Do you just want to store
the "difference" to a reference genome? That's going to be a lot less
data, but extracting this information reliably from an NGS sequencing
data set is a world of headaches all its own.

Example here: A recent sequencing project I was involved in sequenced
the exomes of four human cell lines at 30X coverage. The data size is
approx. 5 GB per cell line gzip compressed, and 12 GB uncompressed.
The alignment per cell line is 5 GB large (also compressed), though
the size of an alignment depends on the alignment parameters used. A
file containing all called SNVs for one cell line, on the other hand,
is only 2.5 MB large. Again, this somewhat depends on the parameters
used during SNV calling, but you get the idea. Going from that, if
you're worried about storage space, it would probably make sense to
put in the work of crunching the sequencing data, and only storing a
"diff".

As an aside, there is research into compressing DNA data which has had
some impressive results. Intro/summary here:
http://genomeinformatician.blogspot.de/2012/05/dna-compression-reprise.html

Was that useful?

Lisa Thalheim

unread,
Jan 14, 2013, 3:27:33 PM1/14/13
to diy...@googlegroups.com
Hey Cathal,

On Mon, Nov 26, 2012 at 1:14 PM, Cathal Garvey <cathal...@gmail.com> wrote:

> However, how do they encode that to binary? Do they store as ASCII, which
> (if memory serves) uses a full byte to store each character?

Depends? "Raw" sequencing data (well, not quite raw; see reply to
Giovanni) is usually stored as FASTQ, which is plaintext ASCII, and
then compressed using GZIP. Alignments are usually stored either in
SAM format (plaintext ASCII) or BAM (compressed). Then there's the
2bit format (the name says it all), which is used to pass around the
reference genome assemblies, for example. In this format, the human
reference genome file is about 700MB large.

> That accounts only for the actual base sequence of course; your annotations
> etc. will still outstrip the DNA vastly in size, although they will probably
> have a better compression ratio.

Nah, annotations are cheap. But alignments are costly, if you're
hell-bent on keeping them (though I don't see why you would once
you're done with the analysis).

> Has anyone looked into a direct basepair->bit codec for minimisation of
> genome storage? Even a compression from a full byte per base down to four
> bits would halve storage size while leaving plenty of overhead for modified
> bases etc.?

See above, and the link in the reply to Giovanni.

Cheers,
Lisa

Giovanni

unread,
Jan 14, 2013, 5:45:24 PM1/14/13
to diy...@googlegroups.com, ltha...@googlemail.com
Thanks Lisa. It is useful.
Reply all
Reply to author
Forward
0 new messages