Longterm genomic Data storage

81 views
Skip to first unread message

Jeswin

unread,
Jan 13, 2015, 12:04:40 PM1/13/15
to diy...@googlegroups.com
Hey all,
I'm kind of curious as to how companies store DNA seq data for
archival purposes. The company I work at has local storage, but
there's always issues like running out of space on the HDDs. There's
only so much HDDs you can buy before you need to expand the server
room. So what's done with old data, in case there is a need to dig it
out 3-5 years later? I heard most companies just limit the holding
time for raw data for 6 months to a year.

scoc...@gmail.com

unread,
Jan 13, 2015, 12:25:22 PM1/13/15
to diy...@googlegroups.com
I got a Life Vault USB drive and it claims to be stable for 100 years. Catch is storage space is on the low side. If its raw DNA, how much space would one really need for their personal DNA repo? Are you talking about long term, large scale needs or enough to burn onto a few redundant bluray disks?

Sebastian S. Cocioba
CEO & Founder
New York Botanics, LLC
Plant Biotech R&D

From: Jeswin
Sent: ‎1/‎13/‎2015 12:04 PM
To: diy...@googlegroups.com
Subject: [DIYbio] Longterm genomic Data storage

--
-- You received this message because you are subscribed to the Google Groups DIYbio group. To post to this group, send email to diy...@googlegroups.com. To unsubscribe from this group, send email to diybio+un...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/diybio?hl=en
Learn more at www.diybio.org
---
You received this message because you are subscribed to the Google Groups "DIYbio" group.
To unsubscribe from this group and stop receiving emails from it, send an email to diybio+un...@googlegroups.com.
To post to this group, send email to diy...@googlegroups.com.
Visit this group at http://groups.google.com/group/diybio.
To view this discussion on the web visit https://groups.google.com/d/msgid/diybio/CAAhF0R%2BNi2WqjTdm13XV7_BUxyOQ%3DVCJX0m1_nyJZq3trSHAhQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Jeswin

unread,
Jan 13, 2015, 2:22:41 PM1/13/15
to diy...@googlegroups.com
Yea, I don't really know much, except that our sequencers put out
something like ~50 gb data files, usually from the exome seq. They're
compressed before sent to clients and I see that they end up around
30gb. I think our local server has 50 or more terabytes of space. At
this point we're low through-put compared to the big guys but still,
there's a lot of data. When I first joined, those numbers were mind
boggling. So, I'm just wondering out of curiosity.
> https://groups.google.com/d/msgid/diybio/54b554fc.4778e00a.54d8.ffff98f2%40mx.google.com.

Cathal Garvey

unread,
Jan 13, 2015, 2:28:39 PM1/13/15
to diy...@googlegroups.com
Large QR codes on vellum.

scoc...@gmail.com

unread,
Jan 13, 2015, 2:39:29 PM1/13/15
to diy...@googlegroups.com
That would make for a really beautiful family crest. How huge do you think it has to be to store the entire refSeq human genome? Also the camera to snap that picture has to be pretty nice to counteract noise.

Sebastian S. Cocioba
CEO & Founder
New York Botanics, LLC
Plant Biotech R&D

From: Cathal Garvey
Sent: ‎1/‎13/‎2015 2:28 PM
To: diy...@googlegroups.com
Subject: Re: [DIYbio] Longterm genomic Data storage

--
-- You received this message because you are subscribed to the Google Groups DIYbio group. To post to this group, send email to diy...@googlegroups.com. To unsubscribe from this group, send email to diybio+un...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/diybio?hl=en
Learn more at www.diybio.org
---
You received this message because you are subscribed to the Google Groups "DIYbio" group.
To unsubscribe from this group and stop receiving emails from it, send an email to diybio+un...@googlegroups.com.
To post to this group, send email to diy...@googlegroups.com.
Visit this group at http://groups.google.com/group/diybio.

Cathal Garvey

unread,
Jan 13, 2015, 2:55:40 PM1/13/15
to diy...@googlegroups.com
Someone more awake can do the math: Assume a high-DPI scanner and
scripted resplicing of the scanned sheet. So, you pack the sheet with QR
codes at mimimum size assuming the high DPI coverage. You reduce error
correction but don't eliminate. Each QR is assumed to hold ~3kb of
binary data at max. You use a quick first-pass binary encoding scheme
for DNA, compressing it immediately by 4x, then you use a more
complicated compression scheme to maximise compression.

The human genome is, rounding up (Fermi estimation done wrong), 4
gigabases. Binary encoding gives 1 gigabyte. Compression might buy a 30%
reduction. 700Mb (700,000kb) into QR codes, assuming 6 codes per sheet
side (so 36kb per sheet), means close to 20,000 pages.

WE CAN DO EET

On 13/01/15 19:37, scoc...@gmail.com wrote:
> That would make for a really beautiful family crest. How huge do you
> think it has to be to store the entire refSeq human genome? Also the
> camera to snap that picture has to be pretty nice to counteract noise.
>
> Sebastian S. Cocioba
> CEO & Founder
> New York Botanics, LLC
> Plant Biotech R&D
> ------------------------------------------------------------------------
> From: Cathal Garvey <mailto:cathal...@cathalgarvey.me>
> Sent: ‎1/‎13/‎2015 2:28 PM
> To: diy...@googlegroups.com <mailto:diy...@googlegroups.com>
> <mailto:diybio+un...@googlegroups.com>.
> To post to this group, send email to diy...@googlegroups.com
> <mailto:diy...@googlegroups.com>.
> Visit this group at http://groups.google.com/group/diybio.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/diybio/54b5742d.b61f8c0a.31b2.ffffbc78%40mx.google.com
> <https://groups.google.com/d/msgid/diybio/54b5742d.b61f8c0a.31b2.ffffbc78%40mx.google.com?utm_medium=email&utm_source=footer>.

John Griessen

unread,
Jan 13, 2015, 4:51:17 PM1/13/15
to diy...@googlegroups.com
On 01/13/2015 01:55 PM, Cathal Garvey wrote:
> So, you pack the sheet with QR codes at mimimum size assuming the high DPI coverage.

The image would have to have overlap zones at paper fold lines, or no folds and a margin,
and in the second case, no overlap repeated zones would be needed.

But what if insects we have in north America called silverfish got in the box?
They eat paper! Even in absence of much humidity.

If stored in a pharaoh's tomb, with a wax seal on the wooden case, they might go 4000 years...

Sebastian Cocioba

unread,
Jan 13, 2015, 5:04:52 PM1/13/15
to diy...@googlegroups.com
then laser etch the QR code onto a piece of clear cubic zirconium with that fancy focal point etching thing they do at shopping malls to immortalize some mundane family photo or moment...and frame it in a platinum-iridium alloy so thermal expansion is damn near zero. should last a few millennia at least, no? :P



--
-- You received this message because you are subscribed to the Google Groups DIYbio group. To post to this group, send email to diy...@googlegroups.com. To unsubscribe from this group, send email to diybio+unsubscribe@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/diybio?hl=en

Learn more at www.diybio.org
--- You received this message because you are subscribed to the Google Groups "DIYbio" group.
To unsubscribe from this group and stop receiving emails from it, send an email to diybio+unsubscribe@googlegroups.com.

To post to this group, send email to diy...@googlegroups.com.
Visit this group at http://groups.google.com/group/diybio.

Meredith L. Patterson

unread,
Jan 13, 2015, 5:27:54 PM1/13/15
to DIYBio Mailing List
Fabienne Serriere and Dan Kaminsky looked into the long-term archival problem for print matter in general, and they concluded that microfiche is ideal in terms of durability and really quite inexpensive. Bonus: constructing a microfiche viewer requires basically just a lens and a light source.


but I'd think it would be possible to laser-etch QR codes onto microfilm quite densely.

Cheers,
--mlp

-- You received this message because you are subscribed to the Google Groups DIYbio group. To post to this group, send email to diy...@googlegroups.com. To unsubscribe from this group, send email to diybio+un...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/diybio?hl=en

Learn more at www.diybio.org
---
You received this message because you are subscribed to the Google Groups "DIYbio" group.
To unsubscribe from this group and stop receiving emails from it, send an email to diybio+un...@googlegroups.com.

To post to this group, send email to diy...@googlegroups.com.
Visit this group at http://groups.google.com/group/diybio.

Jeswin

unread,
Jan 13, 2015, 7:00:46 PM1/13/15
to diy...@googlegroups.com
How did we get to printing QR codes? I was thinking more along the
lines of magnetic tape drives. Is that still a thing in IT world,
cause I thought I heard it started going away when cheap
multi-terabyte HDDs came out. That and the cloud, but the cloud is not
feasible for storing sequencing data.
> https://groups.google.com/d/msgid/diybio/CAPxGCxdkGA8vwWP9yuN9r0Lea3oYQNYGz6zOY%3Dh%2BE8ZNn7UFfA%40mail.gmail.com.

John Griessen

unread,
Jan 13, 2015, 8:11:32 PM1/13/15
to diy...@googlegroups.com
On 01/13/2015 06:00 PM, Jeswin wrote:
> I thought I heard it started going away when cheap
> multi-terabyte HDDs came out.

Yes, hard drives make sense for storage. Dirvish and rsnapshot are programs
for higher level managing the rsync program for saving backups in separate
redundant locations over networks on a schedule. They let you save the
state of your whole computer by saving just the changes under different
links on the hard drive and let you have many many slightly different
versions of a 20GB file directory in a space of just 40GB.

If your data is totally new and different, then just the rsync program
used in some scripts to write to different drives that are kept in drive trays
will be plenty. After the copying to drives in drive trays is done, put them on shelves
disconnected, unpowered, and not all in the same building to have a redundant backup
that will last through time.

Even hard drives have a shelf life though. You'd need to turn them on and test
that the data is still good, and transfer to a new on every ten years or so...
They can freeze up just sitting there on a shelf.

Cathal Garvey

unread,
Jan 14, 2015, 6:10:11 AM1/14/15
to diy...@googlegroups.com
> How did we get to printing QR codes?

Because LOL vellum genomes?

SC

unread,
Jan 19, 2015, 8:22:38 PM1/19/15
to diy...@googlegroups.com
Digital data can, of course, be submitted to NCBI.  Genbank or SRA, depending on whether or not it's been assembled.  They store it, make it available with umpteen tools, and do offsite backups just to make sure.   

Otto Heringer

unread,
Jan 20, 2015, 12:29:18 PM1/20/15
to diy...@googlegroups.com

Does anyone already did this genome QR code!? Sounds pretty cool.

For the information last thousands years, why not then carve QR codes on golden microchips!? Gold is very inert and a microscope would be enough to get the info - what is almost the same thing that Meredith suggested.

Em 19/01/2015 23:22, "'SC' via DIYbio" <diy...@googlegroups.com> escreveu:
Digital data can, of course, be submitted to NCBI.  Genbank or SRA, depending on whether or not it's been assembled.  They store it, make it available with umpteen tools, and do offsite backups just to make sure.   

--
-- You received this message because you are subscribed to the Google Groups DIYbio group. To post to this group, send email to diy...@googlegroups.com. To unsubscribe from this group, send email to diybio+un...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/diybio?hl=en
Learn more at www.diybio.org
---
You received this message because you are subscribed to the Google Groups "DIYbio" group.
To unsubscribe from this group and stop receiving emails from it, send an email to diybio+un...@googlegroups.com.
To post to this group, send email to diy...@googlegroups.com.
Visit this group at http://groups.google.com/group/diybio.

Cathal (Phone)

unread,
Jan 20, 2015, 12:55:59 PM1/20/15
to diy...@googlegroups.com, Otto Heringer
I've an idea: encode the DNA with error correction codes (Look 'em up) in...DNA! Place in a long-generation-time plant and there you go.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Kermit Henson

unread,
Jan 21, 2015, 7:10:16 AM1/21/15
to diy...@googlegroups.com, ottowh...@gmail.com, cathal...@cathalgarvey.me
We can wipe out the data with a uv lamp instead of a magnet ;) 

Seriously, private companies store the data in hard drives or cloud, (theoretically) always with a backup. If the company is not the owner of the data, they just send it by ftp or hard drives to the client.
Research centres buy expensive storage data centres or create mirrors in other research centres (ideally in other countries). That also gives them a perfect excuse to ask for a public grant.

There's no perfect system storage right now, but i would try minidisc (has de patent expired?) ;)

Koeng

unread,
Jan 21, 2015, 9:12:17 AM1/21/15
to diy...@googlegroups.com, ottowh...@gmail.com, cathal...@cathalgarvey.me
That could probably work if we got to working with large enough DNA

If each base pair could hold 2 bits (A,T,G,C instead of 0,1) then it would take 4 base pairs for a byte of data. Using that, it would take a whooping 4 million base pairs (entire bacterial genome!) to encode a megabyte of data. That's a little much for my taste

Alternatively, you could store computer data in synthesized DNA that isn't cloned 

-Koeng

Cory Geesaman

unread,
Jan 21, 2015, 9:46:03 AM1/21/15
to diy...@googlegroups.com, ottowh...@gmail.com, cathal...@cathalgarvey.me
I'd be hesitant to trust a third party to store data reliably.  Firstly there's not much of a track record for data storage mechanisms with minimal losses of digital data to begin with (a couple decades), secondly there's even less trust you can place in the integrity of modern devices (with the push for more compact data storage volatility goes up along with size) and the companies with any kind of reliable record for data storage have existed barely over a decade (less than the life of most devices - which again haven't even existed long enough to have really tested their estimated lifetime) with enormous numbers of data storage companies losing data and either chalking it up to a freak accident or going out of business as a result.  Your best bet if you're looking at long-term storage of data (assuming little expense spared) would include an offsite storage location (probably of tape backups) along with on-site storage - also your backup location(s) should be shielded from EM and cosmic rays - ideally situated under a mountain for your off-site storage.

Cory Geesaman

unread,
Jan 21, 2015, 9:49:29 AM1/21/15
to diy...@googlegroups.com, ottowh...@gmail.com, cathal...@cathalgarvey.me
You could probably fit 2 BP per byte but you'd want to avoid going for 2 bits per BP (there are more than 4 nucleotides in all, so if you're talking about any life you'd probably want 3 bits and it wouldn't hurt to have a few flag codes for things like likely misreads and such [which most sequencers will inform you of when they happen - otherwise it might be difficult to tell if a sequence is ATTAGC or ATTTTTTTTTAGC]).  Sorry to spam this thread but I actually did a bit of research on the subject of long-term DNA data storage for a business plan recently so hope the information is helpful.
Reply all
Reply to author
Forward
0 new messages