Amazon Glacier?

770 views
Skip to first unread message

David Prothero

unread,
Aug 21, 2012, 12:44:51 PM8/21/12
to s3...@googlegroups.com
Are the s3ql developers looking at Amazon Glacier as another potential backend? Not sure if that model works since Glacier is designed to be somewhere you dump data and then only access it if you really need to (free to dump data IN, super-cheap to store long-term, but then a little more to pull the data back out).

Not sure if routine s3ql operations would end up incurring more "out" transfers than Glacier is suited for.

Thought I'd ask, though.

David

Russell Jones

unread,
Aug 21, 2012, 12:49:12 PM8/21/12
to s3...@googlegroups.com
This looks like a great service. I'm trying to figure out what this
sentence means though:

"Amazon Glacier is optimized for data that is infrequently accessed and
for which retrieval times of several hours are suitable"


Does this mean that the network bandwidth is slower than normal S3?

David Prothero

unread,
Aug 21, 2012, 12:58:12 PM8/21/12
to s3...@googlegroups.com
From the Amazon Glacier FAQ:

Q: How should I choose between Amazon Glacier and Amazon Simple Storage Service (Amazon S3)?

Amazon S3 is a durable, secure, simple, and fast storage service designed to make web-scale computing easier for developers. Use Amazon S3 if you need low latency or frequent access to your data. Use Amazon Glacier if low storage cost is paramount, your data is rarely retrieved, and data retrieval times of several hours are acceptable.

In the coming months, Amazon S3 will introduce an option that will allow customers to seamlessly move data between Amazon S3 and Amazon Glacier based on data lifecycle policies.    




--
You received this message because you are subscribed to the Google Groups "s3ql" group.
To post to this group, send email to s3...@googlegroups.com.
To unsubscribe from this group, send email to s3ql+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/s3ql?hl=en.




--
David

Russell Jones

unread,
Aug 21, 2012, 1:02:34 PM8/21/12
to s3...@googlegroups.com
Still doesn't answer if the throughput is slower or not (unless I'm missing something). Could care less about a few extra milliseconds in latency, but would be interested in knowing if you take a hit in actual data retrieval speed. The way they are wording it leaves questions.


--
Sent from my Android phone with K-9 Mail.

David Prothero <da...@prothero.com> wrote:
To unsubscribe from this group, send email to s3ql+uns...@googlegroups.com.

David Prothero

unread,
Aug 21, 2012, 1:05:14 PM8/21/12
to s3...@googlegroups.com
Read some more. You don't have real-time read capability at all. You have to create a "job" to pull data out of Glacier. They say a job typically takes 3-5 hours to complete.

That's pretty bad read latency :P

It's still a great service for long-term cold storage of data, but probably not something you'd hook s3ql directly to.

David
--
David

David Prothero

unread,
Aug 21, 2012, 1:18:46 PM8/21/12
to s3...@googlegroups.com
I suppose you could use s3ql against a local file store and then periodically back up the s3ql filesystem to Glacier for "cold" storage (/rimshot). That would give you the local encryption benefits of s3ql, but not so much the deduplication benefit for minimizing data transfer to Amazon.

Or maybe, we see what tools Amazon develops for moving data from S3 to Glacier.

David
--
David

Cliff Stanford

unread,
Aug 23, 2012, 10:02:37 AM8/23/12
to s3...@googlegroups.com
On 21/08/12 17:44, David Prothero wrote:

> Not sure if routine s3ql operations would end up incurring more "out"
> transfers than Glacier is suited for.

As far as I can see, s3ql would work for writes on Glacier other than
the initial check it makes to make sure its cache and DB are up to date.
s3ql is very good about not pulling stuff down the line if it doesn't
have to.

The problem would be on any kind of data retrieval. If Nikolaus wants
to support it, it will take some thought but I think it would be very
useful.

Has anyone looked at the API? Is it similar to the S3 one?

Regards,
Cliff.

--
Cliff Stanford
Office: +44 20 0222 1666 UK Mobile: +44 7973 616 666
Spain: +34 952 587 666
http://www.may.be/

David Harrison

unread,
Aug 23, 2012, 10:10:59 AM8/23/12
to s3...@googlegroups.com
perhaps glacier can somehow be utilized for the snapshot feature.

i was thinking however, since amazon states they will be introducing some features to automate s3 to glacier replication, that you could just use glacier to backup your s3ql buckets for somewhat similar results.

restoring would be a bit tedious as you would have to copy your backed-up bucket back to s3 and then mount a separate file system to retrieve the version you are looking for.



--
You received this message because you are subscribed to the Google Groups "s3ql" group.

David Prothero

unread,
Aug 23, 2012, 11:32:05 AM8/23/12
to s3...@googlegroups.com
I've looked at the API. Data retrieval is an asynchronous operation. You create a job to request the data and then wait "3-4" hours for your data to become available to download. I've read elsewhere that people are theorizing that Amazon is using older hardware for Glacier, trying to get more life out of the equipment, and that they presumably power down the equipment once the data is written. So requesting data back, may require powering back up the equipment.

I haven't seen any confirmation from Amazon that's how they're doing it, though.

David

To unsubscribe from this group, send email to s3ql+uns...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/s3ql?hl=en.



--
David

Nikolaus Rath

unread,
Aug 23, 2012, 11:43:38 AM8/23/12
to s3...@googlegroups.com
On 08/23/2012 10:02 AM, Cliff Stanford wrote:
> On 21/08/12 17:44, David Prothero wrote:
>
>> Not sure if routine s3ql operations would end up incurring more "out"
>> transfers than Glacier is suited for.
>
> As far as I can see, s3ql would work for writes on Glacier other than
> the initial check it makes to make sure its cache and DB are up to date.
> s3ql is very good about not pulling stuff down the line if it doesn't
> have to.
>
> The problem would be on any kind of data retrieval. If Nikolaus wants
> to support it, it will take some thought but I think it would be very
> useful.
>
> Has anyone looked at the API? Is it similar to the S3 one?

The upload API is trivial. It shouldn't take more than about 1-2 hours
to support in it S3QL.

Downloading data is much more complex. The API itself is very easy, but
it does not fit into the S3QL programming model at all. S3QL could
easily create download jobs for the data it needs, but it has no way to
notice when a job is ready for download. Also, when downloading data, it
has to be identified by a changing job id rather then the name it was
stored under. The same object will therefore have a different id every
time it's being downloaded. This means that the S3QL data structures
have to be extended to track the id in addition to the object identifier.

Finally, any read request from userspace would block for several minutes
at least. If a request blocks that long, it's going to create lots of
kernel warning messages. While a program is blocked in this way, it is
also in "uninterruptible sleep", so it cannot even be kill -9'ed.

If I made S3QL only able to write data to glacier, the problem is how
the file system should react to attempts to read data. Probably an IO
error should be generated, but this means that the highest S3QL layer
(which talks to the FUSE kernel module) suddenly needs to talk to the
lowest S3QL layer (the backend) for every request to determine if the
file system is write-only. This isn't hard to program, but it's going to
introduce a lot of ugly identical boiler plate code in lots of places.


Therefore, it seems to me that if data has to be copied from Glacier
into S3 for reading anyway, the most sensible approach would be to
require the same procedure for writing it and not add any direct Glacier
support to S3QL.


Best,

-Nikolaus

--
�Time flies like an arrow, fruit flies like a Banana.�

PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C

Jar Jar

unread,
Aug 24, 2012, 7:47:09 AM8/24/12
to s3...@googlegroups.com


On Thursday, August 23, 2012 11:43:38 AM UTC-4, Nikolaus Rath wrote:
On 08/23/2012 10:02 AM, Cliff Stanford wrote:
> On 21/08/12 17:44, David Prothero wrote:
>
>> Not sure if routine s3ql operations would end up incurring more "out"
>> transfers than Glacier is suited for.
>
> Has anyone looked at the API?  Is it similar to the S3 one?

The upload API is trivial. It shouldn't take more than about 1-2 hours
to support in it S3QL.

Downloading data is much more complex. The API itself is very easy, but
it does not fit into the S3QL programming model at all. S3QL could
easily create download jobs for the data it needs, but it has no way to
notice when a job is ready for download. Also, when downloading data, it
has to be identified by a changing job id rather then the name it was
stored under. The same object will therefore have a different id every
time it's being downloaded. This means that the S3QL data structures
have to be extended to track the id in addition to the object identifier.

Finally, any read request from userspace would block for several minutes
at least. If a request blocks that long, it's going to create lots of
kernel warning messages. While a program is blocked in this way, it is
also in "uninterruptible sleep", so it cannot even be kill -9'ed.

If I made S3QL only able to write data to glacier, the problem is how
the file system should react to attempts to read data. Probably an IO
error should be generated, but this means that the highest S3QL layer
(which talks to the FUSE kernel module) suddenly needs to talk to the
lowest S3QL layer (the backend) for every request to determine if the
file system is write-only. This isn't hard to program, but it's going to
introduce a lot of ugly identical boiler plate code in lots of places.

I know of a couple use-cases where a write-only S3QL would be very handy (big data / analysis regulator backups).
If the promised S3 -> Glacier tools materialize, one system could provide online and offline backup with minimal programming risk at $10/TB cost.

One other thing working with Glacier could bring up, with that much cheap storage, can the S3QL structures and processes scale into petabyte data/billion files/million snapshots/million data-block range?
Nikolaus, do you have a good feeling for what the outer limits of the S3QL (it's structures/processes) are?




Nikolaus Rath

unread,
Aug 24, 2012, 10:17:32 AM8/24/12
to s3...@googlegroups.com
On 08/24/2012 07:47 AM, Jar Jar wrote:
> One other thing working with Glacier could bring up, with that much
> cheap storage, can the S3QL structures and processes scale
> into petabyte data/billion files/million snapshots/million data-block range?
> Nikolaus, do you have a good feeling for what the outer limits of the
> S3QL (it's structures/processes) are?

As far as the amount of stored data is concerned, S3QL is only limited
by the metadata database. The limits on those are given by SQLite. The
size of the metadata scales approximately linear with the size of the
stored data, and the proportionality constant depends on the block size
of the S3QL file system and the average size per file in the file
system. SQLite can handle terabyte sized databases just fine, so I
suppose from that point of view you could easily store petabytes of data.

However, I believe you will hit performance issues much earlier. S3QL
performance is roughly logarithmic in stored data, so it will not get
significantly slower no matter how much data you store. However, I am
not sure if S3QL is fast enough to read and store petabytes of data in a
reasonable amount of time. Because of its architecture (FUSE and Python)
it doesn't scale nearly as well as an in-kernel file system.

Finally, S3QL can not upload incremental metadata updates. So every time
you upload the metadata you'd have to upload, say, the entire 1 TB
database, even if you added only 5 MB of data. I can imagine this
becoming very annoying :-).

Domen Kožar

unread,
Sep 1, 2012, 2:31:24 PM9/1/12
to s3...@googlegroups.com
According to http://docs.amazonwebservices.com/amazonglacier/latest/dev/api-initiate-job-post.html:

Reading data from Glacier means requesting a job. You get back an ID which you can check every n seconds if jobis ready to be downloaded. Even though API is async, s3ql could implement also read access.

 ï¿½Time flies like an arrow, fruit flies like a Banana.�

Nikolaus Rath

unread,
Sep 1, 2012, 2:58:17 PM9/1/12
to s3...@googlegroups.com
On 09/01/2012 02:31 PM, Domen Kožar wrote:
> According
> to http://docs.amazonwebservices.com/amazonglacier/latest/dev/api-initiate-job-post.html:
>
> Reading data from Glacier means requesting a job. You get back an ID
> which you can check every n seconds if jobis ready to be downloaded.
> Even though API is async, s3ql could implement also read access.

Of course it is possible in principle, but that doesn't mean that it's a
good idea. I have outlined the architectural problems in another mail.


-Nikolaus

--
»Time flies like an arrow, fruit flies like a Banana.«

Domen Kožar

unread,
Sep 1, 2012, 3:25:08 PM9/1/12
to s3...@googlegroups.com
Hi Nikolaus,

it seems you outlined three issues:

Finally, any read request from userspace would block for several minutes 
at least. If a request blocks that long, it's going to create lots of 
kernel warning messages. While a program is blocked in this way, it is 
also in "uninterruptible sleep", so it cannot even be kill -9'ed. 

This wouldn't be a problem in most cases.

However, I believe you will hit performance issues much earlier. S3QL 
performance is roughly logarithmic in stored data, so it will not get 
significantly slower no matter how much data you store. However, I am 
not sure if S3QL is fast enough to read and store petabytes of data in a 
reasonable amount of time. Because of its architecture (FUSE and Python) 
it doesn't scale nearly as well as an in-kernel file system. 

Meaning there is a transfer speed limit? Python has pretty fast I/O,
otherwise pypy could be used. Are there any benchmarks for this?

Finally, S3QL can not upload incremental metadata updates. So every time 
you upload the metadata you'd have to upload, say, the entire 1 TB 
database, even if you added only 5 MB of data. I can imagine this 
becoming very annoying :-). 

You can easily upload each file as new archive. Changing file would mean downloading it indeed, but for some use cases that's not needed.

Cheers, Domen

Nikolaus Rath

unread,
Sep 1, 2012, 3:57:45 PM9/1/12
to s3...@googlegroups.com
On 09/01/2012 03:25 PM, Domen Kožar wrote:
> Hi Nikolaus,
>
> it seems you outlined three issues:
>
> Finally, any read request from userspace would block for several
> minutes
> at least. If a request blocks that long, it's going to create lots of
> kernel warning messages. While a program is blocked in this way, it is
> also in "uninterruptible sleep", so it cannot even be kill -9'ed.
>
>
> This wouldn't be a problem in most cases.

Why is that?

> However, I believe you will hit performance issues much earlier. S3QL
> performance is roughly logarithmic in stored data, so it will not get
> significantly slower no matter how much data you store. However, I am
> not sure if S3QL is fast enough to read and store petabytes of data
> in a
> reasonable amount of time. Because of its architecture (FUSE and
> Python)
> it doesn't scale nearly as well as an in-kernel file system.
>
>
> Meaning there is a transfer speed limit? Python has pretty fast I/O,
> otherwise pypy could be used. Are there any benchmarks for this?

No, there is no limit, it's just slower. Try copying a big file (that
still fits into the S3QL cache to not be affected by network bandwidth)
into an S3QL mountpoint and an ext3 mountpoint and you'll see the
difference.


> Finally, S3QL can not upload incremental metadata updates. So every
> time
> you upload the metadata you'd have to upload, say, the entire 1 TB
> database, even if you added only 5 MB of data. I can imagine this
> becoming very annoying :-).
>
>
> You can easily upload each file as new archive. Changing file would mean
> downloading it indeed, but for some use cases that's not needed.

This wasn't referring to Glacier. Storing petabytes in S3QL is probably
going to be annoying no matter what backend you use.


Best,

Jaka Hudoklin

unread,
Sep 15, 2012, 4:11:56 PM9/15/12
to s3...@googlegroups.com
Hello,

we are implementing python glacier command line interface, which already works(https://github.com/uskudnik/amazon-glacier-cmd-interface). At the same time we are developing GlacierWrapper library(currently in this branch https://github.com/uskudnik/amazon-glacier-cmd-interface/tree/botoonly) that besides basic glacier functionality supports SimpleDB as storage for archive metadata. Our long term plan is to make filesystem support using fuse(https://github.com/uskudnik/amazon-glacier-cmd-interface/issues/2) and maybe to integrate with s3ql. We are also planning to integrate with boto when their glacier support is ready.

While my knowledge of fuse and especially how your project is implemented is bad, I would really like to know what do you think about integration of our project with s3ql.

Thanks!

Nikolaus Rath

unread,
Sep 16, 2012, 8:45:28 AM9/16/12
to s3...@googlegroups.com
On 09/15/2012 04:11 PM, Jaka Hudoklin wrote:
> Hello,
>
> we are implementing python glacier command line interface, which already
> works(https://github.com/uskudnik/amazon-glacier-cmd-interface). At the
> same time we are developing GlacierWrapper library(currently in this
> branch
> https://github.com/uskudnik/amazon-glacier-cmd-interface/tree/botoonly)
> that besides basic glacier functionality supports SimpleDB as storage
> for archive metadata. Our long term plan is to make filesystem support
> using
> fuse(https://github.com/uskudnik/amazon-glacier-cmd-interface/issues/2)
> and maybe to integrate with s3ql. We are also planning to integrate with
> boto when their glacier support is ready.
>
> While my knowledge of fuse and especially how your project is
> implemented is bad, I would really like to know what do you think about
> integration of our project with s3ql.


I think getting more people to work on his code every open source
programmer's dream, so I'd of course be happy to support you with that :-).

That said, I'm not sure I understand what kind of integration you have
in mind. Getting S3QL to store data in SimpleDB has been requested
several times before and would be a great feature to have. Being able to
support Glacier directly would also be nice, but (as discussed a few
weeks ago), the problem isn't so much one of coding but one of architecture.

I deliberately eliminated the use of boto in S3QL a few years ago. Back
then the boto S3 code was (in my opinion) a terrible mess, and got even
worse when GS support was merged.

Jaka Hudoklin

unread,
Sep 16, 2012, 10:55:23 AM9/16/12
to s3...@googlegroups.com
I'm not sure I understand what kind of integration you have
in mind. Getting S3QL to store data in SimpleDB has been requested
several times before and would be a great feature to have.

I'm not talking about storing data to SimpleDB, but only file metadata, because there's no sane way that you can store archive metadata into glacier. You need some kind of database. So that's why we decided to use SimpleDB.


Glacier directly would also be nice, but (as discussed a few
weeks ago), the problem isn't so much one of coding but one of architecture.
I agree it's not problem of coding, but more of an architecture and we still have some progress to do on our project, before we decide if fuse support is sane option to use our project as interface or will we just stay command line interface. Of course we are not going to implement fuse support if s3ql is going to add it, because we like s3ql and there's no need for two implementations.


I deliberately eliminated the use of boto in S3QL a few years ago. Back
then the boto S3 code was (in my opinion) a terrible mess, and got even
worse when GS support was merged.
As far as boto concerns, i don't know how mouch of a mess it is, and changing interface should not be more than an hour of work(at least in our case), so it's not even that important. But i don't like to have duplicated code, so that's why we will probably use boto when it's ready.

Nikolaus Rath

unread,
Sep 16, 2012, 12:38:18 PM9/16/12
to s3...@googlegroups.com
On 09/16/2012 10:55 AM, Jaka Hudoklin wrote:
> I'm not sure I understand what kind of integration you have
> in mind. Getting S3QL to store data in SimpleDB has been requested
> several times before and would be a great feature to have.
>
>
> I'm not talking about storing data to SimpleDB, but only file metadata,
> because there's no sane way that you can store archive metadata into
> glacier. You need some kind of database. So that's why we decided to use
> SimpleDB.

Yeah, I was talking about metadata too, sorry for the confusion.

Mark

unread,
Apr 8, 2013, 11:42:38 AM4/8/13
to s3...@googlegroups.com


On Monday, 17 September 2012 02:38:18 UTC+10, Nikolaus Rath wrote:
On 09/16/2012 10:55 AM, Jaka Hudoklin wrote:
>     I'm not sure I understand what kind of integration you have
>     in mind. Getting S3QL to store data in SimpleDB has been requested
>     several times before and would be a great feature to have.
>
>
> I'm not talking about storing data to SimpleDB, but only file metadata,
> because there's no sane way that you can store archive metadata into
> glacier. You need some kind of database. So that's why we decided to use
> SimpleDB.

Yeah, I was talking about metadata too, sorry for the confusion.
 
What are the architectural issues preventing supporting SimpleDB instead SQlite for master and slave?

(Non-architectural concerns I can appreciate, e.g. that backing S3QL meta into AWS isn't as 'open' and portable as SQLite for non-AWS users. The added option would also burden S3QL maintenance.) 

Thanks
Mark


Best,

   -Nikolaus

--
 ï¿½Time flies like an arrow, fruit flies like a Banana.�

Nikolaus Rath

unread,
Apr 8, 2013, 11:52:38 AM4/8/13
to s3...@googlegroups.com
On 04/08/2013 08:42 AM, Mark wrote:
>
>
> On Monday, 17 September 2012 02:38:18 UTC+10, Nikolaus Rath wrote:
>
> On 09/16/2012 10:55 AM, Jaka Hudoklin wrote:
> > I'm not sure I understand what kind of integration you have
> > in mind. Getting S3QL to store data in SimpleDB has been
> requested
> > several times before and would be a great feature to have.
> >
> >
> > I'm not talking about storing data to SimpleDB, but only file
> metadata,
> > because there's no sane way that you can store archive metadata into
> > glacier. You need some kind of database. So that's why we decided
> to use
> > SimpleDB.
>
> Yeah, I was talking about metadata too, sorry for the confusion.
>
>
> What are the architectural issues preventing supporting SimpleDB instead
> SQlite for master and slave?

None, it's just going to make everything much slower. But I still
maintain my original position: it'd be a great feature, someone just has
to write the code :-).


Best,
-Niko

Mark

unread,
Apr 8, 2013, 11:57:49 AM4/8/13
to s3...@googlegroups.com


On Tuesday, 9 April 2013 01:42:38 UTC+10, Mark wrote:


On Monday, 17 September 2012 02:38:18 UTC+10, Nikolaus Rath wrote:
On 09/16/2012 10:55 AM, Jaka Hudoklin wrote:
>     I'm not sure I understand what kind of integration you have
>     in mind. Getting S3QL to store data in SimpleDB has been requested
>     several times before and would be a great feature to have.
>
>
> I'm not talking about storing data to SimpleDB, but only file metadata,
> because there's no sane way that you can store archive metadata into
> glacier. You need some kind of database. So that's why we decided to use
> SimpleDB.

Yeah, I was talking about metadata too, sorry for the confusion.
 
What are the architectural issues preventing supporting SimpleDB instead SQlite for master and slave?

(Non-architectural concerns I can appreciate, e.g. that backing S3QL meta into AWS isn't as 'open' and portable as SQLite for non-AWS users. The added option would also burden S3QL maintenance.) 

From Wikipedia :

[edit]Store limitations

AttributeMaximum
domains250 active domains per account. More can be requested by filling out a form.[6]
size of each domain10 GB
attributes per domain1,000,000,000
attributes per item256 attributes
size per attribute1024 bytes

[edit]Query limitations

AttributeMaximum
items returned in a query response2500 items
seconds a query may run5 seconds
attribute names per query predicate1 attribute name
comparisons per predicate22 operators
predicates per query expression20 predicates

Reply all
Reply to author
Forward
0 new messages