Help to decide where to go with S3 filesystems

auada

unread,

Aug 27, 2012, 4:09:12 PM8/27/12

to s3...@googlegroups.com

Hi, I am studying the different S3 filesystem available and I would appreciate if somebody could help me find out which is more appropriate for my situation:

I need a place to store the maildir files of an email system. So, I need performance and also a safe system to be online 24/7 with millions of files (or more) distributed in a directory tree.

I have seen the most active systems are S3ql and S3backer. I wonder what I should take into account to make up my mind and if somebody already has experience related to them.

Also I was thinking and wondering if, building a filesystem and splitting it among different S3 buckets would make any difference in relation to performance and security.

If anyone can jump in, it would be really appreciated.

Best regards,

--

Ricardo Auada

Russell Jones

unread,

Aug 27, 2012, 4:12:20 PM8/27/12

to s3...@googlegroups.com

Hey Ricardo,

Just a heads up that I've tried using S3ql before for storing Dovecot
maildir files without much luck - Dovecot kept thinking the filesystem /
index files were corrupted every time I restarted the dovecot service.

I haven't had a chance to try going back and fiddling with it more. Let
me know if you have luck getting this working with your mailserver
properly, I may give it another shot if so :-)

Nikolaus Rath

unread,

Aug 27, 2012, 4:16:28 PM8/27/12

to s3...@googlegroups.com

On 08/27/2012 04:09 PM, auada wrote:
> Also I was thinking and wondering if, building a filesystem and
> splitting it among different S3 buckets would make any difference in
> relation to performance and security.

No, as far as I can see that would be pointless.

Best,

-Nikolaus

--
�Time flies like an arrow, fruit flies like a Banana.�

PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C

auada

unread,

Aug 27, 2012, 9:12:35 PM8/27/12

to s3...@googlegroups.com

Nikolaus, thanks a lot for your response.

First, I want to congratulate you for the great job you´ve done. It´s clear, by the quality of the manual and attention you give to all that this is a great project.

I would like to add 4 more questions. As a matter of fact, there lots of questions poping up my mind:

1- regarding security and durability, if we use s3ql filesystem as a production resource, should we keep backups outside the filesystem, or the snapshots would be enough ?

2- would it be possible to sinchronize sql lite (or other db) among different instances and try to make multiple mounts of same file system, or there is more to multiple mounts than only the db ? I ask, because it is like a dream to be able to have multiple mounts and leave nfs behind and get rid of a single point of failure.

3- in the manual you say that standard S3 at Amazon has an eventual consistency window and other regions have immediate. Where did you take that from: from AWS or from experience ? And do you include Oregon as standard as well ?

4- what is your feeling about using s3ql filesystem for maildirs, regarding performance, durability and security ? Do you think we could have a bottle neck ?

Thanks a lot.

Regards,

Ricardo Auada

Em segunda-feira, 27 de agosto de 2012 17h16min28s UTC-3, Nikolaus Rath escreveu:

On 08/27/2012 04:09 PM, auada wrote:
> Also I was thinking and wondering if, building a filesystem and
> splitting it among different S3 buckets would make any difference in
> relation to performance and security.

No, as far as I can see that would be pointless.

Best,

-Nikolaus

--

ï¿½Time flies like an arrow, fruit flies like a Banana.ï¿½

auada

unread,

Aug 27, 2012, 9:14:32 PM8/27/12

to s3...@googlegroups.com

Hi Russel,

we might be trying it in the next weeks. As soon as I have any position, I´ll let you know.

But, please note, we use Courier IMAP, so I´m not sure the test will be 100 % valid for your case.

Regards,

--

Ricardo Auada

Martin van Es

unread,

Aug 28, 2012, 4:44:47 AM8/28/12

to s3...@googlegroups.com

Hi Ricardo,

I'm using s3ql as backend for my (personal) Zimbra mail server.
Currently it holds approximately a total of 3G of mail for me and a
couple of friends.

The only problem I've seen is that Zimbra during startup of the
mailbox daemon misinterprets the size of the message store mountpoint
and complains about a shortage of storage. I continues to start
however and of course, never reaches the limit.

Regards,
Martin

> --
> You received this message because you are subscribed to the Google Groups
> "s3ql" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/s3ql/-/YiCagYwetxQJ.
> To post to this group, send email to s3...@googlegroups.com.
> To unsubscribe from this group, send email to
> s3ql+uns...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/s3ql?hl=en.

--
If 'but' was any useful, it would be a logic operator

Nikolaus Rath

unread,

Aug 28, 2012, 8:56:12 AM8/28/12

to s3...@googlegroups.com

On 08/27/2012 09:12 PM, auada wrote:
> 1- regarding security and durability, if we use s3ql filesystem as a
> production resource, should we keep backups outside the filesystem, or
> the snapshots would be enough ?

That's something only you can decide.

Since the release of S3QL 1.0, any reports of data loss where restricted
to data that was in the process of being copied while either computer or
network connection crashed. I have not had any reports of data being
lost after it was written and the metadata was uploaded. For my own
data, I use S3QL backups exclusively.

> 2- would it be possible to sinchronize sql lite (or other db) among
> different instances and try to make multiple mounts of same file system,
> or there is more to multiple mounts than only the db ? I ask, because it
> is like a dream to be able to have multiple mounts and leave nfs behind
> and get rid of a single point of failure.

This has been discussed a few times on the list, you may want to take a
look at the archives. The short answer is: no, it's not feasible.

> 3- in the manual you say that standard S3 at Amazon has an eventual
> consistency window and other regions have immediate. Where did you take
> that from: from AWS or from experience ? And do you include Oregon as
> standard as well ?

That's from the AWS documentation. I don't know about Oregon on top of
my head, if you want to be sure you should probably check the AWS
documentation.

> 4- what is your feeling about using s3ql filesystem for maildirs,
> regarding performance, durability and security ? Do you think we could
> have a bottle neck ?

I think you're trying to make a very complex question a bit too simple
here. This all depends on your requirements. S3QL is slower than
in-kernel file systems like ext4 for sure. It's probably more durable
than running ext3 on a hard disk, because Amazon S3 has much more
redundancy, but you can of course compensate for that by setting up a
giant RAID system yourself. I am not sure what you mean with security.

Every system always has a bottle neck. Whether it will be S3QL depends
on your hardware, software, user base and requirements.

Best,

-Nikolaus

--
�Time flies like an arrow, fruit flies like a Banana.�

auada

unread,

Aug 28, 2012, 10:33:05 AM8/28/12

to s3...@googlegroups.com

Em terça-feira, 28 de agosto de 2012 09h56min14s UTC-3, Nikolaus Rath escreveu:

On 08/27/2012 09:12 PM, auada wrote:
> 1- regarding security and durability, if we use s3ql filesystem as a
> production resource, should we keep backups outside the filesystem, or
> the snapshots would be enough ?

That's something only you can decide.

Since the release of S3QL 1.0, any reports of data loss where restricted
to data that was in the process of being copied while either computer or
network connection crashed. I have not had any reports of data being
lost after it was written and the metadata was uploaded. For my own
data, I use S3QL backups exclusively.

I wonder if from a server crash the db can get corrupted and the data not recoverable, even with s3ql.fsck ... is that a situation that can happen ?

> 2- would it be possible to sinchronize sql lite (or other db) among
> different instances and try to make multiple mounts of same file system,
> or there is more to multiple mounts than only the db ? I ask, because it
> is like a dream to be able to have multiple mounts and leave nfs behind
> and get rid of a single point of failure.

This has been discussed a few times on the list, you may want to take a
look at the archives. The short answer is: no, it's not feasible.

OK, short answer is enoguh, thanks.

Regarding the nfs flag, how does it help NFS server ?

> 3- in the manual you say that standard S3 at Amazon has an eventual
> consistency window and other regions have immediate. Where did you take
> that from: from AWS or from experience ? And do you include Oregon as
> standard as well ?

That's from the AWS documentation. I don't know about Oregon on top of
my head, if you want to be sure you should probably check the AWS
documentation.

Just checked. Oregon is not Standard, so it´s immediate consistent.

> 4- what is your feeling about using s3ql filesystem for maildirs,
> regarding performance, durability and security ? Do you think we could
> have a bottle neck ?

I think you're trying to make a very complex question a bit too simple
here. This all depends on your requirements. S3QL is slower than
in-kernel file systems like ext4 for sure. It's probably more durable
than running ext3 on a hard disk, because Amazon S3 has much more
redundancy, but you can of course compensate for that by setting up a
giant RAID system yourself. I am not sure what you mean with security.

Every system always has a bottle neck. Whether it will be S3QL depends
on your hardware, software, user base and requirements.

Sure, I see.

I have been researching the list and I found something saying that for huge filesystem size, the netada can get too big. Would that be a bottle neck ? Would it be difficult to migrate the filesystem to another server ? What would be the consequences of this metada being too large ?

Best regards,

--

Ricardo Auada

Best,

-Nikolaus

--
ï¿½Time flies like an arrow, fruit flies like a Banana.ï¿½

Brad Knowles

unread,

Aug 28, 2012, 2:08:59 PM8/28/12

to s3...@googlegroups.com, Brad Knowles

On Aug 27, 2012, at 3:09 PM, auada <rau...@gmail.com> wrote:

> I need a place to store the maildir files of an email system. So, I need performance and also a safe system to be online 24/7 with millions of files (or more) distributed in a directory tree.

I've designed and built some large-scale mail systems in the past (see <http://www.shub-internet.org/brad/papers/dihses/>, which was based on my experience at AOL and the largest ISP in Belgium), as well as some systems that are slightly smaller in scale (see <http://www.shub-internet.org/brad/papers/sistpni/>, which was based on experience at a customer in the Netherlands with ~3500 users at their HQ near Eindhoven and a total of ~7000 users world-wide). Let me give you my perspective on this issue.

First off, you need to keep in mind the Cardinal Rule:

Performance, Reliability, and Cost -- Pick Two

Everything else being equal, the #1 bottleneck for any message store is synchronous meta-data updates -- creating new files, deleting old files, re-naming files, moving files from one location to another, etc.... This is because of the locking mechanisms required to ensure that these operations happen safely and correctly, and if they should happen to fail that they do so in a way that is most likely to be recoverable.

> I have seen the most active systems are S3ql and S3backer. I wonder what I should take into account to make up my mind and if somebody already has experience related to them.

When you're building filesystems to store mail messages, the top priority is usually reliability, with performance a secondary consideration -- and you end up paying large amounts of money to get those two things (see the Cardinal Rule above).

However, if you're building something on top of S3, then clearly price is your biggest driver, with reliability and performance being very secondary. For example, S3 makes a very popular choice for online backup solutions, such as Jungledisk, Mozy, etc.... And the reason these solutions are popular is because they are so inexpensive, they have sufficient reliability, and customers don't care how slow they are.

With regards to s3ql, Niklaus does the best he can in terms of reliability, but there's only so much he can do. Among other things, this is not a process that can be distributed easily, due to the fundamental nature of locks and locking mechanisms. And even then it's not enough reliability for a message store.

However, the biggest loser here is performance.

Therefore, S3 is just about the worst possible solution you could choose for the basis of a message store filesystem.

Even if you could kit-bash something together on top of s3ql that would "work" (for certain values of "work"), it would always be something held together with spit, bailing wire, and mostly prayer. And that's just not something that you want to base a mail server on top of.

--
Brad Knowles <br...@shub-internet.org>
LinkedIn Profile: <http://tinyurl.com/y8kpxu>

Nikolaus Rath

unread,

Aug 29, 2012, 6:25:46 PM8/29/12

to s3...@googlegroups.com

On 08/28/2012 02:08 PM, Brad Knowles wrote:
> With regards to s3ql, Niklaus does the best he can in terms of reliability, but there's only so much he can do. Among other things, this is not a process that can be distributed easily, due to the fundamental nature of locks and locking mechanisms. And even then it's not enough reliability for a message store.

Hmm. Did you mean performance rather than reliability here? How would
distributing the workload help with reliability?

Best,

-Nikolaus

--
�Time flies like an arrow, fruit flies like a Banana.�

Nikolaus Rath

unread,

Aug 29, 2012, 6:32:17 PM8/29/12

to s3...@googlegroups.com

On 08/28/2012 10:33 AM, auada wrote:
>
>
> Em ter�a-feira, 28 de agosto de 2012 09h56min14s UTC-3, Nikolaus Rath

> escreveu:
>
> On 08/27/2012 09:12 PM, auada wrote:
> > 1- regarding security and durability, if we use s3ql filesystem as a
> > production resource, should we keep backups outside the
> filesystem, or
> > the snapshots would be enough ?
>
> That's something only you can decide.
>
> Since the release of S3QL 1.0, any reports of data loss where
> restricted
> to data that was in the process of being copied while either
> computer or
> network connection crashed. I have not had any reports of data being
> lost after it was written and the metadata was uploaded. For my own
> data, I use S3QL backups exclusively.
>
>
> I wonder if from a server crash the db can get corrupted and the data
> not recoverable, even with s3ql.fsck ... is that a situation that can
> happen ?

The local copy of the metadata may get corrupted beyond recovery, but
the last remote copy (which is uploaded periodically even if the file
system is never unmounted) is always available.

> > 2- would it be possible to sinchronize sql lite (or other db) among
> > different instances and try to make multiple mounts of same file
> system,
> > or there is more to multiple mounts than only the db ? I ask,
> because it
> > is like a dream to be able to have multiple mounts and leave nfs
> behind
> > and get rid of a single point of failure.
>
> This has been discussed a few times on the list, you may want to take a
> look at the archives. The short answer is: no, it's not feasible.
>
>
> OK, short answer is enoguh, thanks.
>
> Regarding the nfs flag, how does it help NFS server ?

NFS occasionally needs to look up the name of a directory entry from its
inode. The --nfs flag creates a special index for that. Without this
index, the time required to process such a request is linear in the
number of directory entries (read: terrible). The flag increases the
size of the cached metadata (but not the amount of metadata that will be
uploaded) but has no adverse effects. I don't know how often an NFS
server performs this sort of query.

> I have been researching the list and I found something saying that for
> huge filesystem size, the netada can get too big. Would that be a bottle
> neck ? Would it be difficult to migrate the filesystem to another server
> ? What would be the consequences of this metada being too large ?

The metadata will grow, and will take longer and longer to upload. "Too
big" thus depends on how often you want to upload metadata, and how much
time and bandwith you are willing to invest in metadata uploads. But as
far as S3QL is concerned, there is no hard limit on the metadata size.

"Migrating" the file system just involves unmounting it on one server
and mounting it on the other, i.e. it means uploading and downloading
the metadata once. An S3QL file server should be thought of as being
located in S3, not on the computer that has mounted it at a given point
in time.

Best,

-Nikolaus

--
�Time flies like an arrow, fruit flies like a Banana.�

Brad Knowles

unread,

Aug 29, 2012, 6:45:22 PM8/29/12

to s3...@googlegroups.com, Brad Knowles

On Aug 29, 2012, at 5:25 PM, Nikolaus Rath <Niko...@rath.org> wrote:

> On 08/28/2012 02:08 PM, Brad Knowles wrote:
>> With regards to s3ql, Niklaus does the best he can in terms of reliability, but there's only so much he can do. Among other things, this is not a process that can be distributed easily, due to the fundamental nature of locks and locking mechanisms. And even then it's not enough reliability for a message store.
>
> Hmm. Did you mean performance rather than reliability here? How would
> distributing the workload help with reliability?

You are correct -- distributing the workload could potentially help with performance, depending on the implementation details. But distributing the workload would also be likely to be a significant negative impact on reliability.

Since a message store depends on reliability first and foremost, and performance second, this would not be a good trade-off to make.

Thank you for giving me the opportunity to clarify!

Nikolaus Rath

unread,

Aug 29, 2012, 10:11:26 PM8/29/12

to s3...@googlegroups.com

On 08/28/2012 02:08 PM, Brad Knowles wrote:

> On Aug 27, 2012, at 3:09 PM, auada <rau...@gmail.com> wrote:
>
>> I need a place to store the maildir files of an email system. So, I need performance and also a safe system to be online 24/7 with millions of files (or more) distributed in a directory tree.
>
> I've designed and built some large-scale mail systems in the past (see <http://www.shub-internet.org/brad/papers/dihses/>, which was based on my experience at AOL and the largest ISP in Belgium), as well as some systems that are slightly smaller in scale (see <http://www.shub-internet.org/brad/papers/sistpni/>, which was based on experience at a customer in the Netherlands with ~3500 users at their HQ near Eindhoven and a total of ~7000 users world-wide). Let me give you my perspective on this issue.
>

[...]

>
> Therefore, S3 is just about the worst possible solution you could choose for the basis of a message store filesystem.
>
>
> Even if you could kit-bash something together on top of s3ql that would "work" (for certain values of "work"), it would always be something held together with spit, bailing wire, and mostly prayer. And that's just not something that you want to base a mail server on top of.

I think that's putting it a bit too strong. I don't see any problems in
e.g. putting the mail spool for a 100 employee company on an S3QL
volume. All your points are valid, but handling e-mail is not a
particularly demanding job for a computer if it's not as large scale as
you've been working with.

That said, personally I would always put the spool on a regular file
system and set up an rsync job to back it up in S3QL. Even though I
don't see a problem with using S3QL directly for small servers, I don't
see any advantages either.

Brad Knowles

unread,

Aug 30, 2012, 10:43:43 AM8/30/12

to s3...@googlegroups.com, Brad Knowles

On Aug 29, 2012, at 9:11 PM, Nikolaus Rath <Niko...@rath.org> wrote:

> That said, personally I would always put the spool on a regular file
> system and set up an rsync job to back it up in S3QL.

That's much more along the lines of what S3 is good at -- being an online backup solution.

Note that backups != archives, because Amazon has a different solution for archives -- called Glacier. To fully understand the difference between backups and archives, I recommend the writing of my friend Curtis Preston -- see <http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/404-amazon-glacier-release.html>.

auada

unread,

Aug 30, 2012, 11:08:58 AM8/30/12

to s3...@googlegroups.com

Em quarta-feira, 29 de agosto de 2012 19h32min19s UTC-3, Nikolaus Rath escreveu:

On 08/28/2012 10:33 AM, auada wrote:
>
>

> Em terï¿½a-feira, 28 de agosto de 2012 09h56min14s UTC-3, Nikolaus Rath

> escreveu:
>
> On 08/27/2012 09:12 PM, auada wrote:
> > 1- regarding security and durability, if we use s3ql filesystem as a
> > production resource, should we keep backups outside the
> filesystem, or
> > the snapshots would be enough ?
>
> That's something only you can decide.
>
> Since the release of S3QL 1.0, any reports of data loss where
> restricted
> to data that was in the process of being copied while either
> computer or
> network connection crashed. I have not had any reports of data being
> lost after it was written and the metadata was uploaded. For my own
> data, I use S3QL backups exclusively.
>
>
> I wonder if from a server crash the db can get corrupted and the data
> not recoverable, even with s3ql.fsck ... is that a situation that can
> happen ?

The local copy of the metadata may get corrupted beyond recovery, but
the last remote copy (which is uploaded periodically even if the file
system is never unmounted) is always available.

And this last copy, if not updated 100 % with local metada, will be able to access all files ?

> > 2- would it be possible to sinchronize sql lite (or other db) among
> > different instances and try to make multiple mounts of same file
> system,
> > or there is more to multiple mounts than only the db ? I ask,
> because it
> > is like a dream to be able to have multiple mounts and leave nfs
> behind
> > and get rid of a single point of failure.
>
> This has been discussed a few times on the list, you may want to take a
> look at the archives. The short answer is: no, it's not feasible.
>
>
> OK, short answer is enoguh, thanks.
>
> Regarding the nfs flag, how does it help NFS server ?

NFS occasionally needs to look up the name of a directory entry from its
inode. The --nfs flag creates a special index for that. Without this
index, the time required to process such a request is linear in the
number of directory entries (read: terrible). The flag increases the
size of the cached metadata (but not the amount of metadata that will be
uploaded) but has no adverse effects. I don't know how often an NFS
server performs this sort of query.

Great !

> I have been researching the list and I found something saying that for
> huge filesystem size, the netada can get too big. Would that be a bottle
> neck ? Would it be difficult to migrate the filesystem to another server
> ? What would be the consequences of this metada being too large ?

The metadata will grow, and will take longer and longer to upload. "Too
big" thus depends on how often you want to upload metadata, and how much
time and bandwith you are willing to invest in metadata uploads. But as
far as S3QL is concerned, there is no hard limit on the metadata size.

And can we (user) decide the frequency of uploads ? I haven´t seen that in the manual, but I might have missed it.

And is there any relation like, a ratio between the size of the filesystem and the metada ? Any rule of thumb ?

"Migrating" the file system just involves unmounting it on one server
and mounting it on the other, i.e. it means uploading and downloading
the metadata once. An S3QL file server should be thought of as being
located in S3, not on the computer that has mounted it at a given point
in time.

OK, great explanation.

Best,

-Nikolaus

--
ï¿½Time flies like an arrow, fruit flies like a Banana.ï¿½

auada

unread,

Aug 30, 2012, 11:14:41 AM8/30/12

to s3...@googlegroups.com

The advantages we see are 2:

- being able to remount filesystem in another zone (AWS) and not needing to build EBS disks from snapshots and then having to rsync the contents, etc. ... from a crash for instance.

- no worry about filesystem size, once it can grow indefinitely

I would ask:

- if we build a cache in a local driver, but configure it to grow huge, with many more file descriptors and cache size, like trying to reach 10 GBytes, would that bring performance benefit ? How do you see that ?

Regards,

Ricardo Auada

Message has been deleted

Brad Knowles

unread,

Aug 30, 2012, 1:11:30 PM8/30/12

to auada, Brad Knowles, s3...@googlegroups.com

On Aug 30, 2012, at 10:17 AM, auada <rau...@gmail.com> wrote:

>> You are correct -- distributing the workload could potentially help with performance, depending on the implementation details. But distributing the workload would also be likely to be a significant negative impact on reliability.

> Brad, sorry for my ignorance, but how is that so ?

It gets back to the issue of the fundamental nature of directory and file locking. Locking is a hard enough problem to solve correctly when the problem is localized to a single place, but it becomes orders of magnitude much more difficult to do when you try to do distributed locking.

Indeed, Distributed Lock Management is such a hard problem to solve in the field of Computer Science that it is almost as well known as the Dining Philosophers Problem. DLM is a key part of the more general issue of distributed computing, and Wikipedia has an interesting article at <http://en.wikipedia.org/wiki/Distributed_computing> that you might want to read. Various solutions to different aspects to DLM have been proposed, each of which has it's own strengths and weaknesses.

There are few, if any, distributed/cluster filesystems out there that solve the DLM problem well enough that they could be used to support a general message store. Veritas VxFS is one such solution, but is damn bloody expensive.

For message stores that are appropriately designed, NFS can be a useful alternative, if you put them on suitably powered hardware with suitable local direct-attach or SAN-mounted disks.

But there are no distributed/cluster/network filesystem solutions I know of that would be suitable for use as a message store across a wide-area network, and with S3 everything is across the WAN. I'm not saying that VxFS or NFS can't be used across the WAN, just that they don't perform well enough across the WAN to be used underneath a message store.

If you were to locate the message store itself across the WAN (in close proximity to the disk storage system), then you would arrive at the same kind of situation that you have today with standard IMAP/SMTP clients accessing an ISP mailbox across the WAN.

Even if s3ql was to the point where it was a totally suitable filesystem when used with locally attached disk devices, it would still fall all over itself when the back-end is pushed across the WAN. The two key factors there are latency of key operations that occur frequently (getting back to those meta-data operations I was talking about earlier), and bandwidth.

Both of those factors kill you with S3 regardless of what solution you implement on top of it, because your message store cannot be located in close proximity to the S3 servers. Even if you could locate your IMAP message store in close proximity to the S3 servers (e.g., on AWS), the fundamental design of S3 is to optimize storage capacity at the expense of everything else, so even then S3 would not make a suitable back-end for a message store.

Due to the design of S3, it simply is not suitable for use as a back-end for a message store, and there's not really anything that s3ql or any other S3 front-end can do to fix that.

S3 is fine as part of an online backup solution, and s3ql brings a near native filesystem interface to S3 that makes it much more useable for a broad variety of tools that expect to operate on something that looks like a normal filesystem.

Brad Knowles

unread,

Aug 30, 2012, 2:28:07 PM8/30/12

to auada, Brad Knowles, s3...@googlegroups.com

On Aug 30, 2012, at 12:11 PM, Brad Knowles <br...@shub-internet.org> wrote:

> S3 is fine as part of an online backup solution, and s3ql brings a near native filesystem interface to S3 that makes it much more useable for a broad variety of tools that expect to operate on something that looks like a normal filesystem.

Fundamentally, the truth is in the testing. The best tool I've found for simulating the kind of activity you get on a mail server is bonnie++ by Russ Coker. With the right options, bonnie++ is a decent way to simulate both the meta-data and raw data I/O operations that you tend to see in mail servers, and you can tune the parameters so that you get the same minimum/maximum/average I/O operation sizes as you can measure from your mail system.

So, get s3ql to the point where you can run a suitably configured bonnie++ benchmark on it, and compare that to what you would get with local disk.

Then build an EC2 system where you compare bonnie++ on EBS vs. bonnie++ on S3+s3ql -- that will give you the closest "Apples to Apples" comparison that you can get, and you will see just how much it costs you to use S3 even when your client is in close proximity to the storage.

The last time I tried to benchmark s3ql with bonnie++, I wasn't able to get anywhere. Maybe you'll have more luck than I did.

Nikolaus Rath

unread,

Aug 30, 2012, 5:29:50 PM8/30/12

to s3...@googlegroups.com

On 08/30/2012 11:08 AM, auada wrote:
> > I wonder if from a server crash the db can get corrupted and the data
> > not recoverable, even with s3ql.fsck ... is that a situation that can
> > happen ?
>
> The local copy of the metadata may get corrupted beyond recovery, but
> the last remote copy (which is uploaded periodically even if the file
> system is never unmounted) is always available.
>
> And this last copy, if not updated 100 % with local metada, will be able
> to access all files ?

All files that were present at the time the metadata was uploaded and
that have not been deleted afterwards.

> > I have been researching the list and I found something saying that
> for
> > huge filesystem size, the netada can get too big. Would that be a
> bottle
> > neck ? Would it be difficult to migrate the filesystem to another
> server
> > ? What would be the consequences of this metada being too large ?
>
> The metadata will grow, and will take longer and longer to upload. "Too
> big" thus depends on how often you want to upload metadata, and how
> much
> time and bandwith you are willing to invest in metadata uploads. But as
> far as S3QL is concerned, there is no hard limit on the metadata size.
>
>
> And can we (user) decide the frequency of uploads ? I haven�t seen that
> in the manual, but I might have missed it.

Look for --metadata-upload-interval in
http://www.rath.org/s3ql-docs/mount.html

> And is there any relation like, a ratio between the size of the
> filesystem and the metada ? Any rule of thumb ?

I haven't looked into that. If you decide to investigate that, please
share your results :-).

Best,

-Nikolaus

--
�Time flies like an arrow, fruit flies like a Banana.�

Nikolaus Rath

unread,

Aug 30, 2012, 5:35:50 PM8/30/12

to s3...@googlegroups.com

On 08/30/2012 11:14 AM, auada wrote:
> - if we build a cache in a local driver, but configure it to grow huge,
> with many more file descriptors and cache size, like trying to reach 10
> GBytes, would that bring performance benefit ? How do you see that ?

That depends on your access pattern. If you previously accessed more
files simultaneously than you used descriptors, then increasing the
number of descriptors will give you a dramatic performance increase.

Same with cache size. If you simultaneously access more blocks than you
have cache, increasing cache size will give a dramatic performance
increase. Otherwise you'll not gain anything.

I think the thing that would help you most is to simply set up a server
and test.

Message has been deleted

auada

unread,

Aug 30, 2012, 5:44:17 PM8/30/12

to s3...@googlegroups.com

Em quinta-feira, 30 de agosto de 2012 18h29min52s UTC-3, Nikolaus Rath escreveu:

On 08/30/2012 11:08 AM, auada wrote:
> > I wonder if from a server crash the db can get corrupted and the data
> > not recoverable, even with s3ql.fsck ... is that a situation that can
> > happen ?
>
> The local copy of the metadata may get corrupted beyond recovery, but
> the last remote copy (which is uploaded periodically even if the file
> system is never unmounted) is always available.
>
> And this last copy, if not updated 100 % with local metada, will be able
> to access all files ?

All files that were present at the time the metadata was uploaded and
that have not been deleted afterwards.

hmmm ... so I could lose data, if some crash occurs ... Or I would try an fsck on the local metada before ?

> > I have been researching the list and I found something saying that
> for
> > huge filesystem size, the netada can get too big. Would that be a
> bottle
> > neck ? Would it be difficult to migrate the filesystem to another
> server
> > ? What would be the consequences of this metada being too large ?
>
> The metadata will grow, and will take longer and longer to upload. "Too
> big" thus depends on how often you want to upload metadata, and how
> much
> time and bandwith you are willing to invest in metadata uploads. But as
> far as S3QL is concerned, there is no hard limit on the metadata size.
>
>

> And can we (user) decide the frequency of uploads ? I havenï¿½t seen that

> in the manual, but I might have missed it.

Look for --metadata-upload-interval in
http://www.rath.org/s3ql-docs/mount.html

Great. I really did miss it.

> And is there any relation like, a ratio between the size of the
> filesystem and the metada ? Any rule of thumb ?

I haven't looked into that. If you decide to investigate that, please
share your results :-).

Alright ! We´ll do it.

Regards,

Ricardo Auada

Best,

-Nikolaus

--
ï¿½Time flies like an arrow, fruit flies like a Banana.ï¿½

auada

unread,

Aug 30, 2012, 5:45:27 PM8/30/12

to s3...@googlegroups.com

Em quinta-feira, 30 de agosto de 2012 18h35min51s UTC-3, Nikolaus Rath escreveu:

On 08/30/2012 11:14 AM, auada wrote:
> - if we build a cache in a local driver, but configure it to grow huge,
> with many more file descriptors and cache size, like trying to reach 10
> GBytes, would that bring performance benefit ? How do you see that ?

That depends on your access pattern. If you previously accessed more
files simultaneously than you used descriptors, then increasing the
number of descriptors will give you a dramatic performance increase.

Same with cache size. If you simultaneously access more blocks than you
have cache, increasing cache size will give a dramatic performance
increase. Otherwise you'll not gain anything.

I think the thing that would help you most is to simply set up a server
and test.

Yes, we´ll definitely do it.

Regards,

Ricardo Auada

Best,

-Nikolaus

--
ï¿½Time flies like an arrow, fruit flies like a Banana.ï¿½

Brad Knowles

unread,

Aug 30, 2012, 6:05:53 PM8/30/12

to auada, Brad Knowles, s3...@googlegroups.com

On Aug 30, 2012, at 4:41 PM, auada <rau...@gmail.com> wrote:

> Brad, when you mean lock, do you mean, like in a MBOX mailbox ? Because Maildir mailbox usually don´t need locks. Please, consider my ignorance.

That's file-based locking, and you are correct that Maildir and Maildir+ don't typically use file-based locking. They depend on other filesystem semantics to provide atomic operations that theoretically allow them to avoid file-based locking.

But whenever you do an operation in a filesystem, there is a whole 'nother level of locking that is going on -- that's for meta-data operations. When you create a file, the entire directory has to be locked against modifications by any other process, then the metadata for that file can be updated, and then the directory can be unlocked. All this happens at a level below file-based locking, and is internal to the filesystem itself.

This directory-level locking happens every time you create a file, delete a file, change the name for a file, or move a file from one directory to another (in which case two separate directories have to be locked for the entire time the operation is in progress).

Now, s3ql implements its own filesystem and meta-data on top of S3, and for handling meta-data operations I believe that it currently uses SQLite. But SQLite has to implement its own locking operations internally when performing these same kinds of operations.

So, even if s3ql is not itself doing explicit locking when doing meta-data operations, SQLite is certainly doing either row or table locking internally when it gets called by s3ql. So, you get the same effect.

Of course, in a mail system, you tend to get lots and lots of tiny files that get created, deleted, or renamed quite frequently. Maildir is particularly bad at doing an excessive number of these types of meta-data operations due to unnecessary file renaming, etc.... Maildir is also well known for using filenames that are excessively long, which completely blows the kernel-level inode caching algorithms that are found in most OSes.

Maildir+ manages to solve some of these issues relative to Maildir, but not all of them.

So, first thing to do is to get an s3ql installation to the point where you can reliably benchmark it with bonnie++, and make sure that you use benchmark parameters that are reasonably close to what you see in the real world with a mail server running on a local filesystem.

Once you get bonnie++ working on s3ql, then you can try getting a real mail server running on it. I'd recommend Dovecot instead of Courier-IMAP, if only because Dovecot supports Maildir+ and I believe that Courier-IMAP only does plain Maildir.

Nikolaus Rath

unread,

Aug 30, 2012, 6:21:02 PM8/30/12

to s3...@googlegroups.com

On 08/30/2012 05:44 PM, auada wrote:
>
>
> Em quinta-feira, 30 de agosto de 2012 18h29min52s UTC-3, Nikolaus Rath
> escreveu:
>
> On 08/30/2012 11:08 AM, auada wrote:
> > > I wonder if from a server crash the db can get corrupted and
> the data
> > > not recoverable, even with s3ql.fsck ... is that a situation
> that can
> > > happen ?
> >
> > The local copy of the metadata may get corrupted beyond
> recovery, but
> > the last remote copy (which is uploaded periodically even if
> the file
> > system is never unmounted) is always available.
> >
> > And this last copy, if not updated 100 % with local metada, will
> be able
> > to access all files ?
>
> All files that were present at the time the metadata was uploaded and
> that have not been deleted afterwards.
>
>
> hmmm ... so I could lose data, if some crash occurs ...
> Or I would try
> an fsck on the local metada before ?

Yes, you'd of course only use the remote backup if the local metadata is
truly beyond recovery. If you're really worried about it, you can also
tell S3QL/SQLite to fsync() at crucial moments (by editing a .py file,
it's not a command line option). This will reduce performance, but your
metadata will then always be in a consistent state, no matter when a
crash occurs. Of course there's a performance penalty for that.

Best,

-Nikolaus

--
»Time flies like an arrow, fruit flies like a Banana.«

Nikolaus Rath

unread,

Aug 30, 2012, 6:32:17 PM8/30/12

to s3...@googlegroups.com

On 08/30/2012 06:05 PM, Brad Knowles wrote:
> On Aug 30, 2012, at 4:41 PM, auada <rau...@gmail.com> wrote:
>
>> Brad, when you mean lock, do you mean, like in a MBOX mailbox ? Because Maildir mailbox usually don´t need locks. Please, consider my ignorance.
>
> That's file-based locking, and you are correct that Maildir and Maildir+ don't typically use file-based locking. They depend on other filesystem semantics to provide atomic operations that theoretically allow them to avoid file-based locking.
>
> But whenever you do an operation in a filesystem, there is a whole 'nother level of locking that is going on -- that's for meta-data operations. When you create a file, the entire directory has to be locked against modifications by any other process, then the metadata for that file can be updated, and then the directory can be unlocked. All this happens at a level below file-based locking, and is internal to the filesystem itself.
>
> This directory-level locking happens every time you create a file, delete a file, change the name for a file, or move a file from one directory to another (in which case two separate directories have to be locked for the entire time the operation is in progress).

Actually most of this happens in the VFS, not the file system.

> Now, s3ql implements its own filesystem and meta-data on top of S3, and for handling meta-data operations I believe that it currently uses SQLite. But SQLite has to implement its own locking operations internally when performing these same kinds of operations.
>
> So, even if s3ql is not itself doing explicit locking when doing meta-data operations, SQLite is certainly doing either row or table locking internally when it gets called by s3ql. So, you get the same effect.

That's not quite correct. S3QL actually disables all SQLite locking
(more precisely, it grabs the db lock when it starts and only releases
it on unmount) and relies on the VFS for directory entry locking. We can
get away with that because the metadata is only accessed by the S3QL
process.

As far as locking is concerned, S3QL is therefore as fast as any
in-kernel file system.

In practice, S3QL is much slower for a different reason: since it is
written in Python, it runs single threaded most of the time. Real
concurrency only happens for some well defined set of operations that
are considered potentially slow (compression, encryption, and
block/network I/O).

David Harrison

unread,

Aug 30, 2012, 8:43:02 PM8/30/12

to s3...@googlegroups.com

can we maybe see an fsync option available in the next release? ;)

--
»Time flies like an arrow, fruit flies like a Banana.«

PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C

--
You received this message because you are subscribed to the Google Groups "s3ql" group.

auada

unread,

Aug 30, 2012, 8:48:59 PM8/30/12

to s3...@googlegroups.com

Nikolaus, in the case of losing the metadata beyond recovery and having to use one online version, that is older that the actual, we would not be able to see the most up to date data, correct ? And what would happen to this data that is not visible anymore ? Would it be removed from the S3 or would rest there, lost forever ?

Regards,

Ricardo Auada

auada

unread,

Aug 30, 2012, 8:49:44 PM8/30/12

to s3...@googlegroups.com

WOW, that sounds great !

Nikolaus Rath

unread,

Aug 30, 2012, 9:25:19 PM8/30/12

to s3...@googlegroups.com

On 08/30/2012 08:48 PM, auada wrote:
>
> Nikolaus, in the case of losing the metadata beyond recovery and having
> to use one online version, that is older that the actual, we would not
> be able to see the most up to date data, correct ?

Yes.

> And what would happen
> to this data that is not visible anymore ? Would it be removed from the
> S3 or would rest there, lost forever ?

It would pop up in /lost+found with a weird file name after running fsck.

Ricardo Buchalla Auada

unread,

Aug 28, 2012, 3:58:35 PM8/28/12

to s3...@googlegroups.com

Brad, thanks a lot for your great explanation.

Does it apply for web hosting as well, not considering DB´s of course, only web codes ?

Regards,
-- 
Ricardo Buchalla Auada

Brad Knowles

unread,

Sep 1, 2012, 10:11:30 AM9/1/12

to s3...@googlegroups.com, Brad Knowles

On Aug 28, 2012, at 2:58 PM, Ricardo Buchalla Auada <rau...@gmail.com> wrote:

> Does it apply for web hosting as well, not considering DB´s of course, only web codes ?

Latency and bandwidth between your server and S3 are always a potential issue to be concerned about. Only you can decide whether or not that is actually a problem for you, or if what you get can perform to a level that you find to be acceptable.

Even if your server is located in "close proximity" to S3 (i.e., you have one or more EC2 instances that are mounting your S3 storage and providing whatever services you want), that could still be an issue to be concerned about.

For my part, all I can do is try to give you some insight as to potential hidden complications due to implementation details of the service you're considering.

For a low-bandwidth server that is not frequently accessed, but which does have significant storage requirements (e.g., a private picture archive site just for you and your family), then the potential performance issues may actually not pose a significant problem for you. That's one of those things where you'd just have to put something up and try it.

Personally, I have over 50,000 pictures that my wife and I have taken over the years, and even semi-pro local media management tools like Aperture have a hard time dealing with that many pictures. And then we have many gigabytes of video we've recorded. If I wanted to store all of that online and make it available to our friends and family, but I didn't want to use a service like Flickr or YouTube, then building my own S3-backed media server might actually be something that I could consider.

On the other hand, a small server with a terabyte of storage is not that expensive, and it would be a lot easier to set that up as a media server of this sort. A Dynamic DNS service would allow me to host that at my house, and then I wouldn't have to worry about trying to find a co-location provider.

I still think that using S3 as a method of providing primary storage is not a great idea, pretty much regardless of the application. I believe that it was designed as a secondary storage solution (including backups), and that it works reasonably well in that application. I also believe that s3ql is a good interface to S3 to give the community a simpler way to do things like online backups through existing standard filesystem-based tools.

But neither S3 nor s3ql were designed to be used as primary storage.

Reply all

Reply to author

Forward