Assorted questions related to new deployment

39 views
Skip to first unread message

Vladimir Mosgalin

unread,
Sep 7, 2017, 3:22:22 PM9/7/17
to LeoProject.LeoFS
Hello everybody,

First, I'd like to thank LeoFS developers for this product. It's working without problems in our dev environment for a few months and production testing went fine as well, so we are about to launch real production and start data migration to LeoFS cluster. When preparing production setup I got a few various questions, since they are so random I decided to post them here instead of github.

1) Using DNS node names vs. IPs? Do nodes resolve the names on start and just use IPs internally after that (i.e. there will be no problems if DNS doesn't work for a some moment if all nodes are already launched)? Or they will resolve each time commands like "leofs-adm status <node>" is executed?

2) Amount of AVS files and backend_db.eleveldb.max_open_files parameter.
We are using storage nodes with 4 drives, 6 TB each; I plan to set 64 AVS files per drive to keep each file under 100 GB (though right now after moving all data to the cluster they'll be probably like 50 GB each). That's 4*64 databases, each requires 8 (or maybe 6?) open handles; that means that backend_db.eleveldb.max_open_files definitely needs to be increased so it's over 1000, right? Or I'm misunderstanding this setting?

Size of metadata: in our case it's 420-440 bytes per object, according to my calculations with 64 AVS files per drive it will be 263 MB per each (of 64) metadata directory right now but up to 530 MB in theory when the servers are more filled with data. Is that OK (considering that it's stored on HDD, R:W ratio is about 5:1 and there are almost no updates / deletes) or it would be better to use 128 AVS files per drive?

3) For this configuration (4 drives per node, each having 64 or maybe 128 AVS files) should num_of_vnodes be increased or default value is fine?

4) For manual compaction with "compact-start", I don't quite understand "num-of-targets" option. First of all, what's the point in specifying anything but "all" there if you can't specify which AVS files exactly would be compacted? I mean, when executing "leofs-adm compact-start 5" - which 5 would these be? Now, I've seen that somehow when I wait till compaction of these 5 is over and then execute the same command again, the next set of 5 would be picked instead of re-compacting the same ones - but how does it work?

5) On new clusters, there is user "_test_leofs" with admin rights (!) by default. Why? Should it be removed for security reasons (or at least stripped of admin rights)? Isn't it a bad idea that it's created by default even on systems not set up for testing?

6) What do "admin rights" mean for user, anyway?

7) In "create-user" command, what is "password" argument for? There is also "update-user-password" which kind of works but doesn't seem to have any visible effect.

8) leo_gateway generates lots of lines (256 in my case) like this upon startup (in erlang.log)
slab class   1: chunk size       264 perslab   31775
slab class   2: chunk size       528 perslab   15887
slab class   3: chunk size      1056 perslab    7943
slab class   4: chunk size      2112 perslab    3971
slab class   5: chunk size      4224 perslab    1985
slab class   6: chunk size      8448 perslab     992
slab class   7: chunk size     16896 perslab     496
slab class   8: chunk size     33792 perslab     248
slab class   9: chunk size     67584 perslab     124
slab class  10: chunk size    135168 perslab      62
slab class  11: chunk size    270336 perslab      31
slab class  12: chunk size    540672 perslab      15
slab class  13: chunk size   1081344 perslab       7
slab class  14: chunk size   2162688 perslab       3
slab class  15: chunk size   8388608 perslab       1
ps:0x7ff1a401c570

Some memory debug option is enabled by default? It doesn't happen for other node types.

9) When building LeoFS package using Erlang package from erlang-solutions.com, these packages have HiPE enabled. Official LeoFS packages are build with Erlang runtime without HiPE, also README.md in repo recommends --disable-hipe options when building Erlang. Is HiPE-enabled runtime a problem and should be disabled, or it doesn't matter? (at very least, I haven't noticed any problems running LeoFS using Erlang from erlang-solutions.com in my testing)

10) This question is about future project unrelated to current cluster. Some of our services (which are running in docker) actively write certain objects, which are only needed for a short time. Actually, most of them fall into "WOR(A)N" (write once - read (almost) never) category; there are lots of these objects (e.g. 50 GB per day) and they are on TTL, all objects which are > 7 days old are removed.

Currently the storage for that is on NFS share and TTL is implemented in a simple way, objects are stored by paths starting from date when that object was created, e.g.
2017-09-01/...
2017-09-02/...

and so on. A script removes old directories (effectively wiping all objects that are too old).

NFS+docker is a bit of problematic combo, so we'd like to replace it with LeoFS. However, since removing objects by TTL currently isn't supported, so we had an idea of using buckets for that. Like, create buckets "2017-09-01", "2017-09-02" and so on and put objects to the bucket according to the day of object creation, and every day remove bucket which is few days old. The plan is to do it after https://github.com/leo-project/leofs/issues/725 is fixed and there are no known issues with deleting buckets, at least (because bucket removal will be done by script in unattended way). Now the question is - will this work as planned or there are some obvious reasons why it won't work? Like, creating buckets for whole year (or few years) in advance would cause some problems, or something else maybe.

yoshiyuki kanno

unread,
Sep 12, 2017, 3:49:41 AM9/12/17
to Vladimir Mosgalin, LeoProject.LeoFS
Hi Vladimir,

> First, I'd like to thank LeoFS developers for this product. It's working without problems in our dev environment for a few months and production testing went fine as well, so we are about to launch real production and start data migration to LeoFS cluster. When preparing production setup I got a few various questions, since they are so random I decided to post them here instead of github.

Great to hear that. we will keep help you bring LeoFS into the
production as much as we can :)

> 1) Using DNS node names vs. IPs? Do nodes resolve the names on start and just use IPs internally after that (i.e. there will be no problems if DNS doesn't work for a some moment if all nodes are already launched)? Or they will resolve each time commands like "leofs-adm status <node>" is executed?

The timing to resolve the names depends on the erlang runtime (how
distributed erlang is implemented on the name resolving) and IIRC it
tries to resolve the name each time when the RPC(remote procedure
call) is invoked (I will check when I can spare time).

> 2) Amount of AVS files and backend_db.eleveldb.max_open_files parameter.
> We are using storage nodes with 4 drives, 6 TB each; I plan to set 64 AVS files per drive to keep each file under 100 GB (though right now after moving all data to the cluster they'll be probably like 50 GB each). That's 4*64 databases, each requires 8 (or maybe 6?) open handles; that means that backend_db.eleveldb.max_open_files definitely needs to be increased so it's over 1000, right? Or I'm misunderstanding this setting?

I've checked the latest implementation of leveldb managed by basho and
noticed max_open_files seems to be no longer used according to
https://github.com/basho/leveldb/blob/44cb7cbc85590280c9a73856470d5880f4015927/include/leveldb/options.h#L126-L131
That said, you'd not need to increase that param.

> Size of metadata: in our case it's 420-440 bytes per object, according to my calculations with 64 AVS files per drive it will be 263 MB per each (of 64) metadata directory right now but up to 530 MB in theory when the servers are more filled with data. Is that OK (considering that it's stored on HDD, R:W ratio is about 5:1 and there are almost no updates / deletes) or it would be better to use 128 AVS files per drive?

OK.

> 3) For this configuration (4 drives per node, each having 64 or maybe 128 AVS files) should num_of_vnodes be increased or default value is fine?

The optimal num_of_vnodes is not affected by the parameters related to
AVS (how many AVS files/directories are there on the host) but
affected by the number of nodes in your cluster and also how many
objects are supposed to be stored on the cluster.
@yosukehara will teach you the actual calculator to estimate the
optimal num_of_vnodes later on.

> 4) For manual compaction with "compact-start", I don't quite understand "num-of-targets" option. First of all, what's the point in specifying anything but "all" there if you can't specify which AVS files exactly would be compacted? I mean, when executing "leofs-adm compact-start 5" - which 5 would these be? Now, I've seen that somehow when I wait till compaction of these 5 is over and then execute the same command again, the next set of 5 would be picked instead of re-compacting the same ones - but how does it work?

the next set of 5 would be picked while leo_storage is running (reset
the offset where to pick up when leo_storage goes down as the
information stored on memory).

> 5) On new clusters, there is user "_test_leofs" with admin rights (!) by default. Why? Should it be removed for security reasons (or at least stripped of admin rights)? Isn't it a bad idea that it's created by default even on systems not set up for testing?

You are right. filed on https://github.com/leo-project/leofs/issues/825

> 6) What do "admin rights" mean for user, anyway?

It had been used by the web admin UI for leofs - leo_center
https://github.com/leo-project/leo_center to provide the login
functionality however it's outdated (not actively maintained now)

> 7) In "create-user" command, what is "password" argument for? There is also "update-user-password" which kind of works but doesn't seem to have any visible effect.

login users for leo_center.

> 8) leo_gateway generates lots of lines (256 in my case) like this upon startup (in erlang.log)
> Some memory debug option is enabled by default? It doesn't happen for other node types.

Now the debug information for off-heap memory always outputted on
leo_gateway (it should be suppressed). filed on
https://github.com/leo-project/leofs/issues/824

> 9) When building LeoFS package using Erlang package from erlang-solutions.com, these packages have HiPE enabled. Official LeoFS packages are build with Erlang runtime without HiPE, also README.md in repo recommends --disable-hipe options when building Erlang. Is HiPE-enabled runtime a problem and should be disabled, or it doesn't matter? (at very least, I haven't noticed any problems running LeoFS using Erlang from erlang-solutions.com in my testing)

The performance with enabled HiPE was somewhat lower than one with
disabled HiPE when we benchmarked around 2 years ago however it might
be changed so we are going to validate that is still true on OTP19.x.
filed on https://github.com/leo-project/leofs/issues/826

> 10) This question is about future project unrelated to current cluster. Some of our services (which are running in docker) actively write certain objects, which are only needed for a short time. Actually, most of them fall into "WOR(A)N" (write once - read (almost) never) category; there are lots of these objects (e.g. 50 GB per day) and they are on TTL, all objects which are > 7 days old are removed.
> NFS+docker is a bit of problematic combo, so we'd like to replace it with LeoFS. However, since removing objects by TTL currently isn't supported, so we had an idea of using buckets for that. Like, create buckets "2017-09-01", "2017-09-02" and so on and put objects to the bucket according to the day of object creation, and every day remove bucket which is few days old. The plan is to do it after https://github.com/leo-project/leofs/issues/725 is fixed and there are no known issues with deleting buckets, at least (because bucket removal will be done by script in unattended way). Now the question is - will this work as planned or there are some obvious reasons why it won't work? Like, creating buckets for whole year (or few years) in advance would cause some problems, or something else maybe.

It should work as planned so please file any issue if you find
something wrong/unexpected with this use case (TTL implemented by
buckets named datetime).

Best,
Kanno.
> --
> You received this message because you are subscribed to the Google Groups
> "LeoProject.LeoFS" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to leoproject_leo...@googlegroups.com.
> To post to this group, send email to leoproje...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/leoproject_leofs/65b8f821-8f6c-48ce-aa8e-efe9b3620617%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
Yoshiyuki Kanno
LeoFS Committer(http://leo-project.net/leofs/index.html)

Vladimir Mosgalin

unread,
Sep 12, 2017, 2:15:56 PM9/12/17
to LeoProject.LeoFS
вторник, 12 сентября 2017 г., 10:49:41 UTC+3 пользователь mocchira написал:

Hello Kanno,

Thanks for answering and filling the issues on github.

> 1) Using DNS node names vs. IPs? Do nodes resolve the names on start and just use IPs internally after that (i.e. there will be no problems if DNS doesn't work for a some moment if all nodes are already launched)? Or they will resolve each time commands like "leofs-adm status <node>" is executed?

The timing to resolve the names depends on the erlang runtime (how
distributed erlang is implemented on the name resolving) and IIRC it
tries to resolve the name each time when the RPC(remote procedure
call) is invoked  (I will check when I can spare time).

OK, this doesn't sound good if it is the case, better to stick to IPs then.
 
> 3) For this configuration (4 drives per node, each having 64 or maybe 128 AVS files) should num_of_vnodes be increased or default value is fine?

The optimal num_of_vnodes is not affected by the parameters related to
AVS (how many AVS files/directories are there on the host) but
affected by the number of nodes in your cluster and also how many
objects are supposed to be stored on the cluster.
@yosukehara will teach you the actual calculator to estimate the
optimal num_of_vnodes later on.

I see. These parameters aren't anything extraordinary (6 storage nodes, 350M-something objects, planned to grow a few times from that) so it's probably fine. I was worried that this parameter would be affected by the amount or total size of AVS files on the node which are a bit extreme (but we calculated that we are getting performance we need even when using drives that huge, and using smaller drives would've require more nodes which can increase cost quite a bit).
 
> 6) What do "admin rights" mean for user, anyway?

It had been used by the web admin UI for leofs - leo_center
https://github.com/leo-project/leo_center to provide the login
functionality however it's outdated (not actively maintained now)

Ah, I see. I've tried it out before, it looks kind it nice but unfortunately it couldn't replace a properly configured monitoring; and when gathering all kinds of statistics from nodes and submitting it into monitoring / configuring alerts (even though it's much more work), I couldn't see much purpose in it.
 
The performance with enabled HiPE was somewhat lower than one with
disabled HiPE when we benchmarked around 2 years ago however it might
be changed so we are going to validate that is still true on OTP19.x.
filed on https://github.com/leo-project/leofs/issues/826

OK. Well, according to report linked from that ticket it's about the same or even slightly better with HiPE, I think (latency-wise).
 
> 10) This question is about future project unrelated to current cluster. Some of our services (which are running in docker) actively write certain objects, which are only needed for a short time. Actually, most of them fall into "WOR(A)N" (write once - read (almost) never) category; there are lots of these objects (e.g. 50 GB per day) and they are on TTL, all objects which are > 7 days old are removed.
> NFS+docker is a bit of problematic combo, so we'd like to replace it with LeoFS. However, since removing objects by TTL currently isn't supported, so we had an idea of using buckets for that. Like, create buckets "2017-09-01", "2017-09-02" and so on and put objects to the bucket according to the day of object creation, and every day remove bucket which is few days old. The plan is to do it after https://github.com/leo-project/leofs/issues/725 is fixed and there are no known issues with deleting buckets, at least (because bucket removal will be done by script in unattended way). Now the question is - will this work as planned or there are some obvious reasons why it won't work? Like, creating buckets for whole year (or few years) in advance would cause some problems, or something else maybe.

It should work as planned so please file any issue if you find
something wrong/unexpected with this use case (TTL implemented by
buckets named datetime).

OK, great, thanks.

yoshiyuki kanno

unread,
Sep 13, 2017, 2:40:38 AM9/13/17
to Vladimir Mosgalin, LeoProject.LeoFS
Hello Vladimir,

>> The timing to resolve the names depends on the erlang runtime (how
>> distributed erlang is implemented on the name resolving) and IIRC it
>> tries to resolve the name each time when the RPC(remote procedure
>> call) is invoked (I will check when I can spare time).
>OK, this doesn't sound good if it is the case, better to stick to IPs then.

As I could spare time, vet in detail on this topic and turned out it's
my paramnesia!
There are numerous name lookup methods implemented on erlang and the
default method is using syscall (getaddrinfo) in a separate process
named *inet_gethost* and seems all names used in RPC calls get looked
up only once and subsequent lookups never happen on my dev-box so
those names seems to be cached on somewhere in erlang runtime. That
said, might be better to stick with DNS.

Just in case, I will share how to check whether the DNS lookup
actually happens on your dev-box.

```diff
diff --git a/rel/common/launch.sh b/rel/common/launch.sh
index 04861e7..da24323 100755
--- a/rel/common/launch.sh
+++ b/rel/common/launch.sh
@@ -337,6 +337,8 @@ case "$1" in
export BINDIR
export PROGNAME

+ export ERL_INET_GETHOST_DEBUG=5
+
# Dump environment info for logging purposes
echo "Exec: $CMD"
echo "Root: $ROOTDIR"
```

With this hack, kind of the below log get appeared on erlang.log when
getaddrinfo is invoked.

```
inet_gethost[17995] (DEBUG):Saved domainname .localhost.
inet_gethost[17996] (DEBUG):Worker got request, op = 1, proto = 1,
data = teste.localhost.
inet_gethost[17996] (DEBUG):Starting gethostbyname(teste.localhost)
inet_gethost[17996] (DEBUG):gethostbyname OK
```

At this point, I'm still not sure how long the cache entries remain
and also is there some way to configure this behavior (TTL , disable
the cache mechasm and so on).
I will share as I get more info through my vet.

> I see. These parameters aren't anything extraordinary (6 storage nodes, 350M-something objects, planned to grow a few times from that) so it's probably fine. I was worried that this parameter would be affected by the amount or total size of AVS files on the node which are a bit extreme (but we calculated that we are getting performance we need even when using drives that huge, and using smaller drives would've require more nodes which can increase cost quite a bit).

Got it.
AFAIK, With your planned amount of data, the default (168) should be fine.
anyway, @yosukehara will drop the calculator here or maybe write down
to the docs.
Please wait for a while.

Best,
Kanno.
> --
> You received this message because you are subscribed to the Google Groups
> "LeoProject.LeoFS" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to leoproject_leo...@googlegroups.com.
> To post to this group, send email to leoproje...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/leoproject_leofs/af618b85-aacc-44a5-85df-a5dfdd14ad53%40googlegroups.com.

yoshiyuki kanno

unread,
Sep 14, 2017, 3:34:19 AM9/14/17
to Vladimir Mosgalin, LeoProject.LeoFS
Hi Vladimir,

> At this point, I'm still not sure how long the cache entries remain
> and also is there some way to configure this behavior (TTL , disable
> the cache mechasm and so on).
> I will share as I get more info through my vet.

I've finally understood how it works and realized this question could
be FAQ so that I decided to write down the doc about this topic.

PR: https://github.com/leo-project/leofs/pull/830
Preview: https://mocchira.github.io/leofs/faq/administration/#when-will-dns-lookups-for-nodes-in-a-cluster-happen

Please check the preview that is supposed to be part of the official
document for more detail.

In short, there is no cache, persistent connection (keep alived)
between nodes in a cluster do the trick.

Let me know if you have any question or suggestion to the doc.

Best,
Kanno.
Reply all
Reply to author
Forward
0 new messages