User-level sharding

107 views
Skip to first unread message

Robert Norris

unread,
Mar 20, 2015, 1:54:55 AM3/20/15
to proso...@googlegroups.com
Hi all,
 
I've got a prototype for a Prosody feature that I want/need. I'm keen to get your input on it.
 
What I want to do is distribute users for a single domain across multiple Prosody instances while having them look like a single server on the outside. The reasons for this are partly to do with load - I'm preparing for a time where there are thousands of users online - but also to fit neatly into our existing infrastructure, where we spread email users across multiple servers. I like the idea of mirroring our email architecture and doing redundancy the same way.
 
Note that I'm expressly not talking about real live clustering, where a server can fail and the user remains connected. A Prosody instance (or its underlying machine) can fail and the sessions will be disconnected. That's totally fine with me - most clients will quietly reconnect without issue.
 
It seems that all that's necessary to make this happen is to hook into the stanza delivery code. If we have a session for a user on the current server, we deliver it to them. If not, we do a lookup (via an external daemon) to see where the "home" server for the user is. If we find one, we pass the stanza over to that server and let it deliver it. If not, we continue as normal.
 
So I have a very rough prototype in this three plugins:
 
 
mod_shard is the shard lookup hook. mod_shard_client and mod_shard_server are the bits that pass stanzas between servers. They're not especially interesting, and are mostly hacked from mod_component and mod_component_client.
 
I'm assuming a very controlled environment here - secure networks (no auth required between shards), a user->shard lookup daemon, common storage and auth and a proxy that can get incoming user connections to the right shard (in my case that's nginx-xmpp). I'm expecting s2s to be covered by a par of Prosody instances that accept stanzas and pass them to the right shard. I should be able to run multiple of those because they're basically stateless.
 
So that's the whole description. Hopefully it all made sense! What I'm looking for is some feedback on whether the way I've implemented this makes sense (half-arsed prototype code notwithstanding) and any thoughts about what user-visible differences in XMPP it might present (stanza delivery order comes to mind). And any other thoughts you might have!
 
Thanks!
Rob N.

Victor Seva

unread,
Mar 20, 2015, 4:02:38 AM3/20/15
to proso...@googlegroups.com, ro...@fastmail.fm
On 03/20/2015 06:54 AM, Robert Norris wrote:
> Hi all,
>
> I've got a prototype for a Prosody feature that I want/need. I'm keen to
> get your input on it.
>
> What I want to do is distribute users for a single domain across
> multiple Prosody instances while having them look like a single server
> on the outside. The reasons for this are partly to do with load - I'm
> preparing for a time where there are thousands of users online - but
> also to fit neatly into our existing infrastructure, where we spread
> email users across multiple servers. I like the idea of mirroring our
> email architecture and doing redundancy the same way.
>
> Note that I'm expressly not talking about real live clustering, where a
> server can fail and the user remains connected. A Prosody instance (or
> its underlying machine) can fail and the sessions will be disconnected.
> That's totally fine with me - most clients will quietly reconnect
> without issue.

Great. I'm was working on that already so maybe we can share ideas. Be
aware that this work was done several months ago for 0.9.4 and I had to
stop working on this due to work load.

> It seems that all that's necessary to make this happen is to hook into
> the stanza delivery code. If we have a session for a user on the current
> server, we deliver it to them. If not, we do a lookup (via an external
> daemon) to see where the "home" server for the user is. If we find one,
> we pass the stanza over to that server and let it deliver it. If not, we
> continue as normal.
>
> So I have a very rough prototype in this three plugins:
>
> https://github.com/robn/prosody/blob/shard/plugins/mod_shard.lua
> https://github.com/robn/prosody/blob/shard/plugins/mod_shard_client.lua
> https://github.com/robn/prosody/blob/shard/plugins/mod_shard_server.lua
>
> mod_shard is the shard lookup hook. mod_shard_client and
> mod_shard_server are the bits that pass stanzas between servers. They're
> not especially interesting, and are mostly hacked from mod_component and
> mod_component_client.

I'd implemented a module to store the info of what jid is where over
redis here:

https://github.com/sipwise/prosody/blob/master/plugins/mod_sipwise_redis_sessions.lua

You just have to ask redis_sessions.get(jid) to know where to send the
stanza

> I'm assuming a very controlled environment here - secure networks (no
> auth required between shards), a user->shard lookup daemon, common
> storage and auth and a proxy that can get incoming user connections to
> the right shard (in my case that's nginx-xmpp). I'm expecting s2s to be
> covered by a par of Prosody instances that accept stanzas and pass them
> to the right shard. I should be able to run multiple of those because
> they're basically stateless.

But did you try that?

My approach was to extend mod_s2s as mod_s2sc to connect the nodes:

https://github.com/sipwise/prosody/blob/master/core/s2scmanager.lua
https://github.com/sipwise/prosody/tree/master/plugins/mod_s2s_cluster
https://github.com/sipwise/prosody/blob/master/plugins/mod_s2sc_dialback.lua

And this module is the one that hooks and decides if the stanza needs to be
routed somewhere else:

https://github.com/sipwise/prosody/blob/master/plugins/mod_sipwise_cluster.lua

> So that's the whole description. Hopefully it all made sense! What I'm
> looking for is some feedback on whether the way I've implemented this
> makes sense (half-arsed prototype code notwithstanding) and any thoughts
> about what user-visible differences in XMPP it might present (stanza
> delivery order comes to mind). And any other thoughts you might have!
>

So I think we are trying to do the same thing. I have to point that I didn't
finish to polish this and for sure it could not work as it is right now. But
you can get the idea.

Looking forward your comments
Victor Seva

signature.asc

Matthew Wild

unread,
Mar 23, 2015, 4:39:38 PM3/23/15
to Prosody IM Developers Group
Hi Rob,

On 20 March 2015 at 05:54, Robert Norris <ro...@fastmail.fm> wrote:
> What I want to do is distribute users for a single domain across multiple
> Prosody instances while having them look like a single server on the
> outside. The reasons for this are partly to do with load - I'm preparing for
> a time where there are thousands of users online - but also to fit neatly
> into our existing infrastructure, where we spread email users across
> multiple servers. I like the idea of mirroring our email architecture and
> doing redundancy the same way.

Makes sense.

> Note that I'm expressly not talking about real live clustering, where a
> server can fail and the user remains connected. A Prosody instance (or its
> underlying machine) can fail and the sessions will be disconnected. That's
> totally fine with me - most clients will quietly reconnect without issue.

Agreed. I should note that for our initial clustering implementation
we're not aiming for live migration of users between nodes either, I
don't think many people need this feature compared to the number of
people who just want availability and load balancing.

> It seems that all that's necessary to make this happen is to hook into the
> stanza delivery code. If we have a session for a user on the current server,
> we deliver it to them. If not, we do a lookup (via an external daemon) to
> see where the "home" server for the user is. If we find one, we pass the
> stanza over to that server and let it deliver it. If not, we continue as
> normal.

Yes, this is the right approach.
I understand the code is rough, so I haven't really "reviewed" it -
but you're definitely going in the right direction. Comments below...

> I'm assuming a very controlled environment here - secure networks (no auth
> required between shards), a user->shard lookup daemon, common storage and
> auth and a proxy that can get incoming user connections to the right shard
> (in my case that's nginx-xmpp). I'm expecting s2s to be covered by a par of
> Prosody instances that accept stanzas and pass them to the right shard. I
> should be able to run multiple of those because they're basically stateless.

Right, what we can and can't assume drastically affects the complexity
of clustering, so this is where things get interesting.

Auth between nodes: easily done, but fine if you don't need it.

The shard lookup daemon. This would be a problem with the current
design. XMPP requires in-order processing and delivery of stanzas. If
you have to delay a stanza based on a call to an out-of-process
third-party, you're going to lose this unless you block the whole
session until you receive the result. There isn't currently a way to
do this in Prosody 0.9 (but there is in 0.10, with the async work).

Alternatively you could assign users to shards based on a hash or some
other means that can be determined immediately without an external
lookup. If you wanted to get fancy you could use "consistent hashing"
to minimize disruption when adding/removing nodes to the cluster.

For the s2s nodes, make sure you set dialback_secret to some secret
static string (the same on both nodes).

> So that's the whole description. Hopefully it all made sense! What I'm
> looking for is some feedback on whether the way I've implemented this makes
> sense (half-arsed prototype code notwithstanding) and any thoughts about
> what user-visible differences in XMPP it might present (stanza delivery
> order comes to mind). And any other thoughts you might have!

Hopefully the above is a good start.

There are some other things to consider if you haven't already. For
example, off the top of my head... what shard will be responsible for
offline users? Various things happen with offline users, including
handling roster state changes (e.g. in response to subscription
authorization) and the storage of offline messages. If every user has
a fixed "home" shard, this is easy enough to handle.

Regards,
Matthew

Robert Norris

unread,
Apr 1, 2015, 5:17:41 PM4/1/15
to proso...@googlegroups.com
On Tue, 24 Mar 2015, at 07:39 AM, Matthew Wild wrote:
The shard lookup daemon. This would be a problem with the current
design. XMPP requires in-order processing and delivery of stanzas. If
you have to delay a stanza based on a call to an out-of-process
third-party, you're going to lose this unless you block the whole
session until you receive the result. There isn't currently a way to
do this in Prosody 0.9 (but there is in 0.10, with the async work).
 
Yeah, I was planning to build on the async work (which I still need to read up on; first pass months ago made no sense to me, but hopefully I'm smarter these days).
 
The nice thing is that I can cache pretty aggressively. A user will only need to change servers when they're moved to another store (manual process) or during failover. Both of these trigger a network broadcast that tells everyone to invalidate their caches.
 
Alternatively you could assign users to shards based on a hash or some
other means that can be determined immediately without an external
lookup. If you wanted to get fancy you could use "consistent hashing"
to minimize disruption when adding/removing nodes to the cluster.
 
I did think about it, and I probably would if I was planning a separate cluster of Prosody servers. Since I'm planning to run them on the mail stores though it makes sense to keep all of a user in the same place (and makes their storage access super-fast because its on the same host, not that the internal network is slow).
 
For the s2s nodes, make sure you set dialback_secret to some secret
static string (the same on both nodes).
 
Yes. I knew that, but its a point that bears repeating :)
 
There are some other things to consider if you haven't already. For
example, off the top of my head... what shard will be responsible for
offline users? Various things happen with offline users, including
handling roster state changes (e.g. in response to subscription
authorization) and the storage of offline messages. If every user has
a fixed "home" shard, this is easy enough to handle.
 
Yep, fixed home shard.
 
Thanks for the feedback, I'm happy to know I'm not totally off track. Now I just need to find time to finish it!
 
Cheers,
Rob N.

Robert Norris

unread,
Apr 1, 2015, 5:24:54 PM4/1/15
to proso...@googlegroups.com
On Fri, 20 Mar 2015, at 07:02 PM, Victor Seva wrote:
Great. I'm was working on that already so maybe we can share ideas. Be
aware that this work was done several months ago for 0.9.4 and I had to
stop working on this due to work load.
 
Sure! I haven't had time to read your code yet, but I will very soon. Thanks for mentioning it!
 
I'm expecting s2s to be
covered by a par of Prosody instances that accept stanzas and pass them
to the right shard. I should be able to run multiple of those because
they're basically stateless.
 
But did you try that?
 
No, not yet, but it seems to make sense. Do you anticipate a problem with that approach?
 
My approach was to extend mod_s2s as mod_s2sc to connect the nodes:
 
So you're doing s2s/dialback between nodes? That will work I suppose, but in my case would be unnecessary overhead (not a lot of overhead, but still).
 
Do you do outbound s2s from the cluster nodes, or push that through a central s2s node? I'm thinking I'll have inbound through a central node, but outbound directly from the shards - much easier to manage.
 
So I think we are trying to do the same thing. I have to point that I
didn't finish to polish this and for sure it could not work as it is right now.
But you can get the idea.
 
Of course. I'll have a good read of your code as soon as I possibly can. Thanks again!
 
Cheers,
Rob N.

Victor Seva

unread,
Apr 2, 2015, 8:10:54 AM4/2/15
to proso...@googlegroups.com
On 04/01/2015 11:24 PM, Robert Norris wrote:
> On Fri, 20 Mar 2015, at 07:02 PM, Victor Seva wrote:
> My approach was to extend mod_s2s as mod_s2sc to connect the nodes:
>
> So you're doing s2s/dialback between nodes? That will work I suppose,
> but in my case would be unnecessary overhead (not a lot of overhead, but
> still).

Yes, maybe not a wise decision.

> Do you do outbound s2s from the cluster nodes, or push that through a
> central s2s node? I'm thinking I'll have inbound through a central node,
> but outbound directly from the shards - much easier to manage.

I do outbound s2s from the nodes.


>
> So I think we are trying to do the same thing. I have to point that I
> didn't finish to polish this and for sure it could not work as it is
> right now.
> But you can get the idea.
>
>
> Of course. I'll have a good read of your code as soon as I possibly can.
> Thanks again!


Looking forward to your comments,
Victor

signature.asc

Victor Seva

unread,
Nov 3, 2015, 9:09:01 AM11/3/15
to proso...@googlegroups.com, rob...@its-hoffmann.net
Finally, I'd some time to play with the mod_shard* from Robert.

His approach is cleaner and simpler.

So, I've start implementing our user/muc sharding using his work as a
base [0].
I'm using redis to save the info of which shard server has which jid (
mod_sipwise_redis_sessions.lua )
or muc ( mod_sipwise_redis_mus.lua )

Thanks!

[0]
https://github.com/sipwise/prosody/commit/c5f0784cebefd2ba29ebad529331698db2f744e8
--
Victor Seva
Software Engineer

Phone: +43(0)1 301 2029
Email: vs...@sipwise.com
Website: www.sipwise.com

Particulars according Austrian Companies Code paragraph 14
"Sipwise GmbH“ - Europaring F15 – 2345 Brunn am Gebirge
FN:305595f, Commercial Court Vienna, ATU64002206

signature.asc
Reply all
Reply to author
Forward
0 new messages