Scaling / load balancing log aggregators in a high-traffic setup

2,052 views
Skip to first unread message

Ed James

unread,
Mar 3, 2015, 9:27:58 AM3/3/15
to flu...@googlegroups.com
Hi all


I've also taken a look at the suggested setup for a high-availability setup here: http://docs.fluentd.org/articles/high-availability

We are starting to design our own infrastructure using this as a template. We will be running a small application together with a td-agent instance on multiple servers which will send traffic to log aggregators, as per the suggested setup. We are expecting approx 10k/min of data when we go live, although this will increase quite fast.

My main concern here is scaling at the log aggregator level.

I think scaling horizontally at the log-forwarder level should be quite straight-forward, and it's something we've done before. However, what is the suggested approach for horizontal scaling at the aggregator level? Let's say we have 2 servers running as log aggregators which start struggling with a sudden spike in traffic. 

If I want to double my server capacity by adding 2 more aggregator servers, what is the best way of doing this? Also, the reverse also applies - how do I then cleanly and safely scale back down again when the traffic drops?

Many thanks and thanks for a great product.

Ed.

Satoshi Tagomori

unread,
Mar 3, 2015, 11:09:12 AM3/3/15
to flu...@googlegroups.com
Hi Ed,

It is good idea to add aggregator servers for higher scalability, but you should consider to execute
more fluentd processes on these servers with different listen ports.
Fluentd (CRuby) can use only 1 CPU core. So adding processes at first is good idea to use more CPU power
for high throughput if your servers have 2 or more CPU cores. This method doesn't require any more money :)

Scaling down fluentd nodes is very difficult problem. The most clean way is:
(1) rewrite configurations of forwarder node to remove some of aggregation nodes
(2) restart all forwarder nodes
(3) stop some of aggregation nodes

But Fluentd forward plugin can remove stopped nodes automatically if its heartbeat monitor detects that node
went to be down. So stopping aggregation node at first works well... except for warning messages in forwarder
level nodes ("aggregation node xxx is down").

tagomoris.

2015年3月3日火曜日 23時27分58秒 UTC+9 Ed James:

Ed James

unread,
Mar 3, 2015, 12:49:50 PM3/3/15
to flu...@googlegroups.com
Hi Satoshi

Thanks for your reply.

If I understand you correctly, it sounds like scaling horizontally is not that easy. Sure, running more processes on each aggregator server is a good idea, and we will obviously start off running one process per CPU on each machine. However, it sounds a little tricky to add more aggregators, and even harder to remove them afterwards.

I'm sure there are people in this group who have setups that are dealing with huge amounts of traffic - do you know anything more about what other people are doing around this issue i.e. how are you handling scaling at the aggregator level?

Also, you mentioned the heartbeat monitor - I didn't see that in the docs. Where can I find out more about that?

Many thanks,
Ed.

Mr. Fiber

unread,
Mar 3, 2015, 1:37:51 PM3/3/15
to flu...@googlegroups.com
Also, you mentioned the heartbeat monitor - I didn't see that in the docs. Where can I find out more about that?

out_forward article mentiones heartbeat parameters, heartbeat_type and heartbeat_interval. 


do you know anything more about what other people are doing around this issue i.e. how are you handling scaling at the aggregator level?

Some people use ELB like load-balancing proxy.
This approach resolves satoshi's rewrite configuration issue.


Masahiro

--
You received this message because you are subscribed to the Google Groups "Fluentd Google Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Christian Hedegaard

unread,
Mar 3, 2015, 8:11:09 PM3/3/15
to flu...@googlegroups.com

Hey guys, I’d like to expand our fluentd processes to multiple cores. Does anyone have an example of an haproxy config so that I could use haproxy to load balance connections to the two local fluentd/td-agent processes?

 

It would be something like (APP) -> (Haproxy)tcp/24224 -> (fluentd)tcp/24334\(fluentd)tcp/24444

 

Has anyone set this up before who has an example config I could see? Thanks!

--

Naotoshi Seo

unread,
Mar 4, 2015, 1:24:27 AM3/4/15
to flu...@googlegroups.com, chede...@red5studios.com
We (Fluentd comitters)'ve never tried. 
Can you try and share the knowledge?

Regards,
Naotoshi a.k.a. sonots

Ed James

unread,
Mar 4, 2015, 6:51:47 AM3/4/15
to flu...@googlegroups.com
Thanks Masahiro.

We use AWS extensively so using an ELB will be easy. We will almost certainly also use HAProxy on each aggregator machine to load-balance between the agent processes, seeing as though I'm suer we will have multiple processes on each machine.

When we get something operational I'll be sure to post our findings back here to share.

Best,
Ed.

Lance N.

unread,
Mar 9, 2015, 6:33:46 PM3/9/15
to flu...@googlegroups.com
We run one server with several Fluentd processes and the 'pen' load-balancer app.
The FLuentd processes run in one Docker container and the pen program runs in another. The whole thing scales well. We have a complication in that I did not want to allow multiple writers to the same directory (S3), so I have two layers of indirection. The 'pen' program load-balances across four Fluentds, which all do intermediate processing and send to 4 back-end processes for S3 writing. The 4 back-end Fluentds each write to separate S3 buckets. This sounds weird, and maybe it's too ornate, but I had a LOT of trouble working around the CRuby 1-processor limitation and this has run for weeks with no problems.

Does Fluentd run under JRuby? It would be nice to have one process and not 11.

Cheers,

Lance

Mr. Fiber

unread,
Mar 10, 2015, 8:30:54 AM3/10/15
to flu...@googlegroups.com
Does Fluentd run under JRuby? It would be nice to have one process and not 11.


One guy keeps to work on this off-time.
msgpack-ruby now support JRuby and cool.io will support JRuby.
After that, we can support JRuby officially.


Masahiro



--
Reply all
Reply to author
Forward
0 new messages