High-volume logging on Kubernetes

886 views
Skip to first unread message

Matthias Rampke

unread,
Jul 22, 2016, 1:12:38 PM7/22/16
to google-c...@googlegroups.com
Hello,

we've hit the throughput limit of fluentd at ~10k lines/second (shipping to Kafka, but loadtesting showed the bottleneck to be fluentd).

We are now considering alternatives and I am wondering if anyone else has come up against this and how you solved this?

Cheers,
MR
--
Matthias Rampke
Engineer

SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany | +49 173 6395215

Managing Director: Alexander Ljung | Incorporated in England & Wales with Company No. 6343600 | Local Branch Office | AG Charlottenburg  | HRB 110657B

Juho Mäkinen

unread,
Jul 22, 2016, 5:14:43 PM7/22/16
to Containers at Google
We have a Go program which can accept an udp syslog feed from docker, verify that it has valid json structure (or encapsulate it into json) and then feed it into Kafka. We deploy the binary to every machine in our fleet. I haven't benchmarked it, but if you're interested I'm happy to share the code.

It's plugged into Docker by using following arguments: "--log-driver=syslog --log-opt syslog-address=udp://localhost:8061 --log-opt tag={{.Name}}/{{.ID}}/{{.ImageName}}".

 - Garo

Matthias Rampke

unread,
Jul 23, 2016, 5:17:12 PM7/23/16
to Containers at Google

This is very valuable, thank you! How do you deal with `kubectl logs` / live-tailing? What Kafka library do you use?

/MR


--
You received this message because you are subscribed to the Google Groups "Containers at Google" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-contain...@googlegroups.com.
To post to this group, send email to google-c...@googlegroups.com.
Visit this group at https://groups.google.com/group/google-containers.
For more options, visit https://groups.google.com/d/optout.

Alex Robinson

unread,
Jul 25, 2016, 2:31:09 PM7/25/16
to Containers at Google
If you don't need any parsing or any special metadata tagging, it shouldn't be too much code to write something like Juho has written but that reads the logs off disk instead to maintain `kubectl logs` support. Or grab one of the existing open source lightweight log forwarders. That'll save you a lot of resources over using fluentd.

The other option may just be to give fluentd more CPU resources, but that'll depend on whether it's currently CPU-bound or just not parallelizing its I/O sufficiently.

Matthias Rampke

unread,
Jul 26, 2016, 8:49:31 AM7/26/16
to Containers at Google
Thank you, that is what we've been thinking about too.

Fluentd is unlimited and still CPU-bound (Ruby is not very good at multiprocessing). We do need some metadata handling, we'll have to see if a full JSON un/marshalling cycle is acceptable in Go.

Thank you!
Matthias

Alex Robinson

unread,
Jul 26, 2016, 4:10:53 PM7/26/16
to Containers at Google
You might be interested in the discussion happening on https://github.com/kubernetes/kubernetes/issues/29411

--
You received this message because you are subscribed to a topic in the Google Groups "Containers at Google" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/google-containers/iLDsG85exRQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to google-contain...@googlegroups.com.

To post to this group, send email to google-c...@googlegroups.com.

Juho Mäkinen

unread,
Jul 29, 2016, 4:15:07 PM7/29/16
to google-c...@googlegroups.com, m...@soundcloud.com
Hi MR. Sorry for the delayed response.

I pushed my code to https://github.com/garo/logs2kafka - consider it to be beta quality, but I'm happy to get some feedback. The software is part of our kafka logging stack. The other part is a similar program which will feed logs to Elasticsearch in Kibana supported format, but that's not yet ready so we are still using logstash to do that. Our idea is that each application can choose to simply log to STDERR/STDOUT, or in a more complex case log either to logs2kafka (via tcp, not yet implemented), or directly to Kafka. Once the JSON log messages are in Kafka we can then use multiple different consumers: Push to Elasticsearch, store to S3, write files to disk for easy grepping, write custom code to analyse messages and update data and so on.

The program enforces that each message is a single JSON document in a single line. Each document must contain a "ts" field in ISO 8609 format, then a "service" field which tells which service produce the message and a "host" field which contains the fully qualified hostname of the system where the message was generated.

The application can either log complete JSON documents, or choose to log raw log messages. logs2kafka will detect which mode each message uses and either encapsulate the raw log message into JSON, or verify that the JSON document contains required field and add those if they are missing. The as docker syslog format contains the name of the container, logs2kafka will use that as the service if the "service" field is missing, but it will not use the autogenerated docker container names. (for that I copy-pasted the name generation code from Docker).

Logs2kafka also supports analysing the "level" field and updating statsd counters, so that you will get a nice amount of DEBUG/ERROR/WARN log message metrics for each of your application. This can be handy if you want to build alerts based on ERROR messages in the downstream later.

This is very valuable, thank you! How do you deal with `kubectl logs` / live-tailing? What Kafka library do you use?

I haven't yet deployed Kubernetes. I'm still worrying how we can make sure we can have good enough networking performance in a multi-availability-zone setup in EC2. I'm hoping there's a way to use for example Calico so that it will use native L3 within same subnet but fallback to IPIP when doing cross-subnet connectivity. But that's a different story.

The logs2kafka uses sarama and I'm experimenting with the sarama-cluster library which will use the new Kafka 0.9.x based consumer tracking instead of zookeeper. Kafka is a complex protocol to implement, so unfortunately most of the libraries aren't yet matured outside JVM world.

Please take a look and tell me what you think!

 - Garo

 
You received this message because you are subscribed to a topic in the Google Groups "Containers at Google" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/google-containers/iLDsG85exRQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to google-contain...@googlegroups.com.

Matthias Rampke

unread,
Aug 2, 2016, 5:49:50 AM8/2/16
to Juho Mäkinen, google-c...@googlegroups.com
On Fri, Jul 29, 2016 at 8:15 PM Juho Mäkinen <juho.m...@gmail.com> wrote:
Hi MR. Sorry for the delayed response.


Thank you!
 
each application can choose to simply log to STDERR/STDOUT, or in a more complex case log either to logs2kafka (via tcp, not yet implemented), or directly to Kafka.

12factor4lyfe! We are doing the same. We had a lot of issues because we mixed semantic event streams (needing exactly-once delivery) and operational logs; now the former go directly to a separate Kafka cluster and the latter to stdout/err. We probably won't support anything but these two ways.

 
Once the JSON log messages are in Kafka we can then use multiple different consumers: Push to Elasticsearch, store to S3, write files to disk for easy grepping, write custom code to analyse messages and update data and so on.

We're going to ship to HDFS. For log analysis (~Kibana), we are thinking about ingesting into a temporary log analysis tool (ELK or Graylog) on an as-needed basis, instead of continuously indexing log lines that 99% of the time won't ever be queried.


 

I haven't yet deployed Kubernetes. I'm still worrying how we can make sure we can have good enough networking performance in a multi-availability-zone setup in EC2. I'm hoping there's a way to use for example Calico so that it will use native L3 within same subnet but fallback to IPIP when doing cross-subnet connectivity. But that's a different story.

Slightly veering off topic here, but I wouldn't worry about IPIP performance. We've been using that in different contexts (IPVS) and it haven't noticed _any_ slowdown.
 

The logs2kafka uses sarama and I'm experimenting with the sarama-cluster library which will use the new Kafka 0.9.x based consumer tracking instead of zookeeper. Kafka is a complex protocol to implement, so unfortunately most of the libraries aren't yet matured outside JVM world.

Thanks, sarama seems to be the Go library of choice.

Thank you!
Matthias

Prashanth B

unread,
Aug 3, 2016, 1:42:06 PM8/3/16
to Containers at Google, juho.m...@gmail.com
Some nice ideas on this thread. Please continue brain storming on https://github.com/kubernetes/kubernetes/issues/30006 so they don't get lost. 

Eduardo Silva

unread,
Aug 3, 2016, 4:58:58 PM8/3/16
to Containers at Google
Hi Matthias,

would you please describe in detail the limit issue you mentioned ? , also if you provide your Fluentd configuration file and 'load' details we will be able to help to fix that problem.

regards,

Matthias Rampke

unread,
Aug 4, 2016, 5:01:20 AM8/4/16
to Containers at Google
Hey,

I don't have the exact configuration any more; it was derived from the fluentd-es example in Kubernetes. We tested with just the tail plugin and a null output, and at 10k messages/sec it was using ~50% of 2 cores; adding the Kubernetes metadata (from the upstream fluentd config) brought it up to 2x100% and buffer overflows. Actually writing to Kafka we were getting repeatedly disconnected by the Kafka daemon, adding to the pushback issues.

Our Docker daemons are currently rotating logfiles after 10MB (I want to raise this). Profiling Ruby we could not identify any particular bottleneck.

/MR

--
You received this message because you are subscribed to the Google Groups "Containers at Google" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-contain...@googlegroups.com.
To post to this group, send email to google-c...@googlegroups.com.
Visit this group at https://groups.google.com/group/google-containers.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages