Loggregator dropping application logs and warning 'message output too high'

837 views
Skip to first unread message

Yann ROBERT

unread,
Feb 26, 2015, 6:11:13 AM2/26/15
to vcap...@cloudfoundry.org
Hi CF users and developers

This is a follow-up to a GitHub issue I created earlier this week about Loggregator truncating messages on some conditions
https://github.com/cloudfoundry/loggregator/issues/42

Alex suggested we move to a discussion on this group.

You can find the main comments bellow.


loggregator (LGR) is sometime filtering with the following warning:

Log message output too high. We've dropped 100 messages 

It occurs when the application is logging a lot of messages in a short time.
It may be opportunate in the event the application is logging GB/sec for several seconds or minutes.

But sometimes the application is just logging 200 lines at a time, then stop or go back to a moderate logging rate.

It may not be appropriate to just drop those messages, as the log messages being dropped are important.

Why is LGR using a static threshold (100 messages) instead of a dynamic or configurable threshold?
Is that possible to configure it on a per application basis? (running in a public cloud)

Would it be possible to change LGR so that it truncate a buffer only when the syslog endpoint fails to keep up with the current rate?
Or, would it be possible to enable the flood protection only when it's during for a 'long' time? For instance if the application is logging 300 lines / seconds for 10 seconds, only drop after the 3rd or 4th second, so that we get the content of the flood for the first 2 or 3 seconds. Also, it could send the last buffer if the flood is stopping so that we get the 10th second of logs.


Hi Yan,
Loggregator will output the message you referred to when clients can't keep up with the volume. So if you have a syslog drain or websocket consumer that isn't keeping up with the flow of logs, loggregator will truncate it's buffer and send the client this warning message. So the answer to your question about only truncating the buffer when the syslog endpoint fails to keep up with the current rate is that this is exactly what is happening when this message appears.

The buffer size is configurable when deploying loggregator with the doppler.maxRetainedLogMessages property that can be added in the bosh manifest.

This property is set by the operators of the CF instance and can't be set for an individual application. Since the buffer size is heavily dependent on available resources for the loggregator vms it's not really appropriate to allow users of a public cloud to set this property.

We are exploring ways to throttle log messages from noisy applications based on time scales as you suggest but currently we don't have anything in the system that does this.

Hope this helps to explain why you're seeing this output and what loggregator can and can't currently do.


Hi Alex, thank you for the reply.
I am not sure I understand. Could we please discuss that further?

In particular, you are referring to "clients" that can't keep up. But the endpoint in a syslog server (namely Papertrail).
If the syslog server cannot keep up with the flow sent by CloudFoundry, it should at least receive a few bytes of logs. But here there's nothing (but the warning message).

Like when I send a big text file to a syslog server, like that

cat big.txt | nc -v -u logs.papertrailapp.com 514

I get at least the beginning of the file.

So I suppose there is a condition that makes LGR not even try to send the logs to the syslog server; like when the application is logging 150-300 lines (differs on the different public clouds I tried) of logs at the same time.

What is exactly the network condition that makes LGR say "the syslog server can't keep up i'm going to truncate the buffer but still send it a warning message"? Is that a connection reset? Is that a connection timeout? Is that a blocking socket write?

 
Could we come up with a detailed explanation on the exact condition where loggregator is dropping messsage? why it is designed to work like that? and how one can ensure (either the developer or the cloudfoundry provider) that the syslog endpoint gets every messages even when the application is logging 150 to 500 lines of logs in the same second?

Is the problem inside/between CF components? or between CF and the external syslog endpoint server?

Thanks in advance for any help.

Yann

Alexander Jackson

unread,
Feb 26, 2015, 10:13:10 PM2/26/15
to vcap...@cloudfoundry.org
Yann thanks for moving the discussion over here.   Hopefully more folks will benefit from it.

Sorry for my confusion of language in my earlier reply.   When I mentioned "clients" being slow, I was referring to "clients" in the sense of "clients" of loggregator.  You are correct that these "clients" include things that are "servers" as in syslog endpoints.

When does loggregator emit this message?

Loggregator will emit the "Log message output too high. We've dropped 100 messages " message when an endpoint (websocket or syslog drain) can't keep up. Loggregator keeps a buffer of 100 messages. Once that limit is reached, the 101st message will cause the buffer to be truncated. The warning message will be added to the buffer and then the 101st message will be added, resulting in a new buffer size of 2 messages. This message is only sent when we have issues with external syslog or websocket endpoints. It is not emitted when there are issues inside/between CF components.

What could cause the buffer to overflow?  

The buffer could be overflowing for a variety of reasons. It could be because the connection to the server has timed out (or other connection issues). It could be due to slow blocking writes. In the case of websockets it could be that the remote side has stopped responding to "pings".

Why is it designed to work like this?  

Basically loggregator assumes that CF operators don't have infinite resources, that it is running in a cloud environment that can be somewhat hostile (machines going up and down, networks that do experience packet loss etc.), and that we are operating in a multi-tenant environment were we need to attempt to provide fair shares of available resources to all applications. Loggregator attempts to make trade-offs to protect itself and other tenants when faced with large log volumes or slow/unavailable downstream log consumers. It also attempts to never put back pressure on the application or CF component when faced with large log volumes or resource constraints. It assumes you'd rather have your application serve your clients while maybe losing a log message here and there, than having your application and users "waiting" for a log pipeline to clear due to a slow external log consumer. Now, all this said, loggregator is far from perfect and we continually work to improve both the implementation and the design.  

I am of the opinion that in a large distributed environment that spans internal/external networks and third-party systems, on shared/limited hardware, you can never build a system that can deliver 100% of the messages 100% of the time while still ensuring adequate response time from the logging application. This is very much in line with the CAP theorem[0].

The long and short is that if you need to ensure your data in your log messages is "never" lost, logs are not the best place for this data. Rather you should put this information in a more formal datastore like a database. Such systems make different decisions/trade-offs and use different protocols. In theory this results in a much better chance that your data will be safely stored.

How can I better debug issues with third party syslog drains?

Make sure you can see log messages with `cf logs <app>`

If you perform `cf logs <app>` you will see additional messages regarding connection issues with syslog endpoints. Try doing this and see if it provides any more debugging information.


Are there other known issues with logging?

Make sure that your log lines are actually different lines with a newline character of some type between them.

A common issue with some application frameworks is that stdout is buffered.   This can result in logs either not making it past the buffer (very little logging, short-lived apps, with improper buffer flush on shutdown/crash) or in a delay.

Another common issue with third party syslog drains on run.pivotal.io is that you need to whitelist the group of IP addresses that loggregator could be sending data from. This list does change from time to time and is located here[1]

Hope that sheds some light on loggregator.
- Alex.



--
You received this message because you are subscribed to the Google Groups "Cloud Foundry Developers" group.
To view this discussion on the web visit https://groups.google.com/a/cloudfoundry.org/d/msgid/vcap-dev/a4771027-c30d-4a5f-8fb8-9a33c84da06a%40cloudfoundry.org.

To unsubscribe from this group and stop receiving emails from it, send an email to vcap-dev+u...@cloudfoundry.org.

Yann ROBERT

unread,
Mar 5, 2015, 6:25:55 AM3/5/15
to vcap...@cloudfoundry.org
Thank you Alex for this interesting piece of knowledge that I am sure many will benefit from.

It appears to me that applications hosted on CloudFoundry just are not supposed to log much to STDOUT/STDERR, at least when hosted on a public Cloud.

And that the CloudFoundry feature that sends the application STDOUT/STDERR to a syslog drain is to be used either :
- to monitor the application's logs at deployment, initialization, and undeployment
- to get the normal output of applications having a very low logging rate (that does not log more than about 50-100 lines per seconds)

It can also be the right solution for
- applications that are not modifiable and that use STDOUT/STDERR as the only output method
- applications in early development phase (where logs are not critical or alternative log method is not developed yet)

However, when the application is :
- modifiable (you have the source code and can change it)
- mission critical, and logs are important and should not be arbitrarily discarded

then the developer should make the application send the logs to the log drain directly.

This have the following effect :

CloudFoundry will only send to the syslog drain the logs of the deployment (DEA and STG), logs of initialization prior application starts to send directly to the log drain (when the application is run in a container like tomcat, then tomcat will send logs to STDOUT, also the logging framework initialization may output some lines too), and logs of DEA at shutdown.
The amount of logs for this is very low, so CloudFoundry does not have any problem with that.

The applications logs should not be written to STDOUT/STDERR. So the application have to send them to the syslog drain by itself.
In order to do that, at startup, the application can read the VCAP_SERVICES environment variable to get the syslog drain url.
Then use a syslog framework to send the logs to this endpoint, if it is available for your language.

In case of a Java application (which is my case), you can benefit from logback's SyslogAppender or the logback-syslog4j library.

It is working fine in my case for the moment, even when the application is under heavy load and generating a lot of logs.

I may be wrong to say CloudFoundry is not fitted for application with high logging rate to STDOUT. I apologize if that hurts anyone.
And maybe it was never advertised to be so (as was the Cloudbees RUN platform I used before).

Anyways, I certainly wish that CloudFoundry will improve on that matter in the near future.

I hope this story will benefit to others.

Yann

James Bayer

unread,
Mar 5, 2015, 9:14:27 AM3/5/15
to vcap...@cloudfoundry.org
hi yann,

i view this issue a little bit differently.

I may be wrong to say CloudFoundry is not fitted for application with high logging rate to STDOUT.

what "high logging rate" means is going to vary by situation. in this case, if your "cf logs myapp" has good enough bandwidth connection to loggregator to keep up with the amount of logs coming from all of your app instances, then it's likely that when you're using syslog that the syslog server is not able to keep up. so that's why alex was recommending that path i'm guessing.

if you always have more log volume coming in than is able to be sent out of the system, at some point you'll overflow capacity in the system in the middle. in CF's default configuration, the number of log lines allowed to backup is 100 lines per app to make sure we don't have cascading failures in a multi-tenant system where 1 tenant's very chatty logging app stops other apps logs from getting to where they need to go.

this has not been much of a problem for others that we've noticed yet so i would investigate if the system your draining to can keep up with the incoming volume.

--
You received this message because you are subscribed to the Google Groups "Cloud Foundry Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to vcap-dev+u...@cloudfoundry.org.



--
Thank you,

James Bayer

Yann ROBERT

unread,
Mar 5, 2015, 11:17:54 AM3/5/15
to vcap...@cloudfoundry.org
Hi James,

Just to clarify on the logging rate :
The main issue that is occuring pretty often is when an application is logging let's say 120 lines of logs at a time and then keep quiet. This is a pretty common scenario when a Java application is printing a stacktrace after a failure. For instance if you have 2 or 3 threads accessing a backend and the connection to the backend fails then all of the running thread will get a failure, which may result in more than 100 lines of logs at the same time.
This simple scenario is enough to get the LGR buffer overflowed because it cannot send to the syslog before having it's buffer full.

NOTE : The number of lines required to be sent at once to get the LGR overflow differs upon the public cloud provider I tested this.

In the light of the previous explaination, because the buffer size is only 100, it's simply impossible to support that scenario.

I am not talking about a scenario when the application is constantly logging 200 lines per seconds for several minutes or hours.

On the syslog server capability to keep up, and the available bandwidth between CF and the syslog server :
My tests show that when the application is directly sending logs to the syslog server, no log is losts, and the syslog server is able to keep up with the flow, be it a temporary peak in the flow or a constant flow of 100s of lines per seconds.
So I am pretty confident that the hosted highly available syslog service is able to keep up with a peak of 100 lines.

The question is just how much time it take for the LGR buffer to be full. It you send 120 lines at the same time to STDOUT, the buffer is full instantly and overflows immediately.
It ends up being FIRST a problem with the buffer size.

Java application are very chatty when it comes to failures because a common practice is to print the stack trace.
Supporting very temporary peak in logs rate would be great. The default buffer size should be 1000, not 100.

Also the strategy of what is done when the buffer is full is questionable.

Putting backpressure on the application is fine to me.
I guess some application developers would rather have the thread that is logging be blocked while the buffer is full, than to loose some logs.
It would end up being the developers responsibility to either leave it that way if it's rare enough, or optimize the application so that it logs less. The developer would be in control.

An other strategy to discard to logs would be to discard the new messages instead of discarding the content of the buffer. This would allow to at least get 100 first lines of logs instead of none or just the latest lines if your lucky.
Alternatively, using a circular buffer may also be interesting to avoid dropping chunks; and allows to only drop the oldest overflow.

A dynamic strategy that would discard the logs based on the level (INFO, WARN, ERROR, ...) like what is done by logback's AsyncAppender [ http://logback.qos.ch/apidocs/ch/qos/logback/classic/AsyncAppender.html ] would be great but we don't have a level on plain STDOUT flow. Also, when all logs are important because it's WARNING or ERROR level, you don't want to filter them.

A dynamic strategy that would arbitrarily discard 1 line out of 2 may be better than dropping a full chunk of 100 messages. Because when you look at the logs it's easier to 'read between the lines'.

We can think of many more alternative strategies to discard messages. It all depends on the developers need.
Putting backpressure on the application to provide control to the developer is what will work best.

Yann

Mike Heath

unread,
Mar 5, 2015, 1:42:43 PM3/5/15
to vcap...@cloudfoundry.org
I have to disagree with regard to putting back pressure on the application. If someone is tailing the logs of an app over a slow VPN connection, that should not slow the application down. I do agree that 100 retained log messages is too few. We retain 500 log messages on our Cloud Foundry instances and that seems to work well for us.

-Mike
Reply all
Reply to author
Forward
0 new messages