Diagnostic tools for Quarkus apps in container

712 views
Skip to first unread message

Emmanuel Bernard

unread,
Jan 18, 2022, 7:18:40 AM1/18/22
to Quarkus Development mailing list, Fabian Martinez Gonzalez, Eric Wittmann
Hi,

I am having conversations with teams using Quarkus in production in containers and figuring out the tooling they want to be able to diagnose when something goes not OK in production in one of the containerized quarkus apps.

Their initial thought was to use JMX as their entry door to apply some operations. But 
1. I'm not sure a JMX entry door is the right thing to do
2. JMX and native executable are still not a thing anyways

So my question is do we have a document describing the options to diagnose a Quarkus app in Kubernetes deployed containers?
If not, how can we discuss it to find the best options and to build such a document?

Specific needs that got raised thus far:
- thread dump capture
- heap dump capture
- maybe flamegraph
- maybe JFR (though I don't think they have instrumented their app for it)

They need a wide net as it is about diagnostics before finding a clue and zooming in.

Thoughts

Emmanuel

Max Rydahl Andersen

unread,
Jan 18, 2022, 8:23:05 AM1/18/22
to Emmanuel Bernard, Quarkus Development mailing list, Fabian Martinez Gonzalez, Eric Wittmann

On 18 Jan 2022, at 13:18, Emmanuel Bernard wrote:

Hi,

I am having conversations with teams using Quarkus in production in
containers and figuring out the tooling they want to be able to diagnose
when something goes not OK in production in one of the containerized
quarkus apps.

Their initial thought was to use JMX as their entry door to apply some
operations. But
1. I'm not sure a JMX entry door is the right thing to do
2. JMX and native executable are still not a thing anyways

So my question is do we have a document describing the options to diagnose
a Quarkus app in Kubernetes deployed containers?
If not, how can we discuss it to find the best options and to build such a
document?

Specific needs that got raised thus far:
- thread dump capture
- heap dump capture
- maybe flamegraph
- maybe JFR (though I don't think they have instrumented their app for it)

They don't need to instrument their app for it - can be enabled via command line flags or with a sidecar.

They need a wide net as it is about diagnostics before finding a clue and
zooming in.

So first - I don't actually think we have a good doc for it but makes sense we make one - below my own thoughts on what to cover.

  1. One should be using traditional cloud native means first (enable monitoring/telemetry etc.) - expectation is that they have existing infrastructure that can gather that info.
  2. if the app is running in jvm mode then I don't see why to exclude using JMX - it is a rich source of info. Would be a mistake to not utilize it IMO.
  3. for JFR - look into cryostat which could gather JFR data (https://developers.redhat.com/blog/2021/01/25/introduction-to-containerjfr-jdk-flight-recorder-for-containers)
  4. beyond that - it is just java; so all the tricks in the book applies; i.e. if you have access to the container via sidecars, port forwards etc. you can use jcmd, jps, jstat etc. to get tons of info without stopping the instance using things like oc exec (I belive kubectl has similar now but forget its name)

/max

You received this message because you are subscribed to the Google Groups "Quarkus Development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to quarkus-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/quarkus-dev/CANYWk7Md%3DxRTsb54CX6aMcAhEy6Lzjtt%2B7U_7fKoJoFaZ1ro_A%40mail.gmail.com.

Emmanuel Bernard

unread,
Jan 18, 2022, 11:52:14 AM1/18/22
to Max Rydahl Andersen, Quarkus Development mailing list, Fabian Martinez Gonzalez, Eric Wittmann
On Tue, Jan 18, 2022 at 2:23 PM Max Rydahl Andersen <mand...@redhat.com> wrote:

On 18 Jan 2022, at 13:18, Emmanuel Bernard wrote:

Hi,

I am having conversations with teams using Quarkus in production in
containers and figuring out the tooling they want to be able to diagnose
when something goes not OK in production in one of the containerized
quarkus apps.

Their initial thought was to use JMX as their entry door to apply some
operations. But
1. I'm not sure a JMX entry door is the right thing to do
2. JMX and native executable are still not a thing anyways

So my question is do we have a document describing the options to diagnose
a Quarkus app in Kubernetes deployed containers?
If not, how can we discuss it to find the best options and to build such a
document?

Specific needs that got raised thus far:
- thread dump capture
- heap dump capture
- maybe flamegraph
- maybe JFR (though I don't think they have instrumented their app for it)

They don't need to instrument their app for it - can be enabled via command line flags or with a sidecar.

Ah yes sure for the existing events published, I was thinking of custom events, my bad.
 

They need a wide net as it is about diagnostics before finding a clue and
zooming in.

So first - I don't actually think we have a good doc for it but makes sense we make one - below my own thoughts on what to cover.

  1. One should be using traditional cloud native means first (enable monitoring/telemetry etc.) - expectation is that they have existing infrastructure that can gather that info.
  2. if the app is running in jvm mode then I don't see why to exclude using JMX - it is a rich source of info. Would be a mistake to not utilize it IMO.
How do I do the same in a natively compiled version?
Should I have two ways if I have a A/B of Quarkus in JVM and in native?
 
  1. for JFR - look into cryostat which could gather JFR data (https://developers.redhat.com/blog/2021/01/25/introduction-to-containerjfr-jdk-flight-recorder-for-containers)
  2. beyond that - it is just java; so all the tricks in the book applies; i.e. if you have access to the container via sidecars, port forwards etc. you can use jcmd, jps, jstat etc. to get tons of info without stopping the instance using things like oc exec (I belive kubectl has similar now but forget its name)
Cool, I'd love a recommended, documented version on how to enable that. Maybe we will discover that enabling something different in our kubernetes setup could facilitate the users here.

Here is my initial dump to get started

Telemetry is your first line of defense


Metrics to your dashboard

Logs -> to some centralized logging system of your choice (e.g. Elastic, CloudWatch)

Trace -> ...


Enable JFR event collection

They are very useful and cost ~1% overhead (default jfr profile)


Use cryostat to enable JFR in your Quarkus or Java container

(https://developers.redhat.com/blog/2021/01/25/introduction-to-containerjfr-jdk-flight-recorder-for-containers


Set the default.jfr: See JAVA_HOME/lib/jfr/default.jfc


JFR can create:

  • thread dumps every “chunk” whatever that is

  • Heap dumps every n nanoseconds


Does JFR and cryostat work for mandrel?

How do you change these settings (chunk and nanosecond value)?

How to make it easier for people to set cryostat from a quarkus app to a kube deployment?

Privileged scripts - ran by someone with cluster privilege

Next step is to use a script that captures the relevant information from the Kubernetes pod(s).

This script has to be run by a specially authorized group of people.



Specific needs that got raised thus far:

- thread dump capture

- heap dump capture

- maybe flamegraph

- maybe JFR 


Note that JFR can do thread and heap dump for you.


Otherwise, use the jmap/jstack/etc tools

https://stackoverflow.com/questions/64121941/how-to-get-a-heap-dump-from-kubernetes-k8s-pod 

https://www.jaktech.co.uk/java/how-to-get-java-memory-heap-dump-from-pod-in-kubernetes/ 


TODO get the specific examples here


How to make things better

Create higher level tools


Azure Spring Cloud has higher level commands for it

https://docs.microsoft.com/en-us/azure/spring-cloud/how-to-capture-dumps

I like it.

Find a way to automate this process for people for capturing heap dump on OOME

https://medium.com/@281332/how-collect-java-heap-dumps-from-stateless-microservices-running-on-kubernets-a1aed5575128

Loïc MATHIEU

unread,
Jan 18, 2022, 12:23:19 PM1/18/22
to Emmanuel Bernard, Quarkus Development mailing list, Fabian Martinez Gonzalez, Eric Wittmann
Hi,

I usually do not rely on JMX on containerized apps deployed on Kubernetes as it's not easy to locate pods and gather JMX metrics as pods are ephemeral. And you'll need to open a port for it.

I usually add extensions for observability instead:
- metrics: usually Prometheus with Grafana dashboard. I add meters for the time taken for the entrypoints of my apps (@Timed for message based application or OOTB support for JAX-RS).
- opentracing / opentelemetry (but I almost never use them)

When I need to debug / analyze deeper (once a month), I connect inside the pod and launch a profiling session with async-profiler or JFR. It's not very convenient but as it's not often it's not an issue. But you'll need to kubectl cp / kubectl exec, and not everybody can. Some tools can help (kubectl flame for async-profiler or cryostat for JFR).

Async-profiler inside a container has some limitations due to missing symbols but it's still my tool of choice. Some JVM options allow more fine grained profile information: -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints, you'll want to add them to your JVM command line.

You raise a point for thread dump / heap dump, again, you'll need to connect to the pod and take them, and this is not very convenient. I never take heap dump, I usually find a clever way to analyze the heap (heap histogram, gc logs, allocation profiling) but there is a legitimate use case for thread dumps. One can imagine exposing an endpoint to take them but do it with caution as it can be very easy to crash the app by calling it too often.

Last but not least, don't forget to activate GC logs, for past Java 8 -Xlog:gc*:file=/tmp/gc.log is enough, it contains invaluable information of the memory usage of your application (and way more if you know the dark art of interpreting them). Again, to have access to them you'll need to kubectl cp them.

Regards,

Loïc

--

Max Rydahl Andersen

unread,
Jan 19, 2022, 4:03:51 AM1/19/22
to Loïc MATHIEU, Emmanuel Bernard, Quarkus Development mailing list, Fabian Martinez Gonzalez, Eric Wittmann
On 18 Jan 2022, at 18:23, Loïc MATHIEU wrote:

> Hi,
>
> I usually do not rely on JMX on containerized apps deployed on Kubernetes
> as it's not easy to locate pods and gather JMX metrics as pods are
> ephemeral. And you'll need to open a port for it.

port is needed if you want external access to it yes; I would still make use of jmx
if I actually can get to the pod to do jstack etc I can also acess jmx info.

I agree fully to better add metrics/opentracing to the app for the key numbers/info you are after;
but in the case you missed something accessing them via jmx in prod is an option to not ignore.
Is this still the case ? with JFR in newer java versions my understanding was that
it contained the same and possible to have with less overhead ?

/max
>> <https://groups.google.com/d/msgid/quarkus-dev/CANYWk7Md%3DxRTsb54CX6aMcAhEy6Lzjtt%2B7U_7fKoJoFaZ1ro_A%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>
> --
> You received this message because you are subscribed to the Google Groups "Quarkus Development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to quarkus-dev...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/quarkus-dev/CAJLxjVEVvAdy7ewh6EQBCT_j77wjY43proa6nzkEoezHsLa7Ww%40mail.gmail.com.

/max
https://xam.dk/about

Max Rydahl Andersen

unread,
Jan 19, 2022, 4:11:35 AM1/19/22
to Emmanuel Bernard, Quarkus Development mailing list, Fabian Martinez Gonzalez, Eric Wittmann
>> 1. One should be using traditional cloud native means first (enable
>> monitoring/telemetry etc.) - expectation is that they have existing
>> infrastructure that can gather that info.
>> 2. if the app is running in jvm mode then I don't see why to exclude
>> using JMX - it is a rich source of info. Would be a mistake to not utilize
>> it IMO.
>>
>> How do I do the same in a natively compiled version?
> Should I have two ways if I have a A/B of Quarkus in JVM and in native?

Yes, its effectively two different scenarios. No way around it. You also can't do jstack, etc.
with a native image. JFR is the only somewhat common denminator.
>
>
>>
>> 1. for JFR - look into cryostat which could gather JFR data (
>> https://developers.redhat.com/blog/2021/01/25/introduction-to-containerjfr-jdk-flight-recorder-for-containers
>> )
>> 2. beyond that - it is just java; so all the tricks in the book
>> applies; i.e. if you have access to the container via sidecars, port
>> forwards etc. you can use jcmd, jps, jstat etc. to get tons of info without
>> stopping the instance using things like oc exec (I belive kubectl has
>> similar now but forget its name)
>>
>> Cool, I'd love a recommended, documented version on how to enable that.
> Maybe we will discover that enabling something different in our kubernetes
> setup could facilitate the users here.
> Looks like it's kubectl exec too
> https://www.jaktech.co.uk/java/how-to-get-java-memory-heap-dump-from-pod-in-kubernetes/

yes, correct - I feel someone told me there were some slight differences between the two privilege wise
but I don't know for certain how aligned they are now. I assume fairly close :)

> Here is my initial dump to get started

Looks good - wanna make a PR for a guide starting point ?

>
> Telemetry is your first line of defense
>
> Metrics to your dashboard
>
> Logs -> to some centralized logging system of your choice (e.g. Elastic,
> CloudWatch)
>
> Trace -> ...
>
> Enable JFR event collection
>
> They are very useful and cost ~1% overhead (default jfr profile)
>
> Use cryostat to enable JFR in your Quarkus or Java container
>
> (
> https://developers.redhat.com/blog/2021/01/25/introduction-to-containerjfr-jdk-flight-recorder-for-containers
>
> Set the default.jfr: See JAVA_HOME/lib/jfr/default.jfc
>
> JFR can create:
>
> -
>
> thread dumps every “chunk” whatever that is
> -
>
> Heap dumps every n nanoseconds
>
>
> Does JFR and cryostat work for mandrel?

It should; but something we should figure out. Lets get the guide PR started and we can drag in thouse involved to tell us where we are wrong :)

>
> How do you change these settings (chunk and nanosecond value)?
>
> How to make it easier for people to set cryostat from a quarkus app to a
> kube deployment?
> Privileged scripts - ran by someone with cluster privilege
>
> Next step is to use a script that captures the relevant information from
> the Kubernetes pod(s).
>
> This script has to be run by a specially authorized group of people.
>
>
> Specific needs that got raised thus far:
>
> - thread dump capture
>
> - heap dump capture
>
> - maybe flamegraph
>
> - maybe JFR
>
> Note that JFR can do thread and heap dump for you.
>
> Otherwise, use the jmap/jstack/etc tools
>
> https://stackoverflow.com/questions/64121941/how-to-get-a-heap-dump-from-kubernetes-k8s-pod
>
>
> https://www.jaktech.co.uk/java/how-to-get-java-memory-heap-dump-from-pod-in-kubernetes/
>
>
>
> TODO get the specific examples here
>
> How to make things betterCreate higher level tools
>
> Azure Spring Cloud has higher level commands for it
>
> https://docs.microsoft.com/en-us/azure/spring-cloud/how-to-capture-dumps
> I like it.

AFACS this is literally what cryostat can/should enable. Triggering capture and gathering of the jvm relevant data.
I don't know if they have it all but that would be useful for sure.

> Find a way to automate this process for people for capturing heap dump on
> OOME
> https://medium.com/@281332/how-collect-java-heap-dumps-from-stateless-microservices-running-on-kubernets-a1aed5575128

yeah - the big challenge is ensuring it is done securely.

/max

Loïc MATHIEU

unread,
Jan 19, 2022, 4:30:40 AM1/19/22
to Max Rydahl Andersen, Emmanuel Bernard, Quarkus Development mailing list, Fabian Martinez Gonzalez, Eric Wittmann
I'm not sure how JFR generates GC events, by the way, GC streaming events also exist with JMX but they are known to be less accurate. 
The best source of GC activity is the GC logs, and it's very low overhead and well known with a lot of tooling (much better tooling than the one based on JFR event for GC) so I still use it in any circumstances.

Loïc MATHIEU

unread,
Jan 19, 2022, 4:43:46 AM1/19/22
to Max Rydahl Andersen, Emmanuel Bernard, Quarkus Development mailing list, Fabian Martinez Gonzalez, Eric Wittmann
I think one of the answers was not going to the ML so I will try to answer globally.

For native applications, JFR is a new solution that I didn't test it yet but looking at the documentation it has a limited set of events and can only be activated at startup, see https://www.graalvm.org/22.0/reference-manual/native-image/JFR/. So you cannot really use it in production as it needs to be started at the beginning of your application and the profile will be available when you stop it (or at a time defined when starting the application). On JVM I usually use jcmd to start a profiling session when needed.

For native application, you should rely on standard Linux tool and native image capabilities, so you can:
- take thread dump with kill
- if you're on the enterprise version you can take heap dump (never tested): https://www.graalvm.org/22.0/reference-manual/native-image/NativeImageHeapdump/

--
You received this message because you are subscribed to the Google Groups "Quarkus Development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to quarkus-dev...@googlegroups.com.

Emmanuel Bernard

unread,
Jan 19, 2022, 10:32:00 AM1/19/22
to Max Rydahl Andersen, Quarkus Development mailing list, Fabian Martinez Gonzalez, Eric Wittmann
On Wed, Jan 19, 2022 at 10:11 AM Max Rydahl Andersen <mand...@redhat.com> wrote:

Looks good - wanna make a PR for a guide starting point ?

Fabian Martinez Gonzalez

unread,
Jan 20, 2022, 5:17:30 AM1/20/22
to Emmanuel Bernard, Max Rydahl Andersen, Quarkus Development mailing list, Eric Wittmann
Hey Emmanuel

I find that list you made really useful

btw, for our specific situation it turned out that running privileged scripts is going to be difficult so I think we will look into setting up JFR event collection in our environment

Thanks!
--

Fabian Martinez Gonzalez

Senior Software Engineer

Red Hat

Reply all
Reply to author
Forward
0 new messages