I would like to take the chance of the last excellent cross-team collab to squash a devmode-related issue[1,2] to discuss a couple of concerns that I've been meaning to raise for a while now
(1) compatibility and testing of Quarkus features (esp. devmode/classloading)
(2) CR vs. Final releases
I have tried to keep this mail short and stick to the data as much as possible, with a few, most recent examples, so that together, we can discuss solutions and improve the overall development experience.
- We all value moving fast, and we wouldn't want the Quarkus team to slow down just for the sake of it.
The business automation team definitely has its own issues with keeping with the pace, but we are doing our best :-)
Part of it is due to the number of interconnected components that have to be cross-tested, but also the number and types of artifacts (e.g. cloud images) that have to be released:
all the BA-managed dependencies in Kogito have been released together at once, as part of the same train, for a while now
Part of it is due to technical limitations, that we've been trying to address (e.g. using JBoss infra instead of deploying to Maven Central)
Finally, like everyone, we have conflicting priorities and we cannot always interrupt other tasks to investigate Quarkus-related issues.
However, this has sometimes resulted in a poor experience for end users: Quarkus may ship with a broken Kogito release and/or even not ship Kogito at all (luckily, this has only happened in extreme cases)
In our experience, the features that are most commonly a source of issues are
native image building
devmode/classloading;
integration between different extensions
Now:
with native-image, there is little we can do, as most issues often derive from changes in the upstream GraalVM native image builder
over DevMode, classloading and cross-extension integration we have more control
For instance, instrumentation was first introduced [3], it dramatically changed the behavior of "real devmode", departing sometimes significantly from the behavior that we could reproduce in supported DevMode tests. We were able to reproduce the behavior of instrumentation by using a mechanism (DevMojoIT [4]) that is supposedly for internal Quarkus use only.
A recent release (Quarkus 2.6) introduced a major bump of the Kafka version that inadvertently broke a Vert.X extension [5]. Admittedly this extension was not among the advised Kafka connectors, but it still shipped as part of the core. I remember other similar incidents in the past (for instance with the OpenAPI extension). While evidence of such situations often occurs within the Kogito codebase, this seems to suggest some extensions may not be verifying some of integration paths. Kogito, being a "leaf" consumer of many features is more likely to incur in breakage: but this also means that also user-level code may incur in similar breakage. Indeed, this is a thorny issue, because the compatibility matrix may be very large.
Another cause of disruption is that, sometimes in the past, fixes to issues of a CR release have been shipped as part of a Final release, without intermediate CRs. While it is completely understandable why it was so, this has sometimes (e.g. [5]) resulted in a change in behavior in the middle of the platform release window; as already mentioned, due to our release process, this window is already kind of short, and issues occurring within that window shrink it even further because of the time that needs to be consumed seeking a solution and releasing the fix (sometimes on both ends).
What kind of actions can we take to further improve over the current state?
On the one hand, I think, as the business automation/Kogito team, we may be able to contribute representative tests even to quarkus core when some edge-case is detected, instead of only creating integration tests in our own codebase. Again, the reason why such tests are often not provided on our end, is that it is harder for us to create completely self-contained reproducers, because most bugs involve at least our Quarkus extension.
Do you have any other ideas on how to improve this process further?
[1] Zulip: Kogito and RESTEasy Reactive
[2] GitHub PR
[3] Zulip: Kogito extension: instrumentation may not reload some classes
[4] DevMojoIT
[4] Kogito + 2.4.2.Final/2.5.0.CR1 issues
quarkus-jackson no longer a (transitive) dependency
Quarkus-provided Vert.X Kafka Client incompatible with the Quarkus Kafka client version (3.0.0)
--
You received this message because you are subscribed to the Google Groups "Quarkus Development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to quarkus-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/quarkus-dev/CACPHShxoOYA0cdzS-gX8%2Bu2XLC9i7VTX3Ujn%3DVYe2h3RMQ58gw%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/quarkus-dev/CALt0%2Bo-0sZc7RioO8%3Du07TbSVi9EETT%3D4Hq4bWkQHkRuR8o%3DNw%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/quarkus-dev/CAJ97idFprChSot9H6vNQn3D_-gzo5qhmswWVzPkvODUvdHwHKA%40mail.gmail.com.
Thanks everyone for chiming in.
So, to summarize:
it is not possible to introduce a further round of CR after a CR1, because it would make Final release dates less predictable
we need to address the issue with Kogito + Quarkus Ecosystem CI
document which components are being consumed by our extension
introduce post-mortems after significant incidents
w.r.t Ecosystem CI. Indeed, I understand the sentiment. We, ourselves, have a list of nightly builds that are often in broken state (e.g. full native build) and in the last few weeks we've been trying to address those with a more thorough "guardian" schedule (i.e. we are taking turns at watching the logs); but admittedly, for some reason, the Quarkus Ecosystem CI was overlooked.
I am raising this issue to make sure that we are notified on more channels, so more eyes can be kept on it.
The Kogito Ecosystem CI would need further improvement too, because currently it only builds kogito-runtimes, while issues have been sometimes found also on kogito-apps and kogito-examples.
On the other hand, I think that we should not rely only on Kogito's Quarkus Ecosystem CI build, or we may end up being frustrated again and stop watching it again :-)
There are at least two reasons why it often breaks:
1) Kogito surface itself is large because it integrates quite a few components: this is also one reason why we have nightly builds ourselves: to run longer integration tests, native builds, etc. we are also starting to consider some degree of manual testing to verify devmode and/or other features that have sometimes shown to be challenging to test in an automated way.
2) because the surface is large, it integrates several different upstream components in the core; which means that breakage may occur at many different places
Hence, as much as I am frustrated myself with it being routinely broken, even if we address it more promptly (which, again, indeed it is necessary), it will continue to break often (and if we include -apps and -examples, even more broken) unless we also extend the coverage of Quarkus core, when it comes to integrating multiple extensions together.
In other words, we should strive to make it so that a red status in Kogito Ecosystem CI only means "Kogito did something wrong, that was rectified in Quarkus, and Kogito needs fixing", and not just "either Kogito or Quarkus broke, and the issue may be on either end".
<insert spiderman meme here>
:)
Interesting read and thanks for the link to issues that I'm still exploring.
I see it as two part solutions:
fix ecosystem ci for Kogito and as far as I can see kogito team hold all the answers here to add additional test and keep an eye on it ?
adding tests to Quarkus whhere Kogito team find there is not well enough coverage. I say that makes perfect sense to me. We have two places here depending on what level the tests make most sense. In Quarkus core extensions and integration tests or in the platform test suite.
And doing post mortems makes perfect sense for me.
/max
--
You received this message because you are subscribed to the Google Groups "Quarkus Development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to quarkus-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/quarkus-dev/CACPHShzMr1tPZa6p27Aj_igSi8MBjyrcHLDVpYiaSfDb-S%3DgkQ%40mail.gmail.com.
/max
https://xam.dk/about
Thanks everyone for chiming in.
So, to summarize:
it is not possible to introduce a further round of CR after a CR1, because it would make Final release dates less predictable
we need to address the issue with Kogito + Quarkus Ecosystem CI
document which components are being consumed by our extension
introduce post-mortems after significant incidents
--
You received this message because you are subscribed to the Google Groups "Quarkus Development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to quarkus-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/quarkus-dev/CACPHShz8xbKebXAA939%2B1G9S19j1v7sxZHiDcZ2WAA0Z62Qh1Q%40mail.gmail.com.
I am happy to report that the Ecosystem CI is back to green! Now let us all keep it that way! 🚀