(Resending as I don't see it showing up)
Hello,
The subject shows a bit of how I feel on a system I helped design
(Treeherder/pulse_actions/mozci/Heroku/Pulse/TaskCluster/Buildbot).
In this post I will do the following:
i) explain how Treeherder actions work
ii) some difficulties I'm facing
iii) an idea that could help systems like this
iv) how could it have been different
i) How Treeherder actions work
------------------------------
NOTE: I'm ignoring simple requests like cancel jobs or re-trigger
because those are not dealt by pulse_actions.
* The developer authenticates on TH
* The developer either triggers an action (either a job or a push)
* Treeherder sends a pulse message with the action name + relevant data
* pulse_actions consumes multiple Treeherder exchanges and topics
* pulse_actions uses mozci to find metadata about the push
** e.g. relationship of jobs (build jobs trigger test jobs)
* pulse_actions schedules jobs through Buildapi or TaskCluster/BBB
* If everything goes well, the job will show up on Treeherder
ii) Some difficulties I'm facing
--------------------------------
This system is composed of various tools working together. To determine
why something doesn't work requires checking in a myriad of places. It
also requires very specialized knowledge on how each of the pieces work
with each other.
Here's some places where the system might fail.
TH does not tell the developer that he/she did not belong to the right
group (this is not an issue anymore IIUC)
pulse_actions receives the request and might fail due to a recent mozci
deployment.
A TaskCluster graph is scheduled but the Buildbot Bridge fails OR
it might not show the jobs on Treeherder (e.g. a regression on BBB).
As you can see, each system can break and there's not a unified view for
what happened from the beginning of the request to the end of it.
iii) an idea that could help systems like this
----------------------------------------------
Most of these systems have logs on papertrail, however, there's not a
way to tie together all the lines that apply to the request as a whole.
A naive solution would be that the first system requests UUID from a new
tool. Each tool reports to this new system. Each tool passes the UUID to
the next one. This new tool shows a unified log from all the tools
involved in the request.
Each of the systems should be able to handle when something goes
wrong and declares that the request has ended.
I don't know how each system will authenticate itself or how we can
declare upfront which tools can report to this new tool.
This system would produce pulse messages to indicate requests
starting and ending.
Loading an API on this new system with the UUID would show all messages
related to this request.
It would be ideal to allow for two levels of logging. The first for
developers to read and understand for bug filing. The second for
maintainers to dig deep and understand what failed.
iv) How could it had been different?
------------------------------------
Is there a way we could have designed all of our systems differently to
not get into this situation?
Are these the problems of having micro-services? (I believe what we have
are micro-services rather than a monolithic app).
regards,
Armen
--
Zambrano Gasparnian, Armen (armenzg)
Mozilla Senior Automation and Tools
https://mozillians.org/en-US/u/armenzg/
http://armenzg.blogspot.ca