I created an ugly monster

Armen Zambrano Gasparnian

unread,

May 27, 2016, 2:16:51 PM5/27/16

to mozill...@lists.mozilla.org

(Resending as I don't see it showing up)

Hello,
The subject shows a bit of how I feel on a system I helped design
(Treeherder/pulse_actions/mozci/Heroku/Pulse/TaskCluster/Buildbot).

In this post I will do the following:
i) explain how Treeherder actions work
ii) some difficulties I'm facing
iii) an idea that could help systems like this
iv) how could it have been different

i) How Treeherder actions work
------------------------------
NOTE: I'm ignoring simple requests like cancel jobs or re-trigger
because those are not dealt by pulse_actions.

* The developer authenticates on TH
* The developer either triggers an action (either a job or a push)
* Treeherder sends a pulse message with the action name + relevant data
* pulse_actions consumes multiple Treeherder exchanges and topics
* pulse_actions uses mozci to find metadata about the push
** e.g. relationship of jobs (build jobs trigger test jobs)
* pulse_actions schedules jobs through Buildapi or TaskCluster/BBB
* If everything goes well, the job will show up on Treeherder

ii) Some difficulties I'm facing
--------------------------------

This system is composed of various tools working together. To determine
why something doesn't work requires checking in a myriad of places. It
also requires very specialized knowledge on how each of the pieces work
with each other.

Here's some places where the system might fail.

TH does not tell the developer that he/she did not belong to the right
group (this is not an issue anymore IIUC)

pulse_actions receives the request and might fail due to a recent mozci
deployment.

A TaskCluster graph is scheduled but the Buildbot Bridge fails OR
it might not show the jobs on Treeherder (e.g. a regression on BBB).

As you can see, each system can break and there's not a unified view for
what happened from the beginning of the request to the end of it.

iii) an idea that could help systems like this
----------------------------------------------
Most of these systems have logs on papertrail, however, there's not a
way to tie together all the lines that apply to the request as a whole.

A naive solution would be that the first system requests UUID from a new
tool. Each tool reports to this new system. Each tool passes the UUID to
the next one. This new tool shows a unified log from all the tools
involved in the request.

Each of the systems should be able to handle when something goes
wrong and declares that the request has ended.

I don't know how each system will authenticate itself or how we can
declare upfront which tools can report to this new tool.

This system would produce pulse messages to indicate requests
starting and ending.

Loading an API on this new system with the UUID would show all messages
related to this request.

It would be ideal to allow for two levels of logging. The first for
developers to read and understand for bug filing. The second for
maintainers to dig deep and understand what failed.

iv) How could it had been different?
------------------------------------
Is there a way we could have designed all of our systems differently to
not get into this situation?
Are these the problems of having micro-services? (I believe what we have
are micro-services rather than a monolithic app).

regards,
Armen

--
Zambrano Gasparnian, Armen (armenzg)
Mozilla Senior Automation and Tools
https://mozillians.org/en-US/u/armenzg/
http://armenzg.blogspot.ca

William Lachance

unread,

May 27, 2016, 2:55:23 PM5/27/16

to mozill...@lists.mozilla.org

On 2016-05-27 2:16 PM, Armen Zambrano Gasparnian wrote:
>
> iv) How could it had been different?
> ------------------------------------
> Is there a way we could have designed all of our systems differently to
> not get into this situation?
> Are these the problems of having micro-services? (I believe what we have
> are micro-services rather than a monolithic app).

I don't know if this is a problem inherent to microservices...
Treeherder can be pretty bad about reporting some errors, even when it
controls everything. :)

But certainly the fact that there are a bunch of different systems at
play here makes things even more complicated. at the end of the day I
think it's the responsibility of the user facing system (Treeherder) to
make errors comprehensible to the user. Obviously systems that it
connects to have to be designed to provide clear and obvious errors when
things go wrong! To that end, I really like your idea of using GUIDs to
make things more traceable.

I personally don't think platform/firefox developers should have to know
what "pulse" or "taskcluser" are. It's fine to put stuff referencing
them in detailed error messages that they can put into bug reports, of
course.

Will

Armen Zambrano G.

unread,

Oct 5, 2016, 8:32:17 AM10/5/16

to mozill...@lists.mozilla.org, William Lachance

I think OpenTracing could help:
http://opentracing.io/spec

--
Zambrano Gasparnian, Armen
Engineering productivity
http://armenzg.blogspot.ca

Armen Zambrano Gasparnian

unread,

Nov 14, 2020, 12:04:10 AM11/14/20

to mozill...@lists.mozilla.org

> A naive solution would be that the first system requests UUID from a new
> tool. Each tool reports to this new system. Each tool passes the UUID to
> the next one. This new tool shows a unified log from all the tools
> involved in the request.
>

It seems that Sentry has something like this:
https://blog.sentry.io/2019/04/04/trace-errors-through-stack-using-unique-identifiers-in-sentry