Treeherder Job Status and Color

34 views
Skip to first unread message

gar...@mozilla.com

unread,
Apr 18, 2016, 6:51:13 PM4/18/16
to mozilla-tool...@lists.mozilla.org
There has been a long standing issue with how certain jobs are reporting their job status to treeherder depending on the job type and what failed within the job.

As it stands today, all failed jobs within taskcluster are reported as orange (failed). We do nothing to infer something from exit codes or doing any custom log parsing.

Here are some possible scenarios that give unexpected results:

1. Build job that builds successfully but the post-build steps fail. This will be reported by buildbot as red.
2. Infrastructure related issue, with the expectation that the task will be retried. Buildbot reports purple.

These are the two that come immediately to mind.

What I would like to bike shed are possible solutions to this situation (and others that were not mentioned) for reporting job status to treeherder.

I hope that a solution exists that does not rely on log parsing or relying on certain exit codes. This seems to be prone to failure and inconsistency when new jobs are stood up that do not match log parsing regular expressions or do not output the right exit code because something else in the chain of commands failed in a different way.

Some possibilities that have been brought up:
1. Rely on exit codes.
* worker does not update treeherder nor has any idea of it. Our integration component will still need to do some work at parsing logs to find out exit code.
* Still prone to things not exiting in a way that's expected
2. Parse messages in the log and infer job state
* similar problem as #1
* creates the regex situation that currently lives within treeeherder
3. Introduce a "step" syntax that tasks can use to document individual steps within a task. Similar to buildbot steps.
* still requires log parsing
* possibly could contain additional information in the "step" as to the job state
4. Create artifact with information on how to update treeherder if it should deviate from the norm
* integration component would need to check for the existence of this artifact
* commands executed within the task would need to do The Right Thing for creating this artifact.


I definitely look forward to any ideas that come up. Hopefully this is something we can iron out in the coming weeks and make our platform report status to treeherder in a way everyone expects.

gar...@mozilla.com

unread,
Apr 18, 2016, 8:32:52 PM4/18/16
to mozilla-tool...@lists.mozilla.org

> 1. Build job that builds successfully but the post-build steps fail. This will be reported by buildbot as red.

Hrm, I'm not sure if google groups ate my last post or not, but I wanted to clarify that the colors are incorrect here. If the build fails, then the job symbol is red, but if it fails during the post-build steps it's orange.

After talking with :glandium, I have also learned that there is additional behavior I was not aware of. If the build is orange, the tests will still run the moment the build artifacts are created and uploaded. These can run in parallel with the final build steps.

In taskcluster, we do not have the concept of a "kind of successful" task. Either it completed successfully or not so this might become more difficult.

One possibility is splitting the build steps from the post-build steps into two separate tasks. This would allow tests to rely on the build step, but the post-build steps like `make check` could still run in parallel without changing if tests should run or not.

Here are two useful bugs that were pointed out to me:
https://bugzilla.mozilla.org/show_bug.cgi?id=992323
https://bugzilla.mozilla.org/show_bug.cgi?id=1210759

gar...@mozilla.com

unread,
Apr 18, 2016, 8:41:04 PM4/18/16
to mozilla-tool...@lists.mozilla.org
> 1. Build job that builds successfully but the post-build steps fail. This will be reported by buildbot as red.

I was incorrect with this. The colors are reversed. Build jobs that fail during the build will be red and post-build failures will be reported as orange.

It's important to note that these two colors, based on the underlying job statuses that were sent in, are useful because:
1. visual representation of what failed within the build process
2. Orange builds will still have tests triggered. This does not happen currently for taskcluster jobs. In buildbot, once the artifact is created and uploaded, the tests could start in parallel even if the build will eventually be orange.


One possible solution is to work with the other teams at splitting the build from the post-build checks. Create two tasks where the tests only depend on the actual build task to succeed. Things like `make check` could happen in a separate task and not influence what gets scheduled.


Also some good info are in these two bugs that discuss what should be included in `mack check` and how it should be executed:
https://bugzilla.mozilla.org/show_bug.cgi?id=1210759
https://bugzilla.mozilla.org/show_bug.cgi?id=992323

Gregory Arndt

unread,
Apr 18, 2016, 9:07:34 PM4/18/16
to mozilla-tool...@lists.mozilla.org
Looks like google groups didn't eat it, just took forever to post it.
Sorry about the duplicated info.

-Greg

On Apr 18, 2016, at 7:41 PM, "gar...@mozilla.com" <gar...@mozilla.com> wrote:

>> 1. Build job that builds successfully but the post-build steps fail. This will be reported by buildbot as red.
>
> _______________________________________________
> tools-taskcluster mailing list
> tools-ta...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/tools-taskcluster

Chris AtLee

unread,
Apr 19, 2016, 10:02:12 AM4/19/16
to Gregory Arndt, mozilla-tool...@lists.mozilla.org
I think that in general we should rely on the tasks themselves to correctly
report their error status. That can be implemented by inspecting the error
code, or log parsing as part of the task, or something else. Requiring
Treeherder to make the determination about job status that's different from
what the task thinks is a bit of an architecture/design fail IMO.

On 18 April 2016 at 19:33, <gar...@mozilla.com> wrote:

> > 1. Build job that builds successfully but the post-build steps fail.
> This will be reported by buildbot as red.
>
> I was incorrect with this. The colors are reversed. Build jobs that fail
> during the build will be red and post-build failures will be reported as
> orange.
>
> It's important to note that these two colors, based on the underlying job
> statuses that were sent in, are useful because:
> 1. visual representation of what failed within the build process
> 2. Orange builds will still have tests triggered. This does not happen
> currently for taskcluster jobs. In buildbot, once the artifact is created
> and uploaded, the tests could start in parallel even if the build will
> eventually be orange.
>
>
> One possible solution is to work with the other teams at splitting the
> build from the post-build checks. Create two tasks where the tests only
> depend on the actual build task to succeed. Things like `make check` could
> happen in a separate task and not influence what gets scheduled.
>
>
> Also some good info are in these two bugs that discuss what should be
> included in `mack check` and how it should be executed:
> https://bugzilla.mozilla.org/show_bug.cgi?id=1210759
> https://bugzilla.mozilla.org/show_bug.cgi?id=992323
>

I'm not sure how practical it is to move all the post-build checks into
separate tasks. There can be a lot of state that needs to be shared between
the build itself and the post-build checks.

Could we have tasks that are allowed to unblock their dependencies before
they're finished? That way test tasks could start before the build task is
completed.

gar...@mozilla.com

unread,
Apr 19, 2016, 10:41:11 AM4/19/16
to mozilla-tool...@lists.mozilla.org
Chris,

Thanks for the feedback! I added some comments below.


> I think that in general we should rely on the tasks themselves to correctly
> report their error status. That can be implemented by inspecting the error
> code, or log parsing as part of the task, or something else. Requiring
> Treeherder to make the determination about job status that's different from
> what the task thinks is a bit of an architecture/design fail IMO.

I'm sorry, I was not implying that treeherder would at all make the determination as to job status. Something needs to tell treeherder what that status is and Treeherder just displays it. Right now the only thing reporting status is our integration component (mozilla-taskcluster). It does so by trying to translate a taskcluster run status into a treeherder status, and in this case we only have fail...not "build succeeded but there were other failures that should not prevent other tasks from running". Task run failures are all treated the same regardless of task type or what failed within the task command execution.

If we are going to have an in-between status (not really a failure, but not a 100% success) then we need a way for the task to indicate that so it's reported to treeherder correctly while also unblocking the remaining tasks. Treeherder would not make this determination. This of course means relying on exit codes, log parsing, etc. that the task produces that can be inspected by mozilla-taskcluster when reporting job status.

> I'm not sure how practical it is to move all the post-build checks into
> separate tasks. There can be a lot of state that needs to be shared between
> the build itself and the post-build checks.
>

That's good to know. I was just throwing crazy ideas out there to see if one stuck. Clearly this might not be the best route to go down.

> Could we have tasks that are allowed to unblock their dependencies before
> they're finished? That way test tasks could start before the build task is
> completed.

I'm not 100% sure, but there might be a way in our build tasks that the last command that is run will remove the dependencies within the queue for those test tasks. These tasks will need to use the new task.dependencies feature added to the queue and the build task will need to know all tasks that depend on it so it can change the dependency list after the fact. This is assuming that it's possible to do so from within a task like that using the taskcluster proxy.

Selena Deckelmann

unread,
Apr 20, 2016, 3:00:20 PM4/20/16
to Chris AtLee, Gregory Arndt, mozilla-tool...@lists.mozilla.org
On Tue, Apr 19, 2016 at 7:02 AM Chris AtLee <cat...@mozilla.com> wrote:

>
> I'm not sure how practical it is to move all the post-build checks into
> separate tasks. There can be a lot of state that needs to be shared between
> the build itself and the post-build checks.
>
> Could we have tasks that are allowed to unblock their dependencies before
> they're finished? That way test tasks could start before the build task is
> completed.
>

I talked with Greg and Jonas about this yesterday independently, and will
summarize those conversations. They can weigh in if I make an error..

As far as I can tell, there are three options for moving forward:

1. Split the tasks apart, having post-build checks be in separate tasks and
tests being dependent on the build task itself
pros: follows the TC model of having each task being "self contained"
cons: lots of shared state, will take someone a long time to do the work

2. Run two tasks: one that makes a build, a dependent task that makes a
build AND runs the post-build checks, and tests depend on the build without
post-build checks
pros: easy to set up
cons: "wasteful" (although, compared to developer time, I argue we should
just pay)

3. Schedule tests from inside the build step

In Jonas' psuedo code:
build-task
make build
A) upload build.tar.gz
B) for each dependent test-task:
queue.scheduleTask(test-task)
make-check

pros: supports unblocking dependencies before they are finished
cons: fragile, requires proxy support and manual uploading (docker-worker
would have taken care of it) in the task, and makes the build task aware of
dependencies

I think we should go with #2, with the longer term goal being #1.

-selena

Gregory Arndt

unread,
Apr 20, 2016, 3:43:43 PM4/20/16
to Dustin Mitchell, mozilla-tool...@lists.mozilla.org, Selena Deckelmann, Ted Mielczarek, Chris AtLee
Hey Ted,

Do you have any insight into if `make check` will be dropped from our build
tasks?

On Wed, Apr 20, 2016 at 2:32 PM, Dustin Mitchell <dus...@mozilla.com> wrote:

> Check in with Ted -- I think the plan is to drop `make check`
> entirely, which makes all of this moot.
>
> Sorry to be brief.. TRIBE starts in 2 minutes and "be on time" is in
> our designed alliance ;)
>
> Dustin

Gregory Arndt

unread,
Apr 20, 2016, 3:53:13 PM4/20/16
to Armen Zambrano G., mozilla-tool...@lists.mozilla.org, Sheriffs
>
> Can someone please refresh my memory when we need the treeherder regexes
> to kick in?
> IIRC this is only necessary for the nice log view to highlight errors [3]
> and show successful steps.
> If this is what you're referring to I think this is a different issue and
> we should not mix it together.
> Having metadata about logs for enhanced log viewing is a different matter.
>
>
I'm referring to the fact that there are a list of regexes that take care
of special scenarios of when something appears in the log. What I would
like to not add to this is something that can parse out the exit code or
other things that people later on want to add. Right now only the worker
knows the exit code of the container, and that is not the component that
updates treeherder. Mozilla-taskcluster, or whatever is updating
treeherder in the future, will need to somehow know this exit code,
probably by parsing the tail end of the log.


> I'm happy that we would fix 2, 3 & 4 but these are not related to showing
> the right colour of the job


I was tossing out ideas that by doing something like 2, 3 or 4 it might
lend itself to fixing this issue as well (maybe).

Gregory Arndt

unread,
Apr 20, 2016, 4:57:03 PM4/20/16
to Armen Zambrano G., mozilla-tool...@lists.mozilla.org, Sheriffs
On Wed, Apr 20, 2016 at 3:22 PM, Armen Zambrano G. <arm...@mozilla.com>
wrote:

On 16-04-20 03:52 PM, Gregory Arndt wrote:
>
>>
>>> Can someone please refresh my memory when we need the treeherder
>>> regexes to kick in? IIRC this is only necessary for the nice log
>>> view to highlight errors [3] and show successful steps. If this is
>>> what you're referring to I think this is a different issue and we
>>> should not mix it together. Having metadata about logs for
>>> enhanced log viewing is a different matter.
>>>
>>>
>>> I'm referring to the fact that there are a list of regexes that take
>> care of special scenarios of when something appears in the log.
>> What I would like to not add to this is something that can parse out
>> the exit code or other things that people later on want to add.
>>
>
> We're on the same page wrt to not adding parsing exit codes to the
> regexes of treeherder.
>
> Right now only the worker knows the exit code of the container, and
>> that is not the component that updates treeherder.
>>
>
> Where's the code that takes care of it?
>
> I only understands this part of the pipeline:
> * Mozharness exit code
> * test*.sh exit code
> * docker container exit code
> * <nebula reporting to treeherder>
> * worker running the task


Where is the code that takes care of which piece? The worker will mark a
task as successful if the final container exit code is 0, failed otherwise
[1] and report to taskcluster the right resolution (success/failed) [2].

>From there the taskcluster queue produces pulse message to the
task-failed/task-exception/task-completed exchanges for that run.
Mozilla-taskcluster listens for this message, produces a Treeherder job
payload and submits it to the treeherder API (soon to be publishing to a
pulse exchange instead).

Here is an example of a handler for a failed task run in
mozilla-taskcluster [3]. Currently we do no log inspection. The log URL
is provided to treeherder for parsing. Also note that mozilla-taskcluster
is also responsible for posting job results to treeherder for non-hg repos
(such as gaia and bmo). Soon this will be expanded to more github repos.

I hope that this clears things up. I can hop on vidyo at any point to help
explain things more if I was unclear.


[1]
https://github.com/taskcluster/docker-worker/blob/master/lib/task.js#L925
[2]
https://github.com/taskcluster/docker-worker/blob/master/lib/task.js#L564
[3]
https://github.com/taskcluster/mozilla-taskcluster/blob/master/src/treeherder/job_handler.js#L447-L482

Gregory Arndt

unread,
Apr 21, 2016, 10:53:57 AM4/21/16
to Armen Zambrano Gasparnian, mozilla-tool...@lists.mozilla.org, Sheriffs
No problem! thanks for your help in trying to come up with a solution to
this.

The worker will receive the exit code of whatever the command for the
container exited with (the command that's in the task.payload).

The issue is not so much with determining the exit code
(mozilla-taskcluster could do that from the logs if it really wanted to
without polluting the queue with that), the problem is that not everything
that will be reported to treeherder will be forced to use that convention,
such as tasks running from github repos or other tasks that are not a
build/test task.



On Thu, Apr 21, 2016 at 9:31 AM, Armen Zambrano Gasparnian <
arm...@mozilla.com> wrote:

> Does the worker receive the exit code of the script it called inside of
> the container? (e.g. test-linux.sh)
> Can the taskcluster queue also include the exit code for
> mozilla-taskcluster to use when reporting to Treeherder?
>
> Thanks for the clarification!
>
> On 20 April 2016 at 16:48, Gregory Arndt <gar...@mozilla.com> wrote:
>
>>
>>
>> On Wed, Apr 20, 2016 at 3:22 PM, Armen Zambrano G. <arm...@mozilla.com>
>> wrote:
>>
>> On 16-04-20 03:52 PM, Gregory Arndt wrote:
>>>
>>>>
>>>>> Can someone please refresh my memory when we need the treeherder
>>>>> regexes to kick in? IIRC this is only necessary for the nice log
>>>>> view to highlight errors [3] and show successful steps. If this is
>>>>> what you're referring to I think this is a different issue and we
>>>>> should not mix it together. Having metadata about logs for
>>>>> enhanced log viewing is a different matter.
>>>>>
>>>>>
>>>>> I'm referring to the fact that there are a list of regexes that take
>>>> care of special scenarios of when something appears in the log.
>>>> What I would like to not add to this is something that can parse out
>>>> the exit code or other things that people later on want to add.
>>>>
>>>
>>> We're on the same page wrt to not adding parsing exit codes to the
>>> regexes of treeherder.
>>>
>>> Right now only the worker knows the exit code of the container, and
>>>> that is not the component that updates treeherder.
>>>>
>>>
> --
> Zambrano Gasparnian, Armen
> Engineering productivy engineer - #ateam
> http://armenzg.blogspot.ca
>

Dustin Mitchell

unread,
Apr 27, 2016, 9:19:13 AM4/27/16
to Gregory Arndt, mozilla-tool...@lists.mozilla.org, Sheriffs, Armen Zambrano Gasparnian
I have some context from Ted regarding the future of make check.
Basically, `make check` will be short and quick, but will remain, and we
will still want to run dependent tasks if the build succeeds but the test
fails. This removes the requirement that we unblock downstream tasks
mid-build. And the requirement that downstream tasks run even if 'make
check' fails seems "soft" as it's optimizing for intermittency in just a
handful of tests.

09:03:07 <dustin> ted: was I dreaming or did you say recently that there's
work afoot to remove 'make check' from the build process?
09:03:23 <ted> you are correct:
https://bugzilla.mozilla.org/show_bug.cgi?id=992323
09:03:26 <firebot> Bug 992323 — NEW, nob...@mozilla.org — Get everything
useful out of "make check"
09:03:32 <ted> it's a little stalled at the moment because i'm busy with
other things
09:03:35 <dustin> dank u wel
09:03:47 <ted> mostly the biggest thing left is
https://bugzilla.mozilla.org/show_bug.cgi?id=1210759
09:03:48 <dustin> no worries, I'm just hoping to use that information to
forestall an elaborate workaround :)
09:03:51 <firebot> Bug 1210759 — NEW, nob...@mozilla.org — Run
PYTHON_UNIT_TESTS somewhere other than `make check`
09:03:57 <ted> n.b., we'll still have to run *some* tests after the build
09:03:59 <ted> but they should be few and short
09:04:04 <dustin> ok
09:04:10 <ted> we have some tests that want to check things in the binary
or whatever
09:04:18 <ted> and splitting them out to a separate job is just silly
09:04:33 <ted> like i have an integration test for symbol dumping
09:04:41 <dustin> is it worth the effort to try to run the separate test
tasks anyway, if those make-check things fail?
09:04:45 <ted> yeah
09:05:05 <ted> well
09:05:07 <ted> i *think*
09:05:17 <ted> the problem we keep hitting is that builds are expensive
09:05:31 <ted> so you don't want to have to re-run a whole build because
some test failed
09:06:38 <dustin> right
09:06:53 <dustin> I guess the question is, how common is that
09:07:13 <dustin> it's a rather significant refactor to be able to have a
job "fail" but still trigger its downstream tasks
09:07:26 <dustin> anyway, I'll add that to the thread
09:07:33 <dustin> I'm kinda trying to stay out of this one :)
09:07:34 <dustin> thanks
09:09:17 <ted> gotcha
09:09:26 <ted> right now it's kinda common, because we still have a bunch
of stuff in `make check`
09:09:32 <ted> if it was just a handful of things? dunno
09:09:42 <ted> probably would be less common, but tests gonna fail
intermittently

On Thu, Apr 21, 2016 at 10:53 AM, Gregory Arndt <gar...@mozilla.com> wrote:

> No problem! thanks for your help in trying to come up with a solution to
> this.
>
> The worker will receive the exit code of whatever the command for the
> container exited with (the command that's in the task.payload).
>
> The issue is not so much with determining the exit code
> (mozilla-taskcluster could do that from the logs if it really wanted to
> without polluting the queue with that), the problem is that not everything
> that will be reported to treeherder will be forced to use that convention,
> such as tasks running from github repos or other tasks that are not a
> build/test task.
>
>
>
> On Thu, Apr 21, 2016 at 9:31 AM, Armen Zambrano Gasparnian <
> arm...@mozilla.com> wrote:
>
> > Does the worker receive the exit code of the script it called inside of
> > the container? (e.g. test-linux.sh)
> > Can the taskcluster queue also include the exit code for
> > mozilla-taskcluster to use when reporting to Treeherder?
> >
> > Thanks for the clarification!
> >
> > On 20 April 2016 at 16:48, Gregory Arndt <gar...@mozilla.com> wrote:
> >
> >>
> >>
> >> On Wed, Apr 20, 2016 at 3:22 PM, Armen Zambrano G. <arm...@mozilla.com
> >
> >> wrote:
> >>
> >> On 16-04-20 03:52 PM, Gregory Arndt wrote:
> >>>
> >>>>
> >>>>> Can someone please refresh my memory when we need the treeherder
> >>>>> regexes to kick in? IIRC this is only necessary for the nice log
> >>>>> view to highlight errors [3] and show successful steps. If this is
> >>>>> what you're referring to I think this is a different issue and we
> >>>>> should not mix it together. Having metadata about logs for
> >>>>> enhanced log viewing is a different matter.
> >>>>>
> >>>>>
> >>>>> I'm referring to the fact that there are a list of regexes that take
> >>>> care of special scenarios of when something appears in the log.
> >>>> What I would like to not add to this is something that can parse out
> >>>> the exit code or other things that people later on want to add.
> >>>>
> >>>
> >>> We're on the same page wrt to not adding parsing exit codes to the
> >>> regexes of treeherder.
> >>>
> >>> Right now only the worker knows the exit code of the container, and
> >>>> that is not the component that updates treeherder.
> >>>>
> >>>
Reply all
Reply to author
Forward
0 new messages