On 11/19/2012 11:33 PM, Eric Rescorla wrote:
> TL;DR:
> I think this is almost 180 degrees opposed from the right strategy.
> What we need right now is coverage of the things that work so that
> when we make changes we know we're not making the situation worse.
> *All* other testing work should wait on this. In particular, crashtests
> are not useful at this point.
I'll note that "crashtests" is a term being used loosely here.
Crashtests are nothing more than "when I load this nothing crashes",
while mochitests are "when I load this it calls Ok or No Ok"
(effectively). So really, there isn't a lot of difference there beyond
(for most WebRTC things) "we got to the normal end of the test".
The way some people are using "crashtests" is really "we once crashed
here, capture that as a crashtest", with the assumption you're
protecting against that (exact) regression.
In active development of a new feature, tests for real-world(ish) cases
are most important. This is (as ekr points out) that the first order of
things is "don't break existing functionality". Testing for that on each
checkin is far more important than checking if we regressed a minimized
testcase from a specific bug.
Once we have the main functionality tests in place (both "sunny day" and
"normal failure" cases), then more-specific regression tests begin to
rise in importance, though they still should usually be as broad a test
as possible that captures the failure instead of all minimized
testcases. The reason is that for it to have value for the future, it
should not just capture someone undoing the bugfix, but should also
capture any similarly-provoked failures as much as is possible, since
most future errors will not precisely duplicate an old bug, but will
arise from similar stress on the component.
An example might be a bug triggered by having >10 DataChannels open.
The minimized crashtest testcase might stop there, and it *would*
prevent the specific bug from being regressed. But if some new checkin
caused >16 DataChannels to fail, this crashtest wouldn't catch it, so a
good (IMO) test would create an unreasonable number of DataChannels, do
something basic with them to verify they all work, and then return
success. Note that this ends up being a Mochitest, not a crashtest!
So, don't think "this was a crash bug, it needs a crashtest". I would
use crashtests only where there's no reasonable way to report success or
failure. For example, if we changed an API, a crashtest would happily
stay green, while as a mochitest it would flag that we broke the test.
This does mean almost all tests should be mochitests. That's not a bad
thing, it's a good thing! But the bigger takeaway is that tests should
start with functionality (as the feature is useless without that), then
add general stressors (create/destroy fast, create a lot, evil
arrangements, obvious edge cases, etc), then (especially as code matures
and stabilizes) add very specific tests. If you focus on the very
specific tests too early, you end up with a ton of those many of which
no longer are relevant due to architectural changes and refactors (and
may spend a lot of time revising them when an API is still in flux!)
This isn't an absolute, and it doesn't mean that a crash bug doesn't get
morphed into a (mochi)test - many still will, but again those should be
broader tests wherever possible, or it should be worked into an existing
broad test.
That's another point: a few broad tests are a lot simpler to revise when
we have an unstable API than a thousand specific-instance testcases.
Given the W3C (and IETF) are still changing the API, we should be
cautious about this. If a large, generic sunny-day test that hits much
of the API fails, that's generally of (much) more use to me today the
the same set of features being tested in 100 small separate tests, if
for no other reason than the maintenance effect (and there are other
reasons, per above.)
> Analysis follows:
> Currently, the state of the code can be best described as brittle. The
> sunny day cases work and work fairly reliably, but as you deviate from
> them, you start to get crashes. This isn't surprising; it's a fairly normal
> state of affairs for software that was developed in a hurry with an
> emphasis on features rather than testing, which is what this is.
>
> There are two major classes of problems in the code:
>
> - Functional defects (stuff which should work and does not).
> - Crashes (i.e., memory errors, whether security relevant or not.)
> - Memory and other resource leaks
>
> There are a lot of all of these, and a number of them require fairly
> significant refactoring to fix. The result of this is that:
>
> 1. It's not going to be possible to *run* automated unit tests for some
> time (this is partly due to deficiencies in the testing framework, but you
> go to the release with the testing framework you have, not the one you
> wish you had.)
>
> 2. It's quite common to have multiple reports of problems due to the same
> underlying defect. Even when that's not true, we're often going to
> accidentally
> fix one defect while working on another.
Not just accidentally, but often on purpose as we notice a logical hole
that can or has caused a number of similar failures. A perfect example
is the current "big lock" patch, which causes numerous semi-random
failures related to timing.
> 3. It's quite easy to break one piece of functionality while refactoring
> another.
> We've already had a number of instances on this.
Yes; that's my number one concern for tests at this point. We're
checking in code with the only true cross-check being ourselves and
other devs running manual tests and using the feature. While doing that
is good, tests (automated or run-by-hand-but-scripted) would greatly
reduce the "you broke calls" problem.
> Rather, what is useful is to develop unit tests for the stuff that does
> work so
> that when we are fixing one thing we don't accidentally break another.I
> realize
> that these cannot currently be run automatically on m-c, but they can be run
> either (a) on a developer's own system or (b) on alder under the more
> reslient
> test framework Ted is hacking up. It already takes far too long to validate
> manually
> that your changes haven't broken stuff and we don't have anything like full
> coverage in JS tests of the functionality that works.
>
> If we want people not to break the system, it needs to be reasonably easy
> to verify that one has not done so.
Agreed.
> The nice thing about this level of test is that it's easy to develop.
> Indeed, we
> already have a number of tests that test one variant manually. All we need
> is someone to stamp out a bunch of tests that (a) are automatic and (b)
> cover more of the space. If you find bugs doing that, then great. Otherwise,
> we have a regression harness and you can move on to crashtests.
Agreed, and honestly a basic set of such functionality tests should be
fairly quick to develop and provide significant help to the project
quickly. And they don't (in my mind) have to be a lot of specific small
functionality tests. For example, I'd rather have one test that covered
all the basic call scenarios (audio-only, video-only, audio/video,
one-way audio, one-way video, etc) - because they are likely to be
easier to revise if the API changes, and if it fails, we can find out
why quickly. Not saying it has to be that way, but I might prefer that,
especially if I can build a "basic call test framework" I can then
internally automate. See things like the test_URIs xpcshell test for
the URI code where it has a framework and then runs a bazillion
test/edge cases through it, all in one "test".
--
Randell Jesup