Re: Refactor serving_dirs. (issue 838253005 by aiolos@chromium.org)

aio...@chromium.org

unread,

Mar 3, 2015, 5:10:47 PM3/3/15

to d...@chromium.org, sull...@chromium.org, nedn...@google.com, chris...@google.com, to...@chromium.org, chromium...@chromium.org, telemet...@chromium.org

I changed the file paths to work across all os. Anyone want to take another
look
before I submit?

https://codereview.chromium.org/838253005/

nedn...@google.com

unread,

Mar 3, 2015, 5:20:13 PM3/3/15

to aio...@chromium.org, d...@chromium.org, sull...@chromium.org, chris...@google.com, to...@chromium.org, chromium...@chromium.org, telemet...@chromium.org

https://codereview.chromium.org/838253005/diff/80001/tools/telemetry/telemetry/util/cloud_storage.py
File tools/telemetry/telemetry/util/cloud_storage.py (right):

https://codereview.chromium.org/838253005/diff/80001/tools/telemetry/telemetry/util/cloud_storage.py#newcode253
tools/telemetry/telemetry/util/cloud_storage.py:253: if
os.path.dirname(directory) == directory: # If in the root directory.
This will always return True if directory is a folder directory. e.g:
os.path.dirname('/tmp/foo/') == '/tmp/foo'

https://codereview.chromium.org/838253005/

aio...@chromium.org

unread,

Mar 3, 2015, 6:16:37 PM3/3/15

to d...@chromium.org, sull...@chromium.org, nedn...@google.com, chris...@google.com, to...@chromium.org, chromium...@chromium.org, telemet...@chromium.org

https://codereview.chromium.org/838253005/diff/80001/tools/telemetry/telemetry/util/cloud_storage.py
File tools/telemetry/telemetry/util/cloud_storage.py (right):

https://codereview.chromium.org/838253005/diff/80001/tools/telemetry/telemetry/util/cloud_storage.py#newcode253
tools/telemetry/telemetry/util/cloud_storage.py:253: if
os.path.dirname(directory) == directory: # If in the root directory.

On 2015/03/03 22:20:13, nednguyen wrote:
> This will always return True if directory is a folder directory. e.g:
> os.path.dirname('/tmp/foo/') == '/tmp/foo'

Good catch. Done.

https://codereview.chromium.org/838253005/

nedn...@google.com

unread,

Mar 4, 2015, 10:58:24 AM3/4/15

to aio...@chromium.org, d...@chromium.org, sull...@chromium.org, chris...@google.com, to...@chromium.org, chromium...@chromium.org, telemet...@chromium.org

lgtm again

https://codereview.chromium.org/838253005/

dpr...@chromium.org

unread,

Mar 5, 2015, 11:03:58 PM3/5/15

to aio...@chromium.org, d...@chromium.org, sull...@chromium.org, nedn...@google.com, chris...@google.com, to...@chromium.org, chromium...@chromium.org, telemet...@chromium.org

lgtm

https://codereview.chromium.org/838253005/diff/90001/tools/telemetry/telemetry/util/cloud_storage.py
File tools/telemetry/telemetry/util/cloud_storage.py (right):

https://codereview.chromium.org/838253005/diff/90001/tools/telemetry/telemetry/util/cloud_storage.py#newcode264
tools/telemetry/telemetry/util/cloud_storage.py:264:
GetIfChanged(path_name, bucket)
Is it possible that you might call into this from multiple unit tests
running at the same time? If that happened, windows might have multiple
processes try to write to the same file, and bad things would happen.

https://codereview.chromium.org/838253005/

dpr...@chromium.org

unread,

Mar 5, 2015, 11:05:55 PM3/5/15

to aio...@chromium.org, d...@chromium.org, sull...@chromium.org, nedn...@google.com, chris...@google.com, to...@chromium.org, chromium...@chromium.org, telemet...@chromium.org

On 2015/03/06 04:03:58, Dirk Pranke wrote:
> lgtm

er, I didn't mean to actually lgtm this :).

https://codereview.chromium.org/838253005/

nedn...@google.com

unread,

Mar 5, 2015, 11:54:37 PM3/5/15

to aio...@chromium.org, d...@chromium.org, sull...@chromium.org, chris...@google.com, dpr...@chromium.org, to...@chromium.org, chromium...@chromium.org, telemet...@chromium.org

https://codereview.chromium.org/838253005/diff/90001/tools/telemetry/telemetry/util/cloud_storage.py
File tools/telemetry/telemetry/util/cloud_storage.py (right):

https://codereview.chromium.org/838253005/diff/90001/tools/telemetry/telemetry/util/cloud_storage.py#newcode264
tools/telemetry/telemetry/util/cloud_storage.py:264:
GetIfChanged(path_name, bucket)

On 2015/03/06 04:03:58, Dirk Pranke wrote:

> Is it possible that you might call into this from multiple unit tests
running at
> the same time? If that happened, windows might have multiple processes
try to
> write to the same file, and bad things would happen.

Does typ provide a way to synchronize the function calls?

https://codereview.chromium.org/838253005/

dpr...@chromium.org

unread,

Mar 6, 2015, 1:51:34 PM3/6/15

to aio...@chromium.org, d...@chromium.org, sull...@chromium.org, nedn...@google.com, chris...@google.com, to...@chromium.org, chromium...@chromium.org, telemet...@chromium.org

On 2015/03/06 04:54:36, nednguyen wrote:
> Does typ provide a way to synchronize the function calls?

Not generally, no. Tests are run in multiple independent processes,
so this may not be straightforward to do. In addition, we should
be writing tests so that they can be run concurrently, so I wouldn't
want to encourage people to synchronize things.

That said, you can decorate the tests themselves to indicate that
they need to be run in isolation, in which case they will be run at
the end of the test run, after the parallel phase is complete.

But, annotating tests isn't the same as annotating functions the
tests call, so I'm not sure if any of those things really helps here.

If the problem is a concurrency one, we'll need to look more closely
at what this patch is trying to do and figure out an alternative.

https://codereview.chromium.org/838253005/

nedn...@google.com

unread,

Mar 6, 2015, 2:02:41 PM3/6/15

to aio...@chromium.org, d...@chromium.org, sull...@chromium.org, chris...@google.com, dpr...@chromium.org, to...@chromium.org, chromium...@chromium.org, telemet...@chromium.org

I agree that this is the best approach. However, to alleviate the pain of
debugging concurrency issue, is there a way for us to check which methods
got
called in parallel? I think the reality is there is a significant portion of
telemetry are global objects/methods. Every once in a while, someone will
add a
test on those global entities without knowing the context of all other tests
that also use them. The solution maybe mock out the global object, maybe
specifying the test to be in isolation, etc. but figuring out the culprit
is the
key in fixing this.

Also looking at
http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_x64_rel_ng/builds/33751/steps/telemetry_unittests%20%28with%20patch%29/logs/stdio,
it's not very clear to me which tests are failling here.

https://codereview.chromium.org/838253005/

aio...@chromium.org

unread,

Mar 6, 2015, 2:10:52 PM3/6/15

to d...@chromium.org, sull...@chromium.org, nedn...@google.com, chris...@google.com, dpr...@chromium.org, to...@chromium.org, chromium...@chromium.org, telemet...@chromium.org

On 2015/03/06 18:51:34, Dirk Pranke wrote:
> On 2015/03/06 04:54:36, nednguyen wrote:
> > Does typ provide a way to synchronize the function calls?

> Not generally, no. Tests are run in multiple independent processes,
> so this may not be straightforward to do. In addition, we should
> be writing tests so that they can be run concurrently, so I wouldn't
> want to encourage people to synchronize things.

> That said, you can decorate the tests themselves to indicate that
> they need to be run in isolation, in which case they will be run at
> the end of the test run, after the parallel phase is complete.

> But, annotating tests isn't the same as annotating functions the
> tests call, so I'm not sure if any of those things really helps here.

> If the problem is a concurrency one, we'll need to look more closely
> at what this patch is trying to do and figure out an alternative.

This patch is consolidating logic that is already happening. If we're doing
it
right (which isn't quite the case) we should never hit this code (or
GetIfChanged or any other real cloud_storage functions) in a unittest. That
should only happen when actually running a benchmark. I recently replaced
the
cloud_storage usage in archive_info_unittest since I was able to find it on
my
Mac due to a failing test. But I con't actually tell which test is failing
on
Win. It may be worthwhile to audit all of the unittests to make sure they
are
using the stub instead of the real cloud_storage. That being said, having an
error that can't be tied to a specific test is still an issue.

The part that I don't understand is why would we not hit this on any
non-Windows
platforms, but consistently on Win? Are the tests run concurrently on
Windows
and not the other platforms? Or are they scheduled in different orders? I
don't
know enough about Windows to know other causes.

https://codereview.chromium.org/838253005/

dpr...@chromium.org

unread,

Mar 6, 2015, 3:17:17 PM3/6/15

to aio...@chromium.org, d...@chromium.org, sull...@chromium.org, nedn...@google.com, chris...@google.com, to...@chromium.org, chromium...@chromium.org, telemet...@chromium.org

On 2015/03/06 19:02:41, nednguyen wrote:
> On 2015/03/06 18:51:34, Dirk Pranke wrote:
> > On 2015/03/06 04:54:36, nednguyen wrote:
> > > Does typ provide a way to synchronize the function calls?
> >
> > Not generally, no. Tests are run in multiple independent processes,
> > so this may not be straightforward to do. In addition, we should
> > be writing tests so that they can be run concurrently, so I wouldn't
> > want to encourage people to synchronize things.
> >
> > That said, you can decorate the tests themselves to indicate that
> > they need to be run in isolation, in which case they will be run at
> > the end of the test run, after the parallel phase is complete.
> >
> > But, annotating tests isn't the same as annotating functions the
> > tests call, so I'm not sure if any of those things really helps here.
> >
> > If the problem is a concurrency one, we'll need to look more closely
> > at what this patch is trying to do and figure out an alternative.

> I agree that this is the best approach. However, to alleviate the pain of
> debugging concurrency issue, is there a way for us to check which methods
> got
> called in parallel?

I think every test in telemetry is run in parallel today; I don't see any
users
of
the @decorators.Isolated call. I know when we were first switching telemetry
to use typ and run tests in parallel I made several fixes to make sure
things
work
in parallel, and tonyg was of the opinion that any test that couldn't be run
in parallel with other tests should just be disabled, but I don't think we
needed
it in practice.

> I think the reality is there is a significant portion of
> telemetry are global objects/methods. Every once in a while, someone will
> add
a
> test on those global entities without knowing the context of all other
> tests
> that also use them. The solution maybe mock out the global object, maybe
> specifying the test to be in isolation, etc. but figuring out the culprit
> is
the
> key in fixing this.

Yes of course. However, we avoid many of the potential issues by running
tests
in different processes, so in a given python process only one test is
running at
a
time. So, the only time you tend to hit things in practice is when
concurrently
modifying the file system (or things like that).

> Also looking at

http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_x64_rel_ng/builds/33751/steps/telemetry_unittests%20%28with%20patch%29/logs/stdio,
> it's not very clear to me which tests are failling here.

Yup, that's the bug I need to keep hunting down. typ should never fail in
this
way.

https://codereview.chromium.org/838253005/

dpr...@chromium.org

unread,

Mar 6, 2015, 3:19:45 PM3/6/15

to aio...@chromium.org, d...@chromium.org, sull...@chromium.org, nedn...@google.com, chris...@google.com, to...@chromium.org, chromium...@chromium.org, telemet...@chromium.org

On 2015/03/06 19:10:52, aiolos wrote:
> On 2015/03/06 18:51:34, Dirk Pranke wrote:
> > On 2015/03/06 04:54:36, nednguyen wrote:
> > > Does typ provide a way to synchronize the function calls?
> >
> > Not generally, no. Tests are run in multiple independent processes,
> > so this may not be straightforward to do. In addition, we should
> > be writing tests so that they can be run concurrently, so I wouldn't
> > want to encourage people to synchronize things.
> >
> > That said, you can decorate the tests themselves to indicate that
> > they need to be run in isolation, in which case they will be run at
> > the end of the test run, after the parallel phase is complete.
> >
> > But, annotating tests isn't the same as annotating functions the
> > tests call, so I'm not sure if any of those things really helps here.
> >
> > If the problem is a concurrency one, we'll need to look more closely
> > at what this patch is trying to do and figure out an alternative.

> This patch is consolidating logic that is already happening. If we're
> doing it
> right (which isn't quite the case) we should never hit this code (or
> GetIfChanged or any other real cloud_storage functions) in a unittest.
> That
> should only happen when actually running a benchmark.

Okay.

Like I said in my original note, I wasn't sure that there was something
wrong
with what you were doing, it just looked suspicious at a glance.

> I recently replaced the
> cloud_storage usage in archive_info_unittest since I was able to find it
> on my
> Mac due to a failing test. But I con't actually tell which test is
> failing on
> Win. It may be worthwhile to audit all of the unittests to make sure they
> are
> using the stub instead of the real cloud_storage. That being said, having
> an
> error that can't be tied to a specific test is still an issue.

Yes, definitely!

> The part that I don't understand is why would we not hit this on any
non-Windows
> platforms, but consistently on Win? Are the tests run concurrently on
> Windows
> and not the other platforms? Or are they scheduled in different orders? I
don't
> know enough about Windows to know other causes.

The tests are run concurrently on every platform, but the timing and
scheduling
can certainly be different.

Windows is much slower to launch subprocesses, the filesystem is much
slower,
and it is much more sensitive to conflicts over open files.

https://codereview.chromium.org/838253005/

dpr...@chromium.org

unread,

Mar 6, 2015, 3:48:22 PM3/6/15

to aio...@chromium.org, d...@chromium.org, sull...@chromium.org, nedn...@google.com, chris...@google.com, to...@chromium.org, chromium...@chromium.org, telemet...@chromium.org

In case it wasn't obvious before, perhaps it bears repeating:

1) I'm not sure there's something wrong w/ this patch, it was a guess

2) There's definitely a problem in typ that I need to fix; you should never
see
the error you're seeing.

I still don't know the root cause and am investigating ...

https://codereview.chromium.org/838253005/

nedn...@google.com

unread,

Mar 6, 2015, 3:57:20 PM3/6/15

to aio...@chromium.org, d...@chromium.org, sull...@chromium.org, chris...@google.com, dpr...@chromium.org, to...@chromium.org, chromium...@chromium.org, telemet...@chromium.org

On 2015/03/06 20:17:17, Dirk Pranke wrote:

> On 2015/03/06 19:02:41, nednguyen wrote:
> > On 2015/03/06 18:51:34, Dirk Pranke wrote:
> > > On 2015/03/06 04:54:36, nednguyen wrote:
> > > > Does typ provide a way to synchronize the function calls?
> > >
> > > Not generally, no. Tests are run in multiple independent processes,
> > > so this may not be straightforward to do. In addition, we should
> > > be writing tests so that they can be run concurrently, so I wouldn't
> > > want to encourage people to synchronize things.
> > >
> > > That said, you can decorate the tests themselves to indicate that
> > > they need to be run in isolation, in which case they will be run at
> > > the end of the test run, after the parallel phase is complete.
> > >
> > > But, annotating tests isn't the same as annotating functions the
> > > tests call, so I'm not sure if any of those things really helps here.
> > >
> > > If the problem is a concurrency one, we'll need to look more closely
> > > at what this patch is trying to do and figure out an alternative.
> >

> > I agree that this is the best approach. However, to alleviate the pain
> of
> > debugging concurrency issue, is there a way for us to check which
> methods
got
> > called in parallel?

> I think every test in telemetry is run in parallel today; I don't see any
users
> of
> the @decorators.Isolated call.

Hi, by checking which methods got called in parallel, I meant s.t like:
when text X fails, then typ tells us that it X fails when test A, B are
running.
This will help us nail down the cause of bug much easier.

dpr...@chromium.org

unread,

Mar 6, 2015, 4:13:24 PM3/6/15

to aio...@chromium.org, d...@chromium.org, sull...@chromium.org, nedn...@google.com, chris...@google.com, to...@chromium.org, chromium...@chromium.org, telemet...@chromium.org

On 2015/03/06 20:57:20, nednguyen wrote:
> Hi, by checking which methods got called in parallel, I meant s.t like:
> when text X fails, then typ tells us that it X fails when test A, B are
running.
> This will help us nail down the cause of bug much easier.

Ah. There isn't a way of getting that directly, but the logs should tell you
roughly which tests were running.

https://codereview.chromium.org/838253005/

dpr...@chromium.org

unread,

Mar 6, 2015, 9:10:04 PM3/6/15

to aio...@chromium.org, d...@chromium.org, sull...@chromium.org, nedn...@google.com, chris...@google.com, to...@chromium.org, chromium...@chromium.org, telemet...@chromium.org

https://codereview.chromium.org/838253005/diff/90001/tools/telemetry/telemetry/util/cloud_storage_unittest.py
File tools/telemetry/telemetry/util/cloud_storage_unittest.py (right):

https://codereview.chromium.org/838253005/diff/90001/tools/telemetry/telemetry/util/cloud_storage_unittest.py#newcode36
tools/telemetry/telemetry/util/cloud_storage_unittest.py:36: stubs =
system_stub.Override(cloud_storage, ['open', 'subprocess'])
Okay, it turns out that the underlying problem is that you can't
override 'open' this way.

'open' is a built-in function, nor an imported module. So, when you
attempt to override it in system_stub, the first time you try to save
the old value, you call getattr(cloud_storage, 'open', None), and it
returns None (because open is not an attribute on cloud_storage).
system_stub.Override() saves the None away as the "original value", and
when you call stubs.Restore(), it *creates* an attribute in open that
shadows out the built-in name.

Then, down the road, when you end up calling open from inside
cloud_storage -- which happens when you call @unittest.skipUnless in
find_dependencies_unittest -- you end up trying to dereference None.

This is the bug in your CL (or, at least, the bug that your CL is
exposing).

Because the @unittest.skipUnless is evaluated when we construct the test
suite for the test, it isn't caught by the normal try/catch block we
wrap each unit test in; this is bug #1 in typ. We're probably tripping
over this now just because the ordering of tests between processes has
shifted and we happen to run the find_dependencies test after one of
these tests.

bug #2 in typ is that we're not propagating the stack trace back, making
it hard to see what the heck happened (though it would've been
nigh-impossible to figure out what happened from a log regardless).

I will fix the bugs in typ, but you will want to restructure your tests
and the code in cloud_storage as well.

https://codereview.chromium.org/838253005/

Reply all

Reply to author

Forward