Hi guys,
Here is an extremely interesting read about how does google handle its code: https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext
It is very long, but worth the read in my opinion.I am also discussing
it with a friend working on the google search engine code, in order to
clarify some more points.
In any case, I will try to summarize a bit and specify what I think we
should take out of it. It is not necessarily in order, so bare with me.
First of all, google has a Monolithic development model. This basically means that:
* all the code is in a single repository for all the project (gmail, calendar, search engine, drive, whatever... all in one).
* everyone only develops on the head
* they do not make branches, except for a given release, as a snapshot, for long term support.
* They do bug fixes in the head, and then cherry-pick in the branch.
For us, I'd say it would mean having DIRAC, WebApp and the
externals (+bundle? ) together. I am not sure we would gain much, but I
have seen a PR this week that would have broken the WebApp if Zoltan did
not spot it ! (kudo :-) ).
Speaking of externals and lcgbundle, I just would like to stress that google uses Static linking !
When they are developing new features, they do it progressively.
The old code and the new code are together in the same code base, and
you switch from one to another using a configuration flag (note that
this is something I did use a lot, for example for
the move to FTS2 to FTS3. And this is exactly what is meant to be used
in the 2 year old PR for the new FTS3). After a while, they remove the
flag and the old code. It involves extra development work and care, but
it allows them to never block a release (sounds
familiar ? :-) ), to test in production earlier, and also to do live
tests on part of their infrastructure (for example, expose a feature
only in the US).
Also, and this is something I believe for long we should really be
doing: if a PR has been reviewed and the tests are OK, the PR is merged.
So it is available to everyone. You do not need to wait for weeks until
your code is there. This is for example
a problem Philippe is always having. Of course, this would mean a bit
more effort on our side to review QUICKLY the pending PR, and not wait
that they accumulate. It would also mean on effort on some people to
make small and topical PR ;-)
About reviewing the PR, they have the concept of "owners" and
"developers". "developers" are what it says, and "owners" are normally
the main developers for a given part of the source code. To some extend,
we already have that... we are all developers,
and Federico is owner of TS, Philippe of DMS, etc. When there is a PR,
any developer can review the code, and an owner has to approve the
changes as being meaningful/useful/etc. So we are quite close from this
model.
Speaking of code cleanup, they have a team that regularly make
massive cleaning campain. Many things are automated (find dead code,
make the appropriate fixes, split it in several PR, and assign the
appropriate reviewers). Obviously, we can't do that.
But I believe we should do cleaning regularly (and break compatibility
at some point). I started looking a bit, and there is a nice python tool
called Vulture that we could use for that. Obviously, you have false
positive, especially in places where we have
dynamic loading and things like that. But I tried it, and it does throw
interesting results ! I'll maybe try to integrate it to LHCbDIRAC...
An approach that I found interesting was that by default,
developers are asked to make all their interface private. That forces
them to think carefully whether a method should be public or not. And
this makes interfaces cleaner, and naturally stabilizes
the API. Yet, they have the case of some special codes, stored somewhere
else, which are used as "open source central libraries". And there, it
is "expected that the teams follow the API". And I also think we should
not be scared to ask the extensions to follow
the API with no delays.
Google use an amazing amount of tests. And we know that we should
be increasing our tests coverage and automation, that's no news. I just
want to stress it again. They do automated tests for the code change
(like, in the PR), but also on the whole code
base, when several PRs are merged, so it checks it is all consistent.
The advantage of their monolithic approach is that they can actually
test the whole code base. If we want to test the Externals for example,
we have no other way that releasing them, making
a release of DIRAC, and hopping it works. We could of course track in
the DIRAC code all the calls to these libraries, and reproduce them as
some sort of unit/integration tests, but this is brain dead.
In general, I'd say that we are anyway unable to do proper tests on
the head. We always have to create a tag to be able to test it, simply
because of the way dirac-install works (usual story...). We cannot
"install the head". Of course, we can install
the previous tag, and then overwrite the code, but still, it is not the
same.
One thing I start wondering and that will certainly bite us (in
fact, it already did), is how to handle mock properly: how do you make
sure that all the mocks are reflecting the latest API of what they are
emulating ? We had the problem in the past that
tests were fine because of a mock still exposing the old API, while the
real API had changed. We could think of having centralized mocks, but
it is not always possible. So I am not sure how to do that. I ask my
friend at google how they do, still waiting.
Also I must admit I did not yet search for solution to that on the
internet.
I am also trying to know more about their release cycle, as well as how they handle DB schema changes.
All in all, I think this idea of always working on the head very
interesting. I think it fits quite well our needs, would of course
require a bit of work and adaptation, but would pay off. Personally, I
am literally fed up with my PR waiting for years,
and with operations/developments being blocked because we can't release
because we wait for one or the other feature.
Any thoughts ?
Cheers
Chris