DIRAC development strategy

Christophe HAEN

unread,

Oct 12, 2017, 10:41:21 AM10/12/17

to diracgrid-develop

I will reproduce here a serie of emails. The idea is to review the development strategy of DIRAC. These emails contain many considerations and propositions. They will all be discussed during a meeting, which date is to be decided on the following doodle:

https://doodle.com/poll/849w7xpv8p3pha7f

Please fill it in before Wednesday 18th October.

The aim will be to find solutions to the following issues, directly related to our development procedure (this list can be extended on demand):

* Getting to a release takes us forever.

* One feature can end up blocking EVERYTHING (example: v6r17 -> v6r19 drama)

* Tests are currently a load on CERN Building 2 only, and this needs to change

* The code base is getting larger and larger and contains many dead code, or a lot of code which is here for supporting features deprecated 2923 versions ago.

It is expected that people come to this meeting having read the propositions bellow and having thought about it, so that we can discuss concrete points and come to agreement.

I personally have a list of propositions I want to present and discuss. If you have others, you are more than welcome to present them as well, as long as it addresses the points above.

The indico page for the meeting will be created as soon as we will have decided on a date.

Cheers

Chris

Christophe HAEN

unread,

Oct 12, 2017, 10:42:09 AM10/12/17

to diracgrid-develop

Hi guys,

Here is an extremely interesting read about how does google handle its code: https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext

It is very long, but worth the read in my opinion.I am also discussing it with a friend working on the google search engine code, in order to clarify some more points.
In any case, I will try to summarize a bit and specify what I think we should take out of it. It is not necessarily in order, so bare with me.

First of all, google has a Monolithic development model. This basically means that:

* all the code is in a single repository for all the project (gmail, calendar, search engine, drive, whatever... all in one).

* everyone only develops on the head

* they do not make branches, except for a given release, as a snapshot, for long term support.

* They do bug fixes in the head, and then cherry-pick in the branch.

For us, I'd say it would mean having DIRAC, WebApp and the externals (+bundle? ) together. I am not sure we would gain much, but I have seen a PR this week that would have broken the WebApp if Zoltan did not spot it ! (kudo :-) ).

Speaking of externals and lcgbundle, I just would like to stress that google uses Static linking !

When they are developing new features, they do it progressively. The old code and the new code are together in the same code base, and you switch from one to another using a configuration flag (note that this is something I did use a lot, for example for the move to FTS2 to FTS3. And this is exactly what is meant to be used in the 2 year old PR for the new FTS3). After a while, they remove the flag and the old code. It involves extra development work and care, but it allows them to never block a release (sounds familiar ? :-) ), to test in production earlier, and also to do live tests on part of their infrastructure (for example, expose a feature only in the US).

Also, and this is something I believe for long we should really be doing: if a PR has been reviewed and the tests are OK, the PR is merged. So it is available to everyone. You do not need to wait for weeks until your code is there. This is for example a problem Philippe is always having. Of course, this would mean a bit more effort on our side to review QUICKLY the pending PR, and not wait that they accumulate. It would also mean on effort on some people to make small and topical PR ;-)

About reviewing the PR, they have the concept of "owners" and "developers". "developers" are what it says, and "owners" are normally the main developers for a given part of the source code. To some extend, we already have that... we are all developers, and Federico is owner of TS, Philippe of DMS, etc. When there is a PR, any developer can review the code, and an owner has to approve the changes as being meaningful/useful/etc. So we are quite close from this model.

Speaking of code cleanup, they have a team that regularly make massive cleaning campain. Many things are automated (find dead code, make the appropriate fixes, split it in several PR, and assign the appropriate reviewers). Obviously, we can't do that. But I believe we should do cleaning regularly (and break compatibility at some point). I started looking a bit, and there is a nice python tool called Vulture that we could use for that. Obviously, you have false positive, especially in places where we have dynamic loading and things like that. But I tried it, and it does throw interesting results ! I'll maybe try to integrate it to LHCbDIRAC...

An approach that I found interesting was that by default, developers are asked to make all their interface private. That forces them to think carefully whether a method should be public or not. And this makes interfaces cleaner, and naturally stabilizes the API. Yet, they have the case of some special codes, stored somewhere else, which are used as "open source central libraries". And there, it is "expected that the teams follow the API". And I also think we should not be scared to ask the extensions to follow the API with no delays.

Google use an amazing amount of tests. And we know that we should be increasing our tests coverage and automation, that's no news. I just want to stress it again. They do automated tests for the code change (like, in the PR), but also on the whole code base, when several PRs are merged, so it checks it is all consistent. The advantage of their monolithic approach is that they can actually test the whole code base. If we want to test the Externals for example, we have no other way that releasing them, making a release of DIRAC, and hopping it works. We could of course track in the DIRAC code all the calls to these libraries, and reproduce them as some sort of unit/integration tests, but this is brain dead.

In general, I'd say that we are anyway unable to do proper tests on the head. We always have to create a tag to be able to test it, simply because of the way dirac-install works (usual story...). We cannot "install the head". Of course, we can install the previous tag, and then overwrite the code, but still, it is not the same.

One thing I start wondering and that will certainly bite us (in fact, it already did), is how to handle mock properly: how do you make sure that all the mocks are reflecting the latest API of what they are emulating ? We had the problem in the past that tests were fine because of a mock still exposing the old API, while the real API had changed. We could think of having centralized mocks, but it is not always possible. So I am not sure how to do that. I ask my friend at google how they do, still waiting. Also I must admit I did not yet search for solution to that on the internet.

I am also trying to know more about their release cycle, as well as how they handle DB schema changes.

All in all, I think this idea of always working on the head very interesting. I think it fits quite well our needs, would of course require a bit of work and adaptation, but would pay off. Personally, I am literally fed up with my PR waiting for years, and with operations/developments being blocked because we can't release because we wait for one or the other feature.

Any thoughts ?

Cheers

Chris

Christophe HAEN

unread,

Oct 12, 2017, 10:42:34 AM10/12/17

to diracgrid-develop

[REPLY FROM FEDERICO]

Hi,
nice post!

I will go by points:
- DIRAC is not providing a web service, so unfortunately we need to keep several branches. I would say it's OK, not big bad point.
- We may gain a bit by putting some more code inside the same repo, but as you say, not enormously.
- I tried vulture 2 or 3 years ago and it was pretty terrible (95% were false positives), maybe now it improved, I don't know
- I really like the point of having the possibility to test the "HEAD" more easily. I am not sure how we could make that, though.
- I like also proper mocks: https://github.com/DIRACGrid/DIRAC/blob/integration/DataManagementSystem/Client/test/mock_DM.py is an example of it - one mock only for a certain object, that should be used everywhere. We have too few of them.

The main problem we ALWAYS had is that it takes us ages to move from "development" to production. How does google manage to speed that up? They do, because they have wrote, and they maintain, an enormous amount of automated tests (for which they even wrote their own build system: https://bazel.build/)

"When the review is marked as complete, the tests will run; if they pass, the code will be committed to the repository without further human intervention."
it is written in the article. Great! It's also written that
"Each day the repository serves billions of file read requests, with approximately 800,000 queries per second during peak traffic and an average of approximately 500,000 queries per second each workday. Most of this traffic originates from Google's distributed build-and-test systems."

Where are we? Very far from that! The "certification" is a full, lengthy process that takes not only time to run, but mostly it takes a lot of time to write down. The reason is that we had to code ALL the tests that nobody, for many years, ever wanted to do. Can we finally speed up now? Hopefully, but out of the VERY long list of tests that we have in https://trello.com/b/lIyCB6uO/lhcbdirac-certification-template and https://trello.com/b/cp8ULOhQ/dirac-certification-template not many are automatized. Now, let's compare this the google approach, where you click "review OK" and a full chain of tests start. Take one example PR, e.g. your famous PR about FTS3: what would it take to do the same? I mean, to review that one OK right now, then having tests to start that can guarantee that it's completely safe to be merged, and to go in production. This, is a full certification, including interactions with FTS3 servers, of course. It's a lot of work ahead. Which we should do, of course!

Cheers,
Federico

Christophe HAEN

unread,

Oct 12, 2017, 10:42:55 AM10/12/17

to diracgrid-develop

[REPLY FROM ME]

I'll answer by points as well :-)

- DIRAC is not providing a web service, so unfortunately we need to keep several branches. I would say it's OK, not big bad point.

I am not sure I understand the connection here ? This strategy applies for (almost) all google code, even internal. And they do have branches, but more as checkpoints, and no development is targeted at this. How would it be a problem for us if we make all our PR against integration and backport them ?

- I tried vulture 2 or 3 years ago and it was pretty terrible (95% were false positives), maybe now it improved, I don't know

I'd say it deserves a try. There is recent development on it.

- I really like the point of having the possibility to test the "HEAD" more easily. I am not sure how we could make that, though.

I'd say that as a very first step, dirac-install should take the URLs it uses as parameters. This is what I suggested in https://github.com/DIRACGrid/DIRAC/issues/3478 (for the change from AFS to EOS)

- I like also proper mocks: https://github.com/DIRACGrid/DIRAC/blob/integration/DataManagementSystem/Client/test/mock_DM.py is an example of it - one mock only for a certain object, that should be used everywhere. We have too few of them.

TBH, I don't think this mock is very useful. It is extremely targeted, and it will only work if I call my LFNs pippo, /a/lfn/at/SE1.txt or /a/lfn/at/SE2.txt.
I started a serie of Mock slightly more advanced where the behavior would be based on the LFNs. Basically, part of the LFN is used to decide what will happen to it.
For example: /lhcb/removal/error_removeFile/failRemove.txt "fail_removeFile" tells the DM that the presence of this LFN should trigger S_ERROR when calling removeFile. I agree it is already an "advanced" mock, but I think we could do it for a few entities. This gives the complete freedome of LFNs, makes it very flexible, AND you immediately know what will hapen to an LFN just by reading it. I don't know what will happen to "pippo" :-)

About the automation of our testing:
I agree with you. I think everyone does. There clearly is work needed to automate all that. Now, to be fair, I'd say we are on a good track: unit tests are ran automatically for DIRAC and LHCbDIRAC now. The integration tests are ALMOST automated: it is one click in jenkins, but we are blocked by the fact that we cannot test the head (back to the original problem, which I think should be the first to be addressed now).As for the automation of what is on Trello, I don't know all the tests, but for sure all the client installation can be automated. Also, yes there are zillions of automated tests in Google, but they still have certifiers: our system tests. What is key point here I think is that the old code and the new code are switched on and off by flags. And to answer your question about FTS3: unit tests are there, Integration tests are there, the code is only enabled by a configuration flag, but yes, it would need system tests.

About the merging of a PR as soon as it has been reviewed OK and the tests have passed, I don't see in fact how it would be different from what we have now in terms of "security". There are anyway no extra checks done now. It is just sitting there, waiting for someone to click the merge button. And this has only negative consequences, for example code getting desynchronized because one developper did not take into account that in 17 weeks from now, another PR will be merged which also touched the same file as he did. Also, all our unit tests are now failing in LHCbDIRAC because waiting for your DIRAC PR to be merged (maybe you could have done it in another PR, but that's detail here).

Can we try to get a list of actions from here ? Maybe also agreements on some practices ?
I would propose:

* Any new functionality should be enabled/disabled by a flag. I think this is very important because in this way, we will not anymore block a release like v6r19 (just to give one...) blocked us.
* Any PR which has been reviewed OK and succeeds the tests should be merged straight away. I do not see any "danger" here, only up sides. This however requires that PR are topic targeted and are not expanded again and again and again ;-)
* About backward compatibility: I would go for a clear statement like "features/options declared as deprecated will live for 2 releases. At the third one, it is removed". This will help us keeping a cleaner code base, and also better target the tests !
* dirac-install should be more flexible.
* There should be some work done to automate tests. And it should not come from CERN building 2 only.
* We could develop some standard mocks. I volunteer to write them, providing we agree with the approach I gave you. I am also pretty sure it will help us keeping the real interface and the mock in sync.
* Also, I would give a go with developping against the head and eventually backporting to the release branch. Currently, we make integration unusable, make fixes AND new small features in the production branch, that we need to port to the head. If we develop against the head, and the new big features are conditionned by flags, then we save a lot of time and complexity. Since we are at the end of life of one release, I would say it is a good time to try it out...

These are my propositions. Just to reframe, I welcome any other, but the problems I really would like to see addressed are:
* releases take us forever

* One feature can end up blocking EVERYTHING

* We need to better automate our tests.

I think the propositions I make address concretely all these points.

Cheers
Chris

Christophe HAEN

unread,

Oct 18, 2017, 12:49:56 PM10/18/17

to diracgrid-develop

People have spoken: this Friday, October 20th at 9:30 it is (https://doodle.com/poll/849w7xpv8p3pha7f)

The indico page (yet empty) is here: https://indico.cern.ch/event/674501/

The vidyo channel to use is the same as the BILD meeting

Christophe HAEN

unread,

Oct 23, 2017, 2:40:37 AM10/23/17

to diracgrid-develop

The minutes of the meetings are attached to the agenda. Send me any corrections/additions you would like to see