New RFC #21: "Transformation System evolution" available for comments

38 views
Skip to first unread message

Luisa Arrabito

unread,
Mar 31, 2015, 10:35:48 AM3/31/15
to diracgri...@googlegroups.com
Dear all,

a new RFC about the evolution of the Transformation System is available for comments:

https://github.com/DIRACGrid/DIRAC/wiki/Transformation-System-evolution

Cheers,

Luisa




Luisa Arrabito

unread,
Jun 8, 2015, 8:30:57 AM6/8/15
to diracgri...@googlegroups.com
In order to gather your comments, I summarise here some private exchanges with Federico, Chris, Andrei, Johan and some open questions related to this RFC.

For reference the current dev version is under:
https://github.com/arrabito/DIRAC/tree/TSrelv6r13/TransformationSystem
and integration tests under:
https://github.com/arrabito/TestDIRAC/blob/testTestDIRACTS/Integration/TransformationSystem/TestClientTransformation.py

* About the list of methods to instrument with meta-data filters:
- addFile
- setMetaData
- addReplica

Although the necessity of instrumenting addFile and setMetadata methods is evident, it's less evident the need for the addReplica method. Indeed, it has sense only if input data queries consider the location of the input data in the list of metadata. This may go down at the level of the computing model: do we consider that data at certain locations are "different" than the same data at other locations? In LHCb the answer is "no", and also in CTA for the moment. Also, there could be a performance issue: if every time a new replica is added or removed the whole list of queries is scanned, there's an added logic to be applied and this may hinder the performances of the system.

* Appending/removing files to a transformation

When a change of file meta-data makes the file matching the query condition of a transformation, the file is attached to the transformation. What should it happen when the change of file meta-data makes that a query condition that was previously statisfied is not satisfied anymore?
Should the file be removed from the transformation? If yes, we could end-up with a quite confusing situation, since part of the file attached to a transformation could remain attached, while another part would be removed. One could also imagine to remove only files that are unprocessed, however this means that in the end we would have a transformation having processed only part of all the files having the same meta-data.
To avoid this confusing situation, it would may be better that a file always remains attached to a transformation until the transformation is cleaned or the file itself removed. This means that we consider the files attached to a transformation as the files matching the inputdata query at a given time.

* Implementation of meta-data filters on client or server side?

Currently it's on the client side. However after some discussions we concluded that it would be better to move it on the server side (not necessarily on the service code, but at least server side). Indeed, if a file is added to a transformation, we want to have it logged at a single place and understand why, and not having to check all the job's logs. Especially if using a message queue, we would want to simply dump the info in it, and the listener applies the filter.

* Instantiation of FileCatalog within the TransformationClient

The TransformationClient is meant to be used as a FileCatalog plugin. However in its current implemention we instantiate a FileCatalog there, so there is something conceptually wrong. Indeed, we end up having a FileCatalog instantiating a TSCatalog instantiating a TransformationClient instantiating a FileCatalog instantiating a TSCatalog, and luckily stopping there.
The reason of instantiating the FileCatalog there is that we need to call a few meta-data methods to implement the logic of the filter, essentially to update the metadict of a file.
 How to avoid this? Even transferring the logic on the server side, in the end would result in the same chain of instantiations.

* Multiple inheritance in the TransformationClient

Currently, and also in v6r13 we have:
class TransformationClient( Client, FileCatalogueBase ):
but we only call the init of the Client, not the one of the FileCatalog. Is there any reason for it?

* Add the possibility to create a transformation without any inputdataquery and to add an inputdataquery afterwards

* Missing unit-tests

Thanks in advance for your comments,

Luisa

Federico Stagni

unread,
Jun 17, 2015, 6:52:48 AM6/17/15
to diracgri...@googlegroups.com
Hi Luisa,
I finally find the time to comment on this important topic.

- UserMetadata -> FileMetadata?
names are important. Can we rename what in RFC is referred as "UserMetadata" to "FileMetadata"? Because it has nothing to do with the users and all with the files

- TransformationCatalog().addReplica():
I share the same concerns regarding "addReplica" as you. I also add that, for what concerns the transformation system, the logic for determining the destination does NOT apply to the files, but to the tasks, and this is a logic implemented by the TransformationPlugins used within the TransformationAgent. So, "addReplica" should just return S_OK()

- TransformationCatalog().removeFile()
this function is missing. Is it on purpose? IMHO removing a file from a Transformation doesn't mean fully removing it, rather it means changing its status (e.g. in the LHCbDIRAC extension of the TS it exists a "Removed" status in the transformation files state machine)

- Changing FileMetadata
First of all, I would say there's a non-written rule regarding Transformation InputDataFilters: in case they change, they should never become more restrictive, since this may have bad implications. 
At the same time, suppose the following case: file F1 has property P1 and transformation T1 process it. If F1 loses property P1, what would happen of all the output files already created in T1 from F1? This is a post processing already. 
IMHO, things should be kept as simple as possible: ideally, if F1 is still in "Unused" status in T1 at the time it loses P1, then this file may be remove (i.e. marked as "Removed") but in practice I am tempted to leave it where it is, let T1 process it, and then it will be part of the post processing, that should happen anyway and that is not bound to the transformation system itself to make the necessary fixes.

- Implementation: who should be the client of whom?
I would say that the logical chain should be (for a typical addFile operation): 
  
  (Catalog ->) TransformationCatalog -> (TransformationClient ->) TransformationHandler -> MetaQuery -> TransformationDB

It is true that TransformationClient right now inherits from FileCatalogueBase but I think this is a relic that should be removed.

There are some unit tests in TransformationSystem/Agent/test and TransformationSystem/Client/test and they should all be expanded.

Cheers,
Federico


--
You received this message because you are subscribed to the Google Groups "diracgrid-develop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to diracgrid-deve...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Luisa Arrabito

unread,
Aug 24, 2016, 5:33:09 AM8/24/16
to diracgrid-develop
Dear Federico and al.,

I'm coming back to this topic, since I've started to work on a new version of the code based on v6r16. This new version is available here:

https://github.com/arrabito/DIRAC/tree/TSrel-v6r16

This version is similar to the previous one mentioned in this post (based on v6r13), but it includes the implementation of some of your comments. Here below some details:

* First of all, in v6r16, there is no need anymore to implement fake methods for catalog plugins, since each plugin declares which methods it implements. So I've removed some fake methods from the TSCatalog, just implementing the ones really used.

* According to:

- TransformationCatalog().addReplica():
I share the same concerns regarding "addReplica" as you. I also add that, for what concerns the transformation system, the logic for determining the destination does NOT apply to the files, but to the tasks, and this is a logic implemented by the TransformationPlugins used within the TransformationAgent. So, "addReplica" should just return S_OK()

and on according to my previous comment, I've simply removed the addReplica method from the TSCatalog and TransformationClient.

* According to:


- TransformationCatalog().removeFile()
this function is missing. Is it on purpose? IMHO removing a file from a Transformation doesn't mean fully removing it, rather it means changing its status (e.g. in the LHCbDIRAC extension of the TS it exists a "Removed" status in the transformation files state machine)

For this method, I've kept the same implementation done in the original v6r16, where removing a file from a Transformation means exactly what you say, i.e. changing its status. My only change is to sligthly modify the chain of calls to be coherent with the other methods implemented in the TSCatalog. So in v6r16, it was:

TransformationCatalog -> TransformationHandler -> TransformationDB

while now I've just introduced an intermediate call to the TransformationClient, i.e.:

TransformationCatalog -> (TransformationClient ->) TransformationHandler -> TransformationDB

* Changing FileMetadata


- Changing FileMetadata
First of all, I would say there's a non-written rule regarding Transformation InputDataFilters: in case they change, they should never become more restrictive, since this may have bad implications. 
At the same time, suppose the following case: file F1 has property P1 and transformation T1 process it. If F1 loses property P1, what would happen of all the output files already created in T1 from F1? This is a post processing already. 
IMHO, things should be kept as simple as possible: ideally, if F1 is still in "Unused" status in T1 at the time it loses P1, then this file may be remove (i.e. marked as "Removed") but in practice I am tempted to leave it where it is, let T1 process it, and then it will be part of the post processing, that should happen anyway and that is not bound to the transformation system itself to make the necessary fixes.

I've not yet worked on that.

* Implementation: who should be the client of whom?

I would say that the logical chain should be (for a typical addFile operation): 
  
  (Catalog ->) TransformationCatalog -> (TransformationClient ->) TransformationHandler -> MetaQuery -> TransformationDB

It is true that TransformationClient right now inherits from FileCatalogueBase but I think this is a relic that should be removed.

For addFile and setMetadata, I've not moved the code to the server side yet (see my previous post). So the current implementation chain is:

  (Catalog ->) TransformationCatalog -> TransformationClient -> MetaQuery

but I plan to implement it this way:


  (Catalog ->) TransformationCatalog -> (TransformationClient ->) TransformationHandler -> MetaQuery -> TransformationDB

as you proposed. However, I still don't know how to solve the problem of multiple FileCatalog instantiations. See:


* Instantiation of FileCatalog within the TransformationClient

The TransformationClient is meant to be used as a FileCatalog plugin. However in its current implemention we instantiate a FileCatalog there, so there is something conceptually wrong. Indeed, we end up having a FileCatalog instantiating a TSCatalog instantiating a TransformationClient instantiating a FileCatalog instantiating a TSCatalog, and luckily stopping there.
The reason of instantiating the FileCatalog there is that we need to call a few meta-data methods to implement the logic of the filter, essentially to update the metadict of a file.
 How to avoid this? Even transferring the logic on the server side, in the end would result in the same chain of instantiations.

Any idea?

* Concerning the text of the RFC:


- UserMetadata -> FileMetadata?
names are important. Can we rename what in RFC is referred as "UserMetadata" to "FileMetadata"? Because it has nothing to do with the users and all with the files

I've used UserMetadata to avoid confusion, since this is the name of the method in the FileCatalogClient, that corresponds to what described in the text.

* I've expanded TransformationSystem/Client/test with some unit tests and adapted the integration test to the new way of using the FileCatalog in v6r16.

So, the next steps in my plan would be:

1. Move the logic of addFile and setMetadata on the server side (in TransformationDB)

2. Work on:
- Changing FileMetadata
First of all, I would say there's a non-written rule regarding Transformation InputDataFilters: in case they change, they should never become more restrictive, since this may have bad implications. 
At the same time, suppose the following case: file F1 has property P1 and transformation T1 process it. If F1 loses property P1, what would happen of all the output files already created in T1 from F1? This is a post processing already. 
IMHO, things should be kept as simple as possible: ideally, if F1 is still in "Unused" status in T1 at the time it loses P1, then this file may be remove (i.e. marked as "Removed") but in practice I am tempted to leave it where it is, let T1 process it, and then it will be part of the post processing, that should happen anyway and that is not bound to the transformation system itself to make the necessary fixes.
3. Add the possibility to create a transformation without any inputdataquery and to add an inputdataquery afterwards

4. Collect your suggestion about how avoiding multiple FileCatalog instantiations

Finally, I would like to know what do you think about opening already now a PR, so that we can follow the review there. Also consider that the current code should already work in production (both unit and integration tests pass), but at the same time the introduced new feature can be used or not.

Indeed, the filters based on meta-data queries are used only if a filter is passed as argument to:

TransformationClient.addTransformation

Thanks in advance for your comments.

Cheers,

Luisa


Reply all
Reply to author
Forward
0 new messages