How to distinguish read and download/get operations

38 views
Skip to first unread message

mustafa dikmen

unread,
Oct 17, 2023, 5:10:33 AM10/17/23
to iRODS-Chat
Hello,

I would like to distinguish "read" and "download/get" operations done by clients other than iCommands through PEPs. iCommands  (istream, iget) fire different PEPs for these as we know.

In other words, I will decide/name an operation based on PEPs triggered. As far as I can see, the "read" and "get" operations by python-irodsclient or go-client trigger these:
pep_api_data_obj_open_post
pep_api_data_obj_read_post
pep_api_data_obj_close_post


More specifically by the PRC:
with iRODSSession(irods_env_file=env_file) as session:
    # A download operation
    session.data_objects.get("/u0137480/home/user1/hello.txt", "hello_downloaded.txt") 
    # A read operation
    obj = session.data_objects.get("/u0137480/home/user1/hello.txt")
    with obj.open('r+') as f:
        print(f.read())

The only differences that I observed in PEPs is that
- the "len" key in pep_api_data_obj_read_post contains different values, for download it is a larger than read,
- the "open_flags" key in pep_api_data_obj_open_post has different value; it is 0 for read and 2 for dowdload/get.

But I have no idea what these represent. Could you tell me if there is a way to distinguish these two operations?

Thanks.

Alan King

unread,
Jun 10, 2024, 11:29:27 AMJun 10
to irod...@googlegroups.com
Hi Mustafa,

Sorry for the delayed response here... Did you figure anything out here? Any updates?

I see... 3 questions. I will try to address them, but I mostly just have follow-up questions for now:

Question 1:

- the "len" key in pep_api_data_obj_read_post contains different values, for download it is a larger than read,

How different is the value? Is either value the correct value? How large is the file in question, or, what is the correct/expected value?

Question 2:

- the "open_flags" key in pep_api_data_obj_open_post has different value; it is 0 for read and 2 for dowdload/get.

I think this can be explained by the obj.open() call in your code. The open mode is "r+" which is read-write. The get() call should be an open() call in read-only mode. From the manual:

   File access mode
       Unlike the other values that can be specified in flags, the
       access mode values O_RDONLY, O_WRONLY, and O_RDWR do not specify
       individual bits.  Rather, they define the low order two bits of
       flags, and are defined respectively as 0, 1, and 2.  In other
       words, the combination O_RDONLY | O_WRONLY is a logical error,
       and certainly does not have the same meaning as O_RDWR.

Question 3:

Could you tell me if there is a way to distinguish these two operations?

I don't think there's a way to distinguish the python-irodsclient calls because they are invoking the same API. From the perspective of the server (and POSIX, for that matter, as I understand it), this is just an open() call and there is no difference as to which client application made the call. What is the objective of differentiating the origin of an open() API invocation from the client calls?

--
--
The Integrated Rule-Oriented Data System (iRODS) - https://irods.org
 
iROD-Chat: http://groups.google.com/group/iROD-Chat
---
You received this message because you are subscribed to the Google Groups "iRODS-Chat" group.
To unsubscribe from this group and stop receiving emails from it, send an email to irod-chat+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/irod-chat/fac418d2-2d06-4159-aeb8-0953938ccd92n%40googlegroups.com.


--
Alan King
Senior Software Developer | iRODS Consortium

mustafa dikmen

unread,
Jun 11, 2024, 12:21:09 PMJun 11
to irod...@googlegroups.com
Hi Alan, thanks for looking at this.

What is the objective of differentiating the origin of an open() API invocation from the client calls?

For our auditing, we would like to know action types, that is, since we basically transform raw audit messages to more meaningful outputs, we categorize actions and we add as a key to transformed json. So that we can generate a history data for objects or a general audit report. There are couple of examples here:

{
10 items
"zone_name":
"u0137480"
"timestamp":
1718117003963
"user_name":
"rods"
"proxy_user_name":
"rods"
"client":
"python-irodsclient"
"force_flag":
""
"pid":
1687093
"action":
"create"
"data_id":
10268
"path":
"/u0137480/home/rods/hello1.txt"
}
{
10 items
"zone_name":
"u0137480"
"timestamp":
1718117004079
"user_name":
"rods"
"proxy_user_name":
"rods"
"client":
"python-irodsclient"
"force_flag":
""
"pid":
1687093
"action":
"write"
"data_id":
""
"path":
"/u0137480/home/rods/hello1.txt"
}
{
10 items
"zone_name":
"u0137480"
"timestamp":
1718114900863
"user_name":
"rods"
"proxy_user_name":
"rods"
"client":
"python-irodsclient"
"force_flag":
""
"pid":
1684873
"action":
"upload"
"data_id":
10264
"path":
"/u0137480/home/rods/hello.txt"
}
{
10 items
"zone_name":
"u0137480"
"timestamp":
1718117531278
"user_name":
"rods"
"proxy_user_name":
"rods"
"client":
"mango-portal"
"force_flag":
""
"pid":
1687649
"action":
"read"
"data_id":
""
"path":
"/u0137480/home/rods/hello.txt"
}

I have not figured it out yet completely but I have some findings here:
- if there is an upload by the PRC (calling put method), I am able to see the dataSize key with a correct value in pep_api_data_obj_open_post. Also, I can see the selObjType key in the same pep. However if there is a download by the PRC (calling get method with a second argument - the local file path), I dont see any difference than regular reading (instantiate object to read).
- It looks like open_flags key in pep_api_data_obj_open_post holds different values than what documented/coded here https://github.com/irods/python-irodsclient/blob/main/irods/manager/data_object_manager.py#L73-L80. For Example, when an object is created by the create method in the PRC, I capture "open_flags":
"65" , 
and while writing in an already available object, I see  "openType":
"3"
 in pep_api_data_obj_open_post.
- A single large file upload triggers many times pep_api_data_obj_write_post, I guess this is the same for reading/downloading too. But I have not been able to set a correlation between parallel upload chunks (32 MB). It looks like PEPs are triggered much more often than chunk numbers. Also, for example, reading a 6 bytes data object invokes pep_api_data_obj_write_post two times. The first one shows the actual length of the object with the len key, but the second invoke shows a completely different value (8192). (I am still investigating this to understand behaviours better.)
-  I was expecting to see the same @pid for all triggered PEPs because of an api call (for example reading or writing a large file), but no I see different #pid values, I am guessing for each thread there is a dedicated @pid number, (this looks a bit misleading/confusing, that is seems we cannot rely on @pid for specific logics needed to process raw messages.)

I think there should be a way to get standard/consistent information to the actions users can take. That is why still investigating. iCommands' behaviour looks more consistent in this context (I know it is a different client than other clients which invoke different apis).

Best Regards,
Mustafa



Alan King

unread,
Jun 12, 2024, 10:54:14 AMJun 12
to irod...@googlegroups.com
Okay, thanks for that. I'll provide some comments on these observations to see if that helps...

- if there is an upload by the PRC (calling put method), I am able to see the dataSize key with a correct value in pep_api_data_obj_open_post. Also, I can see the selObjType key in the same pep. However if there is a download by the PRC (calling get method with a second argument - the local file path), I dont see any difference than regular reading (instantiate object to read).

Right, I don't think we can do much about this as currently implemented. PRC uses open/write/close for uploads and open/read/close for downloads.

- It looks like open_flags key in pep_api_data_obj_open_post holds different values than what documented/coded here https://github.com/irods/python-irodsclient/blob/main/irods/manager/data_object_manager.py#L73-L80. For Example, when an object is created by the create method in the PRC, I capture "open_flags":"65" , and while writing in an already available object, I see  "openType":"3" in pep_api_data_obj_open_post.

This is expected. The values in the link you provided are a mix of open modes and open flag values which can themselves be modified by bitwise operations. Here is the map in the open() method between the open modes and open flags passed to the iRODS open API call: https://github.com/irods/python-irodsclient/blob/71d787fe1f79d81775d892c59f3e9a9f60262c78/irods/manager/data_object_manager.py#L362-L369 Notice that the open modes are being modified by the createFlag and O_TRUNC, which would affect the integer value that you are seeing.

- A single large file upload triggers many times pep_api_data_obj_write_post, I guess this is the same for reading/downloading too. But I have not been able to set a correlation between parallel upload chunks (32 MB). It looks like PEPs are triggered much more often than chunk numbers. Also, for example, reading a 6 bytes data object invokes pep_api_data_obj_write_post two times. The first one shows the actual length of the object with the len key, but the second invoke shows a completely different value (8192). (I am still investigating this to understand behaviours better.)
-  I was expecting to see the same @pid for all triggered PEPs because of an api call (for example reading or writing a large file), but no I see different #pid values, I am guessing for each thread there is a dedicated @pid number, (this looks a bit misleading/confusing, that is seems we cannot rely on @pid for specific logics needed to process raw messages.)

These are both related to parallel transfer. I think the unexpected number of write calls needs a little more detail before we can say anything conclusively. I think this raises a very good point that we should document the expectations here a bit better. Currently, one must read the parallel transfer implementation in the PRC code to understand how it is splitting up the works and making the API calls. Feel free to open an issue for this.

Yes, this is the expected behavior for PRC's implementation of parallel transfer (what we refer to as "multi-1247" parallel transfer). A set of separate client connections are made to the same server and bytes are written in parallel. Each connection has a separate agent servicing requests.

Now for the matter of associating these with a particular "action". The separate writes are coordinated in that each one should be calling open() at the start with a "replicaToken", found in the condInput of the dataObjInp. So, the open PEP could try looking for that key in the condInput of the dataObjInp and printing that in your audit for later use. Apart from that, I think that it should be possible to associate the separate writes based on the client IP, target logical path, and resource name in a given open()/close() window thanks to logical locking.

Hope some of that helps!

Reply all
Reply to author
Forward
0 new messages