Questions about the bulk upload and PEPs

30 views
Skip to first unread message

mustafa dikmen

unread,
Apr 1, 2022, 11:22:48 AM4/1/22
to iRODS-Chat
Hello,

I have some questions about the bulk upload and possible PEPs that need to be triggered:

1. I have a test folder which has several files in different size. `iput -rb test` always triggers `pep_api_data_obj_put_post` which seems, I am guessing, normal (if there is no bug), though I expected it should trigger `pep_api_bulk_data_obj_put_post`. And I think then what is written here is valid only for the static  `acPostProcForPut`,  not for the dynamic `pep_api_data_obj_put_post`. Is this correct? That is, we should expect `pep_api_data_obj_put_post` will be triggered n times (files quantity in the test folder) by iput -rb.

2. By using `iput -rb`, I could manage to trigger a bulk PEP -`pep_api_bulk_data_obj_put_post`- only for the very small size files (guessing under 1MB) inside a folder. And this bulk pep is triggered only two times regardless of how many small files you have. So that for those files. the `pep_api_data_obj_put_post` pep is not triggered. However, if `iput -rb` run for a single small size file, then only the `pep_api_data_obj_put_post` is triggered. I am wondering in which condition a bulk PEP is triggered. Is it normal that `iput -rb` triggers the bulk pep only two times for more than two files?

3. In line with the observation in the number 2, how should we understand the bulk operation is being performed? It seems that as the verbose shows below, the bulk is for certain sizes:

-bash-4.2$ iput -brv test_KB
Running recursive pre-scan... pre-scan complete... transferring data...
C- /u0137480/home/rods/test_KB:
   5MB                             5.000 MB | 2.378 sec | 0 thr |  2.103 MB/s
C- /u0137480/home/rods/test_KB:
Bulk upload 5 files.
   test100k.db                     0.488 MB | 1.919 sec | 0 thr |  0.254 MB/s

4. `irsync -r test :i test` never triggers any bulk PEPs. So can we say for small files `iput -rb` and `irsync` uses different API? Any correlation between iput -rv and irsync -r for small files?

5. For many small files inside a source folder that needs to be synched to iRODS, `pep_api_data_obj_put_post` pep is triggered as many as the files quantity that needs to be synched. Does this create any performance issue?

Answers to these questions will help us to understand an issue that we have recently experienced. That was, once we enabled a python rule which includs `pep_api_data_obj_put_post` pep, we observed the iput -rv and irsync -r operations stuck and does nothing - cant recursively uploads (of course there might be another causing reasons).

Thanks in advance.

Best Regards,
Mustafa

Alan King

unread,
Apr 4, 2022, 12:37:32 PM4/4/22
to irod...@googlegroups.com
Hi,

1. Your assessment seems correct. acBulkPutPostProcPolicy will only be triggered on bulk puts and the microservice called therein will only cause the static PEP acPostProcForPut to be called (as seen here: https://github.com/irods/irods/blob/a86b0ed40bcb2bd23cf526a52244286b78285468/server/api/src/rsBulkDataObjPut.cpp#L988)

2/3. Unfortunately, I don't think the bulk upload behavior is very well understood by anybody at this point. We have had users bring up this situation in the past and it is still something we are considering.

However, I did some sniffing around and found a few things to say.

First, if iput targets a single file, the recursive and bulk options have no effect. rcDataObjPut is invoked, which triggers policy for signatures like pep_api_data_obj_put_* in a default configuration.

If iput targets a directory, there are two distinct steps:
    a. First, a "dir put" occurs. It starts in a special bulk mode called BULK_OPR_LARGE_FILES.
           https://github.com/irods/irods/blob/eb361f5e49e817ff498dbb417eb935030e5b2322/lib/core/src/putUtil.cpp#L936
       This will basically skip all of the files under a certain size (hard-coded to 4MB)
           https://github.com/irods/irods/blob/eb361f5e49e817ff498dbb417eb935030e5b2322/lib/core/include/irods/rodsDef.h#L120
       For the files which meet the size threshold, a "normal put" happens *for each file*
           https://github.com/irods/irods/blob/eb361f5e49e817ff498dbb417eb935030e5b2322/lib/core/src/putUtil.cpp#L873
           https://github.com/irods/irods/blob/eb361f5e49e817ff498dbb417eb935030e5b2322/lib/core/src/putUtil.cpp#L434

    b. Next, the directory is scanned in a special bulk mode called BULK_OPR_SMALL_FILES
           https://github.com/irods/irods/blob/eb361f5e49e817ff498dbb417eb935030e5b2322/lib/core/src/putUtil.cpp#L949
       This is the bulk upload behavior that one expects to see. The contents of the directory are traversed and collected into a tar file, which is then put here via the bulk API:
           https://github.com/irods/irods/blob/eb361f5e49e817ff498dbb417eb935030e5b2322/lib/core/src/putUtil.cpp#L1080-L1084
           https://github.com/irods/irods/blob/eb361f5e49e817ff498dbb417eb935030e5b2322/lib/core/src/putUtil.cpp#L1114

Hopefully this dissection helps with understanding the behavior you're seeing.

4. irsync is a different operation and uses a separate API. As far as I know, there is no correlation between these two and irsync does not support bulk uploads.

5. I would not consider this a performance issue if the PEPs being invoked for each data object is a requirement of your data management policy. The bulk put functionality was implemented is a way of improving performance at the cost of granular policy invocation.

Finally, I would like to suggest a potential alternative solution... You could try to use a tool like automated ingest (https://github.com/irods/irods_capability_automated_ingest/) to upload files in large quantities and the vanilla rcDataObjPut endpoint would be hit every time on each uplo
ad.

Hopefully that helps :)

Alan

--
--
The Integrated Rule-Oriented Data System (iRODS) - https://irods.org
 
iROD-Chat: http://groups.google.com/group/iROD-Chat
---
You received this message because you are subscribed to the Google Groups "iRODS-Chat" group.
To unsubscribe from this group and stop receiving emails from it, send an email to irod-chat+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/irod-chat/74b0a73f-7bb1-461c-bedb-c2602298e0a0n%40googlegroups.com.


--
Alan King
Senior Software Developer | iRODS Consortium
Reply all
Reply to author
Forward
0 new messages