Rebalance question/issue

37 views
Skip to first unread message

Jan de Graaf

unread,
Mar 13, 2025, 8:50:13 AMMar 13
to iRODS-Chat
Hi,

I've a question on the rebalance.

We're tiering data from operational storage to archive and ran in to some issue on the archive part. Our archive needs to store data (geo)redundant. And the first option was to let our cloud provider do the geo replication. So iRODS would have to do this. Turns out there are some issues that make this impossible to do this. So I now need to do this in iRODS.
Complicating factor there is already data offloaded to the archive because of space issue on the operational storage.

So i guessed that I could add the existing storage resource to a replication resource and add the second resource that writes data to the other geo location.

So my tree now looks like this.

res-01-store04:passthru
└── res-01-pt04:passthru
    └── res-01-repl04:replication
        ├── res-01-repl04-pt04a:passthru
        │   └── res-01-04a:unixfilesystem
        └── res-01-repl04-pt04b:passthru
            └── res-01-04b:unixfilesystem

Where res-01-04a is the original resource and  res-01-04b is the newly added in combination with the upper replication resource.
For new data this works fine. New files are written to both resources.
But there is an issue with the old data. 
From the docs I understand that the rebalance command can solve this en replicate the data from the res-01-04a to the res-01-04b so the end up in sync. (Is this correct to begin with?)

If so then the following:

So I tried the rebalance command for this and that crashes with the following error.

iadmin modresc res-01-store04 rebalance
remote addresses: 172.31.32.83 ERROR: rcGeneralAdmin failed with error -808000 CAT_NO_ROWS_FOUND

In the log file i found this line:
select DATA_NAME, COLL_NAME, DATA_MODE, DATA_RESC_ID where DATA_ID = '1334243' and DATA_RESC_ID IN ('538280','79878534'

The rescid 538280 is the res-01-04a and 79878534 is the res-01-04b
Offcourse this query will result in 0 rows, because the data is not yet on the 79878534/res-01-04b resource

So should you be able to use the rebalance for this? And if not how can this be done?

Thanks!
Jan de Graaf
NKI

Log file:
 {"log_category":"legacy","log_level":"error","log_message":"iRODS Exception:
    file: /irods_source/plugins/resources/replication/src/irods_repl_rebalance.cpp
    function: (anonymous namespace)::ReplicationSourceInfo (anonymous namespace)::get_source_data_object_attributes(rsComm_t *, const rodsLong_t, const std::vector<leaf_bundle_t> &)
    line: 129
    code: -808000 (CAT_NO_ROWS_FOUND)
    message:
        rsGenQuery failed. genquery_inp contents:
maxRows: 256 continueInx: 0 rowOffset: 0
options: 0
selectInp.len: 4
    column: 403 DATA_NAME
    options: 1
    column: 501 COLL_NAME
    options: 1
    column: 421 DATA_MODE
    options: 1
    column: 423 DATA_RESC_ID
    options: 1
sqlCondInp.len: 3
    column: 401 DATA_ID
    condition: = '1334243'
    column: 423 DATA_RESC_ID
    condition: IN ('538280','79878534')
    column: 413 DATA_REPL_STATUS
    condition: = '1'



 possible iquest [select DATA_NAME, COLL_NAME, DATA_MODE, DATA_RESC_ID where DATA_ID = '1334243' and DATA_RESC_ID IN ('538280','79878534') and DATA_REPL_STATUS = '1']
        : [-]\t/irods_source/plugins/resources/src/passthru.cpp:849:irods::error passthru_file_rebalance(irods::plugin_context &) :  status [CAT_NO_ROWS_FOUND]  errno [] -- message []
stack trace:
--------------
 0# irods::stacktrace::dump() const in /lib/libirods_common.so.4.3.1
 1# irods::exception::assemble_full_display_what() const in /lib/libirods_common.so.4.3.1
 2# irods::exception::what() const in /lib/libirods_common.so.4.3.1
 3# irods::error::result() const in /lib/libirods_common.so.4.3.1
 4# irods::log(irods::error const&) in /lib/libirods_common.so.4.3.1
 5# passthru_file_rebalance(irods::plugin_context&) in /usr/lib/irods/plugins/resources/libpassthru.so
 6# std::__1::__function::__func<irods::error (*)(irods::plugin_context&), std::__1::allocator<irods::error (*)(irods::plugin_context&)>, irods::error (irods::plugin_context&)>::operator()(irods::plugin_context&) in /lib/libirods_server.so.4.3.1
 7# irods::error irods::plugin_base::call<>(RsComm*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, boost::shared_ptr<irods::first_class_object>)::'lambda'(irods::plugin_context&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*)::operator()(irods::plugin_context&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*) const in /lib/libirods_server.so.4.3.1
 8# std::__1::__function::__func<irods::error irods::plugin_base::call<>(RsComm*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, boost::shared_ptr<irods::first_class_object>)::'lambda'(irods::plugin_context&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*), std::__1::allocator<irods::error irods::plugin_base::call<>(RsComm*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, boost::shared_ptr<irods::first_class_object>)::'lambda'(irods::plugin_context&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*)>, irods::error (irods::plugin_context&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*)>::operator()(irods::plugin_context&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*&&) in /lib/libirods_server.so.4.3.1
 9# irods::error irods::plugin_base::call<>(RsComm*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, boost::shared_ptr<irods::first_class_object>) in /lib/libirods_server.so.4.3.1
10# passthru_file_rebalance(irods::plugin_context&) in /usr/lib/irods/plugins/resources/libpassthru.so
11# std::__1::__function::__func<irods::error (*)(irods::plugin_context&), std::__1::allocator<irods::error (*)(irods::plugin_context&)>, irods::error (irods::plugin_context&)>::operator()(irods::plugin_context&) in /lib/libirods_server.so.4.3.1
12# irods::error irods::plugin_base::call<>(RsComm*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, boost::shared_ptr<irods::first_class_object>)::'lambda'(irods::plugin_context&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*)::operator()(irods::plugin_context&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*) const in /lib/libirods_server.so.4.3.1
13# std::__1::__function::__func<irods::error irods::plugin_base::call<>(RsComm*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, boost::shared_ptr<irods::first_class_object>)::'lambda'(irods::plugin_context&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*), std::__1::allocator<irods::error irods::plugin_base::call<>(RsComm*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, boost::shared_ptr<irods::first_class_object>)::'lambda'(irods::plugin_context&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*)>, irods::error (irods::plugin_context&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*)>::operator()(irods::plugin_context&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*&&) in /lib/libirods_server.so.4.3.1
14# irods::error irods::plugin_base::call<>(RsComm*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, boost::shared_ptr<irods::first_class_object>) in /lib/libirods_server.so.4.3.1
15# _rsGeneralAdmin(RsComm*, GeneralAdminInput*) in /lib/libirods_server.so.4.3.1
16# rsGeneralAdmin(RsComm*, GeneralAdminInput*) in /lib/libirods_server.so.4.3.1
17# irods::api_call_adaptor<GeneralAdminInput*>::operator()(irods::plugin_context&, RsComm*, GeneralAdminInput*) in /lib/libirods_server.so.4.3.1
18# std::__1::__function::__func<irods::api_call_adaptor<GeneralAdminInput*>, std::__1::allocator<irods::api_call_adaptor<GeneralAdminInput*> >, irods::error (irods::plugin_context&, RsComm*, GeneralAdminInput*)>::operator()(irods::plugin_context&, RsComm*&&, GeneralAdminInput*&&) in /lib/libirods_server.so.4.3.1
19# int irods::api_entry::call_handler<GeneralAdminInput*>(RsComm*, GeneralAdminInput*) in /lib/libirods_server.so.4.3.1
20# rsApiHandler(RsComm*, int, BytesBuf*, BytesBuf*) in /lib/libirods_server.so.4.3.1
21# readAndProcClientMsg(RsComm*, int) in /lib/libirods_server.so.4.3.1
22# agentMain(RsComm*) in /lib/libirods_server.so.4.3.1
23# runIrodsAgentFactory(sockaddr_un) in /lib/libirods_server.so.4.3.1
24# main::$_5::operator()() const at rodsServer.cpp:?
25# main in /usr/sbin/irodsServer
26# __libc_start_main in /lib/x86_64-linux-gnu/libc.so.6
27# _start in /usr/sbin/irodsServer

","request_api_name":"GENERAL_ADMIN_AN","request_api_number":701,"request_api_version":"d","request_client_user":"rods","request_host":"172.31.32.83","request_proxy_user":"rods","request_release_version":"rods4.3.1","server_host":"p-irods-001","server_pid":3315516,"server_timestamp":"2025-03-13T10:38:21.009Z","server_type":"agent","server_zone":"nki"}
 {"log_category":"legacy","log_level":"error","log_message":"iRODS Exception:
    file: /irods_source/plugins/resources/replication/src/irods_repl_rebalance.cpp
    function: (anonymous namespace)::ReplicationSourceInfo (anonymous namespace)::get_source_data_object_attributes(rsComm_t *, const rodsLong_t, const std::vector<leaf_bundle_t> &)
    line: 129
    code: -808000 (CAT_NO_ROWS_FOUND)
    message:
        rsGenQuery failed. genquery_inp contents:
maxRows: 256 continueInx: 0 rowOffset: 0
options: 0
selectInp.len: 4
    column: 403 DATA_NAME
    options: 1
    column: 501 COLL_NAME
    options: 1
    column: 421 DATA_MODE
    options: 1
    column: 423 DATA_RESC_ID
    options: 1
sqlCondInp.len: 3
    column: 401 DATA_ID
    condition: = '1334243'
    column: 423 DATA_RESC_ID
    condition: IN ('538280','79878534')
    column: 413 DATA_REPL_STATUS
    condition: = '1'



 possible iquest [select DATA_NAME, COLL_NAME, DATA_MODE, DATA_RESC_ID where DATA_ID = '1334243' and DATA_RESC_ID IN ('538280','79878534') and DATA_REPL_STATUS = '1']
        : [-]\t/irods_source/plugins/resources/src/passthru.cpp:855:irods::error passthru_file_rebalance(irods::plugin_context &) :  status [CAT_NO_ROWS_FOUND]  errno [] -- message []
        : [-]\t/irods_source/plugins/resources/src/passthru.cpp:849:irods::error passthru_file_rebalance(irods::plugin_context &) :  status [CAT_NO_ROWS_FOUND]  errno [] -- message []
stack trace:
--------------
 0# irods::stacktrace::dump() const in /lib/libirods_common.so.4.3.1
 1# irods::exception::assemble_full_display_what() const in /lib/libirods_common.so.4.3.1
 2# irods::exception::what() const in /lib/libirods_common.so.4.3.1
 3# irods::error::result() const in /lib/libirods_common.so.4.3.1
 4# irods::log(irods::error const&) in /lib/libirods_common.so.4.3.1
 5# passthru_file_rebalance(irods::plugin_context&) in /usr/lib/irods/plugins/resources/libpassthru.so
 6# std::__1::__function::__func<irods::error (*)(irods::plugin_context&), std::__1::allocator<irods::error (*)(irods::plugin_context&)>, irods::error (irods::plugin_context&)>::operator()(irods::plugin_context&) in /lib/libirods_server.so.4.3.1
 7# irods::error irods::plugin_base::call<>(RsComm*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, boost::shared_ptr<irods::first_class_object>)::'lambda'(irods::plugin_context&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*)::operator()(irods::plugin_context&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*) const in /lib/libirods_server.so.4.3.1
 8# std::__1::__function::__func<irods::error irods::plugin_base::call<>(RsComm*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, boost::shared_ptr<irods::first_class_object>)::'lambda'(irods::plugin_context&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*), std::__1::allocator<irods::error irods::plugin_base::call<>(RsComm*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, boost::shared_ptr<irods::first_class_object>)::'lambda'(irods::plugin_context&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*)>, irods::error (irods::plugin_context&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*)>::operator()(irods::plugin_context&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*&&) in /lib/libirods_server.so.4.3.1
 9# irods::error irods::plugin_base::call<>(RsComm*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, boost::shared_ptr<irods::first_class_object>) in /lib/libirods_server.so.4.3.1
10# _rsGeneralAdmin(RsComm*, GeneralAdminInput*) in /lib/libirods_server.so.4.3.1
11# rsGeneralAdmin(RsComm*, GeneralAdminInput*) in /lib/libirods_server.so.4.3.1
12# irods::api_call_adaptor<GeneralAdminInput*>::operator()(irods::plugin_context&, RsComm*, GeneralAdminInput*) in /lib/libirods_server.so.4.3.1
13# std::__1::__function::__func<irods::api_call_adaptor<GeneralAdminInput*>, std::__1::allocator<irods::api_call_adaptor<GeneralAdminInput*> >, irods::error (irods::plugin_context&, RsComm*, GeneralAdminInput*)>::operator()(irods::plugin_context&, RsComm*&&, GeneralAdminInput*&&) in /lib/libirods_server.so.4.3.1
14# int irods::api_entry::call_handler<GeneralAdminInput*>(RsComm*, GeneralAdminInput*) in /lib/libirods_server.so.4.3.1
15# rsApiHandler(RsComm*, int, BytesBuf*, BytesBuf*) in /lib/libirods_server.so.4.3.1
16# readAndProcClientMsg(RsComm*, int) in /lib/libirods_server.so.4.3.1
17# agentMain(RsComm*) in /lib/libirods_server.so.4.3.1
18# runIrodsAgentFactory(sockaddr_un) in /lib/libirods_server.so.4.3.1
19# main::$_5::operator()() const at rodsServer.cpp:?
20# main in /usr/sbin/irodsServer
21# __libc_start_main in /lib/x86_64-linux-gnu/libc.so.6
22# _start in /usr/sbin/irodsServer

","request_api_name":"GENERAL_ADMIN_AN","request_api_number":701,"request_api_version":"d","request_client_user":"rods","request_host":"172.31.32.83","request_proxy_user":"rods","request_release_version":"rods4.3.1","server_host":"p-irods-001","server_pid":3315516,"server_timestamp":"2025-03-13T10:38:21.749Z","server_type":"agent","server_zone":"nki"}
 {"log_category":"legacy","log_level":"error","log_message":"iRODS Exception:
    file: /irods_source/plugins/resources/replication/src/irods_repl_rebalance.cpp
    function: (anonymous namespace)::ReplicationSourceInfo (anonymous namespace)::get_source_data_object_attributes(rsComm_t *, const rodsLong_t, const std::vector<leaf_bundle_t> &)
    line: 129
    code: -808000 (CAT_NO_ROWS_FOUND)
    message:
        rsGenQuery failed. genquery_inp contents:
maxRows: 256 continueInx: 0 rowOffset: 0
options: 0
selectInp.len: 4
    column: 403 DATA_NAME
    options: 1
    column: 501 COLL_NAME
    options: 1
    column: 421 DATA_MODE
    options: 1
    column: 423 DATA_RESC_ID
    options: 1
sqlCondInp.len: 3
    column: 401 DATA_ID
    condition: = '1334243'
    column: 423 DATA_RESC_ID
    condition: IN ('538280','79878534')
    column: 413 DATA_REPL_STATUS
    condition: = '1'



 possible iquest [select DATA_NAME, COLL_NAME, DATA_MODE, DATA_RESC_ID where DATA_ID = '1334243' and DATA_RESC_ID IN ('538280','79878534') and DATA_REPL_STATUS = '1']
        : [-]\t/irods_source/plugins/resources/src/passthru.cpp:855:irods::error passthru_file_rebalance(irods::plugin_context &) :  status [CAT_NO_ROWS_FOUND]  errno [] -- message []
        : [-]\t/irods_source/plugins/resources/src/passthru.cpp:855:irods::error passthru_file_rebalance(irods::plugin_context &) :  status [CAT_NO_ROWS_FOUND]  errno [] -- message []
        failed to rebalance resource: [-]\t/irods_source/server/api/src/rsGeneralAdmin.cpp:1100:int _rsGeneralAdmin(rsComm_t *, generalAdminInp_t *) :  status [CAT_NO_ROWS_FOUND]  errno [] -- message [failed to rebalance resource]
stack trace:
--------------
 0# irods::stacktrace::dump() const in /lib/libirods_common.so.4.3.1
 1# irods::exception::assemble_full_display_what() const in /lib/libirods_common.so.4.3.1
 2# irods::exception::what() const in /lib/libirods_common.so.4.3.1
 3# irods::error::result() const in /lib/libirods_common.so.4.3.1
 4# irods::log(irods::error const&) in /lib/libirods_common.so.4.3.1
 5# _rsGeneralAdmin(RsComm*, GeneralAdminInput*) in /lib/libirods_server.so.4.3.1
 6# rsGeneralAdmin(RsComm*, GeneralAdminInput*) in /lib/libirods_server.so.4.3.1
 7# irods::api_call_adaptor<GeneralAdminInput*>::operator()(irods::plugin_context&, RsComm*, GeneralAdminInput*) in /lib/libirods_server.so.4.3.1
 8# std::__1::__function::__func<irods::api_call_adaptor<GeneralAdminInput*>, std::__1::allocator<irods::api_call_adaptor<GeneralAdminInput*> >, irods::error (irods::plugin_context&, RsComm*, GeneralAdminInput*)>::operator()(irods::plugin_context&, RsComm*&&, GeneralAdminInput*&&) in /lib/libirods_server.so.4.3.1
 9# int irods::api_entry::call_handler<GeneralAdminInput*>(RsComm*, GeneralAdminInput*) in /lib/libirods_server.so.4.3.1
10# rsApiHandler(RsComm*, int, BytesBuf*, BytesBuf*) in /lib/libirods_server.so.4.3.1
11# readAndProcClientMsg(RsComm*, int) in /lib/libirods_server.so.4.3.1
12# agentMain(RsComm*) in /lib/libirods_server.so.4.3.1
13# runIrodsAgentFactory(sockaddr_un) in /lib/libirods_server.so.4.3.1
14# main::$_5::operator()() const at rodsServer.cpp:?
15# main in /usr/sbin/irodsServer
16# __libc_start_main in /lib/x86_64-linux-gnu/libc.so.6
17# _start in /usr/sbin/irodsServer

","request_api_name":"GENERAL_ADMIN_AN","request_api_number":701,"request_api_version":"d","request_client_user":"rods","request_host":"172.31.32.83","request_proxy_user":"rods","request_release_version":"rods4.3.1","server_host":"p-irods-001","server_pid":3315516,"server_timestamp":"2025-03-13T10:38:22.292Z","server_type":"agent","server_zone":"nki"}
 {"log_category":"api","log_level":"info","log_message":"rsGeneralAdmin: rcGeneralAdmin error -808000","request_api_name":"GENERAL_ADMIN_AN","request_api_number":701,"request_api_version":"d","request_client_user":"rods","request_host":"172.31.32.83","request_proxy_user":"rods","request_release_version":"rods4.3.1","server_host":"p-irods-001","server_pid":3315516,"server_timestamp":"2025-03-13T10:38:22.335Z","server_type":"agent","server_zone":"nki"}


Terrell Russell

unread,
Mar 13, 2025, 9:22:32 AMMar 13
to irod...@googlegroups.com
Hi Jan,

What *is* the result of running that query?

select DATA_NAME, COLL_NAME, DATA_MODE, DATA_RESC_ID where DATA_ID = '1334243' and DATA_RESC_ID IN ('538280','79878534') and DATA_REPL_STATUS = '1']

I expect it will find no rows, just as it says.

This query is asking for the four columns for that data_id... on either of those resources... where the status is 'good'.


If you DON'T have a good replica on res-01-04a:unixfilesystem, then... this result of NO ROWS FOUND is expected and correct.

If you DO have a good replica on res-01-04a:unixfilesystem... then you have found a bug.   I see you're running 4.3.1.

This is the most-likely relevant bug fixed since 4.3.1...

Terrell





--
--
The Integrated Rule-Oriented Data System (iRODS) - https://irods.org
 
iROD-Chat: http://groups.google.com/group/iROD-Chat
---
You received this message because you are subscribed to the Google Groups "iRODS-Chat" group.
To unsubscribe from this group and stop receiving emails from it, send an email to irod-chat+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/irod-chat/7f3e3b5d-7122-487a-a5b1-aedf7bc21bden%40googlegroups.com.

Jan de Graaf

unread,
Mar 17, 2025, 6:11:19 AMMar 17
to iRODS-Chat
Hi Terrel,

 iquest "select DATA_NAME, COLL_NAME, DATA_MODE, DATA_RESC_ID where DATA_ID = '1334243' and DATA_RESC_ID IN ('538280','79878534') and DATA_REPL_STATUS = '1'"
CAT_NO_ROWS_FOUND: Nothing was found matching your query

The object 1334243 doesn't exist in resource 79878534 (that's the res-01-04b) so your correct no rows as result. There is an object on the 538280 (res-01-04b, the archive tier). I'm trying to get the 79878534 in sync with 538280 so their not yet on both.

iquest "select DATA_NAME, COLL_NAME, DATA_MODE, DATA_REPL_STATUS ,DATA_RESC_ID where DATA_ID = '1334243'"
DATA_NAME = 20131014_Fusion_OB_KK_VBIM_10_1.raw
COLL_NAME = /nki/home/archive/mass_spec/02_Projects/Project_001_to_550/Project_001_to_050/Project_003_Peeper_Kristel_Kemper
DATA_MODE = 0
DATA_REPL_STATUS = 0
DATA_RESC_ID = 538280
------------------------------------------------------------
DATA_NAME = 20131014_Fusion_OB_KK_VBIM_10_1.raw
COLL_NAME = /nki/home/archive/mass_spec/02_Projects/Project_001_to_550/Project_001_to_050/Project_003_Peeper_Kristel_Kemper
DATA_MODE = 0
DATA_REPL_STATUS = 1
DATA_RESC_ID = 538240
------------------------------------------------------------
538240 is the first resource and 538280 is the second resource in a tiering group. I Do see that the DATA_REPL_STATUS is 0 and 1 is to be expected. So that's the issue here if I read your response. (So no access related bug)

ils -L /nki/home/archive/mass_spec/02_Projects/Project_001_to_550/Project_001_to_050/Project_003_Peeper_Kristel_Kemper/20131014_Fusion_OB_KK_VBIM_10_1.raw
  rods              0 res-01-store01;res-01-pt01;res-01-01    162496006 2013-10-15.05:10 & 20131014_Fusion_OB_KK_VBIM_10_1.raw
        generic    /irods/res-01-01/home/archive/mass_spec/02_Projects/Project_001_to_550/Project_001_to_050/Project_003_Peeper_Kristel_Kemper/20131014_Fusion_OB_KK_VBIM_10_1.raw
  rods              1 res-01-store04;res-01-pt04;res-01-repl04;res-01-repl04-pt04a;res-01-04a    162496006 2024-11-17.21:05 X 20131014_Fusion_OB_KK_VBIM_10_1.raw
        generic    /mnt_nfs_azure_archive/res-01-04/home/archive/mass_spec/02_Projects/Project_001_to_550/Project_001_to_050/Project_003_Peeper_Kristel_Kemper/20131014_Fusion_OB_KK_VBIM_10_1.raw

There it also shows an X in the status. for the file on the res-01-04a. 

So I did : irepl -U -S res-01-store01 -R res-01-store04 /nki/home/archive/mass_spec/02_Projects/Project_001_to_550/Project_001_to_050/Project_003_Peeper_Kristel_Kemper/20131014_Fusion_OB_KK_VBIM_10_1.raw
To see if that would pull the 2 replica's (due to the tiering) in sync again.

ils -L /nki/home/archive/mass_spec/02_Projects/Project_001_to_550/Project_001_to_050/Project_003_Peeper_Kristel_Kemper/20131014_Fusion_OB_KK_VBIM_10_1.raw
  rods              0 res-01-store01;res-01-pt01;res-01-01    162496006 2013-10-15.05:10 & 20131014_Fusion_OB_KK_VBIM_10_1.raw
        generic    /irods/res-01-01/home/archive/mass_spec/02_Projects/Project_001_to_550/Project_001_to_050/Project_003_Peeper_Kristel_Kemper/20131014_Fusion_OB_KK_VBIM_10_1.raw
  rods              1 res-01-store04;res-01-pt04;res-01-repl04;res-01-repl04-pt04a;res-01-04a    162496006 2025-03-14.09:57 & 20131014_Fusion_OB_KK_VBIM_10_1.raw
        generic    /mnt_nfs_azure_archive/res-01-04/home/archive/mass_spec/02_Projects/Project_001_to_550/Project_001_to_050/Project_003_Peeper_Kristel_Kemper/20131014_Fusion_OB_KK_VBIM_10_1.raw

The iquery:
iquest "select DATA_NAME, COLL_NAME, DATA_MODE, DATA_RESC_ID where DATA_ID = '1334243' and DATA_RESC_ID IN ('538280','79878534') and DATA_REPL_STATUS = '1'"
DATA_NAME = 20131014_Fusion_OB_KK_VBIM_10_1.raw
COLL_NAME = /nki/home/archive/mass_spec/02_Projects/Project_001_to_550/Project_001_to_050/Project_003_Peeper_Kristel_Kemper
DATA_MODE = 0
DATA_RESC_ID = 538280
------------------------------------------------------------
Now returns 1 row.

The output now of the rebalance.

iadmin modresc res-01-store04 rebalance
remote addresses: 172.31.32.83 ERROR: rcGeneralAdmin failed with error -808000 CAT_NO_ROWS_FOUND

Still an error but now on a different object: 
 {"log_category":"legacy","log_level":"error","log_message":"iRODS Exception:\n    file: /irods_source/plugins/resources/replication/src/irods_repl_rebalance.cpp\n    function: (anonymous namespace)::ReplicationSourceInfo (anonymous namespace)::get_source_data_object_attributes(rsComm_t *, const rodsLong_t, const std::vector<leaf_bundle_t> &)\n    line: 129\n    code: -808000 (CAT_NO_ROWS_FOUND)\n    message:\n        rsGenQuery failed. genquery_inp contents:\nmaxRows: 256 continueInx: 0 rowOffset: 0\noptions: 0\nselectInp.len: 4\n    column: 403 DATA_NAME\n    options: 1\n    column: 501 COLL_NAME\n    options: 1\n    column: 421 DATA_MODE\n    options: 1\n    column: 423 DATA_RESC_ID\n    options: 1\nsqlCondInp.len: 3\n    column: 401 DATA_ID\n    condition: = '1334248'\n    column: 423 DATA_RESC_ID\n    condition: IN ('538280','79878534')\n    column: 413 DATA_REPL_STATUS\n    condition: = '1'\n\n\n\n possible iquest [select DATA_NAME, COLL_NAME, DATA_MODE, DATA_RESC_ID where DATA_ID = '1334248' and DATA_RESC_ID IN ('538280','79878534') and DATA_REPL_STATUS = '1']\n 

This object has the same issue af the previous object with the DATA_REPL_STATUS.

So replication of the object with the tiering plugin does not seem to play nice? I checked a random set of files, and they all are the same, but have an invalid status on the archive tier. Something is causing that the files end up in an invalid status while the data is oke. This causing the rebalance not being able to pick up the files. For now, I manually corrected the file status and the rebalance is now running.
I will see of newly tiered through the storage tiering plugin will also end up in a invalid state or that this issue is an one off issue.

One question yet about the rebalance? Is this an async proces that tries to sync multiple files (out of a queue) at once just like the storage tiering? Or does it just RBAR through the list of files to proces and does 1 file at a time. This because the process is slow, during the weekend it only synced 1TB per day, and I'm trying to find out what is making the proces slow.

Best,
Jan


Op donderdag 13 maart 2025 om 14:22:32 UTC+1 schreef Terrell Russell:

Terrell Russell

unread,
Mar 17, 2025, 8:07:14 AMMar 17
to irod...@googlegroups.com
Hi Jan,

Good detective work.   Yes, looks like some other reason is causing the replicas on your archive to be 'stale'.   The rebalance logic appears to be looking for replicas correctly.

To answer your final question...  rebalance is currently a serial, synchronous process, correct.  We wanted to start with a no-surprises, methodical way of handling any errors or edge cases that we had not anticipated.  To make it faster, we have almost all the parts needed...

Absorbing metadata guard

so that we can

Allow rebalance to continue on any failures

which would make possible

Adding parallel / asynchronous behavior to rebalance


Terrell



Jan de Graaf

unread,
Mar 17, 2025, 11:36:07 AMMar 17
to iRODS-Chat
Terrell,

Good to here that there is work being done to improve.

Another thing that got my attention is the high database load that the rebalance produces. It's constantly pulling between 0.5 and 1GBit/s
(my iCAT server is different than the actual DB server due to infratructure issues here at the NKI). Since my ICAT is connected with max 1Gbit network... That's my complete network capacity on that network connection just for coordinating a rebalance operation??
Also on the postgress server side i see a lot of connections/processing.

 
db connections.pngcpuload.png
*the drop is where i stopped the rebalance.

iftop.png

And this is on a Standard_D8ds_v5 (8 vCores, 32GiB Memory) Azure machine and if it's serial/synchroness prosessing.... that is a high load and a lot of traffic for a relative slow proces with slow performance.

Jan





Op maandag 17 maart 2025 om 13:07:14 UTC+1 schreef Terrell Russell:

Terrell Russell

unread,
May 20, 2025, 2:39:56 PMMay 20
to irod...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages