irods performance and profiling

94 views
Skip to first unread message

Gareth Williams

unread,
May 27, 2010, 2:49:15 AM5/27/10
to irod...@googlegroups.com
Hi,

I've been looking at the performance of irods (as deployed by ARCS) to
try and understand some performance limitations we see and potentially
target areas to investigate speedup. A couple of the motivating
performance limitations are :

1) recently I was running many separate clients (ils in a loop) to
debug memory leak problems and the time per ils was relatively large
(about 4/s max).

arcs-admin@arcs-df ~/tmp $ time for i in `seq 1 100`; do ils > /dev/null; done
real 0m27.489s
user 0m0.281s
sys 0m0.360s

2) ils -l over a large collection is quite slow (and consumes lot of
memory - but I'll ask about the memory separately!)

3) creation (and removal) of a lot of empty collections or small files
is relatively slow when done one at a time and also slow when done as
a single recursive operation

-- see data at http://projects.arcs.org.au/trac/systems/wiki/DataServices/iRODS_profiling

I'd been running valgrind's memcheck to investigate leaking (though it
didn't actually lead to finding the setenv/putenv problem so much as
it eliminated other potential problems) and after having problems with
getting useful info out of gprof, turned to callgrind (on the
irodsServer and irodsAgent) which gave me some results (not shown in
any detail) that I'm having trouble interpreting.

I'd guessed that the slowness of a succession of simple ils commands
might indicate a problem with the speed of forking new Agents to deal
with successive connections (knowing that apache keeps a pool of
workers - I think for this reason) but I could see no indication in
the profiling that significant time was spent in this activity. On the
other hand I could not see cycles in 'select' - waiting for clients -
so I'm pretty unsure of the data. Maybe some of the extra time is in
authenticating to irods and to postgres.

The time I do see dominating Agents, seems to come from strncpy and
strncat (also chlGenQuery and copyStrFromPiBuf).

From searching for info on strncpy, it seems that it can have a large
overhead if the writes are typically smaller than the buffers written
to. Could that be the case here and it be a clear target for
optimization (or the buffer size or to use alternative string handling
routines)?

I guess it is likely that the chlGenQuery time relates to waiting on
the database and we would need to speed it up to get improvement.
However if it is CPU bound in one thread, the change would need to be
a query optimization rather than more hardware or better network
(unless better use of memory is possible on the db server).

Sorry if this is a lot to digest. We could break it into multiple
threads depending on how the conversation goes.

cheers,

Gareth - ARCS Data Team

Jean-Yves Nief

unread,
May 27, 2010, 8:49:54 AM5/27/10
to irod...@googlegroups.com
hello Gareth,

you raise interesting points.
On a side note, for the irods performance, an other key factor is also
the iCAT database performances and in some of your tests below, it has
an impact. Recomputing the index on a regular basis is something
important: here, I am doing it once per week for each iCAT (cron task).
Even for a catalog with millions of files, if you have ingested tens of
thousands of new files, you will see an improvement when doing
reindexing (for example an ils performed in 0.3 s will goes down to less
than 0.1 s when reindexing has been made). We have Oracle here, but for
Postgres the vacuum cleaning has also to be performed. Of course, things
also depends on the hardware setting for the database.
Of course, it has only an impact on a few items that you mentioned below.
cheers,
JY

schr...@diceresearch.org

unread,
May 27, 2010, 12:31:05 PM5/27/10
to irod...@googlegroups.com
Hello Gareth,

As Jean-Yves pointed out, the ICAT DBMS optimization is very important, a 'vacuum' can improve performance significantly.  Also,  we added another index in 2.3, 'create unique index idx_coll_main3 on R_COLL_MAIN (coll_name);' which you might make sure is in place.  See the thread http://groups.google.com/group/irod-chat/browse_thread/thread/6e2c9b3f306c041f#h for more about this; it made a very significant difference.

If you are using GSI, you might try a test using iRODS passwords.  With GSI, each i-command takes a bit longer to authenticate than the iRODS password system does, and that would add up to quite a bit of time for a long series of i-commands. 

Yes, the general-query call will be spending most of its time waiting for the DBMS to respond.  With the indexes we have defined, it seems that the DBMSes perform pretty well (and scale well), but there may be some improvements that could be made.  But you might first look at just making sure Postgres is running well on your host.  There is information available from Postgres on performance tuning and optimization and platform requirements at various scales.

I suppose the Agent could be spending a relatively large proportion of its time in strncpy/strncat, but that may not be actually slowing it down that much.  It will spend most of its time waiting on network or disk I/O and for responses from the DBMS.  The little bit of time when it is running, it will be passing buffers around and sometimes copying them.

 - Wayne -
--
"iRODS: the Integrated Rule-Oriented Data-management System; A community driven, open source, data grid software solution" https://www.irods.org

iROD-Chat: http://groups.google.com/group/iROD-Chat

Sridhar Reddapani

unread,
May 27, 2010, 7:44:10 PM5/27/10
to irod...@googlegroups.com
Hello JY,

----- "Jean-Yves Nief" <ni...@cc.in2p3.fr> wrote:
> hello Gareth,
>
>              you raise interesting points.
> On a side note, for the irods performance, an other key factor is also
> the iCAT database performances and in some of your tests below, it has
> an impact. Recomputing the index on a regular basis is something
> important: here, I am doing it once per week for each iCAT (cron task).

We are already running Vaccum weekly in cron. We didn't dare to run RE-INDEXING in cron as during Re-Indexing process, ICAT will not  respond to iRODS and cause problem to our users. I am wondering how you are running re-index? are you stopping iRODS before doing it? 

> Even for a catalog with millions of files, if you have ingested tens of
> thousands of new files, you will see an improvement when doing
> reindexing (for example an ils performed in 0.3 s will goes down to less
> than 0.1 s when reindexing has been made). We have Oracle here, but for
> Postgres the vacuum cleaning has also to be performed. Of course, things
> also depends on the hardware setting for the database.
> Of course, it has only an impact on a few items that you mentioned below.
> cheers,
> JY
--
Thanks & Regards,
Sridhar Reddapani,
------------------------------------------------------------------------------------------------------------------------
| ARCS - Australian Research Colaboration Service | M +61 430725424 | sridhar....@arcs.org.au | www.arcs.org.au |

| Intersect | sridhar....@intersect.org.au |T +61 2 8079 2547 | www.intersect.org.au |
------------------------------------------------------------------------------------------------------------------------

Jean-Yves Nief

unread,
May 28, 2010, 5:07:31 AM5/28/10
to irod...@googlegroups.com
hello Sridhar,

Sridhar Reddapani wrote:
> Hello JY,
> ----- "Jean-Yves Nief" <ni...@cc.in2p3.fr> wrote:
> > hello Gareth,
> >
> > you raise interesting points.
> > On a side note, for the irods performance, an other key factor is also
> > the iCAT database performances and in some of your tests below, it has
> > an impact. Recomputing the index on a regular basis is something
> > important: here, I am doing it once per week for each iCAT (cron task).
>
> We are already running Vaccum weekly in cron. We didn't dare to
> run RE-INDEXING in cron as during Re-Indexing process, ICAT will not
> respond to iRODS and cause problem to our users. I am wondering how
> you are running re-index? are you stopping iRODS before doing it?

no iRODS is not stopped during reindexing. We are using Oracle here. In
Oracle, you have the "alter index" feature: it recompute and update the
index only if needed. On top of that, if the index you are trying to
update is used by the iRODS servers at the time you are doing this
operation, it won't do anything as the index will be locked. So doing
that way, there is no interruption of service and the performances can
only improve.
For PostGres, "alter index" has nothing to do with recomputing the
index, it just rename them. But starting with PostGres 8.2, it seems to
be possible to rebuild indexes online while the applications are still
running, but I don't know how.
cheers,
JY


>
> > Even for a catalog with millions of files, if you have ingested tens of
> > thousands of new files, you will see an improvement when doing
> > reindexing (for example an ils performed in 0.3 s will goes down to
> less
> > than 0.1 s when reindexing has been made). We have Oracle here, but for
> > Postgres the vacuum cleaning has also to be performed. Of course,
> things
> > also depends on the hardware setting for the database.
> > Of course, it has only an impact on a few items that you mentioned
> below.
> > cheers,
> > JY
> --
> Thanks & Regards,
> Sridhar Reddapani,
> ------------------------------------------------------------------------------------------------------------------------
> | ARCS - Australian Research Colaboration Service | M +61 430725424 | sridhar....@arcs.org.au | www.arcs.org.au |
> | Intersect | sridhar....@intersect.org.au |T +61 2 8079 2547 | www.intersect.org.au |
> ------------------------------------------------------------------------------------------------------------------------
>

Gareth Williams

unread,
May 31, 2010, 11:29:47 PM5/31/10
to irod...@googlegroups.com
Hi JY, Mike and all,

As Sridhar said, we are vacuuming weekly - though with the following
commands to psql rather than the disruptive dboptimize - full vacuum.
VACUUM VERBOSE ANALYZE r_coll_main;
VACUUM VERBOSE ANALYZE r_data_main;
VACUUM VERBOSE ANALYZE r_objt_access;

Also I've been testing with a non-GSI user.

After the advice to vacuum and reindex, I tried a more manageable size
test on a non-production 2.1 server (which is a virtual host with a
much smaller icat - 70MB vs 7GB). Some results are displayed at
http://projects.arcs.org.au/trac/systems/wiki/DataServices/iRODS_profiling
with comments.

In summary, on this system there were no major improvement after
vacuum or reindex but clearly some existing indexes are being
effective. I think it follows that the indexes may/must have been
relatively up-to-date, whereas on our production system they may be
quite stale.

I consider the times (20 to 40 collections processed per second for
icp -r, ils -r and irm -rf) to be prohibitively slow and a large part
of the slowness in not helped enough by the vacuum or existing
indexes. Can anyone offer other ideas?

Would anyone be willing to run similar tests on their systems (iput
-r, icp -r, iget -r, ils -r; irm-rf on 32x32 collections) to see if
out setup is unusual?

> no iRODS is not stopped during reindexing. We are using Oracle here. In
> Oracle, you have the "alter index" feature: it recompute and update the
> index only if needed. On top of that, if the index you are trying to update
> is used by the iRODS servers at the time you are doing this operation, it
> won't do anything as the index will be locked. So doing that way, there is
> no interruption of service and the performances can only improve.
> For PostGres, "alter index" has nothing to do with recomputing the index, it
> just rename them. But starting with PostGres 8.2, it seems to be possible to
> rebuild indexes online while the applications are still running, but I don't
> know how.
> cheers,
> JY

I see that postgres now had a 'create index concurrently' syntax. It
requires you to drop the index first and the documentation caveats
suggest that it can fail (but you should just try again...) and that
it is more expensive than building from scratch with locks or reindex
with locks, but that is may be a fine price to pay for the performance
gain. On the other hand if it takes hours, performance may suffer so
much in the interim that the system is unusable. Time to test...

Well on the test system, drop index; create index concurrently was
fast and only 10-30% slower than without concurrently but it is a
small db.

Testing the production system is another matter.

Otherwise I had an idea for creating the index with a new name then
dropping the existing index and renaming the new one - and it seems to
work and would reduce the time without an index to be minimal. I
tried:
psql -d ICAT -c 'create unique index concurrently tmp_idx_objt_access1
on R_OBJT_ACCESS (object_id,user_id);'
psql -d ICAT -c 'drop index idx_objt_access1;'
psql -d ICAT -c 'alter index tmp_idx_objt_access1 rename to idx_objt_access1;'

maybe that is OK to test on the production system if I chack that the
build of the tmp_ index works.

any comments?

-- Gareth

Gareth Williams

unread,
Jun 1, 2010, 5:09:20 AM6/1/10
to irod...@googlegroups.com
> Otherwise I had an idea for creating the index with a new name then
> dropping the existing index and renaming the new one - and it seems to
> work and would reduce the time without an index to be minimal.  I
> tried:
> psql -d ICAT -c 'create unique index concurrently tmp_idx_objt_access1
> on R_OBJT_ACCESS (object_id,user_id);'
> psql -d ICAT -c 'drop index idx_objt_access1;'
> psql -d ICAT -c 'alter index tmp_idx_objt_access1 rename to idx_objt_access1;'
>
> maybe that is OK to test on the production system if I check that the

> build of the tmp_ index works.

for reference I posted this part of my question to to pgsql-general
here: http://archives.postgresql.org/pgsql-general/2010-06/msg00014.php

- Gareth

mw...@diceresearch.org

unread,
Jun 1, 2010, 1:01:09 PM6/1/10
to irod...@googlegroups.com
Gareth,

Looks like icommands are a lot faster on your test system than on your
production system. Are they running the same iRODS version ?
If you are not running 2.3, be sure to have this indexing in place as suggested
by Wayne:

>we added another index in 2.3, 'create unique index idx_coll_main3 on R_COLL_MAIN (coll_name);' >which you might make sure is in place

For iRods system with large collection of data, I think you'll need to run the iCAT on the
best hardware available. I think some of our users run the iCAT on SSD disks and maybe
they can share their experience.

Mike

-------- Original Message --------
Subject: Re: [iROD-Chat:4172] irods performance and profiling
From: Gareth Williams <gareth....@arcs.org.au>
Otherwise I had an idea for creating the index with a new name then
dropping the existing index and renaming the new one - and it seems to
work and would reduce the time without an index to be minimal. I
tried:
psql -d ICAT -c 'create unique index concurrently tmp_idx_objt_access1
on R_OBJT_ACCESS (object_id,user_id);'
psql -d ICAT -c 'drop index idx_objt_access1;'
psql -d ICAT -c 'alter index tmp_idx_objt_access1 rename to idx_objt_access1;'

maybe that is OK to test on the production system if I chack that the

build of the tmp_ index works.

any comments?

-- Gareth

Gareth Williams

unread,
Jun 3, 2010, 10:26:38 PM6/3/10
to irod...@googlegroups.com
On Wed, Jun 2, 2010 at 3:01 AM, <mw...@diceresearch.org> wrote:
> Gareth,
> Looks like icommands are a lot faster on your test system than on your
> production system. Are they running the same iRODS version ?
> If you are not running 2.3, be sure to have this indexing in place as
> suggested
> by Wayne:
>>we added another index in 2.3, 'create unique index idx_coll_main3 on
>> R_COLL_MAIN (coll_name);' >which you might make sure is in place
> For iRods system with large collection of data, I think you'll need to run
> the iCAT on the
> best hardware available. I think some of our users run the iCAT on SSD disks
> and maybe
> they can share their experience.
> Mike

HI Mike,

Yes we have that extra index and thank for the suggestion of SSD.
I've given up on reindexing after figuring out why it might be needed
(see followups to
http://archives.postgresql.org/pgsql-general/2010-06/msg00014.php).
We will do a full vacuum (and reindex at least once to test if it
matters) when we have a planned outage. It seems to me that the
regular vacuum analyze should be sufficient given the nature of out
data (large mostly static, growing but relatively little deletion or
replacement relative to total holdings).

This leave us with our performance issues unresolved. I'll measure
again after our next outage (not scheduled yet) and report back.

In the meantime, it would be useful for other sites with large
production systems to provide some performance info - to see if we are
atypical. Any volunteers?

> Would anyone be willing to run similar tests on their systems (iput
-r, icp -r, iget -r, ils -r; irm-rf on 32x32 collections) to see if
out setup is unusual?

regards,

Gareth

Reply all
Reply to author
Forward
0 new messages