Slow Discovery indexing performance in DSpace 6

29 views
Skip to first unread message

Alan Orth

unread,
Feb 6, 2020, 2:47:04 AM2/6/20
to DSpace Technical Support
Dear list,

I'm testing an upgrade of a DSpace 5.8 instance to DSpace 6.3 and one of the first things I notice is that Discovery indexing is about three or four times slower than it was before. On the same hardware, my repository with ~85,000 items takes 30 minutes to index with DSpace 5 and three hours with DSpace 6.3 and DSpace 6.4-SNAPSHOT. My development environment is on Linux with a fast SSD and lots of RAM, so I fear it will be even worse on our production server.

I have read that the new Hibernate database layer in DSpace 6 involves much more complicated or time-consuming database queries. How are other people handling this? We're using PostgreSQL 9.6. Could it be time to move to something higher to hopefully gain something from PostgreSQL's own advances?

Thanks for any comments or suggestions,

Mark H. Wood

unread,
Feb 6, 2020, 9:15:15 AM2/6/20
to DSpace Technical Support
On Thu, Feb 06, 2020 at 09:46:49AM +0200, Alan Orth wrote:
> Dear list,
>
> I'm testing an upgrade of a DSpace 5.8 instance to DSpace 6.3 and one of
> the first things I notice is that Discovery indexing is about three or four
> times slower than it was before. On the same hardware, my repository with
> ~85,000 items takes 30 minutes to index with DSpace 5 and three hours with
> DSpace 6.3 and DSpace 6.4-SNAPSHOT. My development environment is on Linux
> with a fast SSD and lots of RAM, so I fear it will be even worse on our
> production server.
>
> I have read that the new Hibernate database layer in DSpace 6 involves much
> more complicated or time-consuming database queries. How are other people
> handling this? We're using PostgreSQL 9.6. Could it be time to move to
> something higher to hopefully gain something from PostgreSQL's own advances?

I don't know that upgrading PostgreSQL will help your indexing
performance all that much, but it shouldn't hurt. We run production
against Pg 10.9 and I develop DSpace 5, 6, and 7 against 12.1.

Hibernate does tend to fetch more stuff, but it also caches very
aggressively and rather well, so it's hard to say whether it is
contributing to any particular slow-down. There have been specific
DSpace operations in which Hibernate was found to be a source of
excess activity, but I think that most of them have been addressed in
patches scheduled for 6.4. I have no doubt that there are others.

Probably the most methodical approach would be to run indexing with a
profiler and find out where the time is being spent. Since
command-line indexing involves three processes (bin/dspace, Pg, and
Tomcat (running Solr)) it would be good to pay particular attention to
time spent waiting on another process.

Short of profiling, tools like 'top' and 'iotop' will give a rough
idea of whether the system is generally busier and suggest which parts
are responsible. You might be able to set up 'strace' or the like to
log mainly I/O calls and grind some statistics out of the log.

(I really should try some of these myself....)

--
Mark H. Wood
Lead Technology Analyst

University Library
Indiana University - Purdue University Indianapolis
755 W. Michigan Street
Indianapolis, IN 46202
317-274-0749
www.ulib.iupui.edu
signature.asc

Alan Orth

unread,
Feb 6, 2020, 2:08:27 PM2/6/20
to DSpace Technical Support
Dear Mark,

I think flame graphs might be what we need to look into. Brendan more or less pioneered this way of doing visualization of performance analysis while troubleshooting MySQL in 2011, and now the tooling ecosystem has really developed. We should definitely look into this. I've only ever done amateur performance analysis with strace, iotop, VisualVM, etc, but I will try to see if I can figure out how to generate a flame graph of DSpace while it is indexing.

Regarding patches pending in 6.4, I've been testing 6.4-SNAPSHOT and it's the same. :\

Cheers,


Brendan is a pioneer in this type of performance visualization that I see

--
All messages to this mailing list should adhere to the DuraSpace Code of Conduct: https://duraspace.org/about/policies/code-of-conduct/
---
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-tech/20200206141511.GE11530%40IUPUI.Edu.

Alan Orth

unread,
Feb 7, 2020, 2:22:49 AM2/7/20
to DSpace Technical Support
Following up on this, it is extremely easy to generate flame graphs on Linux using perf to collect the data and process it with Brendan's perl script. See here:


I'm attaching a flame graph generated from sixty seconds of profiling during a DSpace 6.3 Discovery indexing (view it in Firefox). The y-axis shows what is running and the x-axis shows how much. As you can see, the system profiler doesn't have any visibility into the Java method calls, so everything Java is just "libjvm.so"! Incidentally, Brendan gave a talk about doing "mixed mode" flame graphs to capture system profiling data AND Java method data, see his 2016 presentation:


I will try to see what we need to do to get visibility into the Java method calls. He has written much about this on his personal blog as well as the Netflix engineering blog:


This is definitely promising.
perf-kernel-dspace63.svg

Alan Orth

unread,
Feb 9, 2020, 8:32:26 AM2/9/20
to DSpace Technical Support
Good afternoon,

I spent some time on this and have produced flame graphs for the command line Java process (not Tomcat) during Discovery indexing for both DSpace 5.10 and DSpace 6.4-SNAPSHOT with the same exact data set and Discovery configuration:


If you view the SVG files in Firefox they are interactive. The sampling time was sixty seconds. Green is Java. The width of the stacks indicate time spent on CPU, not progress. What is immediately obvious is that we spend an obscene amount of time in Hibernate in DSpace 6. In my experience the DSpace 5.10 code indexes our 85,000 items in 2 hours, the DSpace 6.4-SNAPSHOT code does it in 3 and a half hours. I hope someone can look into this because it seems we will be using Hibernate for many years to come in DSpace 6 and 7.

To re-produce these in your own environment make sure you have the perf-map-agent¹ and FlameGraph² repositories cloned, and then the process is roughly:

$ cd perf-map-agent
$ cmake .
$ make
$ ./bin/create-links-in ~/.local/bin
$ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk
$ export PERF_RECORD_SECONDS=60
# necessary for the Java process to preserve method calls
$ export JAVA_OPTS="-XX:+PreserveFramePointer"
$ time schedtool -D -e ~/dspace/bin/dspace index-discovery -b &
# process id of java indexing process (not Tomcat)
$ perf-java-record-stack 169639
$ sudo perf script -i /tmp/perf-169639.data > out.dspace510-1
$ cat out.dspace510-1 | ~/src/git/FlameGraph/stackcollapse-perf.pl | grep -E '^java' | ~/src/git/FlameGraph/flamegraph.pl --color=java --hash > out.dspace510-1.svg

Regards,

out.dspace510-3.svg
out.dspace64-3.svg
Reply all
Reply to author
Forward
0 new messages