Item Import time concerning

123 views
Skip to first unread message

Stephen Brush

unread,
Apr 5, 2023, 1:23:48 PM4/5/23
to DSpace Community
Hi,

We are looking at importing a large number of items as part of our launch (~200,000). Imports seemed to be slow from the start but were at least tolerable. As the size of the repository has grown import time has grown significantly along with it. I've tried different "batch" sizes to see if that had an impact but the pattern still seems to be the same.

Currently importing 100 items is taking well over 1 hour. I should mention that the resources involved could be scaled up further -- but I assume they should be sufficient for the tasks this involves (exception maybe SOLR as that's less familiar to me). Based on how fast SOLR indexes items using the "index-discovery" command I can't see it being so slow here.

Is this a known or common problem? Is there anything others have done to speed this up?

To be clear in this instance I am referring to Item Import via Simple Archive Format -- though I've noticed similar behaviour with the CSV import capabilities via the UI.

We are on v7.3 currently.

Thanks,

Steve

Shannon Kipphut-Smith

unread,
Apr 10, 2023, 4:11:29 PM4/10/23
to DSpace Community
We're experiencing similar issues with the Simple Archive Format imports in DSpace 7. None of my imports on our test server have processed (even though Processes show that the imports are complete). I have tried with batches as large as 100 items and as small as 6 items.

Shannon

--
Shannon Kipphut-Smith
Scholarly Communications Liaison
Fondren Library, Rice University
sk...@rice.edu||(713)348-3989
Schedule a meeting or consultation: https://calendly.com/scholcomm
she|her

--
All messages to this mailing list should adhere to the Code of Conduct: https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx
---
You received this message because you are subscribed to the Google Groups "DSpace Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-communi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-community/b190489d-bad0-4d22-9282-b9c610121cd1n%40googlegroups.com.



Tim Donohue

unread,
Apr 12, 2023, 1:27:17 PM4/12/23
to DSpace Community
Hi all,

There were some major performance improvements to batch importing added in the 7.5 release.  But, they seem to have been more specific to CSV-based imports.  We were not aware of performance issues with SAF (Simple Archive Format) imports (and I'm not seeing any bug tickets that are obviously related to that).  That said, be worth re-testing on 7.5 if possible, simply because sometimes a performance fix for one feature may also fix performance issues of another.

I'd also recommend in this scenario providing more details about what exactly you are seeing, so that DSpace developers can try to reproduce the problem on our end (which makes it easier to find a quick solution).  So, if you know it occurs even in small batches, it'd be good to have a sample batch or sample commands where you are seeing the problem (and whether it occurs just from the UI or also from the commandline).   Please feel free to create a ticket for this performance issue in https://github.com/DSpace/DSpace/issues and share what you've found.

Thanks,

Tim

Stephen Brush

unread,
Apr 19, 2023, 12:57:57 PM4/19/23
to DSpace Community
Hi Tim,

I had a chance to come back to this issue recently. Adding relationships was definitely the most time consuming part of the Simple Archive Format import. After some logging and analysis I found two problem areas with respect to importing relationships:
  1. getEntityType() in ItemImportServiceImpl -> was making a DB call when the relevant information was already present on the Item. This was about 1s per call, but is made several times per relationship added and really added as we have many relationships per item we are importing
  2. update() in DSpaceObjectServiceImpl -> this was also called several times per Item and in some cases was only 200ms but in others was up to 3s. Since I don't believe you can assign a Place in the Simple Archive Format import I don't think this was doing anything of value so was just commented out for our purposes.
These two tweaks took our import time from just over 1 minute per Item to just over 20s, though even 20s per Item is still far from ideal as we are looking at importing roughly 100,000 items. Without these import tweaks that is roughly 70 days of 24x7 import processing. With these tweaks it should be more like 24 days, but still a LONG time.

**Adding server/DB resources improved these times above by about 40% so our current run time is estimated at 14 days.

It looks like 40% of time is related to SOLR indexing at the end of the import process. Might it be more efficient to run the "index-discovery" command manually at the end of the entire migration process -- or would that likely take the same time regardless?

Steve

Tim Donohue

unread,
Apr 21, 2023, 10:58:36 AM4/21/23
to DSpace Community
Hi Steve,

I'm not able to easily answer these questions as I don't have the full context (nor am I the expert on all the code in DSpace...I don't write much code these days. I'm more of a technical coordinator).

However, it sounds like you may have found a performance bug and *maybe* a solution?  If so, could you please create a ticket to describe the performance issues you see and possibly even send us a Pull Request with the fixes you noted can speed things up?  That way I can pass this along to other developers who *can help answer the questions*.   As you might expect with any open source project, things get better when people contribute fixes they've found.  So, if you can find time to send us more information via GitHub, it may help immensely... and it might be that you've stumbled on an undiscovered issue with this import process.

Here's where to submit a ticket (and PR): https://github.com/DSpace/dspace/issues

Thanks in advance... if you aren't able to send a PR, just creating a ticket & linking us to the "two tweaks" you made to speed things up might be enough to get started.

Tim

Stephen Brush

unread,
Apr 26, 2023, 10:11:37 AM4/26/23
to DSpace Community
One of the two fixes would be a solution. The other was just a band-aid but I will take a closer look to see if it could be done more elegantly with no impact.

Will send in a PR at some point.

Steve

Stephen Brush

unread,
Jul 5, 2023, 11:47:46 AM7/5/23
to DSpace Community
Just an update to this -- we've found that after some initial degradation as the repo grew things have stabilized at a reasonable time to process EXCEPT for one relationship we are adding - Publication to Author. We have ~50000 Authors entities (Persons) and are matching off of a concatenated string of first/last name and a couple IDs to generate a unique value.

This relationship takes ~10s per Author to set up (each lookup takes ~2s). The other relationships take ~1s or less as they are dealing with collections of entities which are far smaller.

Not sure if it's possible to speed this up at all from where we are at but thought I would share where things ended up.

Steve
Reply all
Reply to author
Forward
0 new messages