Why does Curator do the entire run in a single transaction?

12 views
Skip to first unread message

Mark H. Wood

unread,
Feb 11, 2022, 2:13:07 PM2/11/22
to dspace...@googlegroups.com
The other day I had reason to run a curation task over our entire
repository. It found a large number of Items that needed
modification, and I watched as it got slower...and slower...and
s l o w e r ... until it ran out of memory and crashed, leaving no work
completed. I got a list of the Collections to be affected, and ran
the curator over each one separately, and the job was (eventually)
completed.

It seems to me that the proper unit of work for a curation run is not
the whole set of affected objects, but the task. We should be
committing work each time a task returns. I would expect that a
well-designed task can be re-run in the same scope without causing
problems.

Comments?

--
Mark H. Wood
Lead Technology Analyst

University Library
Indiana University - Purdue University Indianapolis
755 W. Michigan Street
Indianapolis, IN 46202
317-274-0749
www.ulib.iupui.edu
signature.asc

Kim Shepherd

unread,
Feb 12, 2022, 3:28:19 AM2/12/22
to DSpace Developers
Hm, this does sound like a problem, I may not have noticed it myself as I typically do put in Context commits within my performItem() implementation anyway.
Could it be as simple as doing a commit after calling performObject in distribute(), and maybe decaching the dso at the end of distribute()?

It would mean a lot of unnecessary commits, though (for objects that didn't actually have changes made)

0CCB D957 0C35 F5C1 497E CDCF FC4B ABA3 2A1A FAEC


Virus-free. www.avg.com

--
All messages to this mailing list should adhere to the Code of Conduct: https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx
---
You received this message because you are subscribed to the Google Groups "DSpace Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-devel/Yga1QDfIQ%2B0%2BFe5m%40IUPUI.Edu.

Alan Orth

unread,
May 11, 2022, 2:17:01 AM5/11/22
to Kim Shepherd, DSpace Developers
Hey,

I'm not an expert on curation tasks. The docs have a few options that might help here:

-l limit: maximum number of objects in Context cache. If absent, unlimited objects may be added.
-s scope: declare a scope for database transactions. Scope must be: (1) 'open' (default value) (2) 'curation' or (3) 'object'


I run a curation task over our entire repository (~96,000 items) every night like this:

$ dspace curate -t countrycodetagger -i all -s object

The scope parameter seems to have helped, if I recall correctly when I originally wrote this a few years ago.

Regards,



--

Kim Shepherd

unread,
May 11, 2022, 2:23:42 AM5/11/22
to Alan Orth, DSpace Developers
Ah! Yes, this is exactly it.
Is it possible, thinking back to the *why* of Mark's question, that there's a case where if something fails on the 96000th item, I want the entire curation run rolled back? And that's why it's the default scope?
But perhaps we should ask around and re-think that as the most common preferred default

0CCB D957 0C35 F5C1 497E CDCF FC4B ABA3 2A1A FAEC


Virus-free. www.avg.com
Reply all
Reply to author
Forward
0 new messages