Why does Curator do the entire run in a single transaction?

Mark H. Wood

unread,

Feb 11, 2022, 2:13:07 PM2/11/22

to dspace...@googlegroups.com

The other day I had reason to run a curation task over our entire
repository. It found a large number of Items that needed
modification, and I watched as it got slower...and slower...and
s l o w e r ... until it ran out of memory and crashed, leaving no work
completed. I got a list of the Collections to be affected, and ran
the curator over each one separately, and the job was (eventually)
completed.

It seems to me that the proper unit of work for a curation run is not
the whole set of affected objects, but the task. We should be
committing work each time a task returns. I would expect that a
well-designed task can be re-run in the same scope without causing
problems.

Comments?

--
Mark H. Wood
Lead Technology Analyst

University Library
Indiana University - Purdue University Indianapolis
755 W. Michigan Street
Indianapolis, IN 46202
317-274-0749
www.ulib.iupui.edu

signature.asc

Kim Shepherd

unread,

Feb 12, 2022, 3:28:19 AM2/12/22

to DSpace Developers

Hm, this does sound like a problem, I may not have noticed it myself as I typically do put in Context commits within my performItem() implementation anyway.

Could it be as simple as doing a commit after calling performObject in distribute(), and maybe decaching the dso at the end of distribute()?

It would mean a lot of unnecessary commits, though (for objects that didn't actually have changes made)

0CCB D957 0C35 F5C1 497E CDCF FC4B ABA3 2A1A FAEC

Virus-free. www.avg.com

--
All messages to this mailing list should adhere to the Code of Conduct: https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx
---
You received this message because you are subscribed to the Google Groups "DSpace Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-devel/Yga1QDfIQ%2B0%2BFe5m%40IUPUI.Edu.

Alan Orth

unread,

May 11, 2022, 2:17:01 AM5/11/22

to Kim Shepherd, DSpace Developers

Hey,

I'm not an expert on curation tasks. The docs have a few options that might help here:

-l limit: maximum number of objects in Context cache. If absent, unlimited objects may be added.

-s scope: declare a scope for database transactions. Scope must be: (1) 'open' (default value) (2) 'curation' or (3) 'object'

See: https://wiki.lyrasis.org/display/DSDOC6x/Curation+System

I run a curation task over our entire repository (~96,000 items) every night like this:

$ dspace curate -t countrycodetagger -i all -s object

The scope parameter seems to have helped, if I recall correctly when I originally wrote this a few years ago.

Regards,

To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-devel/CAKZKfqpe2CX%2BpuWRto-viosUdxKigX8GuQvw-8qr9fm8zK2WLA%40mail.gmail.com.

--

Alan Orth
alan...@gmail.com
https://picturingjordan.com
https://englishbulgaria.net
https://mjanja.ch

Kim Shepherd

unread,

May 11, 2022, 2:23:42 AM5/11/22

to Alan Orth, DSpace Developers

Ah! Yes, this is exactly it.

Is it possible, thinking back to the *why* of Mark's question, that there's a case where if something fails on the 96000th item, I want the entire curation run rolled back? And that's why it's the default scope?

But perhaps we should ask around and re-think that as the most common preferred default

0CCB D957 0C35 F5C1 497E CDCF FC4B ABA3 2A1A FAEC

Virus-free. www.avg.com

Reply all

Reply to author

Forward