Fwd: RedBox and The Mint performance Improvements

Andrew Brazzatti

unread,

Jan 9, 2013, 11:47:50 PM1/9/13

to ReDBox Developer List

Hi all,

Jay Van Schyndel at JCU has spent quite a bit of time having a look at the CurationManagers in both ReDBox Mint and has added some tweaks to reduce the number of messages that get generated and therefore improve the performance when you are harvesting in large numbers of records.

He has created a pull request in GitHub for it to be incorporated into the core but we will be reviewing and testing it out first. I encourage any of you that may be interested in this change to help us out with some testing as well :)

Thanks Jay!

Andrew

---------- Forwarded message ----------
From: van Schyndel, Jay <jay.van...@jcu.edu.au>
Date: Thu, Jan 10, 2013 at 11:44 AM
Subject: RedBox and The Mint performance Improvements
To: Andrew Brazzatti <and...@redboxresearchdata.com.au>
Cc: Duncan Dickinson <d.dic...@qcif.edu.au>, "Brown, Marianne" <mariann...@jcu.edu.au>

Hi Andrew,

Happy New Year.

The good news is that I have been able to make the CurationManager for both The Mint and ReDBox perform so we can load all of our data and automatically curate.

It now processes about 1500 records in 1.5 hours harvested and fully curated.

Code changes were made to the following files:

com.googlecode.fascinator.redbox.plugins.curation.redbox.CurationManager.java

com.googlecode.fascinator.redbox.plugins.curation.mint.CurationManager.java

The code changes can be found in these repos. I have generated pull requests for them

https://github.com/ozej8y/redbox

https://github.com/ozej8y/mint

Initially I ran a compare between the Mint and ReDBox version of the CurationManager just to see the differences.

I have made some changes to the Mint CurationManager to bring the code more up do date with the code in ReDBox. I didn't make any changes to the Mint specific code. :)

The same changes to make curation perform were applied to both curation managers.

In curation() I added the following code:

if (curated) {

// Happy ending

if (task.equals("curation-response")) {

log.info("Confirmation of curated object '{}'.", oid);

// Send out upstream responses to objects waiting

JSONArray responses = data.writeArray("responses");

for (Object thisResponse : responses) {

JsonSimple json = new JsonSimple((JsonObject) thisResponse);

String broker = json.getString(brokerUrl, "broker");

String responseOid = json.getString(null, "oid");

String responseTask = json.getString(null, "task");

JsonObject responseObj = createTask(response, broker,

responseOid, responseTask);

// Don't forget to tell them where it came from

String id = json.getString(null, "quoteId");

if (id != null) {

responseObj.put("originId", id);

}

responseObj.put("originOid", oid);

// If NLA Integration is enabled, use the NLA ID instead

if (nlaIntegrationEnabled && metadata.containsKey(nlaIdProperty)) {

responseObj.put("curatedPid", metadata.getProperty(nlaIdProperty));

} else {

responseObj.put("curatedPid", thisPid);

}

//JCU: now that the responses have been sent, remove them, so they are not sent again. Otherwise, they just keep getting resent and performance suffers greatly.

responses.clear();

saveObjectData(data, oid);

Why did I make this change ?

Prior to starting investigation, I was amazed at how many messages were being generated when only 6 records were being curated. Over 10K messages in the TransactionManager. It seemed to me that the system was doing way too much work. After working through debug a few times I found out why there were so many messages.

When processing relationships between records a 'response' is added to the metadata. This is so that when the OID has been curated, the CurationManager, knows where to send a 'response' to so that the related object can then complete it's curation process.

E.g. Jeremy Vanderwal is related to all the birds records being processed. For the first relationship being processed, after Jeremy is curated, the CurationManager then sends a 'response' message back to the bird record on RedBox so it can complete its curation. All is good. For the second bird record being processed, Jeremy is already curated but this time the CurationManger, sends a response back to the first bird record and then the second bird record. The first bird record has completed curation, there is no need to tell it again that Jeremy is curated. So, for the third bird record, the system checks the Jeremy is curated and then three responses are sent back. The first two responses are not required but it sends them anyway. This is a bug. The number of messages starts to snowball very quickly and the system just doesn't cope.

The code change I have added above, deletes the 'response' for each relationship once it is sent, so it is not sent again.

This change made a big improvement, but I still couldn't harvest and curate all records at once.

The next change made was in publishRelations()

A similar problem exists in here as above. Messages were being generated unnecessarily.When an object is published, it then processes all the child relationships sending 'publish' messages to the children. If the object is'published' several times, e.g. Jeremy, publish messages were sent to all of the groups Jeremy is associated with repeatedly. They only need to be sent once, not repeatedly.

Since the 'relationship' data is used to generate the curated information when the admin logs in, I just couldn't remove it as I did for the responses.

I have modified the code to add a new entry, publishMsgSent. If this is true a publish message is not created for the relationship.

private void publishRelations(JsonSimple response, String oid) throws TransactionException {

log.debug("Publishing Children of '{}'", oid);

JsonSimple data = getDataFromStorage(oid);

if (data == null) {

log.error("Error accessing item data! '{}'", oid);

emailObjectLink(response, oid,

"An error occured publishing the related objects for this"

+ " record. Please check the system logs.");

return;

}

boolean saveData = false;

JSONArray relations = data.writeArray("relationships");

for (Object relation : relations) {

JsonSimple json = new JsonSimple((JsonObject) relation);

String broker = json.getString(brokerUrl, "broker");

boolean localRecord = broker.equals(brokerUrl);

String relatedId = json.getString(null, "identifier");

// We need to find OIDs to match IDs (only for local records)

String relatedOid = json.getString(null, "oid");

if (relatedOid == null && localRecord) {

String identifier = json.getString(null, "identifier");

if (identifier == null) {

log.error("NULL identifer provided!");

}

relatedOid = idToOid(identifier);

if (relatedOid == null) {

log.error("Cannot resolve identifer: '{}'", identifier);

}

boolean authority = json.getBoolean(false, "authority");

if (authority) {

// Is this relationship using a curated ID?

boolean isCurated = json.getBoolean(false, "isCurated");

//JCU: adding check for publishMsgSent

boolean publishMsgSent = json.getBoolean(false, "publishMsgSent");

if (isCurated && !publishMsgSent) {

log.debug(" * Publishing '{}'", relatedId);

// It is a local object

if (localRecord) {

createTask(response, relatedOid, "publish");

// Or remote

} else {

JsonObject task = createTask(response, broker, relatedOid,

"publish");

// We won't know OIDs for remote systems

task.remove("oid") ;

task.put("identifier", relatedId);

}

//JCU: Adding tag to indicate the publish message has been sent.

((JsonObject) relation).put("publishMsgSent", "true");

saveData = true;

} else if (publishMsgSent){

log.debug(" * Ignoring already published relationship '{}'",

relatedId);

}

else {

log.debug(" * Ignoring non-curated relationship '{}'",

relatedId);

}

if (saveData){

//updating the relations with publishMsgSent

saveObjectData(data, oid);

}

If you have any questions, please ask away.

I do suspect there is more room for improvement in the performance side of things but debugging is extremely time consuming due to the number of messages that are generated for just a few records.

Cheers,

Jay.

Jay van Schyndel
Software Engineer - eResearch Centre
Division of Research and Innovation
James Cook University, Townsville QLD 4811 AUSTRALIA
P (07) 4781 3199; I +61 7 4781 3199;
E jay.van...@jcu.edu.au

www.jcu.edu.au
Location: Faculty of Science & Engineering Room 141 (Building 17.141)

Grant Jackson

unread,

Jan 11, 2013, 1:04:07 AM1/11/13

to redbo...@googlegroups.com

Hi Jay, Andrew, Duncan & others,

Thanks for the contribution Jay. I haven't tested your code, but in addition to the improved performance I suspect it also results in a behavioural change.

Here is an example using v1.5.1 dev-handle.

1. I have an already published collection & related activity (or related person, group or service).

2. I change the (handle) urlTemplate in the activity json file. (I restart Mint & reharvest activities.)

3. I deliberately make a trivial update to the published collection (eg. add a space) & re-save the collection. The intention is to propagate the activity change to the collection's children. (I suspect there is a clever way to do this using messages.)

4. The children of the collection (in this case the activity record) write their new urlTemplate to the corresponding handle record.

If I've understood your description correctly then I suspect the above behaviour will no longer work (because existing child records will not be asked to re-publish).

This example is to encourage discussion. I realise my example might be poor (eg. does not scale well to hundreds or thousands of records) but if there is a behavioural change then I wonder what the implications might be. Perhaps others with more ReDBox-Mint experience than me might have some ideas? Thanks.

Cheers, Grant

--
You received this message because you are subscribed to the Google Groups "ReDBox Development" group.
To post to this group, send an email to redbo...@googlegroups.com.
To unsubscribe from this group, send email to redbox-dev+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jay van Schyndel

unread,

Jan 11, 2013, 1:23:04 AM1/11/13

to redbo...@googlegroups.com

Hi Grant,

Your comments in the example below are correct. Due to the flag I have added for 'publish' messages, they will not be resent using your scenario below.
I used the flag in this case as the 'relationship' data is used to display the curation information when logged in as admin.
I couldn't just remove the 'relationship' data l like I did for the 'response' due to this.

I haven't used ReDBox in the manner you describe below.
Is it a valid scenario ?
Are other people using ReDBox this way ?

Might need to revisit the fix. Thanks for the feedback.

Cheers,
Jay.

Grant Jackson

unread,

Jan 11, 2013, 1:49:44 AM1/11/13

to redbo...@googlegroups.com

Hi Jay,

It just happened that I was playing with urlTemplates today for all of my data sources so that handles can redirect somewhere other than back to ReDBox-Mint (which is to be put behind a firewall). So initially for me it was an accidental scenario!

I'm not a developer, so if in future I needed to get all *existing* collection, party, activity & service handles to redirect to say a new web server, I might use the example I supplied or update the handle.net records directly (or seek help).

Cheers, Grant

To view this discussion on the web, visit https://groups.google.com/d/msg/redbox-dev/-/8Q9xdPC5NlEJ.

Greg Pendlebury

unread,

Jan 13, 2013, 4:13:36 PM1/13/13

to redbo...@googlegroups.com

I suspect any post-publishing process would cause a problem. For example, when any new relationships have been updated and the RIF-CS is altered the Collections would need to be resubmitted to Fedora for any Vital users. This may get caught up in the same logic.

I wonder if one alternative to avoid this snowball effect is to hold a cache of 'publish' messages and de-dupe them as they are added. Then periodically send the cached messages. By making the cache flush time configurable you could adjust it to match the expected volume of ingest for the data. This would allow you to ensure that publish events are still firing for records that are in a linked network that has undergone change, but you don't have to generate 1500 in a row (or something worse, like the factorial?) if the same guy is related to 1500 newly ingested records.

A similar approach may help with the response messages too, but they carry additional payload data which makes caching de-dupes more complicated. They'd have to merged into the cached message I guess.

Ta,
Greg

Andrew Brazzatti

unread,

Jan 13, 2013, 6:02:15 PM1/13/13

to ReDBox Developer List

Hi,

These are the sorts of scenarios I was hoping to explore while testing these changes out :) I was thinking a cache might be a better way to prevent flooding the message queues too but I didn't consider the need to send the messages out on flush (which is obviously needed). Also wasn't sure of the nature of all the messages but it certainly looks like at least the publish messages can be dealt with this way.

If there are any other scenarios people have that might be impacted please post them here :)

Andrew

Reply all

Reply to author

Forward