Harvest into 'Published' and then automatically perform curation

Jay van Schyndel

unread,

Oct 18, 2012, 1:50:21 AM10/18/12

to redbo...@googlegroups.com

Hi Everyone,

I am writing to ask for some from advice on curation.

At JCU, I have built a harvester that loads data in to the 'Investigation' stage.
Currently in ReDBox, there are 6 stages,

The data I am harvesting is the almost the same for every record, except for the bird species. Currently there are 1800 records.
Manually pushing 1800 records to 'Published' in ReDBox would be highly time consuming and inefficient.

When I harvest the data, no additional fields require completion to push the record up to the 'Published' stage.

I would like to know how to harvest directly into the "Published" stage, so that curation automatically takes place with no user intervention.

What type of functionality, if any, goes on behind the scenes, when I hit the submit button each time to progress to the next stage.
Is there something specific to each stage ?

Thanks for your help,

Jay.

Greg Pendlebury

unread,

Oct 18, 2012, 7:28:27 PM10/18/12

to redbo...@googlegroups.com

Hi Jay,

Generally speaking there isn't much more that a metadata value being changed and re-index triggered. This means that I think everything you are looking for could accommodated in the harvest config and rules file:

The rules file already writes the workflow json to storage on the first time it sees an object, so you could definitely put it into the published stage straight away by tweaking this.
The config file can also be told to run your curation plugin as a standard transformer, simply by moving it from 'transformer' > 'curation' to 'transformer' > 'metadata'. This will run them everytime they are touched, as opposed to just during curation.
IIRC, all of the current curation plugins would be safe to run all the time, since they know not to actually regenerate the IDs, they'll just confirm their existence. It might be a tad inefficient against Handle which will still trigger some external calls, but shouldn't be 'wrong'.
I think you could set 'curation' > 'alreadyCurated' to true and the curation manager should pretty much leave it alone.

I think that's about what you want... but that's just the theory. Try it out and see.

Ta,
Greg

--
You received this message because you are subscribed to the Google Groups "ReDBox Development" group.
To post to this group, send an email to redbo...@googlegroups.com.
To unsubscribe from this group, send email to redbox-dev+...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msg/redbox-dev/-/5cwlSpOMR7sJ.
For more options, visit https://groups.google.com/groups/opt_out.

Jay van Schyndel

unread,

Oct 18, 2012, 7:39:28 PM10/18/12

to redbo...@googlegroups.com

G'day Greg,

Thanks for the help again. :)
I'll start making these changes and report back.

Cheers,
Jay.

Jay van Schyndel

unread,

Oct 22, 2012, 6:44:50 PM10/22/12

to redbo...@googlegroups.com

Hi Greg,

Making the change to harvest directly in to 'Published' was very easy.

The second part is proving tricky and I believe an existing problem I have is the stumbling block.

The harvester I am using is a combination of the CSV and Filesystem harvesters.
The harvester creates a payload called "metadata.json" not a '.tfpackage' as required by ReDBox.

The format of the 'metadata.json' is different to that in the '.tfpackage'. To resolve the problem, I am using the harvester rules file (directoryNames.py) to create
a 'formData.tfpackage'. I had to do this as I could not edit my harvested data in ReDBox. I used the 'rda_harvest.py', supplied with the demo redbox as a code example.

When running a harvest, the first entry I see in the 'transactionManager.log' is from the Curation Manager. As shown below.

2012-10-23 09:27:50,467 transactionManager DEBUG CurationManager
{
    "harvester": {
        "type": "directory",
        "directory": {
            "targets": [
                {
                    "baseDir": "${fascinator.home}/data/Edgar",
                    "recordIDPrefix": "jcu.edu.au/tdh/collection/"
                }
            ]
        },
        "file-system": {
            "caching": "basic",
            "cacheId": "config-caching"
        },
        "default-files": {
            "default-metadata-filename": "edgar_default_metadata.json",
            "override-metadata-filename": "metadata.json"
        },
        "metadata-types": [
            {
                "type": "occurrences"
            },
            {
                "type": "suitability"
            }
        ]
    },
    "transformer": {
        "curation": [

        ],
        "metadata": [
            "jsonVelocity"
        ]
    },
    "curation": {
        "neverPublish": false,
        "alreadyCurated": false
    },
    "transformerOverrides": {

    },
    "indexer": {
        "script": {
            "type": "python",
            "rules": "directoryNames.py"
        },
        "params": {
            "repository.name": "ReDBox",
            "repository.type": "Metadata Registry"
        }
    },
    "stages": [
        {
            "name": "inbox",
            "label": "Inbox",
            "description": "Potential records for investigation.",
            "security": [
                "guest"
            ],
            "visibility": [
                "librarian",
                "reviewer",
                "admin"
            ]
        },
        {
            "name": "investigation",
            "label": "Investigation",
            "description": "Records under investigation.",
            "security": [
                "librarian",
                "reviewer",
                "admin"
            ],
            "visibility": [
                "librarian",
                "reviewer",
                "admin"
            ],
            "template": "workflows/inbox"
        },
        {
            "name": "metadata-review",
            "label": "Metadata Review",
            "description": "Records to be reviewed by a data librarian.",
            "security": [
                "librarian",
                "reviewer",
                "admin"
            ],
            "visibility": [
                "librarian",
                "reviewer",
                "admin"
            ],
            "template": "workflows/dataset"
        },
        {
            "name": "final-review",
            "label": "Final Review",
            "description": "Completed records ready for publication and approval into the repository.",
            "security": [
                "reviewer",
                "admin"
            ],
            "visibility": [
                "librarian",
                "reviewer",
                "admin"
            ],
            "template": "workflows/dataset"
        },
        {
            "name": "live",
            "description": "Records already published in the repository.",
            "label": "Published",
            "security": [
                "reviewer",
                "admin"
            ],
            "visibility": [
                "guest"
            ],
            "template": "workflows/dataset"
        },
        {
            "name": "retired",
            "description": "Records that have been retired.",
            "label": "Retired",
            "security": [
                "admin"
            ],
            "visibility": [
                "guest"
            ],
            "template": "workflows/dataset"
        }
    ],
    "oid": "5863d7f96f78a5799ebb6f19a06e324d"
}

Next is the following:
2012-10-23 09:27:50,580 transactionManager INFO   nVelocityTransformer Transforming PID '.tfpackage' from OID '5863d7f96f78a5799ebb6f19a06e324d'
2012-10-23 09:27:50,604 transactionManager ERROR nVelocityTransformer Error accessing payload in storage: '{}'
com.googlecode.fascinator.api.storage.StorageException: ID '.tfpackage' does not exist.

The transformer is failing because I don't a 'formData.tfpackage'. This is correct, the formData.tfpackage gets created when the Indexer run the harvester rules file, which happens next.

2012-10-23 09:27:50,610 transactionManager ERROR ManagerQueueConsumer Error processing order from Transaction Manager:
{
    "type": "TRANSFORMER",
    "target": "jsonVelocity",
    "oid": "5863d7f96f78a5799ebb6f19a06e324d",
    "config": {

    }
}
2012-10-23 09:27:50,623 transactionManager DEBUG SolrIndexer          First time parsing config file: '27f9c7326edfc81e240c97cd937b31ec'
2012-10-23 09:27:50,630 transactionManager DEBUG SolrIndexer          First time parsing rules script: '7bcd41bbd55eb97d1252887657fec59b'
2012-10-23 09:27:55,569 transactionManager INFO   SolrIndexer          Creating 'formData.tfpackage' payload for object '5863d7f96f78a5799ebb6f19a06e324d'
2012-10-23 09:28:01,850 transactionManager DEBUG SolrIndexer          Indexing has altered metadata, closing object.

How can I get the harvester rules file, processed by the Indexer, to run before the Transformer ?
I'm hoping that once this is done, curation may kick in properly. Is this a correct assumption ?

My last resort fix, is to go back and revisit the harvester to ensure it creates a formData.tfpackage instead of a 'metadata.json. I would then also need to revisit my harvester rules file to alter the data mapping as the formats of the '.tfpackage' and 'metadata.json' differ. I've been trying to avoid these changes, large amount of rework.

Thanks,
              Jay.

Greg Pendlebury

unread,

Oct 23, 2012, 2:23:36 AM10/23/12

to redbo...@googlegroups.com

Hi Jay,

Sorry, I'm in training most of this week, so I can only think of a couple ideas off the top of my head:
1) you have nothing under 'transformerOverrides' and could possibly use that to specify that the transformer could use a different payload just for that data source. Mint does this if you are looking for something to reference.
2) if you need to have the transformers run again after the indexer you could just have the rules file post the object back into the message queue for reprocessing. Have a look at the reindex button on the details screen for some python code that would be a good starting point.

Ta,
Greg

==========
"Sent from Ye Olde mobile device. Forgive my fat fingers and their typos."

p: +61 2 6262 1228, m: 0403 674 810, e: greg.pe...@gmail.com

To view this discussion on the web, visit https://groups.google.com/d/msg/redbox-dev/-/beHLlaoHJ08J.

Jay van Schyndel

unread,

Oct 25, 2012, 10:30:01 PM10/25/12

to redbo...@googlegroups.com

Hi Greg,

I had removed the entry under the 'transformerOverrides' but have put it back to prevent the exceptions.

At the end of my rules files, I have added some code to create a message to the queue.
        message = JsonObject()
        message.put("task", "reharvest")
        self.messaging.queueMessage(
                TransactionManagerQueueConsumer.LISTENER_ID,
                message.toString())
        message.put("oid", oid)

This is based on the logic from the Reindex button. The button executes the script reharvest.py (yes it is the reharvest script, I double checked it)
When I harvest, this doesn't quite work. The harvest process goes into a loop. I can tell by looking at the log. :(

Instead of posting a 'reharvest' message to the queue, I have tried to emulate what the 'Proceed' button does on the 'submit' tab, when going from 'Final Review' to 'Published'
To do this I set my harvested data into 'Final Review' stage and setup the following message.
        message = JsonObject()
        message.put("oid", oid)
        message.put("eventType", "NewStep : "live")
        message.put("newStep", "live")
        message.put("username", "admin")
        message.put("context", "Workflow")
        message.put("task", "workflow")
        self.messaging.queueMessage(
                TransactionManagerQueueConsumer.LISTENER_ID,
                message.toString())

This also sets the harvester into a loop.

I think the problem is a timing one. I believe the indexer from the initial harvest, is putting a 'clear-render-flag' message on the queue. Please correct me if I am wrong.
I suspect that when I tell the rules file to add new message, this message gets executed before the 'clear-render-flag', thus sending the process into a loop.

I have looked at the logs for a normal harvest which was then manually pushed into 'Published' using the screens.
Please find atttached the file "Harvest example - normal" as an example that curates properly.
There is an exception generated during the curation process. It's on my list to resolve. :)

I have also attached the file "harvest and auto curate" to show you the logs looping for the 'reharvest' message

I think the resolution to this problem is to ensure the 'clear-render-flag' is executed prior to submitting the message from the harvester rules file.

Any suggestions would be greatly appreciated. In the meantime, I'll continue my familiarsation with curationManager.java

Cheers,
             Jay.

Harvest Example - normal.rtf

harvest and auto curate.rtf

Greg Pendlebury

unread,

Oct 28, 2012, 8:00:01 PM10/28/12

to redbo...@googlegroups.com

Sorry Jay, I'm a bit confused now:

>> "I had removed the entry under the 'transformerOverrides' but have put it back to prevent the exceptions."

My suggestion was to add something, since the config you posted on the 23rd had nothing under that node, but in any case, the two options you've posted now both have 'sourcePayload' in them, and you should be able to try altering the value to something like '*.tfpackage' if you want it to read that file instead.

On the issue of the endless loop, perhaps you just need to move it to a part of the rules file that ensure it only ever gets run once. For example, immediately following the creation of the payload. It sound like right now you have it somewhere that executes every time it goes through the indexer, and of course you are requesting another trip to the indexer as a result.

The 'clear-render-flag' should not have any impact on data flow/processing. It is just an indicator for the UI to know that the tool chain is finished.

I notice all those stack traces at the end, which I am presuming are a result of you manually killing the server whilst it is stuck in that loop. It may or may not be a problem for you, but if the loop reoccurs after restart you can manually destroy the AMQ datastore in the home directory to purge all enqueued messages.

Ta,
Greg

To view this discussion on the web, visit https://groups.google.com/d/msg/redbox-dev/-/WJVl3rXwyn8J.

Jay van Schyndel

unread,

Oct 28, 2012, 11:35:25 PM10/28/12

to redbo...@googlegroups.com

Hi Greg,

Thanks for your help with this.

I am now harvesting into Published and the curation is working.

I moved the messgage so it was set immediately after the payload, formData.tfpackage was created.
This stopped the looping. :)

Thanks again for your help on this one.

Cheers,
Jay.

Tim O'Connor

unread,

Dec 17, 2012, 3:27:06 AM12/17/12

to redbo...@googlegroups.com

Hi guys -could you mention the files you are modifying to make the harvest stage move to published, the actual steps required? Im struggling with this at the moment.

Jay van Schyndel

unread,

Dec 19, 2012, 12:59:53 AM12/19/12

to redbo...@googlegroups.com

G'day Tim,

To achieve this I have added code to my python script that gets run during the harvest process. it's name is directoryNames.py - found in redbox/home/harvest.
This code can be found in the JCU repo on GitHub. "https://github.com/jcu-eresearch/TDH-Research-Data-Catalogue".

There are two pieces of code to do the work.
in __workflow() I am currently setting initialStep = 3, but if you make it 4, it harvests straight into Published.
The "stages" are defined in directoryNames-Edgar.json, also found in redbox/home/harvest.

The auto-curation is the tricky bit. :)
in __updateMetadataPayload() at the end you will find the commented out line self.__sendMessage(self.oid, "live")
This creates the message that kicks off the curation process.

For the moment at JCU we are not harvesting straight into Published and auto curating.
At JCU we have a large number of records to process, approx 1800. ReDBox and The Mint don't cope with them at the moment. The CPU and ram usage just escalates, resulting in Out of Memory errors. I'm investigating this issue at the moment.

Cheers,
Jay.

Tim O'Connor

unread,

Dec 20, 2012, 7:47:19 PM12/20/12

to redbo...@googlegroups.com

Thanks a lot Jay - something for later on down the track :)

Reply all

Reply to author

Forward