Can Dataset Json Harvester be used to harvest a dataset package (zip, tar etc.)

Lloyd Haris

unread,

Sep 10, 2014, 12:29:42 AM9/10/14

to redbo...@googlegroups.com

I have a dataset in the form of a zip archive and I would like to ingest it into the ReDBox and since this needs to be an automatic process, I decided to look at new Json Harvester for this task.

So I have Json Harvester client polling a directory for json files and following is a sample json file that I am using:

{

"type": "DatasetJson",

"harvesterId": "jsonHarvester",

"data": {

"data": [

{

"varMap": {

"file.path": "${fascinator.home}/packages/<oid>.tfpackage"

},

"tfpackage": {

"dc:created": "2014-09-08",

"dc:title": "Test twenty two",

"title": "Test 22",

"metaList": [

"dc:title"

],

"redbox:newForm": "false",

"redbox:formVersion": "1.7-SNAPSHOT",

"repository_type": "Metadata Registry",

"repository_name": "ReDBox",

"redbox:submissionProcess.redbox:submitted": "null",

"viewId": "default",

"packageType": "dataset",

"dc:identifier.dc:type.rdf:PlainLiteral": "handle",

"dc:identifier.dc:type.skos:prefLabel": "HANDLE System Identifier",

},

"datasetId": "test twenty two",

"owner": "admin",

"attachmentDestination": {

"tfpackage": [

"<oid>.tfpackage",

"metadata.json",

"$file.path"

],

"workflow.metadata": [

"workflow.metadata"

]

},

"attachmentList": [

"tfpackage",

"workflow.metadata",

"cratepackage"

],

"customProperties": [

"file.path"

],

"workflow.metadata": {

"id": "dataset",

"formData": {

"title": "",

"description": ""

},

"pageTitle": "Metadata Record 22",

"label": "Metadata Review 22",

"step": "metadata-review"

},

"cratepackage": {

"path": ""

}

]

}

As you can see, there is an attachmentList array where you can specify your attachments. The attachments specified here are actually some json objects. But my dataset is in the form of a zip archive which consists of several files like word files, images etc. and I wonder if I can attach it so it will be ingested into the ReDBox.

Apparently the attachments in the above json file are just json objects and they just look like metadata to me. So how do I actually attach the actual dataset if possible? Does anyone have any idea about this?

Thanks

Greg Pendlebury

unread,

Sep 10, 2014, 1:12:48 AM9/10/14

to redbo...@googlegroups.com

I have some vague memories of an already existing ZIP harvester from Fascinator core that does what you are looking for. Or perhaps it was a transformer. I don't have the code on this PC sorry. In any case, I suspect there isn't a pre-existing solution, but a couple of plugins combined are very close to what you want, and a good starting point for a customization.

Ta,

Greg

--
You received this message because you are subscribed to the Google Groups "ReDBox Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redbox-dev+...@googlegroups.com.
To post to this group, send email to redbo...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/redbox-dev/0cddbab1-69b0-4b69-9aa0-4e67a5070ac5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Lloyd Haris

unread,

Sep 10, 2014, 2:08:09 AM9/10/14

to redbo...@googlegroups.com

Greg, actually I am currently writing a new plugin based on Json Harvester but I am sort of stuck in creating a "harvester" for this. My idea is to send the zip package to a RB queue as a bytes stream and then harvest it.

Duncan, Peter Sefton, Andrew, Shilo and myself was having a conversation about this and based on their suggestions I wrote a polling client to poll for the zip files and harvest using json harvester. The reason being a metadata json file was included in this package so if I could map this json file to the json file format that the RB json harverster harvests, I would be better off harvesting the package using json harvester with the package attached to it as an attachment. But unfortunately I can't find a way to do that using current json harvester.

So instead of extracting the package from the client side and processing json file and harvest as a json harvest, I thought of sending the package to a RB queue and do the processing and ingesting it in the server side because then the zip file would be accessible from the server. Then of course you have to write a harvester so I decided to create a harvester based on Base Json Harvester. But looks like you can't attach a file in Base Json Harvester even so. You can only create string payloads as attachments.

But what I am wondering is, if json harvester cannot attach datasets, what's the point of having a json harvester in the first place? Is it only to harvest datasets which are in the form of json? In the initial discussions I was told that we would be able to attach files when you do json harvests and I was working based on this idea.

Cheers

Lloyd

Andrew Brazzatti

unread,

Sep 10, 2014, 2:25:12 AM9/10/14

to ReDBox Developer List

Hi Lloyd,

The current JSON Harvester does not support attaching files to a dataset (or indeed any other record type) package. It's purpose is to be able to receive a standardized json message and be able to perform the required orchestration to create a package with the associated metadata. The Harvester client, is a spring integration tool that can be used to take input in a variety of formats and from a variety of sources and transform it to the message format required. It is a replacement for the old alerts functionality which was limited to harvesting xml (rif) and csv files from the file system. It also could not handle attachments.

There are 2 reasons why it can't handle attaching (binary) files:

Attaching files to dataset records puts attachments in their own packages (rather than the dataset package)
Sending binary files over JMS is possible but can be pretty clunky especially if the files are very large.

Of course we want to support this functionality and we have some ideas on how to handle it. It may be best to have a discussion about this so we can help you meet your requirements.

There is also some documentation available on a new site we are developing which may help you.

http://harvester-snapshot.redboxresearchdata.com.au/

Thanks,

Andrew

To view this discussion on the web, visit https://groups.google.com/d/msgid/redbox-dev/8281d1ba-fa1c-4032-b2d7-6a646dce4fe9%40googlegroups.com.

Peter Sefton

unread,

Sep 10, 2014, 2:28:29 AM9/10/14

to redbo...@googlegroups.com

Thanks Andrew,

Basically we have a zip file in a known directory. The ZIp has some metadata in it we know how to extract. What's the best way to get the zip file into RB with the metadata extracted and mapped and the Zip attached?

To view this discussion on the web, visit https://groups.google.com/d/msgid/redbox-dev/CAARQ2ANR9kd8FryCTfO2c-Ja-kuGrv5Kuj45GBAdHZM2v9W%3DQw%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--

Peter Sefton +61410326955 p...@ptsefton.com http://ptsefton.com
Gmail, Twitter & Skype name: ptsefton

Andrew Brazzatti

unread,

Sep 10, 2014, 10:07:12 PM9/10/14

to ReDBox Developer List

Hi Peter and LLoyd,

Ultimately in the RB 2 pipeline, is to have a functional API to allow external systems (and a new portal) to communicate with the ReDBox core. For the time being there is an attachments script that is used to handle attachment uploads from the form that could be utilized by the harvester client to send these attachments across.

Thanks,

Andrew

To view this discussion on the web, visit https://groups.google.com/d/msgid/redbox-dev/CAGQnt7Wdxn9c5uKfN5CttPWpTEO-4eVFMF%3Dz765S-p0z%2BG-7Hg%40mail.gmail.com.

Peter Sefton

unread,

Sep 11, 2014, 2:28:07 AM9/11/14

to redbo...@googlegroups.com

Thanks Andrew, having an API seems like a really good move, do you have any details on that, eg planning docs and when it might be delivered? Have you considering SWORD?

To view this discussion on the web, visit https://groups.google.com/d/msgid/redbox-dev/CAARQ2AN4eFxiiNkf9QuB_xaPoTHN91C4EJXY4DQAwJ8gEg78Nw%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Shilo Banihit

unread,

Sep 23, 2014, 8:50:09 PM9/23/14

to redbo...@googlegroups.com

Hi Lloyd,

As mentioned, pending the creation of an API, one possible way to attach binary data is through the attachment script that the forms uses, upon the successful processing of a harvest request. At a high-level, this approach requires the following:

When sending your JSON harvest request, ensure you set a top-level field named "harvesterId". For your project, you can make this unique to each harvest request.
When the request is processed, event JSONs are published in the topic "jsonHarvester_event" (default) within the same broker. Included in the information published is the "harvesterId" supplied during the request, whether the object was successfully harvested (under the field "event"), as well as the identifier of the newly created object.
Use the object identifier obtained in the last step when building the data for the attachment script.

The steps above can be implemented through Spring Integration, possibly using a combination of Inbound JMS adapter, custom transformer and an HTTP Outbound gateway.

During development, you may also want to use an ActiveMQBrowser or some other tool to inspect the broadcasted event JSON.

Regards,

Shilo

To view this discussion on the web, visit https://groups.google.com/d/msgid/redbox-dev/CAARQ2AN4eFxiiNkf9QuB_xaPoTHN91C4EJXY4DQAwJ8gEg78Nw%40mail.gmail.com.

Lloyd Haris

unread,

Sep 29, 2014, 11:25:56 PM9/29/14

to redbo...@googlegroups.com

Hi Shilo,

I've been trying to follow the details that you have provided but I have 2 questions that I am not very clear of.

1. In order to better understand how attachment script works, I tried to debug the new fascinator. I built the fascinator and the fascinator portal but when I try to start the fascinator I got "The plugin 'org.apache.maven.plugins:maven-jetty-plugin' does not exist or no valid version could be found" error. I can not find the "maven-jetty-plugin" in my local repository and wonder why. Do you have any idea why it's not been downloaded from somewhere? Or any other idea why this has happened?

2. Secondly as I am aware the scripts like download.py etc. to be invoked, a user event needs to be triggered through the GUI. So how do I actually invoke the attachment script? Looks like the script expects some sort of form data with "attach-file" function value. How would I provide that?

Thanks

Lloyd

Shilo Banihit

unread,

Oct 6, 2014, 9:05:13 PM10/6/14

to redbo...@googlegroups.com

Hi Lloyd,

With regards to your first issue, I personally haven't encountered the error before. It seems that maven has given up searching for the artifact. Perhaps tinkering around your Maven configuration will help.

On the attachment script, you can use HTTP client libraries available in your chosen development stack. For example, in Java, you have options like: HC, Spring Integration HTTP and many others.

Shilo

To view this discussion on the web, visit https://groups.google.com/d/msgid/redbox-dev/57ad3ace-b094-4f28-ae0d-e5459dbc1fe4%40googlegroups.com.

Lloyd Haris

unread,

Oct 8, 2014, 12:06:48 AM10/8/14

to redbo...@googlegroups.com

Hi Shilo,

I was able to find out the cause for the first issue. There were couple of issues.

After the separation of the fascinator-portal source from the-fascinator code base, it had been renamed as the fascinator-portal. So as a developer, I first cloned the-fascinator and then cloned the portal into the-fascinator directory. If you had just cloned it, it would have gotten the name "fascinator-portal". But unfortunately tf.sh script had not been updated to reflect that. When you start the fascinator, it still tries to change to the "portal" directory which doesn't exist hence the error I mentioned in the previous post.

cd $PROG_DIR/portal

nohup mvn -e $ARGS -P dev jetty:run &> $TF_HOME/logs/stdout.out &

The next problem is with the "FASCINATOR_HOME" directory. In the tf_env.sh script, it is still TF_HOME but in the spring application context, it is "FASCINATOR_HOME"

Once that's rectified, now I am getting an spring autowire exception. It's obvious that it can not autowire an impl of GenericDao:

2014-10-08 12:19:57.979:INFO:/portal:Initializing Spring root WebApplicationContext

2014-10-08 12:20:08.212:WARN::Failed startup of context org.mortbay.jetty.plugin.Jetty6PluginWebAppContext@5b27f008{/portal,/opt/fascinator-clone/the-fascinator/fascinator-portal/src/main/webapp}

org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'hibernateAuthUserService': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.googlecode.fascinator.dao.GenericDao com.googlecode.fascinator.common.authentication.hibernate.HibernateUserService.hibernateAuthUserDao; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.googlecode.fascinator.dao.GenericDao] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}

If I look at the applicationContext I can see that there could be few more configuration xmls in the FASCINATOR_HOME which I assume to be provided by the user?.

So my questions are, do I as the user need to provide the applicationContexts that are looked up in there? I couldn't find any implementation of the GenericDao. Do I have to write an implementation? Also I am wondering if you have any documentation related to this.