Batch ingest clarification...??

38 views
Skip to first unread message

Brandon Weigel

unread,
May 10, 2017, 2:16:33 PM5/10/17
to islandora
I've got some massive zip files (200+ GB) that I'd really not have to upload to the server before starting the whole batch process, duplicating files and doubling up on my server disk space.

Is it possible to run islandora_batch_scan_preprocess on files stored on my PC and let the preprocess function move them to where they need to be on the server? Or do the zip files have to be in a staging area on the server in order to run the process?

Mark Jordan

unread,
May 10, 2017, 2:25:11 PM5/10/17
to isla...@googlegroups.com
Hi Brandon,

During our migration, we also had massive amounts of files that we wanted to avoid duplicating or even copying. We did this by mounting the Windows filespace where input files came out of the sausage grinder (MIK), ready for QA and ingest, on the Linux Islandora server(s) and running drush batch ingests on the Linux servers that pointed to the shared/mounted files. So it is possible to do what you are asking if you can somehow share disk between your local machine and the server (and there are numerous ways of doing that, depending on your operating systems, storage infrastructure, etc.).

Mark

----- On May 10, 2017, at 11:16 AM, Brandon Weigel <jeanpau...@gmail.com> wrote:
I've got some massive zip files (200+ GB) that I'd really not have to upload to the server before starting the whole batch process, duplicating files and doubling up on my server disk space.
Is it possible to run islandora_batch_scan_preprocess on files stored on my PC and let the preprocess function move them to where they need to be on the server? Or do the zip files have to be in a staging area on the server in order to run the process?

--
For more information about using this group, please read our Listserv Guidelines: http://islandora.ca/content/welcome-islandora-listserv
---
You received this message because you are subscribed to the Google Groups "islandora" group.
To unsubscribe from this group and stop receiving emails from it, send an email to islandora+...@googlegroups.com.
Visit this group at https://groups.google.com/group/islandora.
To view this discussion on the web visit https://groups.google.com/d/msgid/islandora/1ca90bc2-b9d6-466e-9290-3ed51ce18e2e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

dp...@metro.org

unread,
May 10, 2017, 2:42:36 PM5/10/17
to islandora

Not sure if relevant but,
If you move that zip file to your server, make sure you can actually unzip 200 Gbytes  on the server.Traditional the "unzip" binary will only process the first few gigabytes (< 4 gbytes) and skip all the rest. 
What I have been using without relying on other contributed tools is "jar" binary. (not meant for that, but a java jar file us really just a shiny zip file).
So you run:

jar xf yourfile.zip 



Diego

Jordan Dukart

unread,
May 10, 2017, 2:47:37 PM5/10/17
to isla...@googlegroups.com
The batch ingest processes typically use the ZIP stream wrapper that allows the ZIP to be read read-only without having to extract it (https://github.com/Islandora/islandora_batch/blob/7.x/includes/islandora_scan_batch.inc#L109-Lundefined). Not sure that really helps in this case for Brandon.

Jordan
--
For more information about using this group, please read our Listserv Guidelines: http://islandora.ca/content/welcome-islandora-listserv
---
You received this message because you are subscribed to the Google Groups "islandora" group.
To unsubscribe from this group and stop receiving emails from it, send an email to islandora+...@googlegroups.com.
Visit this group at https://groups.google.com/group/islandora.

Brandon Weigel

unread,
May 10, 2017, 4:40:51 PM5/10/17
to islandora
It would be nice if Batch would be able to fetch zips and directories from external paths rather than only its internal filesystem. Would it be a difficult change to allow an HTTP target?

dp...@metro.org

unread,
May 10, 2017, 6:26:33 PM5/10/17
to islandora
If that change would happen, batch would need to get the ZIP file under the hood to your server anyway and copy it to a place drupal can access it. And directories, well a net mount does that. 

Jordan Dukart

unread,
May 11, 2017, 7:51:35 AM5/11/17
to isla...@googlegroups.com
Ya what Diego said, sometimes when testing internally we'll use https://github.com/libfuse/sshfs if we have a large amount of data that'd take a long time to move off of an FTP. With that in mind the read/write will be limited by your pipe then.

Rosemary Le Faive

unread,
May 11, 2017, 11:13:32 AM5/11/17
to islandora
UPEI does it the way Mark and Diego suggested: with the files on network-attached storage, mounting the storage volume and using the --type=directory option. 

Brandon Weigel

unread,
May 11, 2017, 2:55:00 PM5/11/17
to islandora
Any way to do something like that when your servers aren't local, though? I'm not sure how I would mount my external HD to my AWS-hosted site.

Jordan: documentation on sshfs isn't perfectly clear... so would this allow me to temporarily mount my volume, work with it in SSH as if it were local to the site, and disconnect afterward?
Reply all
Reply to author
Forward
0 new messages