Options to improve overall ingest timeframes

261 views
Skip to first unread message

er...@infogetics.com

unread,
Oct 21, 2015, 12:13:40 PM10/21/15
to islandora
Greetings,

First an overview of where we are now. Then a question as to how to speed up ingestion times.

We have had a successful first few months with Islandora internally and are beginning to process of mass ingesting objects. We have several terabytes of available material, although it doesn't all need to be loaded before our February go-live date. Our primary content models are large images, books and newspapers. Our video content is hosted on Vimeo so does not present a significant ingest burden.

Using Python scripting and Drush batch ingest commands we have implemented fully automated hot folders where our staff drops documents as they are scanned.
Our scripts then collate these files to create the folder structures Drush is expecting. After "collation" the files are associated with a custom ingestion manifest and FTP'd to our Islandora server which has several TB of attached SSD storage (Amazon AWS). On the Islandora server, a chron job runs and ingests the uploaded objects according to the instructions embedded in the manifest (the manifest specifies parent pids, cmodel, etc).

We have timed the process to take approx. 40 minutes / gigabyte. There are two bottlenecks we are seeing.

The first bottleneck is 40% of the process is spent uploading to AWS. We can address this via several methods out of scope for this group. (ie. we can mail Amazon a hd if we are in a real hurry).

The second bottleneck is actual ingestion via the Drush batch ingest command, which creates derivatives and performs the Fedora ingestion. We are seeing approx. 50% (20 min. / gb) of our processing time in this area.

Here are my thoughts so far on how to speed things up. Can someone with more insight weigh in with suggestions?

1. I have observed, Islandora only allows a single ingest process to run at a time. (ie ingest lock ini drupal semaphore table). So running Islandora on a beefier machine and doing multiple ingests doesn't seem to be an option. I also doubt that increasing the processing power (number of cores) will help much either. Am I correct in these assumptions?

2. Would creating time consuming derivatives such as TIFF to JPG, TN creation, etc. prior to running the ingest command be feasible or time saving? The only advantage here may be that our local batch image processing tools may be more efficient...plus we have several workstations that could run at the time time.

3. If we spun up several clones of our Islandora box and ingested part of our collections into each one simultaneously, could the results be merged onto a single box by copying files and rebuilding Fedora? When I rebuilt the Fedora repo and indexes in the past I ended up damaging things. So am pretty wary of my ignorance here. My guess is that we would end up with PID collisions if we didn't take extra precautions.

4. If we can't speed up the process, is there an established way to ingest to a single master "ingestion" box and then copy the results to a live production machine at night, so we don't drag down our main box during months and months of full time ingestion?

Thanks in advance for any ideas.

Eric Koester
Andrews University
Berrien Springs, MI




Diego Pino

unread,
Oct 22, 2015, 10:14:28 AM10/22/15
to islandora
Hi Eric,

Some ideas(just ideas based on own and other users experience)

First: RAM. Derivatives and ingestion of binaries is Memory consuming. The more fine tuned you have your java env, the more speed you will get. @Brad Spry has a deep knowledge on this.
Second: Logging. Generating logs is good for debugging and understanding what is happening, but if you already have everything tuned and working, tested, etc, my experience is that if you have too fine logs for fedora, gsearch, solr and catalina, then this will also add some ingestion time.
Third: if you disable gsearch (even ActiveMQ if needed) on massive ingestion, tenable afterwards and do reindexing manually, speed up is gained also. Same for derivatives, good idea to do them offline.

But there are also other options here:

a.1) you can batch ingest only metadata first, then put together a script for completing the binary datastreams (keeping track of the PIDS) using fedora client (look at a2 for ideas)  
a.2) @Giancarlo Birello has some good info on batch ingesting (using external tools to islandora) http://dev.digibess.it/doku.php?id=ingesting:ingbscript. They have a lot of books and they do derivatives outside Islandora. 
a.3) They also have a taverna workflow.

b) Fedora allows read-only replication. This is very useful because you can have an master that gets the ingestion and some "clones - slaves" that serve (using a journaling system) read online to the outside world. Since the slaves get all the activeMQ messages, they do also gsearch indexing. https://wiki.duraspace.org/display/FEDORA37/Replication+and+Mirroring

c) You can also easily /but time consuming rebuild a parallel Fedora server using only the object store (Akubra or the legacy) by shutting down one fedora, copying that folder to another Fedora, rebuild, start. You can copy  ActiveMQ messages still waiting for being processed if you wan't, but i think in your case b) is more optimal.

Also, other way, the way we do things, is to have multiple REPO's acting as "one to the public" by sharing a common Solr collection using, e.g Solr Cloud. So you can split your work on different servers and expose at least global search via a common search.

Lastly but very important. It's a good to take Fedora4 and Islandora2 in consideration. Fedora4 resolves a lot of the issues regarding distributed scenarios and concurrent ingesting, and @Daniel Lamb has come up with some very interesting implementations based on Camel and also directly on php (Chullo) to manage  your problems. We are on a development stage where use cases and of course involvement (developers from the community are very needed) is a must, so i encourage you to get involved.

Best

Diego

Brad Spry

unread,
Oct 22, 2015, 1:52:05 PM10/22/15
to islandora
Eric,

Please tell us more about your server instance, number of CPU cores, RAM, etc.  

Feel free to share your JAVA_OPTS, just the performance related ones, like memory and garbage collector configurations.

Also, do you utilize OCR in your drush batch ingest workflow?

One performance tip: If you place your object upload location in close proximity to Drupal's temp and Fedora's temp, some like ingest file operations can happen on the same drive instead of having to copy files across the system bus between multiple drives.   

This particular issue has a noticeable effect on derivative generation performance:
https://jira.duraspace.org/browse/ISLANDORA-1143

...but I'm shifting my hope to Islandora 2.0 and Fedora 4.0 for ultimately resolving that issue.   Every Islandorian shares the same desire for the very best ingest and derivative generation performance! 

There are also some very advanced Islandora implementations, like Diego mentioned, which background and offload derivative generation processes.  You can read more about the characteristics of such a configuration here:
https://github.com/islandora-interest-groups/Islandora-Preservation-Interest-Group/tree/master/background_services_discussion_paper

To your question of creating a fleet of Islandora boxes for simultaneous Fedora ingest, that is an intriguing possibility...  If each system could utilize the same MySQL and same filesystem, it sounds feasible; it certainly inspires curiosity :-)    On AWS, one can use RDS for centralized MySQL and EFS for a true shared filesystem, but EFS is still in preview mode and not released for production, YET.    S3 is not appropriate for Fedora's objectStore and resourceIndex, this much I learned the hard way.    But an autoscaling fleet of ingest servers has a definite appeal, for sure :-)

It is absolutely possible to have single master "ingestion" box and then copy the results to a live production server at night; that pretty much describes my current implementation.   I have such a strategy with a built-in safety mechanism, which only allows a full sync (the Tomcat side) to happen if NO ingest or BagIt writing operations are detected:

#!/bin/bash
drush_ready=$(ps aux 2>/dev/null |grep drush 2>/dev/null |wc -l)
loadingdock_ready=$(/usr/bin/lsof /mnt/island1-loadingdock | grep -e "[[:digit:]]\+[wu]\{1\}" |wc -l)

if (( $drush_ready == '0' || $drush_ready == '1' && $loadingdock_ready == '0'))
        then

#full sync


I came up with a "heartbeat" style strategy to communicate with the receiving system exactly what is about to happen.   If a full sync is detected, the receiving system will shutdown Tomcat in anticipation of full synchronization.   After the full sync is complete, the receiving system rebuilds its Fedora Resource Index and starts itself up.    It can be done!


Sincerely,

Brad Spry
UNC Charlotte








er...@infogetics.com

unread,
Oct 22, 2015, 5:01:35 PM10/22/15
to islandora
Thanks both of you for some excellent info. To address Diego's final point about Fedora 4/Islandora 2.x, we are planning on upgrading within the next 18 months, assuming the pieces are in place by then. This is partially why I wish to stay as close to out-of-the-box with our current version of Islandora.

Here are answers to Brad's questions.

Instance type: M3 Large (2cpu/7.5gb). According to glances, during ingest CPU maxes out to 100% and memory stays pegged at < 50%.

JAVA_OPTS: -Xms3739m -Xmx3739m -XX:MaxPermSize=512m -XX:+CMSClassUnloadingEnabled

OCR: We are including an OCR.asc object in our ingest with the intent that it replace the need to create OCR on the fly...but I just realized I am not specifying the --do_not_generate_ocr option on the drush preprocess command.

We are uploading to one volume and ingesting to another. I also just double checked and realized we are ingesting to a low performance magnetic drive on our dev box where I did the performance measuring. I'm excited to see what things look like if I upgrade the volume to ssd and ensure the drupal and fedora temp folders map to the same drive.

Based on this discussion so far, we will move forward in two phases. First the simple stuff: the disk i/o issue Brad mentions, ensure that our pre-generated hOCR files are being ingested instead of requiring Islandora to OCR everything, make sure we have enough allocated RAM. Second, Diego's master-slave suggestion (item b in his list, which Brad expands on) appears to be the most promising strategy to allow us to move beyond load off of our production box.

As far as the "fleet of ingest" servers go, I assume one would have to circumvent the ingest lock that appears in the Drupal semaphore table and address whatever concerns the locking mechanism was intended to prevent in the first place. Or is this not a concern? If each Islandora ingest box was pointed to its own local Drupal db and a shared Fedora MySQL db, would the lock matter? I have more ignorance than knowledge so may be spouting nonsense here ..

Anyway, thanks again. I'll keep an eye out for more suggestions and work on the high points we have identified here so far.

Eric Koester
Andrews University

Diego Pino

unread,
Oct 22, 2015, 5:22:16 PM10/22/15
to islandora
Great to hear that Eric and thank you Brad for sharing your experience on this (happy community). About the last thing related to "fleet of ingest", the Fedora MySQL table does nothing there, Islandora has no interaction with that table at all (it's used internally by fedora and we only interact with fedora using Tuque which means finally sending/retrieving stuff mostly from Fedora's API-M and API-A, so you have to take care only for Drupal's one (db) to manage the batches. Said that, look at this post:

Drush can be run concurrent (user scenario: drupal multisite), my only concern is how Fedora will react on those calls (never done that before!), which at the end has nothing to do with Islandora/Drupal because we don't store final data on that side.

Cheers!

Brad Spry

unread,
Oct 22, 2015, 9:26:59 PM10/22/15
to islandora
Eric,

To aid in further understanding and tuning, there are some ways to gather more information about what's going on inside the JVM:

The following JAVA_OPTS enable two very useful logs:
-Xloggc:/var/log/islandora/java/java_gc.log -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:ErrorFile=/var/log/islandora/java/java_error%p.log
...adjust the log path to a valid path on your system.

The java_gc.log log reveals Java garbage collection details.   'Tail' java_gc.log, watch it in real time, beginning with a freshly booted system state and then run some ingests.  Observe after ingests are completed too.

If java_gc.log reports "Full GC" nonstop, then garbage collection is probably a major contributor to your CPU chew...

The second log, the JVM error log will reveal bigger, deeper issues, if they happen.

Speaking of logs, have you looked through Fedora's, Tomcat's, and PHP's logs for any errors, like out of memory errors, timeouts, etc?   With a whole lot of logging going on, one of them will surely reveal some tidbit of information...

More tools for your JVM toolbox:

'jps' command, reveals JVM PIDs.   Output:

jps
5097 Bootstrap
19446 Jps

...you most probably want the PID for "Bootstrap".

Armed with the JVM PID, you can use it with the 'jmap' command to peer into the heap:

jmap -heap PID

Output:

Attaching to process ID 5097, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 20.45-b01

using thread-local object allocation.
Parallel GC with 8 thread(s)

Heap Configuration:
   MinHeapFreeRatio = 40
   MaxHeapFreeRatio = 70
   MaxHeapSize      = 37580963840 (35840.0MB)
   NewSize          = 21474836480 (20480.0MB)
   MaxNewSize       = 21474836480 (20480.0MB)
   OldSize          = 5439488 (5.1875MB)
   NewRatio         = 2
   SurvivorRatio    = 8
   PermSize         = 524288000 (500.0MB)
   MaxPermSize      = 524288000 (500.0MB)

Heap Usage:
PS Young Generation
Eden Space:
   capacity = 17074094080 (16283.125MB)
   used     = 8800584608 (8392.891510009766MB)
   free     = 8273509472 (7890.233489990234MB)
   51.543493709037826% used
From Space:
   capacity = 2200371200 (2098.4375MB)
   used     = 2200172576 (2098.248077392578MB)
   free     = 198624 (0.189422607421875MB)
   99.99097315943783% used
To Space:
   capacity = 2200371200 (2098.4375MB)
   used     = 0 (0.0MB)
   free     = 2200371200 (2098.4375MB)
   0.0% used
PS Old Generation
   capacity = 16106127360 (15360.0MB)
   used     = 14981246312 (14287.229835510254MB)
   free     = 1124881048 (1072.770164489746MB)
   93.01581924160321% used
PS Perm Generation
   capacity = 524288000 (500.0MB)
   used     = 116911768 (111.4957504272461MB)
   free     = 407376232 (388.5042495727539MB)
   22.29915008544922% used


Watch it before, during, and after ingests, to gain a lot of understanding that can be applied to JVM tuning.

Here's my tuning o' the day for a 60GB RAM system:
-Xmn20g -Xms35g -Xmx35g -XX:PermSize=500m -XX:MaxPermSize=500m -XX:+DisableExplicitGC -XX:+UseParallelGC -XX:+UseParallelOldGC

For a 7.5GB system with 2 CPUs, I'd start experimenting with a tuning something like this:
-Xmn2500m -Xms4g -Xmx4g -XX:PermSize=200m -XX:MaxPermSize=200m -XX:+DisableExplicitGC -XX:+UseParallelGC -XX:+UseParallelOldGC

-Xmn defines the Young generation.  I'm observing the system doing most of its heavy lifting in the Young generation, so I'm starting to give it more and more, and it's working better and better.

I'm also using the Parallel garbage collector, which gives you one thread of garbage collection per processor.  You have two processors, so I encourage you to experiment with the parallel collector.  The winning garbage collector will be revealed by the GC log.   Repeating PSYoungGen collections are alright, repeating Full GC collections are not...  Garbage collector type with the lowest number of Full GCs during real world system operations wins.

<B

er...@infogetics.com

unread,
Oct 23, 2015, 12:24:33 PM10/23/15
to islandora
Thanks for the seriously detailed info here. I'll go over everything next week and see where we stand and report back.

Have a great weekend!

Eric K

unread,
Nov 2, 2015, 2:42:55 PM11/2/15
to islandora
An update here.

By moving my ingest "intake" folder where I lay down the incoming files so that it is on the same drive as Fedora, and by making that drive an SSD drive like I *though* it already was, we are seeing ingest timeframes that significantly improved. I haven't had a chance to actually time it accurately, but completion is under 5 minutes vs the 15 minutes I was seeing previously for this sample.

Eric

Brad Spry

unread,
Nov 3, 2015, 5:21:04 PM11/3/15
to islandora
Eric,

So glad things are moving in a positive direction for you!

I'm writing to mention the Rules support built into Islandora, which requires the Rules module:
https://www.drupal.org/project/rules

You can utilize Rules to send an e-mail when objects and derivatives are successfully ingested.   Timestamped e-mails could help shed some light on what and especially when things are happening behind the scenes.


Brad




Reply all
Reply to author
Forward
0 new messages