What are the biggest Islandora instances out there?

Mark Jordan

unread,

Jan 23, 2014, 2:41:29 PM1/23/14

to isla...@googlegroups.com

Hi all,

I would like to ask the community how large their Islandora instances are, in terms of numbers of objects. We have over 1.3 million objects in our current (non-Islandora) repo platform and I would find it useful to be able to compare our scale with that of some operational Islandora instances.

I suspect that Fedora will be the bottleneck in the Islandora stack but would also be interested in hearing perspectives on that impression as well.

Thanks,

Mark

Mark Jordan
Head of Library Systems
W.A.C. Bennett Library, Simon Fraser University
Burnaby, British Columbia, V5A 1S6, Canada
Voice: 778.782.5753 / Fax: 778.782.3023 / Skype: mark.jordan50
mjo...@sfu.ca

Nick Ruest

unread,

Jan 23, 2014, 3:06:38 PM1/23/14

to isla...@googlegroups.com

Hi Mark-

We're sitting at 19k objects now, and growing daily. Not quite 1.3
million :-)

In terms of setup, everything is on the same box. I wish is were not
that way, but it is our reality right now.

Local storage is our biggest bottle neck right now. If I had more, well
would have more in the repo since I have more than enough stuff in the
queue.

Let me know if you want/need more info.

-nruest

Mark Jordan

unread,

Jan 23, 2014, 3:15:31 PM1/23/14

to isla...@googlegroups.com

Thanks Nick. Are you using a single box for the entire stack because you don't have additional boxes or for some other reason?

Mark

> --
> You received this message because you are subscribed to the Google
> Groups "islandora" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to islandora+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>

Nick Ruest

unread,

Jan 23, 2014, 3:18:15 PM1/23/14

to isla...@googlegroups.com

Honestly, our local IT cannot support it. We had to go with a virtual
server and attached storage from our central campus IT that is very pricey.

-nruest

Mark Jordan

unread,

Jan 23, 2014, 3:27:16 PM1/23/14

to isla...@googlegroups.com

Number of VMs is not a problem for us since we host our own VMWare infrastructure, storage capacity is our biggest challenge (however, I don't think one repo platform would use a lot more disk than another for the same content). If we needed to horizontally scale the Islandora stack we could. In fact, we already have a dedicated Solr VM we use for our Drupal instances and miscellaneous content indexing.

Daniel Lamb

unread,

Jan 23, 2014, 3:30:59 PM1/23/14

to isla...@googlegroups.com

Hey Mark,

I’m one of the team leads here at discoverygarden, I don’t often reply to questions on the list, but this is right up my alley. I’ve worked on projects that have millions of objects in their Fedoras.

Fedora will be your bottle neck before Drupal, MySQL, or anything else.

Storage is always an issue. Ensuring the disk space you’re filling is local as opposed to being done through a network share is crucial. If you are doing a migration, do it on the same box as Fedora! If Fedora is on a network share and the share is unreliable due to bad configuration or, well, just bad networking, you’re gonna be hurting when your migrations keep failing at random times.

Also, other concerns to keep in mind are that when the number of objects in Fedora gets huge, the triple store becomes bloated and slow. Typically when projects get this huge, we lean heavily on solr for speedy data retrieval. There are mechanisms in Islandora, particularly solr views, that can help with this problem. It’s a pretty good solution, but keep in mind that solr uses flat key/value pairs to model data, which may or may not jive with your current data model.

~Danny

Nick Ruest

unread,

Jan 23, 2014, 3:31:22 PM1/23/14

to isla...@googlegroups.com

You're at a significant advantage over us!

I should have mentioned this in my original reply, but hopefully Ed
and/or Robin from CO Alliance catch this. I *think* they could identify
with similar object counts.

-nruest

Robin Dean

unread,

Jan 23, 2014, 4:02:21 PM1/23/14

to isla...@googlegroups.com

Our members share a Fedora installation and a multisite Drupal setup.

Drupal, Fedora, MySQL, Solr, and microservices (ffmpeg, imagemagick, etc) each have their own VMs, and we have a separate SAN for Fedora storage.

Our Fedora is currently at about 6TB (around 74,000 objects / ~600,000 datastreams). So not at the level that Mark J. is talking about, either!

Robin Dean
Director, Alliance Digital Repository
Colorado Alliance of Research Libraries

ro...@coalliance.org
http://adr.coalliance.org

________________________________________
From: isla...@googlegroups.com <isla...@googlegroups.com> on behalf of Nick Ruest <rue...@gmail.com>
Sent: Thursday, January 23, 2014 1:31 PM
To: isla...@googlegroups.com
Subject: Re: [islandora] What are the biggest Islandora instances out there?

Ed @CO Alliance of Research Libraries

unread,

Jan 23, 2014, 5:24:56 PM1/23/14

to isla...@googlegroups.com

Some additional thoughts about our experiences:

A couple of years back we performance tested our 4 gigabit Fiber Channel SAN against iSCSI on a dedicated 1 gigabit vlan. We tried small, medium, and large files and went with the iSCSI because of performance. We also tested different versions of NFS and found better performance numbers with increased NFS major version numbers especially between version 2 and version 3

A pair of dual quad core servers currently host the production VMs mostly for performance but also for maintaining uptime with vMotion.

We migrated from Fedora 2.x to our production Islandora over the wire with both the old and new servers on the same network switch. We used the migrate option of Fedora export instead of the archive option for two reasons: smaller export files as the datastreams would be pulled across the network during ingest and, when needed, the programmatic and manual editing of FoxML without the datastreams was easier on the humans

Ed

Giancarlo Birello

unread,

Jan 24, 2014, 3:58:02 AM1/24/14

to isla...@googlegroups.com

HI Mark,

out experience at www.digibess.it.
We have around 500k objects (books and pages) with a dual-server architecture:
- front-end server with Islandora
- back-end server with Fedora, Solr, ...
- SAN for storage (iSCSI)
More info here: dev.digibess.it.

At the moment the performance is the same as with few objects.

All machines are VM over KVM hypervisor.

Best,
Giancarlo

Donald Moses

unread,

Jan 24, 2014, 11:42:19 AM1/24/14

to isla...@googlegroups.com

Hi Mark:
a timely question. I'm trying to sort out how to provide statistics for the annual CARL stats.

I always wonder how people come up with the numbers and it'd be great to share the how. I've put my queries below and any feedback would be appreciated.

I like how Robin reported her numbers so will do the same.

Numbers:
707,774 digital objects
5,272,705 datastreams
occupying 9.3Tb of storage

For counting objects in a repository I use this SPARQL query.

SELECT ?obj 
FROM <#ri>
WHERE {
?obj <info:fedora/fedora-system:def/model#hasModel> <info:fedora/fedora-system:FedoraObject-3.0> .
}

For counting datastreams in a repository I use this SPARQL query.

SELECT ?obj ?ds
FROM <#ri>
WHERE {
?obj <info:fedora/fedora-system:def/view#disseminates> ?ds .
}

For storage numbers I go to the fedora data directory ... for me that is at /usr/local/fedora/data and I run
du -sh * and then get the numbers for the objectStore and datastreamStore

Thanks,
Don

Mark Jordan

unread,

Jan 24, 2014, 12:33:47 PM1/24/14

to isla...@googlegroups.com

Donald, thanks very much. Maybe someone should add an option to the main Islandora module to generate these number on demand... if there's interest I could do that and submit a pull request.

Anyway, thanks everybody for your responses. Sounds like starting off with a distributed architecture and making sure the Fedora server is adequately spec'ed, with fast storage, would be wise.

Mark

Giancarlo Birello

unread,

Jan 24, 2014, 12:45:54 PM1/24/14

to isla...@googlegroups.com

Hi Don,
very very interesting ... I like stats!

For digibess I used a collection object with custom query and collection_view (http://dev.digibess.it/doku.php?id=model:customcoll) to show stats about collection/book/page: http://www.digibess.it/fedora/repository/openbess%3Astat.

I like your query because it is more sys admin oriented while my query is librarian oriented.

Best,
Giancarlo

Donald Moses

unread,

Jan 26, 2014, 11:04:43 AM1/26/14

to isla...@googlegroups.com

Giancarlo - grazie for sharing your ITQL queries and your COLLECTION_VIEW datastreams. With some changes for my local repo that is something I could use and get up and running quickly.

Mark ... there was a discussion thread about creating an "Islandora Dashboard" that might be of interest [1] and Peter MacDonald presented at OR2013 about getting Metrics from Islandora [2] but I don't see any slides. Peter if you had any code to share that'd be great. If you are thinking of building something perhaps the quickest route might be a port of Will's D6 islandora_reporter module [3] which stores, runs, and displays the results of SPARQL queries.

Thanks,
Donald

[1] https://groups.google.com/forum/#!searchin/islandora/$20Islandora$20Dashboard/islandora/n9wMai1aCEc/79TIJkmJg30J
[2] http://or2013.net/sessions/developing-metrics-dashboard-islandora
[3] https://github.com/discoverygarden/islandora_reporter

Nick Ruest

unread,

Jan 26, 2014, 11:19:54 AM1/26/14

to isla...@googlegroups.com

Ed created a nifty tool[1] grabbing the file size of a given collection
object, and it also provides object and datastream counts.

-nruest

[1] https://github.com/edf/fcrepo-reporting-utilities

Peter MacDonald

unread,

Jan 27, 2014, 9:07:54 AM1/27/14

to isla...@googlegroups.com

I just now uploaded a PDF of my slides on iBrowse to the OR2013 site. iBrowse is a Drupal 7 module I wrote to browse RELS-EXT relationships. See

http://or2013.net/sessions/developing-metrics-dashboard-islandora

However, the module itself is not ready to distribute due to lack of effort on my part to fully test it, clean it up, and generalize it properly.

I will post it on Github when ready.

It does provide some basic metrics such as how many times each relationship, object label, MIME type, and DC field occurs and hyperlinks to all instances of them.

Sorry for the delays.

Peter

--

You received this message because you are subscribed to the Google Groups "islandora" group.
To unsubscribe from this group and stop receiving emails from it, send an email to islandora+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--

Library Information Systems Specialist
Hamilton College Library

Clinton, New York

Donald Moses

unread,

Jan 27, 2014, 1:54:34 PM1/27/14

to isla...@googlegroups.com

Thanks for posting your slides Peter. You've got some good ideas there. If you push your code up to git you may get some additional contributors from the community to help out :)
Best,
Don

Nick Ruest

unread,

Feb 4, 2014, 1:31:23 PM2/4/14

to isla...@googlegroups.com

Hi Peter-

I'm very interested in your module, and I think a few other folks in the
community are as well. I'm totally willing to help you test, clean it
up, and generalize it if you'd like. If that sounds amenable to you, let
me know, and we can take it from there.

cheers!

-nruest

> <mailto:islandora%2Bunsu...@googlegroups.com>.

Peter MacDonald

unread,

Feb 4, 2014, 2:03:44 PM2/4/14

to isla...@googlegroups.com

Thanks Nick for your offer of help.

I'll begin working on iBrowse for Islandora again as soon as I can and post it for comments, but not right away, I'm afraid.

Peter

send an email to islandora+unsubscribe@googlegroups.com
<mailto:islandora%2Bunsu...@googlegroups.com>.

For more options, visit https://groups.google.com/groups/opt_out.

--
Library Information Systems Specialist
Hamilton College Library
Clinton, New York

--
You received this message because you are subscribed to the Google
Groups "islandora" group.
To unsubscribe from this group and stop receiving emails from it, send

an email to islandora+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--

You received this message because you are subscribed to the Google Groups "islandora" group.

To unsubscribe from this group and stop receiving emails from it, send an email to islandora+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

Ed Fugikawa

unread,

Feb 26, 2014, 5:47:44 PM2/26/14

to isla...@googlegroups.com

using Don's SPARQL in a script

74483 objects
1152447 datastreams

and 5.4T according to du -sh

the script follows:

#/bin/bash

#

# need to change the following three lines for local setup start

FedoraUser=YourUser
FedoraUserPassword=YourPassword
FedoraHostname="http://YourServer.YourDomain.org"
# need to change for local setup end

# may need to change depending on local fedora

FedoraPort=":8080"
FedoraContext="/fedora"

# end custom fedora settings

FedoraServer="$FedoraHostname$FedoraPort$FedoraContext"
resultObject=$(curl -s -u $FedoraUser:$FedoraUserPassword -X POST "$FedoraServer/risearch?type=tuples&lang=sparql&format=count&dt=on&query=SELECT%20%3Fobj%20FROM%20%3C%23ri%3E%20WHERE%20%7B%20%3Fobj%20%3Cinfo%3Afedora%2Ffedora-system%3Adef%2Fmodel%23hasModel%3E%20%3Cinfo%3Afedora%2Ffedora-system%3AFedoraObject-3.0%3E%20.%20%20%7D")
echo "$FedoraHostname - $resultObject objects"
resultDatastream=$(curl -s -u $FedoraUser:$FedoraUserPassword -X POST "$FedoraServer/risearch?type=tuples&lang=sparql&format=count&dt=on&query=SELECT%20%3Fobj%20%3Fds%20FROM%20%3C%23ri%3E%20WHERE%20%7B%3Fobj%20%3Cinfo%3Afedora%2Ffedora-system%3Adef%2Fview%23disseminates%3E%20%3Fds%20.%20%7D")
echo "$FedoraHostname - $resultDatastream datastreams"

From: isla...@googlegroups.com <isla...@googlegroups.com> on behalf of Donald Moses <dmo...@upei.ca>
Sent: Friday, January 24, 2014 9:42 AM
To: isla...@googlegroups.com
Subject: [islandora] Re: What are the biggest Islandora instances out there?

--
You received this message because you are subscribed to the Google Groups "islandora" group.

To unsubscribe from this group and stop receiving emails from it, send an email to islandora+...@googlegroups.com.

Mark Jordan

unread,

Mar 16, 2014, 2:31:37 PM3/16/14

to isla...@googlegroups.com, Kurt Bolko

Hi,

Thanks to everyone who responded to my question and in particular for the heads up on FedoraCommons being the part of the Islandora stack that would likely be the bottleneck.

We've performed some tests to see how FC would scale for us given the number of objects we currently provide access to (over 1.3 million). Over the last several weeks we have loaded more than 3 million objects into an instance of FedoraCommons in order to test its performance serving up the sorts of files that we would host in Islandora. We performed two types of tests: 1) average time retrieving datastreams as we loaded more objects and 2) time taken to retrieve a typical object 'web page' after completing the load. The content we loaded was issues from a local newspaper we digitized from microfilm. These issues contained only about 500,000 pages so we loaded duplicate copies to make up the 3,000,000+ objects. For each page, we loaded a thumbnail image, a full-size page image (running in the 1-2 MB range), and the equivalent of OCR text (not actual page-level OCR but some randomized text).

The tests:

1) Every time we completed a load (at the start of our tests, in groups of 30k or so objects, then when we realized how long getting to 3 million would take, in groups of 100,000 or so objects), we ran a benchmark script that calculated the time it took to retrieve 100 random datastreams from a random selection of page-level objects. The script calculated the average time to retrieve a datastream as reported by PHP's curl_getinfo. As we added more objects we increased the randomization pool to include all objects in Fedora, so every subsequent run of our benchmarking script was downloading 100 random datastreams from an increasing number of potential objects. We ran the benchmark script on the same server that Fedora was installed on to eliminate network latency, and retrieved the datastreams via Fedora's native REST interface.

The results, which are available at https://gist.github.com/mjordan/9575829, show on the left side of the comma the number of objects in our FC instance when we ran the benchmark script, and on the right side of the comma the average time (in seconds) to retrieve a datastream (calculated from 100 datastreams). Average datastream download times range from 0.07 seconds to 1.3 seconds, with an overall average time to download a random datastream of 0.296 seconds. The results are encouraging as they show no appreciable degradation of performance as the number of objects in FedoraCommons increased.

2) The second set of tests used the Apache 'ab' command to derive performance metrics for loading a 'web page' comprised of the DC, thumbnail, text, and full-size page image for a newspaper page object. To create the web page, we wrote a simple script using the Tuque API to get all the datastreams for a random newspaper-page-level object and render them together as a single HTML page.

We configured ab to download the page 1000 times with a concurrency of 1. An example (and typical) ab result is:

Percentage of the requests served within a certain time (ms)
50% 70
66% 71
75% 72
80% 72
90% 76
95% 83
98% 110
99% 113
100% 155 (longest request)

In this case, 95% of all requests for the page in question took less than 83 milliseconds to download, which is not bad considering that the data was derived from 1000 consecutive downloads of the page.

We're pretty happy with the results from both tests.

Mark

----- Original Message -----

Nick Ruest

unread,

Mar 18, 2014, 1:35:16 PM3/18/14

to isla...@googlegroups.com

This is fantastic work Mark!

I'm super curious how these tests would look compared against the same
setup with Varnish. I'm alluding to the work Paul Pound and company
spoke about on committers call a few weeks ago.

-nruest

Mark Leggott

unread,

Mar 27, 2014, 7:48:45 AM3/27/14

to isla...@googlegroups.com, Kurt Bolko

This is an awesome piece of work Mark, and with a critical aspect of the stack. Did you derive any metrics on what the ingest time would be for the 1.3 million? It would be interesting to hear how long it would take without and with processing (i.e. how long to load datastreams (TN, OCR, etc.) that had already been created vs not) as well. Do you have other data stemming from this analysis?

This also raises the opportunity to create a "Testing Interest Group" under the context of the Interest Group framework the Islandora Foundation is working on. Metrics are an oft requested thing and we tend to have precious little to report back, so an effort to capture and record this in a single place would be very useful.

Kudos to you and Kurt for working on this.

Mark

> --
> You received this message because you are subscribed to the Google Groups "islandora" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to islandora+...@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.

Mark Jordan

unread,

Mar 27, 2014, 11:57:22 AM3/27/14

to isla...@googlegroups.com, Kurt Bolko

Hi Mark,

----- Original Message -----
> This is an awesome piece of work Mark, and with a critical aspect of
> the stack. Did you derive any metrics on what the ingest time would
> be for the 1.3 million?

We didn't track this in calendar time because we didn't load all 3M in one batch - we ran each of the jobs of 100k or so objects manually (actually, we ran the loader script and then the benchmark script manually). However, looking at the load logs, I can see that load performance remained pretty consistent regardless of how many objects were already in FC, on average about 8 objects/second.

> It would be interesting to hear how long it
> would take without and with processing (i.e. how long to load
> datastreams (TN, OCR, etc.) that had already been created vs not) as
> well. Do you have other data stemming from this analysis?

Our test was for pre-derived datastreams (i.e., the JPEGS, OCR, and TNs were all ready to serve). Also, this test was really only intended to gauge Fedora's performance on our specific hardware, since we know that we won't be getting any faster storage arrays during the next few years. So we didn't test derivative-creation or other components of the Islandora stack, although both the loader script and the 'web page' test used Tuque. We've been using Drupal and Solr long enough that we aren't too concerned about making them performant; since we've never implemented Fedora before we needed to be sure that it wouldn't choke on us.

>
> This also raises the opportunity to create a "Testing Interest Group"
> under the context of the Interest Group framework the Islandora
> Foundation is working on. Metrics are an oft requested thing and we
> tend to have precious little to report back, so an effort to capture
> and record this in a single place would be very useful.
>

I'd be happy to share our benchmark script in case anyone wants to hack it for their own use. As you suggest, maybe one of the things that a Testing IG could do is determine some standard tests for the Islandora stack and maintain some scripts or other resources for implementers.

Mark

tcha...@smith.edu

unread,

Apr 20, 2018, 9:42:45 AM4/20/18

to islandora

Hi Mark (Jordan),

In the spirit of this original thread, it would be great to hear about the volume of your Islandora repository now that time has passed. What's the number of objects in your production system? Did you end up having to use Blazegraph? Were other modifications needed? We are looking at doing an ingest of about a million pages of documents with OCR, much like the content used in the test you ran above.

For such large ingests do you go straight into Fedora, or do you use the Islandora batch ingest tools? I am aware of your MIK project which seems to be designed for ingesting using the Islandora batch ingest tools, so I'm assuming the latter.

PS: We met a year or two ago at Islandora Camp in BC. Thank you again for your generous participation. I got so much out of it.

Many thanks,

Tristan Chambers

Digital Library Applications Administrator

Smith College, Northampton MA USA

Giancarlo Birello

unread,

Apr 20, 2018, 10:39:33 AM4/20/18

to isla...@googlegroups.com, tcha...@smith.edu

Hi Tristan,

if my experience could be useful:

- we reached more than 1 million of pages (OCR, HOCR, ...) on production site (http://www.byterfly.eu)

- 800,000 pages coming from old Islandora 6.x and the other one new ingesting

- time for migration 1 year

- the time would be three/four time long without pre-derive especially HOCR and JP2 datastreams

- we used/are using book batch ingesting + custom bash script

In addition, I know Diego developed an interesting tool for batch ingesting.

Have a nice day,

Giancarlo

--
For more information about using this group, please read our Listserv Guidelines: http://islandora.ca/content/welcome-islandora-listserv
---

You received this message because you are subscribed to the Google Groups "islandora" group.
To unsubscribe from this group and stop receiving emails from it, send an email to islandora+...@googlegroups.com.

Visit this group at https://groups.google.com/group/islandora.
To view this discussion on the web visit https://groups.google.com/d/msgid/islandora/425bc51a-cb43-44f0-a674-859c4d05c99d%40googlegroups.com.

Mark Jordan

unread,

Apr 20, 2018, 4:14:58 PM4/20/18

to isla...@googlegroups.com

Hi Tristan,

We have four separate Islandora repos, the largest being http://newspapers.lib.sfu.ca/, which has about 720,000 objects, all newspapers, issues, and pages. This is as standard an Islandora instance as you can get - default components (no Blazegraph), all running on one large VM. Search performance isn't stellar, but it's acceptable. With a few exceptions, each newspaper is in its own collection (a pattern we carried over from CONTENTdm), and we don't see any appreciable slowdowns from Fedora. I would say that any perceived slowness is from Solr.

As for ingesting the content, we did not ingest directly into Fedora, we ingested using the Islandora Newspaper Batch module, which we hired DGI to write for us in preparation for our migration. We used MIK (which we developed inhouse as a more general-purpose tool that we could use post-migration for preparing batch loads) to extract each newspaper from CONTENTdm and write out packages in the format described on the Newspaper Batch README. MIK has proven to be very useful to us both during our migration and after it, but as Giancarlo points out, you can assemble your ingest packages in a variety of ways. I would also recommend performing some scripted quality assurance checks on your ingest packages using a tool like https://github.com/mjordan/iipqa to ensure that they are ready to ingest. You don't want any surprises when you're a running days-long drush ingest job. We've distilled our experience migrating from CONTENTdm to Islandora in a wiki page on the MIK github repo. That guide is tuned to MIK but contains information that you might find useful even if you prepare your packages using other means.

Also Giancarlo points out, the more derivatives you can pregenerate, the faster your ingest will be. At your scale, this will be very important. In particular, ingesting prederived page-level OCR will save you months (literally months) of calendar time. If you want HOCR, you'll likely need to have Islandora generate that on ingest, which will cancel out any gains you make by not having it generate OCR. We never had image highlighting in CONTENTdm so didn't think generating the HOCR on migration worth it. Other people on the list might be able to advise on strategies for generating HOCR at scale outside of your target repo.

In addition to prederiving OCR, prederiving JP2 datastreams will cut your ingest time substantially.

Great to hear from you, I'm glad you enjoyed that Camp out here!

Mark

To view this discussion on the web visit https://groups.google.com/d/msgid/islandora/d6684acb-26d2-1eb7-51a8-ff23df49ba1d%40gmail.com.

Mark Jordan

unread,

Apr 20, 2018, 4:20:52 PM4/20/18

to isla...@googlegroups.com

Sorry I forgot to mention, apparently https://github.com/discoverygarden/islandora_gsearcher can speed up ingest substantially as well. We use it now and it seems to work as advertised. We didn't use it during our initial migration however.

Mark

To view this discussion on the web visit https://groups.google.com/d/msgid/islandora/846122877.6955647.1524255296413.JavaMail.zimbra%40sfu.ca.

Giancarlo Birello

unread,

Apr 20, 2018, 4:32:11 PM4/20/18

to isla...@googlegroups.com, Mark Jordan

About OCR and HOCR.

In a first period we pre-derived both by tesseract + parallel (it makes
tesseract multicores).

We have tiffs and searchable pdf of every books, so I found that using
conversion from single page PDF to DJVU format then extract HOCR is
faster than tesseract to derived them.

Just another point of view.

Giancarlo

Mark Jordan

unread,

Apr 20, 2018, 4:43:07 PM4/20/18

to isla...@googlegroups.com

Thanks for the info Giancarlo, parallelizing tesseract to prederive HOCR sounds like an excellent strategy. Tristan, if you can prederive OCR, HOCR, JP2, and TECHMD (FITS output) datastreams and ingest them using Newspaper Batch, you'll be in great shape.

Mark

--
For more information about using this group, please read our Listserv Guidelines: http://islandora.ca/content/welcome-islandora-listserv
---
You received this message because you are subscribed to the Google Groups "islandora" group.
To unsubscribe from this group and stop receiving emails from it, send an email to islandora+...@googlegroups.com.
Visit this group at https://groups.google.com/group/islandora.

To view this discussion on the web visit https://groups.google.com/d/msgid/islandora/5aa3a12c-a259-0cae-65c2-65d1743eb581%40gmail.com.

Tristan Chambers

unread,

Apr 20, 2018, 5:24:02 PM4/20/18

to isla...@googlegroups.com

Hi Mark, this is my current tack so I'm glad to hear that it sounds like the way to go. I've written a couple scripts to take care of this on a 40 core machine that the science department has granted us time on. I've done some test runs of a few hundred pages with very promising results. As Giancarlo mentions, Tesseract doesn't do multithreading on it's own so I added some multiprocess pooling to manage batches of concurrent processes.

Here's are the scripts if anyone is interested.

https://github.com/TristanSmithlib/is-newspaper-batch-derivatives

(I think that repo is a couple commits behind at the time of writing but I'll try to remember to do a push next time I'm on that machine.)

What I haven't tried generating yet is the TECHMD. How would I go about that?

Thanks,

Tristan

To view this discussion on the web visit https://groups.google.com/d/msgid/islandora/206302368.7015010.1524256981518.JavaMail.zimbra%40sfu.ca.

Mark Jordan

unread,

Apr 20, 2018, 5:29:45 PM4/20/18

to isla...@googlegroups.com

We generated page-level TECHMD during the migration using an MIK script. It's at https://github.com/MarcusBarnes/mik/blob/master/extras/scripts/postwritehooks/generate_fits.php if you want to take a look. In your case, it might makes sense to build similar functionality into https://github.com/TristanSmithlib/is-newspaper-batch-derivatives/blob/master/generate-derivatives.py, writing out the TECHMD.xml file into each page's output directory.

Mark

To view this discussion on the web visit https://groups.google.com/d/msgid/islandora/CAAemO%3DshbAwEYh5QPmfG0iibhk4fFk7wAyLPH8e2BR6vKzdh-g%40mail.gmail.com.

dp...@metro.org

unread,

Apr 23, 2018, 1:35:16 PM4/23/18

to islandora

Tristan,

After you decide how to generate your data, you will hit the next bottle neck which is DTD validation. You will see probably 95-99% of the time spend on DTD validation (or waiting for some external website to respond) using external sources on HOCR and XML ingests when GSEARCH processes the Digital Object to Solr Document transformation. I know that Giancarlo++ removes the HTML 4.0 DTD declarations for that, what we do here is we host all needed DTDs locally and fake gsearch (simply a host configuration + apache alias trick) on getting it locally. Without this you can easily hit large pauses and even a denial of Service from W3C or LoC if you hit them 700.000 times!

We went from a few seconds per document on HOCR indexing (not talking about generating the OCR here, just the indexing) to 100 ms. The difference is weeks on indexing books/newspapers

The PDF to DJVU workflow Giancarlo(++ again) describes is also very fast, you can get vendor/commercial OCR done on PDFs (like abbyy OCR) into DJVU and then HOCR in less time than tesseract. But, tesseract is pretty good anyway

Note: We use our IMI[1] (Islandora Mult Importer) from spreadsheets (one row == 1 object) to generate complex ingest structures. It has the benefit of a template based XML generation and multi-source datastream definition, kinda a LEGO building system with UI for islandora. Can handle a few 10.000 batches in a very reliable way and does not depend on specific custom structures to get many CMODELs done. Also, you can have all your data in google spreadsheets if you want (or your OBJs in dropbox and your OCR in local volumen.. pretty flexible). Can put you in contact with other IMI users if you would like to know more.

Note2: Blazegraph. If you are using normal/vanilla content models, you are probably hitting an averange of 8-10 predicates per Object. But Book pages can use more of course (rels-int also). Mulgara starts to degrade to un usable on the 8 million mark. A lot of current SP use Solr for a lot of things, but blazegraph can grow sustainable over the 100 million triples, IF enough food a.k.a memory and fast storage is given.

Note3: we use a lot caching. In front of everything is always a Varnish Cache (but other people have used NGIX with success also), so serving public access is not longer a platform/ecosystem issue anymore, just memory/resources/$. And we get pretty impressive numbers running apache benchmark (like 5000 concurrent users hitting same 50 object listings with 250ms full response times, including thumbnails) on 16 Gbytes on Amazon M5.xlarge machines.

Hope this helps a bit

Best

Diego Pino

Metro.org

[1] https://github.com/mnylc/islandora_multi_importer

Message has been deleted

Lucas van Schaik

unread,

May 28, 2018, 3:24:13 AM5/28/18

to islandora

Hi,

Better late than never, but here are our numbers:

currently at 1,345,388 objects totalling 39 TB. We are still ingesting and plan to grow to 3 million objects and about 70 TB.

We have a 3 server solution: 1 back-end server for Fedora, Solr and blazegraph, and 1 public server and 1 ingest server. Most derivatives are created on this server.

We have hit a lot of boundaries, but overcame them all; for example we have books with more than 1000 pages (hit database limit), we have collections with more than 100.000 objects (normal collection tools give problems), ingests of more than 5000 gave problems, usually with generating derivatives (gearman to the rescue).

We use Islandora batch ingest modules to ingest. DGI (Discovery Garden) helped out a lot with performance tuning.

If there are any questions, don't hesitate to ask them. Oh, our repository is at:

https://digitalcollections.universiteitleiden.nl

regards,

Lucas

Developer/consultant

Leiden University Library

Op donderdag 23 januari 2014 20:41:29 UTC+1 schreef Mark Jordan:

Reply all

Reply to author

Forward