Some questions about scaling issues

Darren Hardy

unread,

Dec 12, 2016, 1:53:28 PM12/12/16

to openwayback-dev

We have a ~20TB (and growing) installation of cdx-server here at Stanford Library. We're running into some scaling problems that we'd like some feedback on.

What is the best configuration for large (>100GB) CDX files? We're currently using a single CDX file for our instance and each time we want to add more content, we have to sort/merge the whole thing again. Is there another configuration that supports incremental indexing, like WatchedCDXSource?
Does anyone have some rough performance characteristics for the CDX generation code (bin/cdx-indexer)? Is it CPU or IO intensive?
What are other institutions using for their filesystem storage of WARC files? And, how are you able to grow that over time? We are limited in our options since our NetApp storage is shared by many stakeholders here. So, we're looking at having to deal with multiple NFS mounts.

Sawood Alam

unread,

Dec 12, 2016, 2:44:12 PM12/12/16

to openway...@googlegroups.com

As far as point number one is concerned, I would ask, why are you forcing yourself to a single CDX file? For quite some time OWB is supporting wildcard like syntax to load one or more CDX files for each collection/endpoint. It is certainly helpful to have less number bigger CDX files than a lot of small CDX files. However, when file system or other limitations arrive, there is no harm inb keeping more than one relatively bigger CDX files in a directory and load them all for lookup.

Incremental merging is fairly fast and efficient [linear O(N+M)] operation if the incremental file is also sorted before merging and -m flag is passed to the sort command to tell that the input files are already sorted.

I am not too sure about the ZipNumCluster, but I have some vague idea that it can be used in case where CDX files grow beyond some limits.

Best,

--

Sawood Alam

Department of Computer Science

Old Dominion University

Norfolk VA 23529

--
You received this message because you are subscribed to the Google Groups "openwayback-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openwayback-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Darren Hardy

unread,

Dec 12, 2016, 4:13:58 PM12/12/16

to openwayback-dev

So, you recommend we use the WatchedCDXSource for a collection of CDX files, rather than a single CDX file. Is there a practical limit to the number of CDX files the server can handle? We have dozens of collections. Also, we're concerned about the scalability of how the server reads these CDX files -- if we have a hundred, is that too much? My understanding is that the server does a binary search of the CDX file to locate the information it needs -- based on looking at the FlatFile class.

Thanks,

-Darren

To unsubscribe from this group and stop receiving emails from it, send an email to openwayback-d...@googlegroups.com.

Sawood Alam

unread,

Dec 12, 2016, 4:43:03 PM12/12/16

to openway...@googlegroups.com

Based on my understanding, records are looked up in each CDX file using binary search. However, if you have a a lot of CDX files to perform the lookup from then the file list would be iterated over linearly and then a binary search would be performed in each. This is why fewer lager CDX files are preferred over many small ones, but being religious about one gigantic CDX file per collection would be an over kill in my opinion.

Another reason why it is not recommended to read from a lot of small CDX files is that the server needs to load the references of all those CDX files in the memory, which might not be an issue initially, but if the number of CDX files is really large then a fair amount of memory would be needed to keep just the file references.

Best,

--

Sawood Alam

Department of Computer Science

Old Dominion University

Norfolk VA 23529

To unsubscribe from this group and stop receiving emails from it, send an email to openwayback-dev+unsubscribe@googlegroups.com.

Alex Osborne

unread,

Dec 12, 2016, 7:01:39 PM12/12/16

to openwayback-dev

Hi Darren,

On Tuesday, December 13, 2016 at 5:53:28 AM UTC+11, Darren Hardy wrote:

What is the best configuration for large (>100GB) CDX files? We're currently using a single CDX file for our instance and each time we want to add more content, we have to sort/merge the whole thing again. Is there another configuration that supports incremental indexing, like WatchedCDXSource?

One strategy is the following. Keep several CDX files of increasing size. Perhaps:

Layer 0 - today's index
Layer 1 - this month's index
Layer 2 - this year's index
Layer 3 - the rest of the index

At the end of the day rewrite this month's index by merging today's index into it. At the end of the month rewrite this year's index by merging last month's. At the end of the year merge the year's index into the full index.

This amortizes the amount of updating that has to be done at the expense of needing to do 4 lookups for every query. To better copy with spikes of suddenly increased input you could base the layers on file size rather than time.

The data structure I've just described is similar to a log-structured merge (LSM) tree. There are a number of general purpose databases which implement this sort of structure. We (National Library of Australia) use RocksDB for this purpose and our CDX server is here: https://github.com/nla/outbackcdx

Another option that more recently occurred to me, that may be a better than OutbackCDX when indexes are too large for a single machine is Cassandra which has some similarities with RocksDB (including LSM and compression support) but is more focused on being a distributed database with sharding etc.

I've heard about people experimenting with storing their CDX index in Solr too which uses a somewhat similar merging process for its segment files.

Does anyone have some rough performance characteristics for the CDX generation code (bin/cdx-indexer)? Is it CPU or IO intensive?

It's both CPU and IO intensive for large gzipped collections. I don't have any figures but in my experience typically the CPU bottleneck is in the gzip code decompressing the WARC files. But if you throw say 32 cores at it (by doing multiple WARCs in parallel), you might find IO becomes the bottleneck, depending on what your storage setup is. For example we routinely saturate a 10gbit network link when reading WARCs for indexing.

What are other institutions using for their filesystem storage of WARC files? And, how are you able to grow that over time? We are limited in our options since our NetApp storage is shared by many stakeholders here. So, we're looking at having to deal with multiple NFS mounts.

Hardware-wise for bulk WARC storage we currently use an Isilon NAS which is used across the institution for all sorts of bulk storage purposes. This can conveniently expose petabytes of filesystem as a single NFS mount. Just about any kind of spinning disk storage should be workable though. I know others are using local storage accessing a large cluster of servers over HTTP or HDFS.

We have two filesystems: a 'working' filesystem for new data and active crawls, and a 'preservation' filesystem for permanent bulk storage. We store the CDX index on SSD directly installed in a server. We keep track of the current location of each WARC in an SQL database and Wayback requests access to the WARC files over HTTP rather than directly accessing the filesystem, allowing us to move things from filesystem to filesystem or to different types of storage without an interruption to service and without Wayback having to worry about the details about where anything is stored.

Hope that helps,

Alex

Darren Hardy

unread,

Dec 12, 2016, 7:37:00 PM12/12/16

to openwayback-dev

Thanks for the feedback. I like the layered index approach.

So, you're using an SQL database rather than the flat file `path-index.txt` file format for mapping to WARC file locations? What's the scale in number/size of WARC files that you're dealing with? We have ~20k files totalling ~20TB right now.

I'm a little confused about the server ecosystem. We're just running the OpenWayback `cdx-server.war`. Are people generally running this cdx-server.war in production settings? Or for scaling purposes, should we consider switching to another implementation that uses a database for the index?

We run our service from VMs and we haven't added SSD storage to our NetApp yet (it's relatively expensive). So we're stuck with relatively slow NFS mounts in the short term.