Elasticsearch indexing AIPs with many files

Andrew Berger

unread,

Aug 25, 2015, 8:17:36 PM8/25/15

to archivematica

Hi all,

I've been testing Archivematica 1.4.1 using AIPs with many files (5000 - 10000) and am running into a couple of problems that all seem to be related to Elasticsearch indexing.

The first has to do with indexing large AIPs:

I tried to ingest an AIP with 10204 files, which represents a few directories pulled from a software collection from the late 80s/early 90s. This stored successfully, but the post-AIP storage indexing step failed with multiple read timeout errors. The final traceback was:

Traceback (most recent call last):
  File "/usr/lib/archivematica/MCPClient/clientScripts/indexAIP.py", line 101, in <module>
    sys.exit(index_aip())
  File "/usr/lib/archivematica/MCPClient/clientScripts/indexAIP.py", line 76, in index_aip
    identifiers=identifiers)
  File "/usr/lib/archivematica/archivematicaCommon/elasticSearchFunctions.py", line 335, in connect_and_index_aip
    try_to_index(conn, aipData, 'aips', 'aip')
  File "/usr/lib/archivematica/archivematicaCommon/elasticSearchFunctions.py", line 343, in try_to_index
    return conn.index(body=data, index=index, doc_type=doc_type)
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/client/utils.py", line 69, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/client/__init__.py", line 254, in index
    _make_path(index, doc_type, id), params=params, body=body)
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/transport.py", line 307, in perform_request
    status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/connection/http_urllib3.py", line 82, in perform_request
    raise ConnectionTimeout('TIMEOUT', str(e), e)
elasticsearch.exceptions.ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host=u'localhost', port=9200): Read timed out. (read timeout=10))

The second issue is that at some point in my testing the Archival Storage tab stopped loading:

Since a ~10,000 file AIP failed to index, I dropped down to trying with about ~5000 files per AIP. I ingested three AIPs of roughly that size: one was taken from the same software collection I used for the ~10,000 file test and the other two were the contents of a 1992 CD that I ingested twice using different format ID tools. Since those were successful, I also tried an AIP of ~7500 files, which was also successfully ingested and indexed, at least according to the output on the ingest tab.

Unfortunately, at some point during the process of ingesting these AIPs, the Archival Storage tab stopped loading and started giving me a 500 server error. Since the rest of the dashboard seems fine, I suspect that there is a problem with Elasticsearch retrieving the data used to populate the Archival Storage tab. I'm still able to ingest new AIPs and verify their fixity using the Storage Service API.

I turned on dashboard debugging and it shows two different errors, depending on the Elasticsearch timeout settings. The first error is simply a read timeout with a similar traceback to the one I got above for the 10,000 file AIP. Following the traceback output, I increased the timeout in

/usr/local/lib/python2.7/dist-packages/elasticsearch/transport.py

Eventually, I stopped getting the read timeout and now get the following IndexError:

Environment:

Request Method: GET
Request URL: http://am1204.hq.computerhistory.org/archival-storage/

Django Version: 1.5.4
Python Version: 2.7.3
Installed Applications:
('django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.messages',
'django.contrib.staticfiles',
'django.contrib.webdesign',
'installer',
'components.accounts',
'main',
'components.mcp',
'components.administration',
'fpr',
'tastypie')
Installed Middleware:
('django.middleware.common.CommonMiddleware',
'django.contrib.sessions.middleware.SessionMiddleware',
'django.contrib.auth.middleware.AuthenticationMiddleware',
'django.contrib.messages.middleware.MessageMiddleware',
'middleware.common.AJAXSimpleExceptionResponseMiddleware',
'installer.middleware.ConfigurationCheckMiddleware',
'middleware.common.SpecificExceptionErrorPageResponseMiddleware')

Traceback:
File "/usr/local/lib/python2.7/dist-packages/django/core/handlers/base.py" in get_response
115.                         response = callback(request, *callback_args, **callback_kwargs)
File "/usr/share/archivematica/dashboard/components/decorators.py" in inner
50.                 return func(request, *args, **kwargs)
File "/usr/share/archivematica/dashboard/components/archival_storage/views.py" in overview
59.     return list_display(request)
File "/usr/share/archivematica/dashboard/components/archival_storage/views.py" in list_display
532.         current_page_number
File "/usr/share/archivematica/dashboard/components/helpers.py" in pager
92.         page = paginator.page(current_page_number)
File "/usr/local/lib/python2.7/dist-packages/django/core/paginator.py" in page
45.         return Page(self.object_list[bottom:top], number, self)
File "/usr/local/lib/python2.7/dist-packages/lazy_paged_sequence.py" in __getitem__
57.             return [self.__getitem__(i) for i in range(index.start, index.stop, step)]
File "/usr/local/lib/python2.7/dist-packages/lazy_paged_sequence.py" in __getitem__
65.         return self.__cache[page_number][index % self.page_size]

Exception Type: IndexError at /archival-storage/
Exception Value: list index out of range

Has anyone else seen this issue? Could the problem be that Elasticsearch isn't able to pull the data required to display and paginate the archival storage results list properly? Aside from the one change to the timeout variable, I have not changed any of the default Elasticsearch configuration, so I could also see that being the source of the problem. I expect that there's a point where the configuration needs to be modified to scale up the number of files indexed, which I may have reached. Any help or advice would be appreciated.

Thanks,
Andrew

Andrew Berger

unread,

Aug 26, 2015, 1:48:25 PM8/26/15

to archiv...@googlegroups.com

To partly answer my own question:

After digging through the Elasticsearch logs and finding Java out of memory errors, I changed the value of the ES_HEAP_SIZE variable in /etc/init.d/elasticsearch to 2GB. The VM I'm testing with has 8GB RAM total, so if I'm understanding the documentation correctly, there's still room to increase the Elasticsearch heap size again to 4GB.

I restarted Elasticsearch with

sudo /etc/init.d/elasticsearch restart

and the Archival Storage tab came back to life. It loads a bit slowly, and even the AIP that previously failed to index now appears multiple times in the list, but everything appears to be there.

This does make me wonder: what common Elasticsearch configurations are people using for Archivematica? The default heap size, which appears to be 1GB, seems like it's fine for most smaller installations, but it looks like there's a point where it has to be increased to accommodate a growing index.

Andrew

--
You received this message because you are subscribed to the Google Groups "archivematica" group.
To unsubscribe from this group and stop receiving emails from it, send an email to archivematic...@googlegroups.com.
To post to this group, send email to archiv...@googlegroups.com.
Visit this group at http://groups.google.com/group/archivematica.
For more options, visit https://groups.google.com/d/optout.

Genevieve HK

unread,

Apr 22, 2016, 1:46:50 PM4/22/16

to archivematica

Hi Andrew, out of curiosity, when you changed the ES_HEAP_SIZE, did you make any changes to other neighboring values, such as the "Heap new generation / ES_HEAP_NEWSIZE" or "max direct memory / ES_DIRECT_SIZE" ?

Just wondering because I made the change you described above and later experienced some very strange behavior (even though I didn't set the new HEAP size to anywhere near 50% of the available memory).

Thanks!
-Gen

Andrew Berger

unread,

Apr 27, 2016, 7:09:19 PM4/27/16

to archiv...@googlegroups.com

Hi Gen,

It's been a while since I was testing this, but I don't think I changed anything other than ES_HEAP_SIZE. What strange behavior have you been seeing? When I tested this, I didn't ingest many more AIPs after increasing the variable, so I may just have never seen what you're seeing.

Incidentally, we are still running in production with the default Elasticsearch settings. If you do resolve this, I'd be interested to know what settings you use. I have some packages in the queue that we have not ingested yet because we are concerned about what will happen with the indexing.

Best,

Andrew

Visit this group at https://groups.google.com/group/archivematica.

Ralf Siebert

unread,

Apr 28, 2016, 4:17:38 AM4/28/16

to archivematica

Good morning Andrew,

what you can do is, to work with a cluster of Elasticsearch to share the work on more than one maschine for the index creation. The standard configuration looks like good for small amounts of AIPs. AWS gives some possibilities for that cluster scenario. There you have the chance to book automatically more maschines into your cluster if you have a special value of heap usage for example.