Hi all,
I've been testing Archivematica 1.4.1 using AIPs with many files (5000 - 10000) and am running into a couple of problems that all seem to be related to Elasticsearch indexing.
The first has to do with indexing large AIPs:
I tried to ingest an AIP with 10204 files, which represents a few directories pulled from a software collection from the late 80s/early 90s. This stored successfully, but the post-AIP storage indexing step failed with multiple read timeout errors. The final traceback was:
Traceback (most recent call last):
File "/usr/lib/archivematica/MCPClient/clientScripts/indexAIP.py", line 101, in <module>
sys.exit(index_aip())
File "/usr/lib/archivematica/MCPClient/clientScripts/indexAIP.py", line 76, in index_aip
identifiers=identifiers)
File "/usr/lib/archivematica/archivematicaCommon/elasticSearchFunctions.py", line 335, in connect_and_index_aip
try_to_index(conn, aipData, 'aips', 'aip')
File "/usr/lib/archivematica/archivematicaCommon/elasticSearchFunctions.py", line 343, in try_to_index
return conn.index(body=data, index=index, doc_type=doc_type)
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/client/utils.py", line 69, in _wrapped
return func(*args, params=params, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/client/__init__.py", line 254, in index
_make_path(index, doc_type, id), params=params, body=body)
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/transport.py", line 307, in perform_request
status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/connection/http_urllib3.py", line 82, in perform_request
raise ConnectionTimeout('TIMEOUT', str(e), e)
elasticsearch.exceptions.ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host=u'localhost', port=9200): Read timed out. (read timeout=10))
The second issue is that at some point in my testing the Archival Storage tab stopped loading:
Since a ~10,000 file AIP failed to index, I dropped down to trying with about ~5000 files per AIP. I ingested three AIPs of roughly that size: one was taken from the same software collection I used for the ~10,000 file test and the other two were the contents of a 1992 CD that I ingested twice using different format ID tools. Since those were successful, I also tried an AIP of ~7500 files, which was also successfully ingested and indexed, at least according to the output on the ingest tab.
Unfortunately, at some point during the process of ingesting these AIPs, the Archival Storage tab stopped loading and started giving me a 500 server error. Since the rest of the dashboard seems fine, I suspect that there is a problem with Elasticsearch retrieving the data used to populate the Archival Storage tab. I'm still able to ingest new AIPs and verify their fixity using the Storage Service API.
I turned on dashboard debugging and it shows two different errors, depending on the Elasticsearch timeout settings. The first error is simply a read timeout with a similar traceback to the one I got above for the 10,000 file AIP. Following the traceback output, I increased the timeout in
/usr/local/lib/python2.7/dist-packages/elasticsearch/transport.py
Eventually, I stopped getting the read timeout and now get the following IndexError:
Environment:
Request Method: GET
Request URL: http://am1204.hq.computerhistory.org/archival-storage/
Django Version: 1.5.4
Python Version: 2.7.3
Installed Applications:
('django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.messages',
'django.contrib.staticfiles',
'django.contrib.webdesign',
'installer',
'components.accounts',
'main',
'components.mcp',
'components.administration',
'fpr',
'tastypie')
Installed Middleware:
('django.middleware.common.CommonMiddleware',
'django.contrib.sessions.middleware.SessionMiddleware',
'django.contrib.auth.middleware.AuthenticationMiddleware',
'django.contrib.messages.middleware.MessageMiddleware',
'middleware.common.AJAXSimpleExceptionResponseMiddleware',
'installer.middleware.ConfigurationCheckMiddleware',
'middleware.common.SpecificExceptionErrorPageResponseMiddleware')
Traceback:
File "/usr/local/lib/python2.7/dist-packages/django/core/handlers/base.py" in get_response
115. response = callback(request, *callback_args, **callback_kwargs)
File "/usr/share/archivematica/dashboard/components/decorators.py" in inner
50. return func(request, *args, **kwargs)
File "/usr/share/archivematica/dashboard/components/archival_storage/views.py" in overview
59. return list_display(request)
File "/usr/share/archivematica/dashboard/components/archival_storage/views.py" in list_display
532. current_page_number
File "/usr/share/archivematica/dashboard/components/helpers.py" in pager
92. page = paginator.page(current_page_number)
File "/usr/local/lib/python2.7/dist-packages/django/core/paginator.py" in page
45. return Page(self.object_list[bottom:top], number, self)
File "/usr/local/lib/python2.7/dist-packages/lazy_paged_sequence.py" in __getitem__
57. return [self.__getitem__(i) for i in range(index.start, index.stop, step)]
File "/usr/local/lib/python2.7/dist-packages/lazy_paged_sequence.py" in __getitem__
65. return self.__cache[page_number][index % self.page_size]
Exception Type: IndexError at /archival-storage/
Exception Value: list index out of rangeHas anyone else seen this issue? Could the problem be that Elasticsearch isn't able to pull the data required to display and paginate the archival storage results list properly? Aside from the one change to the timeout variable, I have not changed any of the default Elasticsearch configuration, so I could also see that being the source of the problem. I expect that there's a point where the configuration needs to be modified to scale up the number of files indexed, which I may have reached. Any help or advice would be appreciated.
Thanks,
Andrew