I'm currently in the process of implementing Mayan to replace our current Document Management System(FileBound).
Our setup currently consists 14 document types, 64 metadata types, 16 indexes and over 66,000 files currently loaded.
Reindexing this system is... somewhat slow to say the least.
I let it crunch away for a good 16 hours, and got about halfway through.
Obviously, this isn't good enough - Indexing might be slow, but it shouldn't be /this/ slow.
With a few mods, I've sped this up by at least 8x(figure around 4 hours for a full rebuild... Acceptable).
What I did was:
1. Instead of indexing by document, then index, I'm indexing by index, then document. This allows for a single index to be rebuilt at a time, vs multiple being 'filled in' at once.
2. Modify the delete section to only delete the current index as it's being worked on. This allows you to keep using the other indexes during the rebuild process.
3. removed the 'with transaction.atomic():' line in the indexer. I'm sure this makes it 'less safe' if something were to fail, but I figure that if something fails a reindex is needed anyway.
(By splitting the index rebuild from the single-file-indexer, I can leave that atomic transaction line for a single file, where it makes sense). This change easily doubled the speed, if not quadrupled it.
My final code:
mayan/apps/document_indexing/managers.py:
def rebuild_all_indexes(self):
from .models import Index
for index in Index.objects.filter(enabled=True):
print 'indexing',index
#Delete nodes applicable to index
print 'deleting nodes'
for instance_node in self.filter(id=index.id):
instance_node.delete()
#Delete empty nodes
self.delete_empty_index_nodes()
print 'adding index node'
#Add index node
root_instance, created = self.get_or_create(
index_template_node=index.template_root, parent=None
)
print 'indexing documents...'
docsIndexed = 0
#Reindex each document
for document in Document.objects.filter(document_type=index.document_types.all()):
#Add index nodes?
for template_node in index.template_root.get_children():
self.cascade_eval(document, template_node, root_instance)
docsIndexed += 1
if docsIndexed % 10 == 0:
print 'indexing document',document,docsIndexed,'completed'
All of the 'print' lines could be removed, but are very handy when watching it run from run-server/devel mode.
Anyone got any other improvement ideas or potential pitfalls that this could cause?