Indexing speed improvements

MacRobb Simpson

unread,

Apr 28, 2017, 5:26:03 PM4/28/17

to Mayan EDMS

I'm currently in the process of implementing Mayan to replace our current Document Management System(FileBound).
Our setup currently consists 14 document types, 64 metadata types, 16 indexes and over 66,000 files currently loaded.

Reindexing this system is... somewhat slow to say the least.
I let it crunch away for a good 16 hours, and got about halfway through.

Obviously, this isn't good enough - Indexing might be slow, but it shouldn't be /this/ slow.

With a few mods, I've sped this up by at least 8x(figure around 4 hours for a full rebuild... Acceptable).
What I did was:
1. Instead of indexing by document, then index, I'm indexing by index, then document. This allows for a single index to be rebuilt at a time, vs multiple being 'filled in' at once.
2. Modify the delete section to only delete the current index as it's being worked on. This allows you to keep using the other indexes during the rebuild process.
3. removed the 'with transaction.atomic():' line in the indexer. I'm sure this makes it 'less safe' if something were to fail, but I figure that if something fails a reindex is needed anyway.
(By splitting the index rebuild from the single-file-indexer, I can leave that atomic transaction line for a single file, where it makes sense). This change easily doubled the speed, if not quadrupled it.

My final code:
mayan/apps/document_indexing/managers.py:

    def rebuild_all_indexes(self):
        from .models import Index

        for index in Index.objects.filter(enabled=True):
            print 'indexing',index
            #Delete nodes applicable to index
            print 'deleting nodes'
            for instance_node in self.filter(id=index.id):
                instance_node.delete()
            #Delete empty nodes
            self.delete_empty_index_nodes()
            print 'adding index node'
            #Add index node
            root_instance, created = self.get_or_create(
                index_template_node=index.template_root, parent=None
            )
            print 'indexing documents...'
            docsIndexed = 0
            #Reindex each document
            for document in Document.objects.filter(document_type=index.document_types.all()):

                #Add index nodes?
                for template_node in index.template_root.get_children():
                    self.cascade_eval(document, template_node, root_instance)
                docsIndexed += 1
                if docsIndexed % 10 == 0:
                    print 'indexing document',document,docsIndexed,'completed'

All of the 'print' lines could be removed, but are very handy when watching it run from run-server/devel mode.

Anyone got any other improvement ideas or potential pitfalls that this could cause?

Roberto Rosario

unread,

May 27, 2017, 11:01:56 AM5/27/17

to Mayan EDMS

That's great! Going through your changes to see how much I can move upstream.

Roberto Rosario

unread,

May 27, 2017, 2:07:31 PM5/27/17

to Mayan EDMS

Doing some tests I've hit several regressions and a few race conditions (without the 'document_indexing_task_do_rebuild_all_indexes' lock, deleting a document would delete it's index instance if it is empty even while an index is being rebuilt).

The entire indexing locking workflow will need to be remade too. This refactor is bigger than initially expected.

Roberto Rosario

unread,

May 28, 2017, 12:29:24 PM5/28/17

to Mayan EDMS

I'm rewriting most of the indexing code and managed to include reindexing for individual indexes and not all at once. Commit here: https://gitlab.com/mayan-edms/mayan-edms/commit/ac6f748113932d91f23f15dffd9a2ba95b2a1b66

The rewrite allows the use of less lock (just 2 now) so it is already much faster. This rewrite also open the possibility of indexing by workflow states and tags. The code is in a separate branch of the master branch (2.2) to try and push this to a next stable release (2.2.1 or 2.3) instead of waiting for the next major version (3.0). If you have a development install of Mayan please help test this branch to make its inclusion faster.

Reply all

Reply to author

Forward