Improving performance

53 views
Skip to first unread message

Zahid Mahmood

unread,
Oct 19, 2018, 10:17:11 AM10/19/18
to Alfresco Bulk Import Tool
Hi Peter,

I am streaming a series of large batches, some over a million nodes. It take about 5 hours  but I noticed that after all the correct number of nodes are created the status does not change to 'Idle' for about 30 minutes. What is the process doing during that time and can I remove the source files? 

Read throughout starts at around 2,000/sec and gradually drops to 120/sec after 5 hours 
Write throughout starts at 300/sec and drops to 70/sec 

Alfresco: 8 cores, 32 GB RAM
DB: 8 cores, 64GB RAM (allow:1500 connections)

Allocating: 45 threads, batch size:270 
Utilizing 25%cpu on Alf Server, 45%cpu on DB
125 DB connections on average

How can I improve on throughout? 
There isn't much I can do about disk i/o but I can add more cores and memory.

Thanks
Zahid

Peter Monks

unread,
Nov 21, 2018, 1:12:55 PM11/21/18
to alfresco-bulk-f...@googlegroups.com
G'day Zahid,

I don't know why there'd be a 30 minute delay at the end.  Can you grab a thread dump during that time and see if there are bulk import threads still around, and if so, what they're doing (i.e. their stack traces)?  All of the threads created by the bulk import tool have pretty self-explanatory names.  And if you could share those thread dumps here, or on a new GitHub issue, I can take a look at them and see if I notice anything unusual.

Regarding throughput, the read and write statistics start off high because the tool's first phase only handles the directory structure, and since directories have no content they're much faster to create in the repository.  There can also be efficiencies in reading directory entries from the OS, though that depends on the OS and filesystem that's in use.  This FAQ item is basically this situation.

Another FAQ item talks about how to improve throughput, but the tl;dr is that performance tuning should always be done empirically.  Just throwing extra resources at the problem without first measuring where the bottleneck is is a great way to waste time.  With that said, some general observations I'd make:
  1. It's always possible to better tune the database.  Having an experienced DBA monitor the database while an import is in flight (even a test import of a subset of the real content set) will almost certainly identify database configuration improvements.
    1. Just be aware that it's possible to over-optimise the database for the bulk import case, only to penalise user usage patterns after the import is complete (of course you could reconfigure the database for online workloads after the import is complete - that's the best of both worlds).
    2. Don't let the DBA add, remove, or reconfigure the Alfresco schema at all.  DBAs love telling developers that their indexes are sub-optimal for example, but messing with the Alfresco schema will put you in an unsupported configuration.
  2. Don't forget about the network between Alfresco and the database - it needs to be as high bandwidth, as low latency, and as transmission-error-free as possible.  Poor network connectivity to the database massively hurts all Alfresco workloads, since Alfresco is very database chatty.
  3. Reviewing and optimising I/O to the content store filesystem and (secondarily) the source filesystem is also worth some time.
    1. If either of these are remote (NAS, SAN, etc.) they'll be competing with database I/O for network resources (see previous point).  If you can afford to multi-NIC the server and segregate the database and filesystem I/O so that they're on separate networks, this is definitely worth considering.  If in doubt, give database I/O the highest priority (i.e. put that traffic on a network of its own), and lump everything else (filesystem I/O, Alfresco web UI traffic, REST API calls, etc.) together on a second network.
  4. Consider introducing an Alfresco, and splitting up the source content set across the cluster, so that each subset can be imported in parallel on each cluster node.
    1. Note 1: unless / until you've tuned the database, network, and filesystem, this won't be particularly effective, since those components don't scale out with the cluster (they're shared across an Alfresco cluster).
    2. Note 2: this trick only works with the original edition of the bulk import tool available from GitHub.  The ancient fork that's included in Alfresco v4 and up is a cluster-singleton operation; one of many reasons not to use it.
  5. Beyond meeting basic Alfresco requirements, assigning more memory to the Alfresco JVM rarely helps.  The bulk import tool is deliberately designed to have a low-memory footprint.
  6. I have never seen the CPU be a bottleneck for bulk imports (well, not since the Sun UltraSparc T1 days...) - bulk imports just don't do much computation.
Finally, 120 nodes per second is actually right around average for a bulk import, and probably indicates that the environment you're using is reasonably well tuned already.  The fastest import I've personally heard of was only a couple of times faster than that - at the end of the day copying millions of files simply takes a while, and there's no easy way around the basic physics of what's happening.

Cheers,
Peter
--


--
You received this message because you are subscribed to the Google Groups "Alfresco Bulk Import Tool" group.
To unsubscribe from this group and stop receiving emails from it, send an email to alfresco-bulk-filesys...@googlegroups.com.
To post to this group, send email to alfresco-bulk-f...@googlegroups.com.
Visit this group at https://groups.google.com/group/alfresco-bulk-filesystem-import.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages