I have started using this tool (primarily via the Maven plugin) and one complaint/issue is that it is
slow.
I am certainly OK with sacrificing a few seconds for the results of this analysis, but the duration of "extra" time is often more than I would expect and I suspect that there are opportunities to improve. For example, consider the following output:
[INFO] Checking for updates
[DEBUG] Begin Engine Version Check
[DEBUG] Last checked: 1441269696330
[DEBUG] Now: 1441670763184
[DEBUG] Current version: 1.3.1-SNAPSHOT
[DEBUG] Upgrade not needed
[INFO] Check for updates complete (24492 ms)
24 seconds seems like an awfully long time just to check for an update. (One would expect that this should be closer to 1 second.)
As a result, I have started to take a closer look at the implementation to better understand the logical flow and identify areas that may be improved to increase runtime performance. Perhaps this thread can be used as a brainstorm to identify and prioritize opportunities.
(Not in a particular order yet. And forgive me if I missed something in the code -- I am just getting started...)
H2 Database Open/Close
It appears that the H2 database is opened and closed multiple times. It appears that
just opening the connection takes a couple of seconds. It looks like this is happening at least twice, if not more. So a couple of wasted seconds times multiple opens starts to add up. According to
H2 Database Performance Tuning, they suggest to avoid opening and closing database connection. I suggest either refactoring the algorithm to open the database connection once, then passing the database connection around to Updaters, or using a Connection Pool (like maybe
HikariCP).
Auto Commit
It appears that the H2 database is opened in Auto Commit mode. If there are a lot of updates, it may be more efficient to open a transaction and commit updates in large batches.
Degree of Parallelism
I see NvdCveUpdater code has a constant for MAX_THREAD_POOL_SIZE. It appears to be set to 3. I would prefer to see this value directly proportional to the number of cores (see
Runtime.availableProcessors()). Admittedly, I have not experimented with increasing this, but I can tell you that my iCore7 CPU on my Windows 7 PC has 8 cores, so only using 3 threads seems like a waste. Also note that I know on Solaris SPARC, this returns the number of vCPUs (not cores); on a T-5, for example, there are 8 vCPUs per Core, whereas a modern Intel chip tends to have 2 vCPUs per core. Capping the formula may seem prudent to avoid creating a Denial of Service attack against NIST... Maybe something like: Max(Min(availableProcessors(), 7), 3). Naturally parallelism is a general issue, not just when it comes to downloading Nvd updates. (HikariCP has an excellent article about
pool sizing.)
Loading Web Resources in Parallel
Despite what the above suggests about parallelism, I noticed code in DownloadTask that seems to be loading two files in serial:
Lucene Optimizations
From reading
How to make searching faster, it suggests to open the IndexReader with readOnly=true -- that doesn't appear to be the case. It also suggests to share a single IndexSearcher across queries and across threads in your application. I'm not exactly sure yet if these suggestions apply, but if so, they are likely small adjustments.
I understand that you don't want to upgrade past 4.7.2 because it means sacrificing Java 6 compliance. Unfortunately, that takes the #2 advice from
How to make indexing faster off the table:
Make sure you are using the latest version of Lucene. (You do see some irony in this, right?)
Limit Updates to Daily
One trick that Maven uses for downloading artifacts from the Internet, is to only check daily. Perhaps one's sensitivity to Internet changes can be tuned. Despite reading something about a 7 day period, it seems to be checking for updates every build. And often, updates are being downloaded, so this tells me that NIST is busy? When you have a project that you are building several dozen times a day, the delays due to these arguably redundant checks start to add up.
I certainly accept that there is a
nist-data-mirror project, and I have not yet tried it -- but I fully intend to.
Caching Vulnerability Results
From repeated and related builds, the same artifacts are repeatedly being checked for vulnerabilities. Are those results being cached? For example, let's say that it identifies two CVE for Spring-JMS-4.2.0 -- is that result cached so that when the project is built next (or perhaps another related project with the same transitive dependency), then time can be saved? Results can be cached until the CVE database is updated, the hash of the Spring-JMS-4.2.0 changes, or a TTL expires.
Any other ideas or observations?