Runtime Performance Observations

24 views
Skip to first unread message

Anthony Whitford

unread,
Sep 8, 2015, 4:04:03 AM9/8/15
to Dependency Check
I have started using this tool (primarily via the Maven plugin) and one complaint/issue is that it is slow.
I am certainly OK with sacrificing a few seconds for the results of this analysis, but the duration of "extra" time is often more than I would expect and I suspect that there are opportunities to improve.  For example, consider the following output:

[INFO] Checking for updates
[DEBUG] Begin Engine Version Check
[DEBUG] Last checked: 1441269696330
[DEBUG] Now: 1441670763184
[DEBUG] Current version: 1.3.1-SNAPSHOT
[DEBUG] Upgrade not needed
[INFO] Check for updates complete (24492 ms)

24 seconds seems like an awfully long time just to check for an update.  (One would expect that this should be closer to 1 second.)


As a result, I have started to take a closer look at the implementation to better understand the logical flow and identify areas that may be improved to increase runtime performance.  Perhaps this thread can be used as a brainstorm to identify and prioritize opportunities.

(Not in a particular order yet.  And forgive me if I missed something in the code -- I am just getting started...)


H2 Database Open/Close

It appears that the H2 database is opened and closed multiple times.  It appears that just opening the connection takes a couple of seconds.  It looks like this is happening at least twice, if not more.  So a couple of wasted seconds times multiple opens starts to add up.  According to H2 Database Performance Tuning, they suggest to avoid opening and closing database connection.  I suggest either refactoring the algorithm to open the database connection once, then passing the database connection around to Updaters, or using a Connection Pool (like maybe HikariCP).


Auto Commit

It appears that the H2 database is opened in Auto Commit mode.  If there are a lot of updates, it may be more efficient to open a transaction and commit updates in large batches.


Degree of Parallelism

I see NvdCveUpdater code has a constant for MAX_THREAD_POOL_SIZE.  It appears to be set to 3.  I would prefer to see this value directly proportional to the number of cores (see Runtime.availableProcessors()).  Admittedly, I have not experimented with increasing this, but I can tell you that my iCore7 CPU on my Windows 7 PC has 8 cores, so only using 3 threads seems like a waste.  Also note that I know on Solaris SPARC, this returns the number of vCPUs (not cores); on a T-5, for example, there are 8 vCPUs per Core, whereas a modern Intel chip tends to have 2 vCPUs per core.  Capping the formula may seem prudent to avoid creating a Denial of Service attack against NIST...  Maybe something like:  Max(Min(availableProcessors(), 7), 3).  Naturally parallelism is a general issue, not just when it comes to downloading Nvd updates.  (HikariCP has an excellent article about pool sizing.)


Loading Web Resources in Parallel

Despite what the above suggests about parallelism, I noticed code in DownloadTask that seems to be loading two files in serial:
    Downloader.fetchFile(url1, first);
   
Downloader.fetchFile(url2, second);
This makes me wonder if there should be a pattern more like Concurrent asynchronous HTTP exchanges.


Lucene Optimizations

From reading How to make searching faster, it suggests to open the IndexReader with readOnly=true -- that doesn't appear to be the case.  It also suggests to share a single IndexSearcher across queries and across threads in your application.  I'm not exactly sure yet if these suggestions apply, but if so, they are likely small adjustments.

I understand that you don't want to upgrade past 4.7.2 because it means sacrificing Java 6 compliance.  Unfortunately, that takes the #2 advice from How to make indexing faster off the table:  Make sure you are using the latest version of Lucene.  (You do see some irony in this, right?)


Limit Updates to Daily

One trick that Maven uses for downloading artifacts from the Internet, is to only check daily.  Perhaps one's sensitivity to Internet changes can be tuned.  Despite reading something about a 7 day period, it seems to be checking for updates every build.  And often, updates are being downloaded, so this tells me that NIST is busy?  When you have a project that you are building several dozen times a day, the delays due to these arguably redundant checks start to add up.

I certainly accept that there is a nist-data-mirror project, and I have not yet tried it -- but I fully intend to.


Caching Vulnerability Results

From repeated and related builds, the same artifacts are repeatedly being checked for vulnerabilities.  Are those results being cached?  For example, let's say that it identifies two CVE for Spring-JMS-4.2.0 -- is that result cached so that when the project is built next (or perhaps another related project with the same transitive dependency), then time can be saved?  Results can be cached until the CVE database is updated, the hash of the Spring-JMS-4.2.0 changes, or a TTL expires.


Any other ideas or observations?

Anthony Whitford

unread,
Sep 8, 2015, 4:31:45 AM9/8/15
to Dependency Check
I should also add that I am definitely observing that NIST downloads are slow.  Downloading a 67k file took 12 seconds, for example.

It strikes me that the NIST site could use some tweaking in a big way.

Jeremy Long

unread,
Sep 8, 2015, 6:03:31 AM9/8/15
to Anthony Whitford, Dependency Check
Great observations and suggestions.

1. Database connections - I agree a better approach could be used. My initial thought will be that the Engine opens the DB and the Analyzer interface be updated to accept the connection during the initialize phase. This might be able to be done more efficiently with 2 database connections. One connection which we use transactions during the update phase and another readonly connection used for the rest of the processing.
2. Autocommit - See above, completely agree.
3. Parallelism - I like your suggestion. In addition, I have been considering increasing the amount of parallelism to the actual analyzers, not just the update code.
4. Parallelism of downloads - I'm actually hoping that the NVD will update their schema soon to include CVSSv3 and one of the other suggestions I've made. Specifically the NVD 2.0 schema doesn't include the all previous versions flag. As such, dependency-check downloads the 1.1 and 2.0 data files - the only items used from the 1.1 data is the all previous version flag. This would limit the number of downloads necessary. As such, I might put this as a lower priority.
5. Lucene - the index is an in memory index; so I don't think some of the suggestions apply/help. However, I am working on a different approach that may speed this up (both the creation of the index and subsequent queries).
6. I'm okay with daily updates - maybe this could be configurable with the default being once per day.
7. Caching results - currently this only occurs in the Maven aggregate goal. I think expanding the caching and using the cache across multiple builds would definitely be a huge performance boost. I had thought of this before, but because I was allowing updates from NVD on every run I never pursued this; converting to daily updates would solve this and allow for the caching to be used.

8. Database updates - The algorithm for updating individual rows in the DB could use some rework. First, I wonder if it would be faster to truncate the tables and rebuild if we go past the 7 day marker... Currently, we see if the CVE exists in the H2 database, if it does we update it else it is inserted; this is only really needed when processing the 'modified' NVD file(s).

I think these items should added as issues in the repo so we can better track against the individual items.

--Jeremy

On Tue, Sep 8, 2015 at 4:31 AM, Anthony Whitford <ant...@whitford.com> wrote:
I should also add that I am definitely observing that NIST downloads are slow.  Downloading a 67k file took 12 seconds, for example.

It strikes me that the NIST site could use some tweaking in a big way.

--
You received this message because you are subscribed to the Google Groups "Dependency Check" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dependency-che...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

colezlaw

unread,
Sep 9, 2015, 7:22:36 AM9/9/15
to Dependency Check, ant...@whitford.com
Also, unarchiving archives. If an archive has no internal archives, we don't need to unarchive it to analyze it - we can just read its contents in-place (e.g., a normal JAR or ZIP or tarball doesn't need to be unpacked to analyze its contents UNLESS it has embedded JAR or ZIP's or tarballs). 
Reply all
Reply to author
Forward
0 new messages