Duplicate Class nodes with same FQN?

145 views
Skip to first unread message

Alistair Jones

unread,
Sep 5, 2015, 5:50:09 AM9/5/15
to jQAssistant
Hi,

I'm analysing the Neo4j codebase (what else?) so I did a simple scan in the repository root like this:

$ bin/jqassistant.sh scan --files ~/projects/neo4j/neo4j

This made a Neo4j database as expected, but when I looked more closely, I found many more Class nodes than I was expecting. I ran this query to see what's going on:
MATCH (n:Class) RETURN COUNT(n), n.fqn ORDER BY COUNT(n) DESC LIMIT 25
COUNT(n)n.fqn
828null
13org.apache.commons.collections.FastHashMap
13org.apache.commons.collections.FastHashMap$Values
13org.apache.commons.collections.BufferUnderflowException
13org.apache.commons.collections.FastHashMap$CollectionView$CollectionViewIterator
13org.apache.commons.collections.FastHashMap$1
13org.apache.commons.collections.FastHashMap$CollectionView
13org.apache.commons.collections.FastHashMap$EntrySet
13org.apache.commons.collections.ArrayStack
13org.apache.commons.collections.FastHashMap$KeySet
11org.neo4j.cypher.LabelScanHintException
11org.neo4j.cypher.EntityNotFoundException
11org.neo4j.cypher.internal.helpers.TypeSafeMathSupport$class
11org.neo4j.cypher.internal.helpers.CastSupport$
11org.neo4j.cypher.internal.helpers.CollectionSupport$$anonfun$castToIterable$1$$anonfun$applyOrElse$1
11org.neo4j.cypher.QueryStatistics$
11org.neo4j.cypher.CypherTypeException$
11org.neo4j.cypher.InvalidSemanticsException
11org.neo4j.cypher.ParameterNotFoundException
11org.neo4j.cypher.internal.helpers.CollectionSupport$class
11org.neo4j.cypher.PatternException
11org.neo4j.cypher.internal.helpers.CollectionSupport$$anonfun$asCollectionOf$1
11org.neo4j.cypher.IndexHintException
11org.neo4j.cypher.internal.helpers.CollectionSupport$NoValidValuesExceptions
11org.neo4j.cypher.internal.helpers.CollectionSupport$$anonfun$castToIterable$1


This shows that there are plenty of nodes with the same FQN. In total there are 265560 Class nodes but only 72530 unique FQNs.

This makes it really difficult to reason about how the classes are connected.

I suspect that the duplicates arise when the same class is referred to from multiple separate jars, but that's only a guess.
Any suggestions for what I'm doing wrong?

In case it's important, I'm building jqassistant from the HEAD of master. But the behaviour seems identical in 1.0.0.

thanks!

-Alistair

Michael Hunger

unread,
Sep 5, 2015, 6:09:33 AM9/5/15
to jqass...@googlegroups.com
Cool stuff Alistair.

As classes in different jar artifacts can be different - imagine scanning a maven repo with different versions or a project which incorrectly refers to different versions of the same library

It would prob make sense to unify them if the artifacts/classes have the same name and hash same for jars.

Right now it might be a solution to limit the query to a single artifact?

Are your querying for a single module or across the project?

Otherwise Dirk, is it possible to mark the classes differently that are declared in a module directly?


Von meinem iPhone gesendet
--
You received this message because you are subscribed to the Google Groups "jQAssistant" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jqassistant...@googlegroups.com.
To post to this group, send email to jqass...@googlegroups.com.
Visit this group at http://groups.google.com/group/jqassistant.
To view this discussion on the web visit https://groups.google.com/d/msgid/jqassistant/78ff3396-cc6b-4920-9e87-cbbaafb9f48a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dirk Mahler

unread,
Sep 5, 2015, 6:42:11 AM9/5/15
to jQAssistant
Hi Alistar,

thanks for playing with jQA ;-) I think the result of the following queries will make the data structures more clear:

match (a:Artifact)-[:CONTAINS]->(t:Type) return a.fileName, t.fqn, labels(t) limit 20

and

match (a:Artifact)-[:REQUIRES]->(t:Type) return a.fileName, t.fqn, labels(t) limit 20

In your case you'll see that a specific class is only contained in one JAR file but there are plenty of other JARs that require it. The required type nodes are just empty templates (i.e. only labeled with type and the fqn property) whereas the contained type nodes also carry a label telling you which concrete type you're looking at (i.e. class, interface, enum or annotation) and all the properties and relations needed to describe the type.

As Michael pointed out it is hard to decide which of the scanned JAR artifacts will be deployed together in a runtime environment, therefore the resolution is scoped to the containing artifact while scanning. If you know that you scanned a bunch of files that are part of one application (which is the case for you if you took the Neo4j JARs) then you can apply a so called concept that will resolve the dependencies. After scanning run the following command:

jqassistant.sh analyze -concepts classpath:resolve

As a result you'll see that now relations like DEPENDS_ON, INVOKES, READS and WRITES have been created between type nodes which are not contained in the same artifact.  

Note: The situation is different if you're scanning a Maven project using the jQAssistant Maven plugin - in this case the dependencies between artifacts are known thus types referenced by a class can be resolved directly.

Regards,

Dirk 

Dirk Mahler

unread,
Sep 5, 2015, 7:40:18 AM9/5/15
to jQAssistant
Hi Alistair,

there's a little but important typo in my last answer, the command for resolving the classpath should be

jqassistant.sh analyze -concepts classpath:Resolve

Regards,

Dirk

Alistair Jones

unread,
Sep 6, 2015, 5:48:39 AM9/6/15
to jQAssistant
Thank you Dirk (and Michael) for being so helpful.

I think I understand the model now.

I tried running `jqassistant.sh analyze -concepts classpath:Resolve` but unfortunately the queries that runs don't seem practical in a code base of this size. I left it running for several hours, using up to 8 CPUs but it eventually exceeded GC overhead while running the 4th query:

            MATCH
              (m:Method)-[i:INVOKES]->(m1:Method)-[:RESOLVES_TO]->(m2:Method)
            MERGE
              (m)-[:INVOKES{lineNumber:i.lineNumber,resolved:true}]->(m2)
            RETURN
              count(i) as ResolvedInvocations

I also tried using the maven plugin, but that fails consistently for me in the neo4j-server module like this:

Caused by: org.apache.maven.plugin.MojoExecutionException: Cannot re-use cached store instance, switch to store life cycle MODULE
at com.buschmais.jqassistant.scm.maven.AbstractMojo.getStore(AbstractMojo.java:297)
at com.buschmais.jqassistant.scm.maven.AbstractMojo.execute(AbstractMojo.java:267)
at com.buschmais.jqassistant.scm.maven.AbstractMojo.execute(AbstractMojo.java:247)
at com.buschmais.jqassistant.scm.maven.AbstractModuleMojo.doExecute(AbstractModuleMojo.java:16)
at com.buschmais.jqassistant.scm.maven.AbstractMojo.execute(AbstractMojo.java:129)
at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:106)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)

I can see where that exception comes from, but I don't understand the code well enough to know what to do about.

So I'm afraid I've diverged this thread into two separate problems - sorry!

Back to the original question, I do see why you chose to model class nodes the way you have. It will help with the situations Michael describes where there are potential classpath conflicts. However, the trade off is increased complexity for simpler cases where it's safe to assume that everything will be loaded from the same classloader. I wonder whether you've found that this trade off is worth it?

cheers,

-Alistair

Michael Hunger

unread,
Sep 6, 2015, 6:03:52 AM9/6/15
to jqass...@googlegroups.com
I think it would be good to have a simpler mode that 
can enabled (or could be the default) for 
scanning where we assume that all classes with the same fqn
Are actually the same and resolve to the same node

I can look into the resolve query, it might be an eager pipe issue

Von meinem iPhone gesendet
--
You received this message because you are subscribed to the Google Groups "jQAssistant" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jqassistant...@googlegroups.com.
To post to this group, send email to jqass...@googlegroups.com.
Visit this group at http://groups.google.com/group/jqassistant.

Dirk Mahler

unread,
Sep 6, 2015, 6:14:42 AM9/6/15
to jQAssistant
Hi Alistair,

I've tried this concept on some larger code bases (and I think this also included a Neo4j distribution) and it worked for me within an acceptable time - I will verify this during the next days and give you some feedback.

Nevertheless I see that it could be an option to provide a configuration to the command line scanner telling the Java plugin that all scanned artifacts belong to the same classpath and shall already be resolved during scan - that would make the data model easier to understand and it would be faster.

Regarding the Maven issue: the jQAssistant Maven plugin per default creates a Neo4j instance at startup and tries to reuse it for all modules which are part of the reactor to build. In some projects this leads to problems as Maven isolates plugin configurations using a quite sophisticated classloader structure - thus the globally cached instance in some case cannot be reused - in this case the error message is reported (I confess: it may be improved to be more comprehensive for the user). There are two solutions for that situation:

1. If you've configured the jQA Maven plugin in your (parent) pom.xml file then you might try to declare it as an extension:
  <plugin>
   <groupId>com.buschmais.jqassistant.scm</groupId>
   <artifactId>jqassistant-maven-plugin</artifactId>
   <version>1.0.0</version>
   <extensions>true</extensions>
...
   <plugin>

2. You can trigger the scan using a property which will enforce starting/stopping the Neo4j instance for each module in the reactor:

mvn jqassistant:scan -Djqassistant.store.lifecycle=MODULE


Hope one of these two options will help - if you see room for improvements please let me know.

Cheers,

Dirk

Michael Hunger

unread,
Sep 6, 2015, 6:15:43 AM9/6/15
to jqass...@googlegroups.com
An idea would be to create  project with a 
simple pom with org.neo4j:neo4j:2.2.5 and use mvn dependency:copy-dependencies and scan target/dependency
or just scan the lib and system/lib directories of Neo4j-Server

then the jars should only occur once.

Dirk Mahler

unread,
Sep 6, 2015, 6:33:21 AM9/6/15
to jQAssistant
Hi Alistar,

just scanned the Neo4j 2.2.5 community distribution (including some extra unmanaged extension stuff in the plugins directory) and applied the concepts - the latter took about 5 minutes (using jQA 1.1.0-SNAPSHOT which itself is based on Neo4j 2.2.5) - will try it with jQA 1.0.0 now (I assume you're using that version). 

It looks to me that there might be some significant differences in our system configurations - I'm on a 3 years old Notebook with an Intel i5 CPU (2 Core 2,5GHz), 8GB RAM and 512 GB hard disk using Windows 7 and Java 8.

Cheers,

Dirk

Michael Hunger

unread,
Sep 6, 2015, 6:45:07 AM9/6/15
to jqass...@googlegroups.com
Right, Alistair scanned the Neo4j build, i.e. github checkout, not server.

Which results in this:



-- 
You received this message because you are subscribed to the Google Groups "jQAssistant" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jqassistant...@googlegroups.com.
To post to this group, send email to jqass...@googlegroups.com.
Visit this group at http://groups.google.com/group/jqassistant.

Dirk Mahler

unread,
Sep 6, 2015, 6:51:53 AM9/6/15
to jQAssistant

Hi Alistar,

Sorry, I mispelled your name for the second time... You're allowed to the same with my name from now on ;-)
 

just scanned the Neo4j 2.2.5 community distribution (including some extra unmanaged extension stuff in the plugins directory) and applied the concepts - the latter took about 5 minutes (using jQA 1.1.0-SNAPSHOT which itself is based on Neo4j 2.2.5) - will try it with jQA 1.0.0 now (I assume you're using that version). 

It looks to me that there might be some significant differences in our system configurations - I'm on a 3 years old Notebook with an Intel i5 CPU (2 Core 2,5GHz), 8GB RAM and 512 GB hard disk using Windows 7 and Java 8.


I executed the same workflow using jQA 1.0.0/Neo4j 2.2.1 and applying the concepts took a bit less than 15 minutes -  i.e. the same time as with jQA 1.1.0 (5 minutes were a typo, should have been 15 - I should write my posts more carefully...).

Cheers,

Dirk

Alistair Jones

unread,
Sep 6, 2015, 2:48:32 PM9/6/15
to jQAssistant


On Sunday, September 6, 2015 at 11:51:53 AM UTC+1, Dirk Mahler wrote:
 
I executed the same workflow using jQA 1.0.0/Neo4j 2.2.1 and applying the concepts took a bit less than 15 minutes
 
As Michael said, I'm scanning the source repo, not a binary distribution. To be precise, I'm scanning this:
Which is just the 2.3 branch plus one commit to bring in the jQAssistant plugin.
Also note that I'm using version 1.1.0-SNAPSHOT of the plugin.

Anyway, I ran this:
mvn jqassistant:scan -Djqassistant.store.lifecycle=MODULE

Small note: I was looking for a goal like this that _just_ does a scan but I couldn't find one in the getting started docs at http://jqassistant.org/get-started/

Anyway, it's super fast - just 2 minutes 20 seconds for me.

Another small thing is that the store lifecycle problem means that `mvn jqassistant:server` doesn't work (same exception as before). I tried with the command line option and <extensions>true</extensions> but neither helped. Obviously it's easy to point any Neo4j server at the store directory, so I just did that instead.

I found the database had 452558 nodes and 2166912 relationships, which seems about right.
I ran my original query again, and it all looks great - only one node per FQN. So I'm a happy bunny - thank you!

Just one more curiosity in the model - I noticed that some :Class nodes don't have an fqn property. I ran this query to find out what they represent:

MATCH (n:Class) WHERE NOT HAS(n.fqn) RETURN COUNT(n), n.name


COUNT(n)n.name
3[3]
12[0]
193expected
1[5]
134value
6[2]
3[4]
9[1]

Returned 8 rows in 3578 ms.

That didn't explain much for me. Maybe the class label is reused for a slightly different concept?

Dirk Mahler

unread,
Sep 7, 2015, 6:37:46 AM9/7/15
to jQAssistant
Hi Alistair,



Anyway, I ran this:
mvn jqassistant:scan -Djqassistant.store.lifecycle=MODULE

Small note: I was looking for a goal like this that _just_ does a scan but I couldn't find one in the getting started docs at http://jqassistant.org/get-started/


I'll have a look if I can integrate this in the guide.
 
 
Anyway, it's super fast - just 2 minutes 20 seconds for me.

Thanks to the makers of the database ;-)
 

Another small thing is that the store lifecycle problem means that `mvn jqassistant:server` doesn't work (same exception as before). I tried with the command line option and <extensions>true</extensions> but neither helped. Obviously it's easy to point any Neo4j server at the store directory, so I just did that instead.

I'll try that - it should work by design but you never know
 

I found the database had 452558 nodes and 2166912 relationships, which seems about right.
I ran my original query again, and it all looks great - only one node per FQN. So I'm a happy bunny - thank you!

Just trying to imagine what a happy bunny looks like... ;-)
 

Just one more curiosity in the model - I noticed that some :Class nodes don't have an fqn property. I ran this query to find out what they represent:

MATCH (n:Class) WHERE NOT HAS(n.fqn) RETURN COUNT(n), n.name


COUNT(n)n.name
3[3]
12[0]
193expected
1[5]
134value
6[2]
3[4]
9[1]

Returned 8 rows in 3578 ms.

That didn't explain much for me. Maybe the class label is reused for a slightly different concept?

That's correct - jQA makes heavy use of label combinations. So the label "Class" is currently used as a classifier for "Type" and "Value" labels. So what you're looking for are actually not "Class" but "Type" nodes which may be qualified with either "Class", "Interface", "Enum" or "Annotation". In other words: each "Type" node shall have the FQN property.
The "Value" label is used for annotation values. e.g. representing @MyAnnotation(Something.class). 

Cheers,

Dirk
 

Michael Hunger

unread,
Sep 11, 2015, 11:51:11 AM9/11/15
to jqass...@googlegroups.com
Hey Alistair, did what Dirk said help you to resolve the issue? Otherwise we can also have a look at work.

Probably right now it's easiest to just scan a binary distribution like a snapshot build?

Until that "single-class-instance for whole project" mode is added?

Michael

--
You received this message because you are subscribed to the Google Groups "jQAssistant" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jqassistant...@googlegroups.com.
To post to this group, send email to jqass...@googlegroups.com.
Visit this group at http://groups.google.com/group/jqassistant.
Reply all
Reply to author
Forward
0 new messages