Hi Riyad,
I recommend searching for "syntactic parsing evaluation metrics" or any related query, you will probably get relevant results. I found a seemingly good answer in StackExchange:
A quick summary: in the PARSEVAL benchmark, a system's candidate parse is compared to a human-annotated parse using a graph-based approach: each node in the tree is transformed into a label of the type "node:every-terminal-below-that-node" (note that you still need the terminals to calculate the metric; using constituent labels alone would be ambiguous because sentences tend to contain several constituents of certain types), i.e., a phrase like "my great grandfather", parsed as "NP[ DET[my] NP[ JJ[great] NN[grandfather] ] ]" would be transformed into the following labels:
det:my
jj:great
nn:grandfather
np:great,grandfather
np:my,great,grandfather
Once a set of labels of this form is generated both for the candidate parse and the reference parse, the similarity between the two can be calculated using standard Information Retrieval metrics (Precision and Recall), i.e., how many nodes they share relative to all the nodes, plus a penalty for any irrelevant labels.
Since in your case you only need to measure similarity, without any particular notion of parse quality, that means you don't need human-annotated parses and can calculate Precision and Recall between arbitrary pairs of trees, which seems to match the use case you described in your question.
Hope this helps, cheers!
Jordi