Syntactic similarity of sentences

602 views
Skip to first unread message

Riyad Parvez

unread,
Nov 4, 2016, 6:16:30 PM11/4/16
to nltk-...@googlegroups.com
Hi,

I'm interested in finding how syntactically similar two different sentences are. I'm looking for a metric that can give a normalized score (e.g., 1: same syntactic structure 0: totally different) of similarity between constituency parse trees.

For example:
"I need to ride the bicycle."

             ROOT                        
              |                           
              S                          
  ____________|________________________   
 |            VP                       | 
 |    ________|____                    |  
 |   |             S                   | 
 |   |             |                   |  
 |   |             VP                  | 
 |   |     ________|___                |  
 |   |    |            VP              | 
 |   |    |    ________|___            |  
 NP  |    |   |            NP          | 
 |   |    |   |         ___|_____      |  
PRP VBP   TO  VB       DT        NN    . 
 |   |    |   |        |         |     |  
 I  need  to ride     the     bicycle  .


"She needs to pass the exam."

              ROOT                     
               |                        
               S                       
  _____________|_____________________   
 |             VP                    | 
 |     ________|____                 |  
 |    |             S                | 
 |    |             |                |  
 |    |             VP               | 
 |    |     ________|___             |  
 |    |    |            VP           | 
 |    |    |    ________|___         |  
 NP   |    |   |            NP       | 
 |    |    |   |         ___|___     |  
PRP  VBZ   TO  VB       DT      NN   . 
 |    |    |   |        |       |    |  
She needs  to pass     the     exam  .

Above two sentences are not semantically similar or even lexically similar, but they are syntactically similar. Is there any metric that can quantify the similarity between sentences based on the constituency parse tree except the terminals?

Thanks,
Riyad

Jordi Carrera

unread,
Nov 8, 2016, 6:54:07 AM11/8/16
to nltk-users
Hi Riyad,

I recommend searching for "syntactic parsing evaluation metrics" or any related query, you will probably get relevant results. I found a seemingly good answer in StackExchange:

A quick summary: in the PARSEVAL benchmark, a system's candidate parse is compared to a human-annotated parse using a graph-based approach: each node in the tree is transformed into a label of the type "node:every-terminal-below-that-node" (note that you still need the terminals to calculate the metric; using constituent labels alone would be ambiguous because sentences tend to contain several constituents of certain types), i.e., a phrase like "my great grandfather", parsed as "NP[   DET[my]  NP[  JJ[great] NN[grandfather]  ]   ]" would be transformed into the following labels:

det:my
jj:great
nn:grandfather
np:great,grandfather
np:my,great,grandfather

Once a set of labels of this form is generated both for the candidate parse and the reference parse, the similarity between the two can be calculated using standard Information Retrieval metrics (Precision and Recall), i.e., how many nodes they share relative to all the nodes, plus a penalty for any irrelevant labels.

Since in your case you only need to measure similarity, without any particular notion of parse quality, that means you don't need human-annotated parses and can calculate Precision and Recall between arbitrary pairs of trees, which seems to match the use case you described in your question.

Hope this helps, cheers!


Jordi
Reply all
Reply to author
Forward
0 new messages