Machine Reading Comprehension

22 views
Skip to first unread message

sudharsan vijayaraghavan

unread,
Jun 6, 2021, 6:21:23 AMJun 6
to link-grammar
I'm working on Machine Reading Comprehension. 
Let me summarise, to start with it targets 
- QA
- text summary
I'm using DistilBert transformer QA model running it per paragraph on the lesson of interest and attempting answers for question of interest
I had explored a lot on usage of 
CoreNLP models for finding Parts Of Speech, it works decently finds various POS including Named Entity Recognitions and lexical tree parsing.

Then landed at lib link grammar, which is cool.

However I'm looking at constituent tree for multiple sentences, like 
- ability to map pronouns singular/plural) across the passage to corresponding proper nouns
I can see in almost all cases (with null link allowed), lib link grammar fails to get this done.

I'm looking at tweaking lib link grammar.
Am I missing anything in my understanding or usage?
Is my understanding correct that lib link grammar cannot really map / inter link several sentences into a single constituent tree.

Also corenlp/Stanford Parser does not work at giving mapping of words (pronouns/proper nouns) mapping across a passage

Please give your thoughts

Thanks

Linas Vepstas

unread,
Jun 6, 2021, 7:14:03 PMJun 6
to link-grammar, sudharsan vijayaraghavan
On Sun, Jun 6, 2021 at 5:21 AM sudharsan vijayaraghavan <sudvij...@gmail.com> wrote:
I'm working on Machine Reading Comprehension. 
Let me summarise, to start with it targets 
- QA
- text summary
I'm using DistilBert transformer QA model running it per paragraph on the lesson of interest and attempting answers for question of interest
I had explored a lot on usage of 
CoreNLP models for finding Parts Of Speech, it works decently finds various POS including Named Entity Recognitions and lexical tree parsing.

Then landed at lib link grammar, which is cool.

However I'm looking at constituent tree for multiple sentences, like 
- ability to map pronouns singular/plural) across the passage to corresponding proper nouns
I can see in almost all cases (with null link allowed), lib link grammar fails to get this done.

Several remarks:

First: link-grammar, as implemented, is intended for single sentences, one at a time. It can, in fact, parse several at once, because it can see where the punctuation is. This can, however, degrade performance (because there are more words). 

Next: LG can output constituent parses, but constituent parsing is clunky and inelegant. It throws away a lot of important information. It's kind of a crutch: it can help you get familiar with it, and it will help you compare against other systems, but, in general, constituent parsing is just ... weak, in a theoretical sense, and ugly, in a practical sense.


I'm looking at tweaking lib link grammar.
Am I missing anything in my understanding or usage?

Yes. Being able to link referents to the things they refer to is an important thing to do, and I'd love to see a practical system where I can bite into the algorithms and do neat stuff with it. But ...

But modifying the current system to create those links won't work, because the current system enforces planar graphs, and creating anaphora-reference links would create links that cross over other links. (i.e. non-planar graphs).

We've discussed, on and off, a system that would allow non-planar links, but there are several subtleties, complexities and work-arounds. Never thought about applying it for anaphora resolution.

Is my understanding correct that lib link grammar cannot really map / inter link several sentences into a single constituent tree.

Well, constituent trees are nasty from the get-go, so ick. It almost works:

linkparser> This is a test. Here is another.
Found 80 linkages (60 had no P.P. violations)
Linkage 1, cost vector = (UNUSED=0 DIS= 1.10 LEN=10)

    +-------------Xp------------+---------Xp---------+
    +----->WV----->+---Ost--+   +--->WV-->+          |
    +-->Wd---+-Ss*b+  +Ds**c+   +>Wp>+<PFb+SIs*x+    |
    |        |     |  |     |   |    |    |     |    |
LEFT-WALL this.p is.v a  test.n .  here is.v another .

sothe above looks great.

(S (S (NP this.p)
      (VP is.v
          (NP a test.n)))
   . here
   (VP is.v
       (NP another))
   .)


Meh. Not so great. Doesn't seem to realize there was a period in the middle of that. So it looks like there's a bug there. No clue how hard it would be to fix.

Also corenlp/Stanford Parser does not work at giving mapping of words (pronouns/proper nouns) mapping across a passage

Sure. The simplest algo I know of for that mapping is called "the Hobbs algorithm". It works great for most simple grade-school reading-level sentences, maybe around 80% accuracy, and falls down badly for anything more sophisticated. 

In order to work around the places where it stinks, we implemented it in a flexible rule system that would allow more sophisticated rules to be added and triggered as appropriate.  You can find this code here:


Some background: the opencog atomspace is a (hyper-)graph database, its ideal for working with generic graphs, transmuting them, re-writing them, hanging arbitrary information off of them, etc.  So it's a pretty good place to do reference resolution.

The opencog nlp subsystem is a tower of assorted NLP stuff built on top of link-grammar. Almost all of it is slowly bit-rotting. The anaphora resolution code is probably bit-rotted, and probably won't work out of the box, without some shock treatment.

The biggest issue with the entire approach is the need to have humans hand-craft custom rules to handle each of the exceptions and special cases.  It's like trying to build a sky-scraper out of two-by-fours: after a while, there are too many pieces, the structure is too complex, the whole thing becomes ever more fragile and unmaintainable.

Most pros have experienced the "fragile and unmaintainable" aspects of it. That's why most pros are focusing on systems that can learn what is needed (supervised or unsupervised).  I do not currently know of any system that could do unsupervised learning of anaphora, although I am working on a system that, in theory (in my imagination), might someday be able to do that. It's this one: https://github.com/opencog/learn


Please give your thoughts

Thanks

you're welcome!

--linas

--
Patrick: Are they laughing at us?
Sponge Bob: No, Patrick, they are laughing next to us.
 

Anton Kolonin @ Gmail

unread,
Jun 6, 2021, 9:59:29 PMJun 6
to link-g...@googlegroups.com, Linas Vepstas, sudharsan vijayaraghavan, Vignav Ramesh
>First: link-grammar, as implemented, is intended for single sentences, one at a time. It can, in fact, parse several at once, because it can see where the punctuation is. This can, however, degrade performance (because there are more words). 

There is a research work done on Language Segmentation Based on Link Grammar

https://ieeexplore.ieee.org/document/9303220

Code is open-source and works for English dictionary (no morphology support)

https://github.com/aigents/aigents-java-nlp


>The biggest issue with the entire approach is the need to have humans hand-craft custom rules to handle each of the exceptions and special cases.  It's like trying to build a sky-scraper out of two-by-fours: after a while, there are too many pieces, the structure is too complex, the whole thing becomes ever more fragile and unmaintainable.

There is another research building Link Grammar dictionaries along with domain ontologies from the unsupervised learning.

The results are kind of promising but the quality is no as good as it needed to be for practical usage, the project is frozen for the time being, it is open-source as well:
--
You received this message because you are subscribed to the Google Groups "link-grammar" group.
To unsubscribe from this group and stop receiving emails from it, send an email to link-grammar...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/link-grammar/CAHrUA369tB%3DfHxUa9ZXZe1oYPM2dnk96ZBdQsvYHW59nJ2wtMw%40mail.gmail.com.
-- 
-Anton Kolonin
telegram/skype/facebook: akolonin
mobile/WhatsApp: +79139250058
akol...@aigents.com
https://aigents.com
https://www.youtube.com/aigents
https://www.facebook.com/aigents
https://wt.social/wt/aigents
https://medium.com/@aigents
https://steemit.com/@aigents
https://reddit.com/r/aigents
https://twitter.com/aigents
https://golos.in/@aigents
https://vk.com/aigents
https://aigents.com/en/slack.html
https://www.messenger.com/t/aigents
https://web.telegram.org/#/im?p=@AigentsBot

sudharsan vijayaraghavan

unread,
Jun 7, 2021, 1:19:09 PMJun 7
to linasv...@gmail.com, link-grammar
Hi Linas,

Thanks for the pointers and inputs.

I will go through the https://github.com/opencog/opencog/tree/master/opencog/nlp/anaphora and also https://github.com/opencog/learn.
Earlier today I was taking a look at https://github.com/opencog/relex (still struggling to even have it built properly) It is definitely using liblink-grammar under the hood, so I feel it will also not generate dependency parsing across multiple sentences, yet to try it out. Please correct me if  I am wrong.

For MRC, DistilBert hugging face model does generate reasonable answers for questions out of grade-school level. It definitely needs a pre-processing setup to create a dependency tree of all mappings of proper nouns/ pronouns and subjects/objects to make it improve accuracy.

A lame method is to get Euclidean distance across all pronouns/proper nouns and subjects/objects (pronouns/proper nouns , subjects/objects obtained through 

edu.stanford.nlp.trees.EnglishGrammaticalStructure

edu.stanford.nlp.parser.lexparser.LexicalizedParser

nlp = stanza.Pipeline('en', processors='tokenize, pos, lemma, depparse, ner', use_gpu=False, pos_batch_size=3000)

are quite accurate.


Then feed in the mappings to reorganize the input passage before passing it onto DistilBert.



I agree the linkages are not expected to intersect with liblink-grammar which means it won't really work. Only hope I had with null-links allowed, it would generate a combination of linkages, well sorting out the various linkages and inferring data from them to get mapping is going to take forever. I agree with your comments on it.

I will analyse further with your inputs and update on progress. Once again thanks for all the pointers.

sudharsan vijayaraghavan

unread,
Jun 7, 2021, 1:24:42 PMJun 7
to Anton Kolonin @ Gmail, link-grammar, Linas Vepstas, Vignav Ramesh
Hi Anton,
Interesting research papers, I skimmed through them. I will dig deeper and hope it addresses my actual requirement.
With products like Lexile, Grammarly I'm  really surprised to learn that dependency parsing for multiple sentences is still not mature.

Thanks for the pointers

Thanks,

Linas Vepstas

unread,
Jun 7, 2021, 6:28:39 PMJun 7
to sudharsan vijayaraghavan, link-grammar
On Mon, Jun 7, 2021 at 12:19 PM sudharsan vijayaraghavan <sudvij...@gmail.com> wrote:
Hi Linas,

Thanks for the pointers and inputs.

I will go through the https://github.com/opencog/opencog/tree/master/opencog/nlp/anaphora and also https://github.com/opencog/learn.
Earlier today I was taking a look at https://github.com/opencog/relex (still struggling to even have it built properly) It is definitely using liblink-grammar under the hood, so I feel it will also not generate dependency parsing across multiple sentences, yet to try it out. Please correct me if  I am wrong.

Reference resolution is not the same thing as dependency parsing.  I am not aware of any concept that could be called "dependency parsing across multiple sentences", so I don't know what you mean by that, other than that you are trying to informally describe reference resolution.

Relex should build and run. However, it is semi-abandoned, because it offers no real value add over what link-grammar already provides.  LG generates a lot of linkage data; Relex basically "forgets" much of this, boils it down to a more simplistic view of sentence structure, one that is more or less compatible with what the Stanford parser does. Because the presentation is simpler, it is easier to understand -- both for you as a human, and also for other post-processing algorithms. My gut instinct is that throwing away information is not a good thing to do, and so I've lost interest in Stanford-style parsing.


For MRC, DistilBert hugging face model does generate reasonable answers for questions out of grade-school level. It definitely needs a pre-processing setup to create a dependency tree of all mappings of proper nouns/ pronouns and subjects/objects to make it improve accuracy.

A lame method is to get Euclidean distance across all pronouns/proper nouns and subjects/objects 

A less lame method is the Hobbs algorithm. It dates back to the 1970's. It works.


I agree the linkages are not expected to intersect with liblink-grammar which means it won't really work. Only hope I had with null-links allowed, it would generate a combination of linkages, well sorting out the various linkages and inferring data from them to get mapping is going to take forever.

I don't understand what you are trying to say here. I don't know what you mean by null-links in this context. Why should something "take forever"?

I agree with your comments on it.

OK. This conversation has given me a wild-and-crazy idea for an easy way of doing reference resolution, but I have to think about it a bit before I try to describe it.

--linas

sudharsan vijayaraghavan

unread,
Jun 8, 2021, 2:33:38 PMJun 8
to Linas Vepstas, link-grammar

dependency parsing across multiple sentences.
Let us take a simple example.
Tom and John are friends. They live  together. Marley is the pet of these guys. 

Expected dependency relations are:
Adjacency matrix to be formed is as follows.

Here is the adjacency matrix. proper nouns, pronouns, noun subjects, noun modifiers can be the nodes of the graph


          Tom John friends They Marley pet these guys

Tom        1   0     1      0    0      0   0     0

John       0   1     1      0    0      0   0     0

friends    1   1     1      1    0      0   0     0

They       0   0     1      1    0      0   1     0

Marley     0   0     0      0    1      1   0     0

pet        0   0     0      0    1      1   0     0

these      0   0     0      1    0      0   1     1

guys       0   0     0      0    0      0   1     1


A BFS followed by DFS on the graph can give all the relevant mappings

if x -> y then automatically y -> x.

If there is the same noun repeated, we can either collapse to earlier occurrence

or suffix it to make it  a new entry in the adjacency matrix.


The same must be built for a given passage / paragraph.


Hobb's algorithm does seem to get us closer to the above adjacency matrix

https://www.isi.edu/~hobbs/ResolvingPronounReferences.pdf



With LG , on allowing null linkages it generates many combinations of linkages. It is still possible to generate what we want "dependency parsing across multiple sentences" by using "corefs" + "ner" + "governorGloss" from stanza pipeline and reinterpret them against the LG linkages. This will be a lot of effort and will take forever, this is exactly what I meant.

Linas Vepstas

unread,
Jun 8, 2021, 5:41:26 PMJun 8
to sudharsan vijayaraghavan, link-grammar
OK.

Please be aware that a "dependency parse" is a technical term, having a very specific meaning in the linguistics literature. That specific technical meaning dates back to the 1960's, see https://en.wikipedia.org/wiki/Dependency_grammar

What you are describing is some form of a semantic network. There are many different kinds of those things, too. See for example,

Semantic networks are the topic of active research in academia and industry. There's no magic software that will do it automatically, as the article on pragmatics should make abundantly clear.

--linas

sudharsan vijayaraghavan

unread,
Jun 9, 2021, 1:11:52 AMJun 9
to Linas Vepstas, link-grammar
Yep the term used is not correct, I will take a look at the links.
I have attached the lex parser output for the example sentence as obtained by stanford lexical parser, clearly linkages across sentences is missing, FYI
Thanks
lex2.ps

sudharsan vijayaraghavan

unread,
Jun 14, 2021, 11:48:24 AMJun 14
to Linas Vepstas, link-grammar
Hi Linas,

As I was coding and modifying Hobb's algorithm / testing it. I hit upon neuralcoref.  This works with more 95% accuracy in establishing mapping between pronouns/noun subjects and noun predicates/objects.
Here is the code which i had tried out.

import spacy

import logging;

logging.basicConfig(level=logging.INFO)

import neuralcoref


# Load SpaCy

nlp = spacy.load('en_core_web_sm')

# Add neural coref to SpaCy's pipe

coref = neuralcoref.NeuralCoref(nlp.vocab, greedyness=0.5)

nlp.add_pipe(coref, name='neuralcoref')


def coref_mapping(text):

    doc = nlp(text)

    tok_list = list(token.text_with_ws for token in doc)

    for cluster in doc._.coref_clusters:

        cluster_main_words = set(cluster.main.text.split(' '))

        

        for coref in cluster.mentions:

                cluster_mention_words = set(coref.text.split(' '))

                v = bool(cluster_mention_words.intersection(cluster_main_words))

                if v is False:

                   tok_list[coref.start] = cluster.main.text + doc[coref.end-1].whitespace_

    return "".join(tok_list)

#input = "Apparently I love the camera and I it"

input = "Tom and John friends. They live together. Marley is pet of these guys"

print(coref_mapping(input)


Thanks
Reply all
Reply to author
Forward
0 new messages