Thought Experiment: Stylometry with Link Grammar?

28 views
Skip to first unread message

Calvin Irby

unread,
Jun 18, 2021, 11:31:04 PMJun 18
to link-grammar
Hello Link Grammar Community!

Well here goes, this is something that I know Linas mentioned that he's working on or is kind of related but wanted to know if it was possible to implement with Link Grammar.

Is it possible to load Link Grammar's data structure of linkages with one that is specifically tailored to a specific author. That way, one could see if a particular text is attributed to an author. Basically, can we see that a person's writing was in fact written by that person?

My ideas were to run Link Grammar through the original author's texts and somehow extract all the linkages into a completely new data structure. Then upload/replace the one that Link Grammar already has. Then one could use this set of linkages to see if a text has any/all of the typical links an author would use.

Let me know what you all think!
-Calvin

Linas Vepstas

unread,
Jun 19, 2021, 7:08:29 PMJun 19
to link-grammar
Hi Calvin,

You bring up an interesting topic, but rather than responding directly to it, let me start at square one. You ask about text attribution, and indirectly, text generation.  I think that thinking about the basics makes it more fun.

Consider a police detective analyzing a threatening note. At some point, it becomes common knowledge that hand-written notes are subject to forensic analysis. Criminals switch to typewriters; alas some famous spy cases from the 1940's are solved by linking notes to the typewriters that produced them. By the 1970's, Hollywood shows us films with the bad guy clipping words from newspapers. Aside from looking for fingerprints left on the paper, psychological profilers look for idiosyncracies in how the criminal expresses ideas. Stranger wording, odd phrases, punctuation or lack thereof.

How about computer text? It's well know that many people consistently mis-spell words (I consistently mis-spell "thier") and I think there were some murder trial evidence that hinged on this. Moving into the PC era, 1980's onwards, we get the "bag of words" model: different texts have different ratios of words, and this is applied to a zillion and one problems in text classification: basically, you have a vector of (word, frequency) pairs and you can take assorted distance measures to determine similarity (the dot-product is very popular and also just plain wrong, but I digress) I don't think you'd have any particular problem with using this method to attribute a text to James Joyce, for example.

It becomes subtle, perhaps, if the text is short: say, a letter, and you are comparing it to other letters written in the same era, written by eloquent Irishmen. The words that Joyce might use in a letter might not be the ones he'd use in a novel. It's reasonable to expect that bag-of-words will fail to provide an unambiguous signal.  How about sentence structure, then?   This is what you are asking me. Yes, I agree: that is a good way - the best way? of doing this. One might still expect Joyce to construct his sentences in a way that is particular to his mode of thinking, irrespective of the topic that he writes on. Mood and feeling echoes on in the grammar. 

So, how might this work? Before I dive into that, a short digression. Besides bag-of-words, there is also a bag of word-pairs. Here, you collect not (word, frequency) pairs, but (word-pair, frequency) pairs. One collects not nearest-neighbor word-pairs, but word-pairs in some window: say, of length six. The problem is that there are vast numbers of word-pairs, like "the-is" and "you-book" - hundreds of millions. Most are junk. You can weed most of these away by focusing on only those with a high mutual information, but even so, you're left with the problem of "overfitting".

Enter the n-gram (as in "google n-gram viewer") or better yet, the skipgram, which is an n-gram with some "irrelevant" words omitted. Effectively all neural-net techniques are skip-gram based. To crudely paraphrase what a neural net does: as you train it on a body of text (say .. James Joyce's complete works...), it develops a collection of (skigram, frequency) pairs, or rather, a (skipgram, weight) vector. You can then compare this to some unknown text: the neural net will act as a discriminator or classifier, telling you if that other text is sufficiently similar (often using the dot product, which is just plain... but I digress...) The "magic" of the neural net is it figures out *which* skip-grams are relevant, and which are noise/junk. (there are millions of trillions of skip grams; out of these, the neural net picks out 200 to 500 of them. This is a non-trivial achievement).

How might this work for one of James Joyce's letters? Hmm. See the problem? If the classifier is trained on his novels, the vocabulary there might be quite different than the vocabulary in his personal letters, and that difference in vocabulary will mess up the recognition.  Joyce may be using the same sentence constructions in his letters and novels, but with a different vocabulary in each. A skip-gram classifier is blind to word-classes: it's blind to the grammatical constructions.  Something as basic as a synonym trips it up. (Disclaimer: there is some emerging research into solving these kinds of problems for neural nets, and I am *not* up on the latest! Anyone who knows better is invited to amplify!)

I've said before (many many times) that skip-grams are like Link Grammar disjuncts, and it's time to make this precise. Lets try this:

    +---->WV--->+     +-----IV--->+-----Ost-----+
    +->Wd--+-SX-+--Pa-+--TO--+-Ixt+   +--Dsu*v--+
    |      |    |     |      |    |   |         |
LEFT-WALL I.p am.v proud.a to.r be.v an emotionalist.n

Here, an example skipgram might be (I..proud..be) or (proud..be..emotionalist) A sentence like "I was immodestly an emotionalist" would be enough for a police detective to declare that Joyce wrote that. Yet.  there is no skip-gram match.

Consider now the Link-grammar word-disjunct pairs. For the above sentence, here's the complete list:

               I  == Wd- SX+
              am  == SX- dWV- Pa+
           proud  == Pa- TO+ IV+
              to  == TO- I*t+
              be  == Ix- dIV- O*t+
              an  == Ds**v+
    emotionalist  == D*u- Os-

You can double-check this by carefully looking at the diagram above; notice that "proud" links to the left with Pa and to the right with TO and IV.

The original intent of disjuncts is to indicate grammatical structure. So, "Pa" is a "predicative adjective". "IV" links to "infinitive verb".  As a side-effect, they work with word-classes: for example, "He was happy to be an idiot" has exactly the same parse, even though the words are quite different.

To finally get back to your original question of author attribution. Well, here's an idea: "bag of disjuncts". Let's collect (disjunct, frequency) pairs from Joyce's novels, and compare them to his letters. The motivation for this idea is that perhaps the specific vocabulary words are different, but the sentence structures are similar.

How well does this work? I dunno. No one has ever studied this in any quantitative, scientific setting.  Some failings are obvious: There is a 100% match to "He was happy to be an idiot" even though the word-choice might not be Joycian. There is a poor match to "I was immodestly an emotionalist" even though the word "emitionalist" is extremely rare, and a dead-giveaway. There's also a problem with the correspondence "immodestly" <=> "proud to be" because "immodestly is an adverb, not a predicative adjective, and it's a single word, not a word-phrase. Raw, naive Link Grammar is insensitive to synonymy between word-phrases.

There is a two-decade old paper that explains exactly how to solve the multi-word synonymous-phrases problem. It's been done. It's doable. I can certainly point out a half-dozen other tricks and techniques to further refine this process.  So, yes, I think that this all provides a good foundation for text attribution experiments. But I mean what I say: "experiments". I think it could work, and I think it might work quite well. But, to do better, you'd have to actually do it. Try it.  It would take a goodly amount of work before any literary critic would accept your results; and even more before a judge would accept it as admissible evidence in a court of law.

As to existing software: I have a large collection of tools for counting things and pairs of things, and comparing the similarity of vectors. Most enthusiasts would find that code unusable, until it gets re-written in python. Alas, that is not forthcoming.  If you wanted to actually do what I describe above, some very concrete plans would need to be made.

I also have this daydream about *generating text* in the style of a given author: given a corpus, create more sentences and paragraphs, in the style and vocabulary of that corpus. My ideas for this follow along similar lines of thought to the above, but this is ... a discussion for some other day.

--linas

p.s. I plan to turn this email into a blog post.


--
You received this message because you are subscribed to the Google Groups "link-grammar" group.
To unsubscribe from this group and stop receiving emails from it, send an email to link-grammar...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/link-grammar/0211b9a9-a9d5-406b-a941-56795b4f4edfn%40googlegroups.com.


--
Patrick: Are they laughing at us?
Sponge Bob: No, Patrick, they are laughing next to us.
 

Calvin Irby

unread,
Jun 20, 2021, 9:58:06 AMJun 20
to link-grammar
Hello again Linas,

And what a terrific response! I saved the blog post, because it was that good. I really enjoyed the historical aspect and how you talked about how one thing led to another. Getting back to some of your possible solutions to the problem...

I think the "bag of disjuncts" idea is an interesting idea. Currently, like you said, someone could surely just look at the frequency of words in a text and see if any of the bigrams/trigrams provide any information; although, using disjuncts might give someone a better approach at a given problem. I got inspired by Chapter Two in the book called Real World Python by Lee Vaughan. Except that it uses the NLTK package which of course  is different from Link Grammar as some assembly is required to get it to work; plus, just looks at different parts of the text to determine if they are similar. If all the boxes are checked, so to speak, then one could assume that Shakespeare was the original author or whatever the original experimenter's intents were. 

Do you happen to remember the name of the paper that solves the multi-word synonymous-phrases problem? Also, let me know what "concrete plans" need to be put in place in order to make the bag of disjuncts idea work. Would this require just taking some of the existing Link Grammar Source code and converting it to Python in some way? I really like coding in Python as hobby nonetheless, it's my language of choice, and sometimes just having some theoretical problems to throw myself at keeps the inspiration fire burning so to speak. 

-Calvin 

Linas Vepstas

unread,
Jun 21, 2021, 10:11:45 AMJun 21
to link-grammar
Hi Calvin,

On Sun, Jun 20, 2021 at 8:58 AM Calvin Irby <calvin...@gmail.com> wrote:
Hello again Linas,

And what a terrific response! I saved the blog post, because it was that good. I really enjoyed the historical aspect and how you talked about how one thing led to another. Getting back to some of your possible solutions to the problem...

I think the "bag of disjuncts" idea is an interesting idea. Currently, like you said, someone could surely just look at the frequency of words in a text and see if any of the bigrams/trigrams provide any information; although, using disjuncts might give someone a better approach at a given problem. I got inspired by Chapter Two in the book called Real World Python by Lee Vaughan. Except that it uses the NLTK package which of course  is different from Link Grammar as some assembly is required to get it to work; plus, just looks at different parts of the text to determine if they are similar. If all the boxes are checked, so to speak, then one could assume that Shakespeare was the original author or whatever the original experimenter's intents were. 

Do you happen to remember the name of the paper that solves the multi-word synonymous-phrases problem?

Hoifun Poon and Pedro Domingos,  Unsupervised Semantic Parsing, http://www.aclweb.org/anthology/D/D09/D09-1001

Also, let me know what "concrete plans" need to be put in place in order to make the bag of disjuncts idea work. Would this require just taking some of the existing Link Grammar Source code and converting it to Python in some way? I really like coding in Python as hobby nonetheless, it's my language of choice, and sometimes just having some theoretical problems to throw myself at keeps the inspiration fire burning so to speak. 

The existing Link Grammar code comes with Python bindings and an example or two.  It also has Java, Javascript and assorted other bindings.  Just take whatever bag-of-words code the Vaughn book provides, and replace it with disjuncts.  Shouldn't be more than a few hours of dinking around, maybe an afternoon.

If the example is literally Shakespeare, then LG will struggle. In the end, LG was built for parsing late 20th century English, and its accuracy degrades as it moves away from that.

If you do come up with that code, you should post it somewhere; I'll link the blog to it. (Or you should add a comment to the blog mentioning it.)

--linas


Calvin Irby

unread,
Jun 21, 2021, 11:43:54 AMJun 21
to link-g...@googlegroups.com
Awesome! Thank you Linas,

Also, yes I'll definitely refer back to your blog post If I spin anything up in Python.

-Calvin

You received this message because you are subscribed to a topic in the Google Groups "link-grammar" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/link-grammar/YMVZJZsfTBw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to link-grammar...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/link-grammar/CAHrUA36WEv_mjuuiAVWVXY%2Bn5mDqLxepmDtLidyLhMvzyHLjqQ%40mail.gmail.com.

Anton Kolonin @ Gmail

unread,
Jun 26, 2021, 7:50:19 AMJun 26
to link-g...@googlegroups.com, Calvin Irby, Vignav Ramesh

Here is the algorithm:

1. Parse the text with LG

2. Count the link types from the  every parse

3. Consider the link count per type for specific author correspond to the point in vector space where each dimension is link type.

4. Do this for different authors and literary style and see if they are grouped as clusters in that space

5. Let us know if it worked ;-)

6. If it worked, you can get any text and do the same procedure to find a point in space corresponding to original author.

Cheers,

-Anton

--
You received this message because you are subscribed to the Google Groups "link-grammar" group.
To unsubscribe from this group and stop receiving emails from it, send an email to link-grammar...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/link-grammar/0211b9a9-a9d5-406b-a941-56795b4f4edfn%40googlegroups.com.

Paul McQuesten

unread,
Jun 26, 2021, 10:16:16 PMJun 26
to link-grammar
"3. Consider the link count per type for specific author ..."
Maybe should normalize these counts into relative frequencies?
Perhaps by dividing counts by total links per author??

Anton Kolonin @ Gmail

unread,
Jun 27, 2021, 12:30:18 AMJun 27
to link-g...@googlegroups.com, Paul McQuesten

Sure, of course, TF-IDF style ;-)

Reply all
Reply to author
Forward
0 new messages