Using "Universal Sentence Encoder" for Documents Similarity

Loreto Parisi

unread,

Apr 16, 2018, 10:11:32 AM4/16/18

to Discuss

I'm using the recently added model Universal Sentence Encoder available here https://www.tensorflow.org/hub/modules/google/universal-sentence-encoder/1 for a Semantic Similarity Task on documents i.e. text with let's say 7-15 paragraphs (hence line separated by at least two end of line "\n\n") and 4-5 sentences for each paragraph (separated by end of line "\n"), like a poetry document.

What I can see it is that the similarity score ranges in most of cases between 0.5-0.9, i.e. the documents appear to be very "close", even if the authors are actually different in their written text style, topics, etc.
As you can see in the attached example that it is plotting the score.
The methodology I'm using is implemented here: https://github.com/loretoparisi/tensorflow-examples/blob/master/Tensorflow_Semantic_Tex_Similarity_Demo.ipynb with sample data, not he actual I have used to plot the output as attached.

My question is how to improve this result. The models on the Tensorflow hub can be trained, so I was wondering if using a model training like

embedded_text_feature_column = hub.text_embedding_column(
      key="sentence", module_spec=hub_module, trainable=train_module)

But I'm not sure if this is the right methodology. Another option could (would) be using additional features other than text embedding features, like (maybe) the date /epoch of the text (let's say decades), or other info like a genre, but how to embed these features in a boosting style approach with the model hubs and the universal sentence encoder?

Schermata 2018-04-16 alle 15.57.08.png

bd.cuth...@gmail.com

unread,

Apr 16, 2018, 8:26:45 PM4/16/18

to Discuss

Hi Loreto,

What kind of text are you using, is there genuinely a lot of difference between the style as a human would perceive? ie. are you trying to look at the similarity between a piece of poetry versus an academic paper on machine learning, or more similar genres of text.

Brett.

Loreto Parisi

unread,

Apr 17, 2018, 8:43:49 AM4/17/18

to Discuss

Hello,

I'm using a collection of song lyrics from different artists in this case. I have intentionally added a gender bias, so that I have metal, pop or hip-hop songs lyrics let's say, so the writing style differs a lot. What I can see from the output of the module is that as-it-is it does not capture the "writing style" then.

I'm aware of academic papers classification, but in that case SVM as well will work well, due to the technical language and the usage of very specific terms, there is a good example in the Andreji Karpathy arxiv sanity preserver application.

But back to my dataset of songs, I have tried to decompose to sentences and paragraphs so that I have as feature vectors the decomposition of the documents in let's say N paragraphs of M lines, and attached is the output that I get. As you can see this time, we are likely in the worst case, the clustering is no that good, while I would expect that have a clustering of at least by genre or by artist.

Schermata 2018-04-17 alle 14.40.49.png

bd.cuth...@gmail.com

unread,

Apr 17, 2018, 1:57:58 PM4/17/18

to Discuss

So yes. You've identified some hot spots. The algorithm is differentiating some features.

What does it mean?

Brett.

On Tuesday, April 17, 2018 at 12:11:32 AM UTC+10, Loreto Parisi wrote:

Jorge Muñoz

unread,

Apr 17, 2018, 7:21:37 PM4/17/18

to Discuss

You are using a model for semantic similarity but you want to detect different write styles. That is your problem. Two texts can be written in a different style but still have the same semantic meaning. But still the model it is giving you some information because the style and semantic meaning are somehow slightly correlated in your case.

Try a different type of model for your problem.

Loreto Parisi

unread,

Apr 18, 2018, 3:31:50 AM4/18/18

to Discuss

Thanks for your suggestion, that is a good point, but I'm not sure that there is enough difference between "writing style" and "semantic meaning" for this module embedding features. I'm not aware of any other module that could be capable of embed features capable to capture a style more than semantics features of a short text. The point is that the Universal Sentence Encoder embedding, as-it-is, of course in my tests, seems to keep all features normalized around some value or we can say that its centroids are too much closer the the mean values of the distance that is used.

My wonder is that, since I have songs of artists that belongs to completely different genres, this module should be able to cluster the similar texts/songs in from a semantic perspective, while I can see that:

- using large documents (hence with no paragraphs split), the resulting features have a small variance;

- using short documents (hence with a paragraph split), the features seems to have too much variance, at the point that more texts belonging to the same author, do not cluster together. There are just few cases in which you can see that the heatmap shows a cluster. The heatmap labels identifies paragraphs of the same author in a consecutive order, so you can see a spot / cluster at the bottom right, but in most of cases it seems to lose this information.

So what it could an improvement for both the semantic and style?

Peijin Chen

unread,

Nov 16, 2018, 8:38:28 PM11/16/18

to Discuss

I've tried it on wine reviews (2-3 sentences) and novels...and I have wondered, with the novels, how exactly things would work. Have you

learned anything new since the last post?

Reply all

Reply to author

Forward