Multi-value semantic annotations

Jan Kouřil

unread,

Dec 1, 2014, 6:47:38 AM12/1/14

to mg...@googlegroups.com

Is there a possibility to index multivalue annotations in mg4j semantic index? As far as I understand, semantic indexing with mg4j allows to use fixed number of annotations per token (as in WikipediaDocumentCollection).

Basically I want to do the following:

position token profession ....

1 Rembrandt painter|printmaker

2 was

...

Every person identified with some NER software can have variable number of professions, and when I make query to "profession" index, I want to retrieve the document about Rembrandt either when I query "profession:painter" or "profession:printmaker".

I am afraid this can't be easily done with mg4j, it would be nice to know, whether such a feature is planned or not.

Regards,

Jan

Sebastiano Vigna

unread,

Dec 1, 2014, 12:34:38 PM12/1/14

to mg...@googlegroups.com

Good question.

Indexing multiple tokens at the same position is a very reasonable idea--if not else, to index variants of the same word. There is nothing in MG4J that prevents this from happening--the only problem is that there is no support in Scan for this.

There are way to patch the problem--you can, for instance, break the multiple annotations in a small number of different indices, and then just merge them. Like, one index has

> position token profession ....
> 1 Rembrandt printmaker
> 2 was

and the other

> position token profession ....
> 1 Rembrandt painter

> 2 was

The resulting indices would have disjoint term sets, and if you merge them you'll get what you want.

Probably the best thing would be to have a new type of field, ANNO, that returns a position and a token (in monotonically nondecreasing order). The indexing system that would just index the token in the right position. If you're interested I can try to put it together.

Of course, you have the possibility of building your own index using a suitable IndexWriter--it's quite easy.

Ciao,

seba

Message has been deleted

Jan Kouřil

unread,

Dec 1, 2014, 5:09:46 PM12/1/14

to mg...@googlegroups.com, vi...@di.unimi.it

Dne pondělí, 1. prosince 2014 18:34:38 UTC+1 Sebastiano Vigna napsal(a):

Merging indices probably wouldn't work for me (I'm afraid such approach would consume too much space). But I am able to supply information about tokens which should be indexed to particular indices at given positions. So yes, I would be interested in this (nevertheless support in Scanner would be ideal). Any help in this would be greatly appreciated.