Text string(s) corresponding to an entity

25 views
Skip to first unread message

marty

unread,
Apr 16, 2012, 3:45:15 PM4/16/12
to Zemanta Developers
Hi there,

I was wondering if it were possible to return the strings from the
original text that were associated with particular entities? So for
example if my text is "lorem ipsum dolor" and Zemanta finds an entity
which links to something like "Ipsum Technologies LLC" - that it would
also return "ipsum" (possibly with position inside the text, or small
context to make it uniquely identifiable) as the string that led to
that entity being returned.

Just to be clear, I'm not trying to reverse engineer what you're doing
in any way - I would just like to be able to highlight the strings
inside the text accordingly if there are entities extracted for them..
I know the OpenCalais API does return these, so I was wondering if by
any chance you would maybe consider this as a parameter as well.

Cheers,
Martin

Andraz Tori

unread,
Apr 16, 2012, 4:27:46 PM4/16/12
to zemanta-d...@googlegroups.com, marty
On 04/16/2012 09:45 PM, marty wrote:
> Hi there,
>
> I was wondering if it were possible to return the strings from the
> original text that were associated with particular entities? So for
> example if my text is "lorem ipsum dolor" and Zemanta finds an entity
> which links to something like "Ipsum Technologies LLC" - that it would
> also return "ipsum" (possibly with position inside the text, or small
> context to make it uniquely identifiable) as the string that led to
> that entity being returned.

This is exactly what "anchor" attribute is for for each returned entity.
It tells you under which "surface form" the entity was found in original
text! :)

we've played around and found out that in texts we are dealing with,
this basically uniquely identifies the place (knowing that there's a
word boundary before and after "anchor")

bye
andraz

marty

unread,
Apr 16, 2012, 5:08:41 PM4/16/12
to Zemanta Developers
Hi Andraz,

Thanks for the quick reply. What you're saying is quite true, I should
have been more precise: Would it be possible to return all (distinct)
anchors for a text?

Why I'm asking is if you take the text from
http://www.readwriteweb.com/archives/why_facebook_terrifies_google.php
for example - Zemanta returns "Google's" as the anchor for the Google
entity.. Now I have to say that this is extremely unhelpful since the
string is longer than the minimal string ("Google"), thus making it so
difficult for the user to detect all corresponding anchors inside the
text..

The above is actually a relatively simple example, but what if the
anchors are completely different, like "USA", "United States of
America", "the US" - cases like that make highlighting all anchors
inside a text basically a new named entity recognition task..

Also, looking again at the text from the RWW link above, it is weird
that the system returns "Google's", since it's neither the first, nor
the last, nor the most frequent string corresponding to that entity
("Google" is all of those things).. but that's beside the point..

So I guess the question is.. would it be possible to have at least a
list of all distinct anchors, i.e. "Google" and "Google's" for the
text above - I'd be happy to find all the locations of those strings
inside the text, so I'm not asking for you to return 10 locations for
the same anchor string.. this would be so so great and I would really
appreciate it if you could consider this! (I think many other users
would as well)

Thanks again,
Martin

Andraz Tori

unread,
Apr 16, 2012, 5:38:54 PM4/16/12
to zemanta-d...@googlegroups.com
On 04/16/2012 11:08 PM, marty wrote:
> Hi Andraz,
>
> Thanks for the quick reply. What you're saying is quite true, I should
> have been more precise: Would it be possible to return all (distinct)
> anchors for a text?

Ok, I now understand what you are looking for.

We do exactly what you are asking behind the scenes, but do not report
those anchors to API users. Mainly because this hasn't yet been
requested by our commercial partners. I can say that we'll consider this
request for future improvements, however I can't guarantee anything.
We're providing this service entirely for free and then we improve it in
directions commercial partners need, plus naturally what we need it for
internally.

I hope the API in current form will be suitable for you.

I would be interested in results of your work though. If you are writing
a paper on these things or if you are building a service around NLP, let
me know.

bye
andraz

marty

unread,
Apr 17, 2012, 7:26:44 AM4/17/12
to Zemanta Developers
Hey again,

I actually already sent you a screenshot of what we're doing :) Was a
different mail address from this (webscio).

I understand that you have to keep your paying clients' interests in
mind first, but considering that you're already doing what we're
talking about, returning these details via the API shouldn't be a big
step (time-wise to develop), so I'm really hoping this would be
possible sometime in the (not so distant) future. :)

Cheers,
Martin
Reply all
Reply to author
Forward
0 new messages