Unique entity identifiers

marty

unread,

Mar 1, 2012, 4:43:36 PM3/1/12

to Zemanta Developers

Hi there,

I just gave Zemanta content analysis API a go and it seems to provide
very useful results. However, one thing that I really miss are some
unique identifiers for entities, either within Zemanta itself, or
externally. I know that RDF links are kind of that, but it turns out
that not all entities will have an RDF link assigned to them (see two
examples below). I was wondering if you plan on introducing something
like this and if so then when (and if not then why not :)). An example
would be how OpenCalais does it, assigning a hash URI to every entity,
such as http://d.opencalais.com/comphash-1/c7172a98-4c8a-31a9-bfd4-ce426c8db3c0.html
or http://d.opencalais.com/er/company/ralg-tr1r/ce181d44-1915-3387-83da-0dc4ec01c6da.html

Thanks in advance for your reply and thumbs up on the great job with
the API!

Regards,
Martin

Two examples without RDF as promised (the first doesn't even have an
entity type):

{'relevance': 0.64361800000000002, 'confidence': 0.604495,
'entity_type': [], 'target': [{'url': 'http://www.theawl.com/',
'type': 'homepage', 'title': 'The Awl'}], 'anchor': 'The Awl'},

{'relevance': 0.62546400000000002, 'confidence': 0.58865100000000004,
'entity_type': ['/book/book'], 'target': [{'url': 'http://
www.amazon.com/Chimpanzees-Ann-Elwood/dp/0785782966%3FSubscriptionId%3D0G81C5DAZ03ZR9WH9X82%26tag%3Dzemanta-20%26linkCode%3Dxm2%26camp%3D2025%26creative%3D165953%26creativeASIN%3D0785782966',
'type': 'amazon', 'title': 'Chimpanzees'}], 'anchor': 'Chimpanzees'}

Andraz Tori

unread,

Mar 12, 2012, 3:54:17 AM3/12/12

to zemanta-d...@googlegroups.com, marty

On 03/01/2012 01:43 PM, marty wrote:
> Hi there,
>
> I just gave Zemanta content analysis API a go and it seems to provide
> very useful results. However, one thing that I really miss are some
> unique identifiers for entities, either within Zemanta itself, or
> externally. I know that RDF links are kind of that, but it turns out
> that not all entities will have an RDF link assigned to them (see two
> examples below). I was wondering if you plan on introducing something
> like this and if so then when (and if not then why not :)). An example
> would be how OpenCalais does it, assigning a hash URI to every entity,
> such as http://d.opencalais.com/comphash-1/c7172a98-4c8a-31a9-bfd4-ce426c8db3c0.html
> or http://d.opencalais.com/er/company/ralg-tr1r/ce181d44-1915-3387-83da-0dc4ec01c6da.html
>
> Thanks in advance for your reply and thumbs up on the great job with
> the API!

Hi Martin,

currently we don't plan to do that. Our beliefs are:
- if there doesn't exist some semi-authoritative source of information
for that entity, than there is no 'meaning' in having a unique
identifier. Any url we return can be used for exactly the same purpose -
in this case http://www.theawl.com would be an identifier. Or you can
use the Amazon's url for that purpose.

- minting new entities that don't have specific well defined meaning is
a bit pointless. You can do it on your side too.

Hope this helps.... and we're always open for discussion.

bye
andraz

marty

unread,

Mar 13, 2012, 11:34:42 AM3/13/12

to Zemanta Developers

Hi Andraz,

Thanks for your reply. I can understand your reasoning and I guess I
can agree with it - I was just worried about matching two same
entities from different texts, considering that the "anchor" is just a
bit from the text and can differ a lot.

One thing I'm still wondering about though: if the same entity is
extracted from different text (with a different anchor etc), will it
always have all the same URL identifiers? So can it happen that one
time I get both dbpedia and wikipedia links for say "zemanta" entity,
and some other text only returns me a dbpedia link but not wikipedia?
(basically that comes down to the question if those urls are matched
by some identifier in your internal database i guess :))

Also, the "title" returned with an "url" can change based source, or?
So if the Wikipedia article about something slightly adjusts it's
title (or adds some identifier after it, like "(band)") will the title
change as well? Actually, in that case the url will change as well..
even if it's the same entity :x *arg*

Cheers,
Martin

On Mar 12, 7:54 am, Andraz Tori <andraz.t...@gmail.com> wrote:
> On 03/01/2012 01:43 PM, marty wrote:
>
> > Hi there,
>
> > I just gave Zemanta content analysis API a go and it seems to provide
> > very useful results. However, one thing that I really miss are some
> > unique identifiers for entities, either within Zemanta itself, or
> > externally. I know that RDF links are kind of that, but it turns out
> > that not all entities will have an RDF link assigned to them (see two
> > examples below). I was wondering if you plan on introducing something
> > like this and if so then when (and if not then why not :)). An example
> > would be how OpenCalais does it, assigning a hash URI to every entity,

> > such ashttp://d.opencalais.com/comphash-1/c7172a98-4c8a-31a9-bfd4-ce426c8db3...
> > orhttp://d.opencalais.com/er/company/ralg-tr1r/ce181d44-1915-3387-83da-...

>
> > Thanks in advance for your reply and thumbs up on the great job with
> > the API!
>
> Hi Martin,
>
> currently we don't plan to do that. Our beliefs are:
> - if there doesn't exist some semi-authoritative source of information
> for that entity, than there is no 'meaning' in having a unique
> identifier. Any url we return can be used for exactly the same purpose -

> in this casehttp://www.theawl.comwould be an identifier. Or you can

> use the Amazon's url for that purpose.
>
> - minting new entities that don't have specific well defined meaning is
> a bit pointless. You can do it on your side too.
>
> Hope this helps.... and we're always open for discussion.
>
> bye
> andraz
>
>
>
>
>
>
>
> > Regards,
> > Martin
>
> > Two examples without RDF as promised (the first doesn't even have an
> > entity type):
>
> > {'relevance': 0.64361800000000002, 'confidence': 0.604495,
> > 'entity_type': [], 'target': [{'url': 'http://www.theawl.com/',
> > 'type': 'homepage', 'title': 'The Awl'}], 'anchor': 'The Awl'},
>
> > {'relevance': 0.62546400000000002, 'confidence': 0.58865100000000004,
> > 'entity_type': ['/book/book'], 'target': [{'url': 'http://

> >www.amazon.com/Chimpanzees-Ann-Elwood/dp/0785782966%3FSubscriptionId%...,

Andraz Tori

unread,

Mar 13, 2012, 4:52:54 PM3/13/12

to zemanta-d...@googlegroups.com, marty

On 03/13/2012 08:34 AM, marty wrote:
> Hi Andraz,
>
> Thanks for your reply. I can understand your reasoning and I guess I
> can agree with it - I was just worried about matching two same
> entities from different texts, considering that the "anchor" is just a
> bit from the text and can differ a lot.

Yeah, however what I am saying is use urls, not anchors for comparisons.

>
> One thing I'm still wondering about though: if the same entity is
> extracted from different text (with a different anchor etc), will it
> always have all the same URL identifiers? So can it happen that one
> time I get both dbpedia and wikipedia links for say "zemanta" entity,
> and some other text only returns me a dbpedia link but not wikipedia?
> (basically that comes down to the question if those urls are matched
> by some identifier in your internal database i guess :))

yes they will have the same urls.
we could add or remove some urls to the list as we aggregate more
databases or find that one wasn't accurate enough, etc.

> Also, the "title" returned with an "url" can change based source, or?
> So if the Wikipedia article about something slightly adjusts it's
> title (or adds some identifier after it, like "(band)") will the title
> change as well? Actually, in that case the url will change as well..
> even if it's the same entity :x *arg*

Wikipedia rarely changes the titles of established entities. So no worry
about that :)

bye
andraz

Reply all

Reply to author

Forward