Plan for the next WePS

Satoshi Sekine

unread,

Jul 12, 2007, 10:41:52 AM7/12/07

to web-people-search-...@googlegroups.com

Dear all WePS-ers,

This is Satoshi, one of the organizers of WePS.

First of all, I would like to thank you all for your participation and
cooperation for the task. I'm glad to see many of you at the SemEval
workshop and nice to talk with you at the workshop and the dinner.

We would like to hear your opinion about the future of the WePS. We
would like to propose a draft of the plan as the following.

1. where we will be
In this WePS, we were a part of SemEval. However, the participants are
restricted mainly from NLP area (i.e. people who attend ACL). We should
stretch the area to IR (SIGIR), WWW (WWW), ML (KDD) or DM (ICDM). In
particular, we believe WWW is a good place to be, as its focus is on
application like WePS and we could easily find industry who are
interested in WePS. If you know ways to advocate WePS to those area,
please let us know.

2. tasks
As many of us discussed and it seems many have enthusiasm on the task on
extracting property/attribute/feature (we have to fix the terminology...) of
people, we would like to have this as an additional task. We've also
discussed about including organization and location as target types, but
we believe this task has less priority (maybe not in the second round).
If there were many names which are ambiguous between people, location
and organization, it could be interesting, but it seems there are not
many. For example, "Washington" sounds like ambiguous, but "George
Washington", "Washington D.C.", "Washington States" or "Washington
Mutual" are not really the same name. (Finding those different entities
from a keyword "Washington" could be an interesting task, but it has a
bit different focus)

3. schedule
We don't want to rush to make the next event. We want to have a time to
advertise the task to different areas, and to build systems for the new
task. I think it is a good idea to aiming at having a workshop at
WWW2009 (most likely April). Then we can have a dry run (or practice
evaluation event) in addition to the formal run.

4. relationship with other activities
There are many related activities out there. We think it is better to have
nice relationships with those activities rather than trying to be
competitive. As many of you may know that Spock has a similar challenge.
I (Satoshi) have a friend there and am discussing a visit their. Also
JHU summer school has a similar project.
http://www.clsp.jhu.edu/ws2007/groups/elerfed/
I have contacted them and am planning a visit to talk about our
activities and discuss a cooperations. (they might become one of the
strong participating systems).

5. organization member
I think Javier, Julio and Satoshi will continue to chair the next event, but
we would like to form an organization committee. If you are interested in
joining the committee, please let us know. We would like to discuss the
issue like the above in the group. (no serious obligation or duty)

Thank you very much and I'm looking forward to hearing from you!

--
Satoshi Sekine
sek...@cs.nyu.edu

Satoshi Sekine

unread,

Jul 19, 2007, 5:02:34 PM7/19/07

to web-people-search-...@googlegroups.com, els.l...@hogent.be, veroniq...@hogent.be, Timur.F...@ugent.be, y...@colorado.edu, James....@colorado.edu, andre...@dfki.de, guenter...@dfki.de, PaulK...@fairisaac.com, Matthi...@fairisaac.com, pop...@itc.it, mag...@itc.it, de...@cs.jhu.edu, nga...@cs.jhu.edu, yaro...@cs.jhu.edu, er...@psu.edu, sy...@psu.edu, don...@psu.edu, tany...@comp.nus.edu.sg, ka...@comp.nus.edu.sg, sag...@dcs.shef.ac.uk, sugi...@lr.pi.titech.ac.jp, o...@pi.titech.ac.jp, e.ag...@si.ehu.es, a.s...@si.ehu.es, dva...@inf.uc3m.es, cdep...@inf.uc3m.es, tvic...@inf.uc3m.es, Jeremy...@unn.ac.uk, Gary....@unn.ac.uk, kba...@science.uva.nl, le...@dcs.gla.ac.uk, m...@science.uva.nl, j.i...@sheffield.ac.uk, l....@sheffield.ac.uk, z.z...@sheffield.ac.uk

Dear WePS-ers,

In order to understand the participated systems, I've read all the
papers and made a summary. I enjoyed reading them and I realize again
that all the efforts you put on this task are really valuable.

I have to confess that the summary is not 100% accurate due to the limit
of my ability and I need your helpt. Please look at your system's summary
and send me (sek...@cs.nyu.edu) back the correction!!

I'm going to visit JHU's summer school on the similar task at the end of
the month and give a talk about our efforts. So I appreciate it if you
could send me your summary as soon as possible.
http://www.clsp.jhu.edu/ws2007/groups/elerfed/

I attach doc, pdf files, as well as the text in the following. All of
them have the same contents.

Thank you very much in advance.

Satoshi

P.S. I CC-ed this to all the authors to make sure all of the
participants received this message. If you have not joined the WePS
mailing list, please visit the following URL.
http://groups.google.es/group/web-people-search-task---semeval-2007/

--------------------------------------------
* Description of systems
> HTML to text: tool/method to extract text from html file
> Pre-process: pre-process including text reformat or NLP analysis
> NE: tool/method for named entity tagger
> Features: type of features to be used in similarity calculation
> Span: span in the text where the features are extracted (e.g. all document, 20 word window of the name in the text)
> Similarity: method to calculate similarity
> Clustering: clustering algorithm
> Threshold: How to decide the threshold in the clustering
> Duplication: How to handle multiple names in a single document
> Notes: Other notes

* CU_COMSEM - University of Colorado at Boulder, US (0.79)
> HTML to text: Beautiful Soup (http://www.crummy.com/software/BeautifulSoup)
> Pre-process: MXTERMINATOR (sentence segmenter), UPenn tokenization, syntactic phrase chunker
> NE: EXERT (http://sds.colorado.edu/EXERT) - ACE based NE & Coreference
> Features: local tokens, full tokens, URL tokens, title tokens in root page (token features), base noun phrase, NE (phrase features)
> Span: ?
> Similarity: two level SoftTFIDF
> Clustering: agglomerative clustering with single linkage
> Threshold: Fixed stop threshold acquired from the training data
> Duplication: one person per document
> Notes:

* IRST-BP - FRK-irst, Italy (0.77)
> HTML to text: ?
> Pre-process: Extract e-mail, phone, fax number. Text is split into person-paragraph, MiniPar
> NE: IRST’s system
> Features: NE, distinctive words, temporal expressions, N-grams of the mentioned features
> Span: ?
> Similarity: Sum of the individual scores of the common features (weight based on distance from the name of interest)
> Clustering: bottom up rule based clustering
> Threshold: hand made heuristics
> Duplication: Text is split into person-paragraph
> Notes:

* PSNUS - Pennsylvania State University, US (0.77)
> HTML to text: ?
> Pre-process: ?
> NE: Stanford Named Entity Recognizer
> Features: NE only
> Span: ?
> Similarity: cosign similarity
> Clustering: Hierarchical agglomerative clustering (HAC)
> Threshold: similarity threshold value of 0.2
> Duplication: one person per document
> Notes: ?

* UVA - University of Amsterdam, Netherland (0.69)
> HTML to text: self developed
> Pre-process: stop word, stemming
> NE: none?
> Features: words in title, snippet and body
> Span: ?
> Similarity: log odd ratio
> Clustering: Single Pass Clustering (SPC)
> Threshold: ?
> Duplication: ?
> Notes: ?

* SHEF - University of Sheffield, UK (0.66)
> HTML to text: a
> Pre-process: Tokenization, Sentence splitting, POS tagging, NE, NE coreference, Summarization using GATE
> NE: GATE
> Features: Mention in full document (not only in summary)
> Span: ?
> Similarity: cosign of vectors
> Clustering: agglomerative clustering, nearest distance
> Threshold: fixed by training data
> Duplication: one person per document
> Notes: ?

* FICO - Fair Issac Cooperation, US (0.67)
> HTML to text: Eliminate HTML and script and convert chunks to paragraph
> Pre-process: Local parsing around NE, coreference
> NE: Own NER system
> Features: Names (location, organization and person), title tokens, string similarity, Soundx
> Span: ?
> Similarity: Logarithm of the inverse frequency
> Clustering: ?
> Threshold: 2 x least person score
> Duplication: ?
> Notes: ?

* UNN - Northumbiria University, UK (0.66)
> HTML to text: ?
> Pre-process: POS tagger to identify sequence of proper nouns
> NE: POS and a simple pattern using gazeteer
> Features: First use other person names, then for an allocated documents, use lexical chains derived from Roget’s thesaurus (Ellman 2000)
> Span: ?
> Similarity: <see above>
> Clustering: single link hierarchical agglomerative clustering
> Threshold: ?
> Duplication: one person per page
> Notes: ?

* AUG - Ghent University Association (0.64)
> HTML to text: ?
> Pre-process: Memory-based shallow parser, regular expression based tokenizer, POS tagger, text chunker using memory-based tagger
> NE: none?
> Features: biographic facts (place, date of birth, death; rule based extracter), dictinctive characteristic for a given person, (three names, e-mail, URL in the page), a list of weighted keywords and meta data information about the web page (URL, domain, city from IP address; http://www.maxmind.com/app/geolitecity)
> Span: ?
> Similarity: TFIDF (for keywords)
> Clustering: the eager RIPPER rule learner (Cohen 1995) for classification <I’m not sure what is this Sekine> , singe-link hierarchical agglomerative clustering. Two methods for famous and non-famous people, also some heuristic rules for merging clusters
> Threshold: train by the trial and training data
> Duplication: ?
> Notes: ?

* SWAT-IV - No paper - (0.62)
> HTML to text: ?
> Pre-process: ?
> NE: ?
> Features: ?
> Span: ?
> Similarity: ?
> Clustering: ?
> Threshold: ?
> Duplication: ?
> Notes: ?

* US-ZSA - No paper - (0.61)
> HTML to text: ?
> Pre-process: ?
> NE: ?
> Features: ?
> Span: ?
> Similarity: ?
> Clustering: ?
> Threshold: ?
> Duplication: ?
> Notes: ?

* TITPI - Tokyo Institute of Technology (0.60)
> HTML to text: ?
> Pre-process: ?
> NE: ?
> Features: ?
> Span: ?
> Similarity: ?
> Clustering: Semi-supervised cluetering that controls the fluctuation of the centroid of a cluster
> Threshold: ?
> Duplication: ?
> Notes: ?

* JHU1-13 - John Hopkins University (0.58)
> HTML to text: ?
> Pre-process: POS tagger
> NE: Place name, occupation name and titles based on gazetter
> Features: bag-of-words on noun and adjectives and preferential weighting to terms around the name, place name, occupation name and titles. Snipets from the Web is also used.
> Span: ?
> Similarity: ?
> Clustering: K-means clustering
> Threshold: tuning based on the taining data
> Duplication: ?
> Notes: ?

* DFKI2 - DFKI (0.53)
> HTML to text: html2text (http://www.mbayer.de/html2text/index.shtml)
> Pre-process: coreference, NE tagger, NLP tool for semantic parsing to build predicate-argument structures
> NE: LingPipe toolkit
> Features: NE and predicate-argument
> Span: ?
> Similarity: <see above>
> Clustering: hierarchical clustering
> Threshold: training based on training data (and set 12)
> Duplication: one person per file
> Notes: ?

WIT - The University of Sheffield (0.52)
> HTML to text: own?
> Pre-process: tokenization, removal of stop words and infrequent words, and stemming with Porter’s algorithm
> NE: OpenNLP
> Features: Graph consists of the nodes of type Token, Webpage, Metadata, Title and Body
> Span: ?
> Similarity: Commute Time Distance on random walk model
> Clustering: group-average agglomerative clustering (Fleischman and Hovy 2004)
> Threshold: unsupervised heuristics to do that, e.g. the well-known Calinski&Harabasz stopping rule (Calinski&Harabasz, 1974)
> Duplication: ?
> Notes: Random walk on graph based approach

* UC3M_13 - Universidad Carlos III de Madrid (0.51)
> HTML to text: ?
> Pre-process: ?
> NE: a robust rule based name recognizer based on surface feature and some trigger words
> Features: a) emails, b) URLs, c) proper names, d) long numbers (more than four figures), e) short numbers (up to four figures), f) title terms, g) terms of the titles of related documents, h) terms contained in the ‘meta’ tag of the documents, i) terms of emphasized text fragments (bold, italic, etc.), j) terms of the document snippet, and k) terms of the related documents snippets: filtered by stop words, name coreference is done by a similarity measure
> Span: ?
> Similarity: cosine vector similarity for tfidf-weight but a filtering of features are done by the similarity between each pair of bags of terms
> Clustering: Agglomerative Vector Space clustering
> Threshold: a threshold of similarity of 0.001, 2.5 pairs of bags of terms above the threshold required for including two documents in the same cluster and the following bags of terms: bags of URLs, proper names, long and short numbers, terms of titles, terms of the titles of the related documents and terms of the document snippets
> Duplication: each document refers only to one person
> Notes: ?

* UBC-AS - UBC(0.45)
> HTML to text: home-made wrapper
> Pre-process: split into sentences and parsed using the FreeLing parser
> NE: none?
> Features: lemmas of nouns (with frequency more than 4 in BNC) in 9 sentence context, AND topology of pages on the web, Find pages which point duplicated pages.
> Span: ?
> Similarity: For inducing the hubs we apply the HyperLex algorithm. Then, the MST is calculated and every context is assigned with a hub score vector. The M matrix of pairwise similarities between documents is then computed and pruned with a threshold of 0.2,
> Clustering: Random walk type algorithm
> Threshold: Tuned by the training data
> Duplication: ?
> Notes: ?

--
Satoshi Sekine
sek...@cs.nyu.edu

systemWePS2007.pdf

systemWePS2007.doc

Paul Kalmar

unread,

Jul 19, 2007, 6:56:41 PM7/19/07

to web-people-search-...@googlegroups.com

Hi Satoshi,

I had some thoughts on your comment that Location and Organization disambiguation is less important because of less frequency.

First of all, Organization names are highly ambiguous, especially when using acronyms. For example, try searching for "ACL" and see how many distinct organizations you find.

Locations are important, as although like famous person names a small percentage of names covers a large percentage of the entities, this makes finding the specific location that the user is searching for even more important. Although the user can get high precision by fully specifying the location, this will produce low recall. Also, even a fully specified query such as "Paris, New York" will still get many ambiguous results.

These seem like important additions, especially if we are waiting until 2009 for the next run.

Thanks,
Paul

Satoshi Sekine

unread,

Jul 20, 2007, 5:25:51 PM7/20/07

to web-people-search-...@googlegroups.com

Paul and all,

Thank you for your comments.

> I had some thoughts on your comment that Location and Organization
> disambiguation is less important because of less frequency.

It is slightly diffrent between "less important" and "has less priority".
I think I wrote "the task has less prority". I thought it would be very
interesting if there are ambiguity among different categories (like
Washington can be person and location), but it might not be so often.

> First of all, Organization names are highly ambiguous, especially when using
> acronyms. For example, try searching for "ACL" and see how many distinct
> organizations you find.

Organization (in particular acronym) can be a similar problem with the
people name. Or in general acronym disambiguation can be interesting
(although this might be out of our scope). NLP has 7 senses in Wikipedia.

> Locations are important, as although like famous person names a small
> percentage of names covers a large percentage of the entities, this makes
> finding the specific location that the user is searching for even more
> important. Although the user can get high precision by fully specifying the
> location, this will produce low recall. Also, even a fully specified query
> such as "Paris, New York" will still get many ambiguous results.

Basically, I agree that location could be interesting task, too. But I'm
not sure why "Paris, New York" could still be ambiguous. There is a
gazetteer of location names. So the task will most likely be to find a
link between the mention and an entity in the gazetteer. I believe it
would be easy to find the answer for "Paris, New York".

> These seem like important additions, especially if we are waiting until 2009
> for the next run.

I like to hear from other people, too.

By the way, several people sent me the correction of the summary.
Thank you very much!! If you are planning to do so, please send me the
correction by Wednesday, if possible.

Satoshi

--
Satoshi Sekine
sek...@cs.nyu.edu

jeremy ellman

unread,

Jul 22, 2007, 5:30:33 AM7/22/07

to web-people-search-...@googlegroups.com

Hi Satoshi,

Keep up the great work!

Your summary of UNN-WePS was pretty good. I'd put the lexical chain
similarity matching under the similarity section as there were two
distinct passes. HTML was converted to text using the standard
component from the perl PoS tagger Lingua::EN.

Span was the whole document.
There was a threshold similarity of 0.7 -- but our Nearest Neighbor is
somewhat idiosyncratic due to its derivation, so I would not like to
claim that our 0.7 similarity was the same as anyone elses.

jeremy ellman

unread,

Jul 22, 2007, 5:39:39 AM7/22/07

to web-people-search-...@googlegroups.com

Hi All,

Place search seems to be known as Toponym resolution, and there is some research interest in this area. I'm pretty sure I'd fine someone interested in taking that forward.

I have a student who (following up on the genealogy side) is interested in trying to identify family relations from the web. There is some research here but not a great deal. Given the range of genealogy sites there might be some interest in the wider web community.

Finally, I'd be happy to be involved in the organizing committee.

Best regards,

Jeremy Ellman

Kazunari Sugiyama

unread,

Jul 23, 2007, 3:30:31 AM7/23/07

to web-people-search-...@googlegroups.com

Dear Prof. Sekine,

The following descriptions are my opiniona about future WePS.

> 1. where we will be
> In this WePS, we were a part of SemEval. However, the participants are
> restricted mainly from NLP area (i.e. people who attend ACL). We should
> stretch the area to IR (SIGIR), WWW (WWW), ML (KDD) or DM (ICDM).

Although you may know, ACM CIKM is one of the candidates to hold the workshop
since CIKM conference is composed of works on database, natural language processing,
and information retrieval. Every year, CIKM is held in at the end of October
or at the beginning of November.

> 2. tasks
> As many of us discussed and it seems many have enthusiasm on the task on
> extracting property/attribute/feature (we have to fix the terminology...) of
> people, we would like to have this as an additional task. We've also
> discussed about including organization and location as target types, but
> we believe this task has less priority (maybe not in the second round).

I think that it is also important task to disambiguate not only personal names
but also location names. I have read a paper that addresses to disambiguate
location name:

E. Amitay et al.,
"Web-a-Where: Geotagging Web Content" (SIGIR'04).

Today, when I searched this paper through "Google Scholar,"
the above paper is cited by 40 papers. Thus, I guess that disambiguating
location name is an important research area.

"Property/attribute/feature" can be good cues to disambiguate personal name,
but if this point is emphasized, I think the purpose of the task can be changed to
information extraction rather than disambiguation. Is it all right?

> 3. schedule
> We don't want to rush to make the next event. We want to have a time to
> advertise the task to different areas, and to build systems for the new
> task. I think it is a good idea to aiming at having a workshop at
> WWW2009 (most likely April). Then we can have a dry run (or practice
> evaluation event) in addition to the formal run.

As I mentioned above, we can list ACM CIKM as one of the candidates.

> 5. organization member
> I think Javier, Julio and Satoshi will continue to chair the next event, but
> we would like to form an organization committee. If you are interested in
> joining the committee, please let us know. We would like to discuss the
> issue like the above in the group. (no serious obligation or duty)

I'd like to join the committee to evolve this research area!

Best regards,
Kazunari Sugiyama

Satoshi Sekine

unread,

Aug 1, 2007, 1:31:51 PM8/1/07

to web-people-search-...@googlegroups.com

Jeremy,

I'm sorry that I could not respond you earlier. I took vacation and
spent sometime to solve all the issues happning while I was away.

Thank you for your notes. There issues are always educational. Please
keep posting such issues. Things that are interesting and task for
evaluation are sometime different, but at this moment, we should open
our eyes widely.

Also, thank you very much for your offer to be a committee member. In
fall, we would like to re-activate those activities.

Thanks,
Satoshi

--
Satoshi Sekine
sek...@cs.nyu.edu

Satoshi Sekine

unread,

Aug 1, 2007, 1:47:52 PM8/1/07

to web-people-search-...@googlegroups.com

Sugiyama-san,

Thank you very much for the comments and I'm sorry that I could not
respond earlier. I was on vacation and have been busy to clean up all
the issues arose during the vacation.

> > 1. where we will be
> > In this WePS, we were a part of SemEval. However, the participants are
> > restricted mainly from NLP area (i.e. people who attend ACL). We should
> > stretch the area to IR (SIGIR), WWW (WWW), ML (KDD) or DM (ICDM).
>
> Although you may know, ACM CIKM is one of the candidates to hold the workshop
> since CIKM conference is composed of works on database, natural language processing,
> and information retrieval. Every year, CIKM is held in at the end of October
> or at the beginning of November.

Thank you. I could not attend ICKM-06, but I was planning to go there.
Compare to WWW, isn't it smaller or less focus on the application?

> > 2. tasks
> > As many of us discussed and it seems many have enthusiasm on the task on
> > extracting property/attribute/feature (we have to fix the terminology...) of
> > people, we would like to have this as an additional task. We've also
> > discussed about including organization and location as target types, but
> > we believe this task has less priority (maybe not in the second round).
>
> I think that it is also important task to disambiguate not only personal names
> but also location names. I have read a paper that addresses to disambiguate
> location name:
>
> E. Amitay et al.,
> "Web-a-Where: Geotagging Web Content" (SIGIR'04).
>
> Today, when I searched this paper through "Google Scholar,"
> the above paper is cited by 40 papers. Thus, I guess that disambiguating
> location name is an important research area.

Sounds interesting, but I'm not sure how much of ambiguities are there
on the location. By reading the abstract of the paper, he use an example
of London in Ontario, but I'm sure if people refer London in Ontario,
there must be an explicit and obvious clue in the text that it is
Ontario's London, rather than England's London.

> "Property/attribute/feature" can be good cues to disambiguate personal name,
> but if this point is emphasized, I think the purpose of the task can be changed to
> information extraction rather than disambiguation. Is it all right?

I felt the same way, but after the conversation at Prague, almost everybody
is interested in this direction. But, it is true that if we want to
include IR people, this should not get much emphasis.

> > 5. organization member
> > I think Javier, Julio and Satoshi will continue to chair the next event, but
> > we would like to form an organization committee. If you are interested in
> > joining the committee, please let us know. We would like to discuss the
> > issue like the above in the group. (no serious obligation or duty)
>
> I'd like to join the committee to evolve this research area!

Thank you very much.
We will re-activate the planning after the summer. We are at the moment
tagging the attribute of people so that how realistic it is to make it a
task.

--
Satoshi Sekine
sek...@cs.nyu.edu

Paul Kalmar

unread,

Aug 2, 2007, 1:53:03 PM8/2/07

to web-people-search-...@googlegroups.com

Hi Satoshi,

I have some comments on a few of your points:

On 8/1/07, Satoshi Sekine <sek...@cs.nyu.edu> wrote:

> I think that it is also important task to disambiguate not only personal names
> but also location names. I have read a paper that addresses to disambiguate
> location name:
>
> E. Amitay et al.,
> "Web-a-Where: Geotagging Web Content" (SIGIR'04).
>
> Today, when I searched this paper through "Google Scholar,"
> the above paper is cited by 40 papers. Thus, I guess that disambiguating
> location name is an important research area.

Sounds interesting, but I'm not sure how much of ambiguities are there
on the location. By reading the abstract of the paper, he use an example
of London in Ontario, but I'm sure if people refer London in Ontario,
there must be an explicit and obvious clue in the text that it is
Ontario's London, rather than England's London.

I still think that you are missing a crucial point. Although it is often the case that humans are able to unambiguously understand a location name, it is not always the case that a computer will be able to do so. Furthermore, it is not always the case that there is an explicit clue which disambiguates a location. How would a computer know to assign the default location of "London" to England? In a search engine query, there is often no context at all with which to disambiguate -- how would a search engine know which set of results you wanted to see? I still think that this would be a very interesting component to the Web Entity Search task.

> "Property/attribute/feature" can be good cues to disambiguate personal name,
> but if this point is emphasized, I think the purpose of the task can be changed to
> information extraction rather than disambiguation. Is it all right?

I felt the same way, but after the conversation at Prague, almost everybody
is interested in this direction. But, it is true that if we want to
include IR people, this should not get much emphasis.

If we are indeed going to include this component, I think the task should be split into two sub-tasks: Information Extraction and Entity Disambiguation. This way teams could choose to participate in only one area, or both, encouraging participation from broader audiences. Furthermore, if the Information Extraction subtask occurs before the Disambiguation task, the attributes that are annotated in the gold standard for the Information Extraction task can be made available to be used as features for the Entity Disambiguation task. This would allow for the two tasks to be more independently judged, rather than biasing the Entity Disambiguation task by teams who perform better on Information Extraction.

Thanks,
Paul

Satoshi Sekine

unread,

Aug 8, 2007, 11:33:41 AM8/8/07

to web-people-search-...@googlegroups.com

Hi Paul,

Thank you very much for your comments.

> > > I think that it is also important task to disambiguate not only personal
> > names
> > > but also location names. I have read a paper that addresses to
> > disambiguate
> > > location name:
> > >
> > > E. Amitay et al.,
> > > "Web-a-Where: Geotagging Web Content" (SIGIR'04).
> > >
> > > Today, when I searched this paper through "Google Scholar,"
> > > the above paper is cited by 40 papers. Thus, I guess that disambiguating
> > > location name is an important research area.
> >
> > Sounds interesting, but I'm not sure how much of ambiguities are there
> > on the location. By reading the abstract of the paper, he use an example
> > of London in Ontario, but I'm sure if people refer London in Ontario,
> > there must be an explicit and obvious clue in the text that it is
> > Ontario's London, rather than England's London.
>
> I still think that you are missing a crucial point. Although it is often
> the case that humans are able to unambiguously understand a location name,
> it is not always the case that a computer will be able to do so.

I understand this, but my point is that in case of location in documents,
it is more likely to have very explicit clues for the disambiguation.
However, in case of people names, it is much less likely (as people does
not know there are other people who share the same name. Maybe only the
person whoes name is the same as very famous person might say I'm "Bill
Clinton, not the one who was the President of US). Also another crucial
difference I found from reading the paper is that the location names
have almost complete gazetteer and the task is to identify the location
from the list. So it is more like categorization task, rather than
clustering task.

By the way, I'm surprized by the experiment result (around 80%) reported
in the paper. Also, I found that the major sourse of error is NE
identification, rather than NE disambiguation. If there is a recent
advances over this paper, please let me know. Thanks.

I still believe the location disambiguation is an interesting task. If
somebody want to run the task, I'm happy to help him/her. But it seems
that there is anough difference from WePS.

> How would a computer know to assign the default
> location of "London" to England? In a search engine query, there is often
> no context at all with which to disambiguate -- how would a search engine
> know which set of results you wanted to see? I still think that this would
> be a very interesting component to the Web Entity Search task.

It could be an interesting topic. The above paper used the population to
decide the default. But there coud be other similar heuristices, I
believe.

> > "Property/attribute/feature" can be good cues to disambiguate personal
> > name,
> > > but if this point is emphasized, I think the purpose of the task can be
> > changed to
> > > information extraction rather than disambiguation. Is it all right?
> >
> > I felt the same way, but after the conversation at Prague, almost
> > everybody
> > is interested in this direction. But, it is true that if we want to
> > include IR people, this should not get much emphasis.
>
>
> If we are indeed going to include this component, I think the task should be
> split into two sub-tasks: Information Extraction and Entity Disambiguation.

Yes, it should be.

> Furthermore, if the
> Information Extraction subtask occurs before the Disambiguation task, the
> attributes that are annotated in the gold standard for the Information
> Extraction task can be made available to be used as features for the Entity
> Disambiguation task. This would allow for the two tasks to be more
> independently judged, rather than biasing the Entity Disambiguation task by
> teams who perform better on Information Extraction.

This could be arguable.
I'm not sure if it is good idea that the disambiguation component can
use the gold standard of IE.

Best,

--
Satoshi Sekine
sek...@cs.nyu.edu

Satoshi Sekine

unread,

Jan 16, 2008, 3:09:35 PM1/16/08

to web-people-search-...@googlegroups.com

Dear all ML member of WePS,

A Happy New Year (maybe too late... X-<)!!

I'm sorry for being quiet for a while about the next WePS task. I have
been conducting a survey for the attribute extraction task and I made a
proposal which has 16 attributes. The survey was basically the following.
We have two annotators looking at 156 WePS documents and extract
possible attributes. This yields 123 distinct attributes. Looking at
those, which include infrequent and/or domain dependent attributes, as
well as merge-able attributes, we made up 16 attribute classes out of
them. (You can find more detail from the attached document).

1 Date of birth
2 Birth place
3 Other name
4 Occupation
5 Affiliation
6 Work
7 Award
8 Education
9 Mentor
10 Location
11 Nationality
12 Relatives
13 Phone
14 FAX
15 Email
16 Web site

This is just a proposal, and you might have some questions and comments.
I'm happy to answer them and keep the discussion vivid.

I appreciate if you respond this e-mail (to the ML or in person) if you
are interested in this task, to make sure it is still worth for us to
make our effort on this.

Thank you very much in advance!!

--
Satoshi Sekine
sek...@cs.nyu.edu

WePS_attribute_20080115.pdf

Reply all

Reply to author

Forward