WePS2-AE clustering description

Javier Artiles

unread,

Feb 10, 2009, 5:12:08 PM2/10/09

to web-people-...@googlegroups.com

Dear WePS participants,

We are currently in the process of writing the WePS2 clustering task description paper. The deadline for your system description papers is the 20th of February, and this gives us a very short time to read all the papers. In order to make the whole process more efficiently I would like to ask you to submit to us (weps-or...@lsi.uned.es) the following points describing your clustering system/s. You don't have to give a long explanation for each point, just the essentials to understand your approach.

Thank you for your collaboration,

All the best,

Javier Artiles.

HTML to text: tool/method to extract text from html file
Pre-process: pre-process including text reformat or NLP analysis
NE: tool/method for named entity tagger
Features: type of features to be used in similarity calculation
Span: span in the text where the features are extracted (e.g. all document, 20 word window of the name in the text)
Similarity: method to calculate similarity
Feature weighting: measure used to weight the features that represent each document
Clustering: clustering algorithm
Clustering stop criteria: how your system decides the number of clusters in the output (e.g. similarity threshold, intrinsic measure of the clustering quality, etc).
Duplication (if contemplated by your system): How do you handle multiple people referred by the same ambiguous name in the same document.
Notes: Other aspects of your participation that you think are important

--
------------------------------------------------------------------------------
Javier Artiles Picón
Departamento de Lenguajes y Sistemas Informáticos
ETSI Informática, UNED

Phone: +34 91 398 8106
Fax: +34 91 398 65 35
Home page: nlp.uned.es/~javier
LinkedIn page: www.linkedin.com/in/javierartiles
------------------------------------------------------------------------------

Javier Artiles

unread,

Feb 10, 2009, 5:16:03 PM2/10/09

to Web People Search Task

The subject of the previous email might have been confusing, it was
supposed to be
"WePS2 clustering description " instead of "WePS2-AE clustering
description".
Participants in the AE task have already been requested to send their
system description
summaries. So this is only directed to the participants in the
clustering task.

Javier Artiles.

On Feb 10, 5:12 pm, Javier Artiles <jav...@gmail.com> wrote:
> Dear WePS participants,
>
> We are currently in the process of writing the WePS2 clustering task
> description paper. The deadline for your system description papers is the
> 20th of February, and this gives us a very short time to read all the
> papers. In order to make the whole process more efficiently I would like to

> ask you to submit to us (weps-organiz...@lsi.uned.es) the following points

> describing your clustering system/s. You don't have to give a long
> explanation for each point, just the essentials to understand your approach.
>
> Thank you for your collaboration,
>
> All the best,
>
> Javier Artiles.
>

> -
>
> *HTML to text: *tool/method to extract text from html file
> -
>
> *Pre-process: *pre-process including text reformat or NLP analysis
> -
>
> *NE:* tool/method for named entity tagger
> -
>
> *Features:* type of features to be used in similarity calculation
> -
>
> *Span: *span in the text where the features are extracted (e.g. all

> document, 20 word window of the name in the text)

> -
>
> *Similarity:* method to calculate similarity
>
> - *Feature weighting*: measure used to weight the features that represent
> each document
> -
>
> *Clustering: *clustering algorithm
> -
>
> *Clustering stop criteria:* how your system decides the number of

> clusters in the output (e.g. similarity threshold, intrinsic measure of the
> clustering quality, etc).

> -
>
> *Duplication (if contemplated by your system): *How do you handle
> multiple people referred by the same ambiguous name in the same document*
> .*
> -
>
> *Notes:* Other aspects of your participation that you think are important

ying chen

unread,

Feb 11, 2009, 3:24:48 AM2/11/09

to web-people-...@googlegroups.com

Javier Artiles,

PolyUHK clustering system has the following information:

HTML to text: tool/method to extract text from html file

Beautiful Soup

Pre-process: pre-process including text reformat or NLP analysis

UPen tokenization, MXTERMINATOR
NE: tool/method for named entity tagger:
None

Features: type of features to be used in similarity calculation

1         Token-based features

Queryname tokens: the tokens occurring in sentences that include a mention of the ambiguous personal name;

Full tokens: the tokens occurring in a given webpage;

URL tokens: the tokens occurring in the corresponding URL of a given webpage;

Title tokens in root page (TTRP): Tokens occurring in the title of the root page of a given webpage.

2         N-gram features

Unigram feature: re-learning weight for each token with the help of the Web 1T 5-gram corpus.

Bigram feature:: extract bigrams and learning their weight with the help of the Web 1T 5-gram corpus.

3         Snippet-based features

More information about the focus person in a webpage are collection from Search engine.

Span: span in the text where the features are extracted (e.g. all document, 20 word window of the name in the text)

all document and other web data

Similarity: method to calculate similarity

cosine similarity

Feature weighting: measure used to weight the features that represent each document

TFIDF
Clustering: clustering algorithm
agglomerative clustering with single linkage

Clustering stop criteria: how your system decides the number of clusters in the output (e.g. similarity threshold, intrinsic measure of the clustering quality, etc).

similarity threshold

Els Lefever

unread,

Feb 11, 2009, 3:42:53 AM2/11/09

to web-people-...@googlegroups.com

Hi Javier,

this is the information for the AUG submissions:

* HTML to text: cleaning script we've implemented ourselves

* Pre-process: tokenizing, PoS-tagging and chunking (MBSP: memory based shallow parser)

* NE: combination of gazetteers and PoS information

* features:

- place/date of birth/dead, NE, IP address, geographical location coordinates, weighted keywords, URL/email addresses, telephone and fax numbers

* span: entire document

* similarity: cosine

* feature weighting: gain ratio

* clustering: two algorithms 1. Fuzzy ants clustering 2. Agnes (hierarchical clustering)

* clustering stop: similarity threshold for hierarchical, not needed for fuzzy ants (number of runs has to be predefined though)

Best regards,

Els

Paul Kalmar

unread,

Feb 12, 2009, 8:34:28 PM2/12/09

to jav...@gmail.com, web-people-...@googlegroups.com

Hi Javier,

Here is a quick answer to your questions:

HTML to text: tool/method to extract text from html file

Convert all boundary tags to <p> </p> and eliminate all other tags.

Pre-process: pre-process including text reformat or NLP analysis

None.

NE: tool/method for named entity tagger

In house system, based on supervised training of hidden Markov models followed by name list and rule-based post-processing

Features: type of features to be used in similarity calculation

Named Entities
URL tokens
Page Title Tokens
Named Entity Lists
Name match
Gender
Title tokens

Span: span in the text where the features are extracted (e.g. all document, 20 word window of the name in the text)

Document

Similarity: method to calculate similarity

Heuristic based on number of matching features and non matching feature

Feature weighting: measure used to weight the features that represent each document

Information

Clustering: clustering algorithm

Greedy agglomeration within a block

Clustering stop criteria: how your system decides the number of clusters in the output (e.g. similarity threshold, intrinsic measure of the clustering quality, etc).

Threshold based on twice the information of most informative name token -- the person with the most rare first and last name should match without other features

Duplication (if contemplated by your system): How do you handle multiple people referred by the same ambiguous name in the same document

Within-document disambiguation, followed by cross document match of within document entities

Notes: Other aspects of your participation that you think are important

Cross document disambiguation process is completely unsupervised.

Thank you,
Paul Kalmar

Juan Martinez Romo

unread,

Feb 15, 2009, 10:36:32 AM2/15/09

to web-people-...@googlegroups.com

Dear Javi,

Features of UNED clustering system are the following:

HTML to text: HttpUnit and HTMLParser
Pre-process: documents indexed with Lucene and Stemming (Porter).
NE: None
Features: relevant terms extracted with language model techniques
Span: Document and title
Similarity: language models and lucene similarity (cosine)
Feature weighting: Kullback-Leibler divergence
Clustering: Heuristic based on similarity
Clustering stop criteria: Heuristic based on obtained votes by means of similarity between documents and occupations

Cheers,

Juan.

--
---------------------------------------------------------
Juan Martínez Romo
Dpto. Lenguajes y Sistemas Informáticos
E.T.S.I. Informática (UNED)
C/ Juan del Rosal nº 16 - Despacho 2.03
Ciudad Universitaria - 28040 Madrid - España
Tef. +34 91 398 93 78

Fax +34 91 398 65 35

---------------------------------------------------------

David Eduardo Pinto Avendaño

unread,

Feb 15, 2009, 11:45:14 PM2/15/09

to web-people-...@googlegroups.com, weps-or...@lsi.uned.es

Dear Javier,

The description of the BUAP system is given as follows:

HTML to text: Two methods that we implemented were evaluated (one of
them programmed with Java, whereas the second was implemented with
ÄWK). No html tags nor url's were considered. We only took into
account the text.

Pre-process: Elimination of punctuation symbols and all words to lowercase.

NE: We used the Stanford Name Entity Recognizer

Features: Entities

Span: All the document

Similarity: Dot product between the vectorial representation of the
document and a set of reference vectors.

Feature weighting: Term frecuency

Clustering: Fingerprinting (we assigned a simple value to each
document by using hash-based functions)

Clustering stop criteria: We used a pre-defined threshold to determine
a range of hash-based values that should belong to the same cluster.

Duplication (if contemplated by your system): It was not considered.

Notes: The implementation is considered to be unsupervised, since it
does not requires a training dataset.

Best regards!

David Pinto

Masaki Ikeda

unread,

Feb 16, 2009, 4:23:29 AM2/16/09

to web-people-...@googlegroups.com, ik...@r.dl.itc.u-tokyo.ac.jp

Dear Javier,

The description of the ITC-UT system is given as follows:

* HTML to text: tool/method to extract text from HTML file

- lxml: We removed useless tag.
- Automatic English Sentence Segmenter: We extracted sentences from an
HTML file
and changed to a TEXT file (one sentence per line).

* Pre-process: pre-process including text reformat or NLP analysis
- Tokenizer: Tree Tokenizer
- POS Tagger: Tree Tagger

* NE: tool/method for named entity tagger
- Stanford Named Entity Recognizer

* Features: type of features to be used in similarity calculation
- Named entity Features
-- Person
-- Organization
-- Location

- Compound Key Word Features
-- We extracted key words based on an important score calculated with
the method proposed by Nakagawa et al.
(http://www.r.dl.itc.u-tokyo.ac.jp/~nakagawa/resource/termext/atr-e.html)

- Link Features
-- <a href="URL">
-- page's URL

* Span: span in the text where the features are extracted (e.g. all

document, 20 word window of the name in the text)

- Named Entity: Whole document
- Compound key word: 100 word window around the name in the text

* Similarity: method to calculate similarity
Each type of features: Overlap coefficient to calculate similarity
Document similarity: the highest similarity among them

* Feature weighting: measure used to weight the features that represent
each document
- First-stage: We removed unnecessary terms based on document frequency
of the terms. Every term has the same weight to calculate Overlap
coefficient.
- Second-stage: We used compound key word ranks which are re-calculated
within the clusters built at the first-stage.

* Clustering: clustering algorithm
We used two-stage clustering.
- First-stage clustering: Agglomerative hierarchical clustering with
group-average.
- Second-stage clustering: Clustering with the result of the first-stage
clustering.
At the second-stage clustering, clusters are added up to one cluster based
on the weights re-calculated within the big clusters.

* Clustering stop criteria: how your system decides the number of

clusters in the output (e.g. similarity threshold, intrinsic measure of
the clustering quality, etc).

Similarity threshold

* Duplication (if contemplated by your system): How do you handle

multiple people referred by the same ambiguous name in the same document.

The second-stage clustering enables to handle multiple people referred by

the same ambiguous name in the same document.

In the second-stage clustering, a cluster formed in the first-stage
clustering: C1 picks up other documents that are similar with C1 and
merge them into C1.

* Notes: Other aspects of your participation that you think are important

Best regards!
Masaki Ikeda

---
Masaki Ikeda
ik...@r.dl.itc.u-tokyo.ac.jp

Nakagawa Laboratory
TEL: +81-3-5841-2729
Location: The University of Tokyo
General Library 4th floor
7-3-1 Hongo, Bunkyo-ku, Tokyo
113-0033
http://www.r.dl.itc.u-tokyo.ac.jp/

priya venkateshan

unread,

Feb 18, 2009, 10:48:17 PM2/18/09

to web-people-...@googlegroups.com

Hi, Javier
The details for system built by team priyaven are as follows:

HTML to text: beautiful soup, extraction of only text after ignoring contents of javascript, forms, and other irrelevant data.
Pre-process: convert to lowercase, porter stemmer, stopword removal, removal of words in the search string.
NE: none
Features: email IDs, hyperlinks,
Span: all document.
Similarity: weighted jaccard
Feature weighting: tf/idf
Clustering: fuzzy ant clustering algorithm.
Clustering stop criteria: algorithm takes care of it; we don't need to determine it by other means.
Duplication (if contemplated by your system): algorithm handles it.

Reply all

Reply to author

Forward