HTML to text: tool/method to extract text from html file
Pre-process: pre-process including text reformat or NLP analysis
NE:
tool/method for named entity tagger
Features: type of features to be used in similarity calculation
Span: span in the text where the features are extracted (e.g. all document, 20 word window of the name in the text)
Similarity:
method to calculate similarity
Clustering: clustering algorithm
Clustering stop criteria: how your system decides the number of clusters in the output (e.g. similarity threshold, intrinsic measure of the clustering quality, etc).
Duplication (if contemplated by your system): How do you handle multiple people referred by the same ambiguous name in the same document.
Notes:
Other aspects of your participation that you think are important
HTML to text: tool/method to extract text from html file
Beautiful Soup
Pre-process: pre-process including text reformat or NLP analysis
UPen tokenization, MXTERMINATOR
NE:
tool/method for named entity tagger:
None
Features: type of features to be used in similarity calculation
1 Token-based features
Queryname tokens: the tokens occurring in sentences that include a mention of the ambiguous personal name;
Full tokens: the tokens occurring in a given webpage;
URL tokens: the tokens occurring in the corresponding URL of a given webpage;
Title tokens in root page (TTRP): Tokens occurring in the title of the root page of a given webpage.
2 N-gram features
Unigram feature: re-learning weight for each token with the help of the Web 1T 5-gram corpus.
Bigram feature:: extract bigrams and learning their weight with the help of the Web 1T 5-gram corpus.
3 Snippet-based features
More information about the focus person in a webpage are collection from Search engine.
Span: span in the text where the features are extracted (e.g. all document, 20 word window of the name in the text)
all document and other web data
Similarity: method to calculate similarity
cosine similarity
Clustering: clustering algorithm
Clustering stop criteria: how your system decides the number of clusters in the output (e.g. similarity threshold, intrinsic measure of the clustering quality, etc).
HTML to text: tool/method to extract text from html file
Pre-process: pre-process including text reformat or NLP analysis
NE: tool/method for named entity tagger
Features: type of features to be used in similarity calculation
Span: span in the text where the features are extracted (e.g. all document, 20 word window of the name in the text)
Similarity: method to calculate similarity
Clustering: clustering algorithm
Clustering stop criteria: how your system decides the number of clusters in the output (e.g. similarity threshold, intrinsic measure of the clustering quality, etc).
Duplication (if contemplated by your system): How do you handle multiple people referred by the same ambiguous name in the same document
Notes: Other aspects of your participation that you think are important
The description of the BUAP system is given as follows:
HTML to text: Two methods that we implemented were evaluated (one of
them programmed with Java, whereas the second was implemented with
ÄWK). No html tags nor url's were considered. We only took into
account the text.
Pre-process: Elimination of punctuation symbols and all words to lowercase.
NE: We used the Stanford Name Entity Recognizer
Features: Entities
Span: All the document
Similarity: Dot product between the vectorial representation of the
document and a set of reference vectors.
Feature weighting: Term frecuency
Clustering: Fingerprinting (we assigned a simple value to each
document by using hash-based functions)
Clustering stop criteria: We used a pre-defined threshold to determine
a range of hash-based values that should belong to the same cluster.
Duplication (if contemplated by your system): It was not considered.
Notes: The implementation is considered to be unsupervised, since it
does not requires a training dataset.
Best regards!
David Pinto
The description of the ITC-UT system is given as follows:
* HTML to text: tool/method to extract text from HTML file
- lxml: We removed useless tag.
- Automatic English Sentence Segmenter: We extracted sentences from an
HTML file
and changed to a TEXT file (one sentence per line).
* Pre-process: pre-process including text reformat or NLP analysis
- Tokenizer: Tree Tokenizer
- POS Tagger: Tree Tagger
* NE: tool/method for named entity tagger
- Stanford Named Entity Recognizer
* Features: type of features to be used in similarity calculation
- Named entity Features
-- Person
-- Organization
-- Location
- Compound Key Word Features
-- We extracted key words based on an important score calculated with
the method proposed by Nakagawa et al.
(http://www.r.dl.itc.u-tokyo.ac.jp/~nakagawa/resource/termext/atr-e.html)
- Link Features
-- <a href="URL">
-- page's URL
* Span: span in the text where the features are extracted (e.g. all
document, 20 word window of the name in the text)
- Named Entity: Whole document
- Compound key word: 100 word window around the name in the text
* Similarity: method to calculate similarity
Each type of features: Overlap coefficient to calculate similarity
Document similarity: the highest similarity among them
* Feature weighting: measure used to weight the features that represent
each document
- First-stage: We removed unnecessary terms based on document frequency
of the terms. Every term has the same weight to calculate Overlap
coefficient.
- Second-stage: We used compound key word ranks which are re-calculated
within the clusters built at the first-stage.
* Clustering: clustering algorithm
We used two-stage clustering.
- First-stage clustering: Agglomerative hierarchical clustering with
group-average.
- Second-stage clustering: Clustering with the result of the first-stage
clustering.
At the second-stage clustering, clusters are added up to one cluster based
on the weights re-calculated within the big clusters.
* Clustering stop criteria: how your system decides the number of
clusters in the output (e.g. similarity threshold, intrinsic measure of
the clustering quality, etc).
Similarity threshold
* Duplication (if contemplated by your system): How do you handle
multiple people referred by the same ambiguous name in the same document.
The second-stage clustering enables to handle multiple people referred by
the same ambiguous name in the same document.
In the second-stage clustering, a cluster formed in the first-stage
clustering: C1 picks up other documents that are similar with C1 and
merge them into C1.
* Notes: Other aspects of your participation that you think are important
Best regards!
Masaki Ikeda
---
Masaki Ikeda
ik...@r.dl.itc.u-tokyo.ac.jp
Nakagawa Laboratory
TEL: +81-3-5841-2729
Location: The University of Tokyo
General Library 4th floor
7-3-1 Hongo, Bunkyo-ku, Tokyo
113-0033
http://www.r.dl.itc.u-tokyo.ac.jp/
HTML to text: beautiful soup, extraction of only text after ignoring contents of javascript, forms, and other irrelevant data.
Pre-process: convert to lowercase, porter stemmer, stopword removal, removal of words in the search string.
NE:
none
Features: email IDs, hyperlinks,
Span: all document.
Similarity:
weighted jaccard
Clustering: fuzzy ant clustering algorithm.
Clustering stop criteria:
algorithm takes care of it; we don't need to determine it by other means.
Duplication (if contemplated by your system): algorithm handles it.