WePS2-AE clustering description

12 views
Skip to first unread message

Javier Artiles

unread,
Feb 10, 2009, 5:12:08 PM2/10/09
to web-people-...@googlegroups.com
Dear WePS participants,

We are currently in the process of writing the WePS2 clustering task description paper. The deadline for your system description papers is the 20th of February, and this gives us a very short time to read all the papers.  In order to make the whole process more efficiently I would like to ask you to submit to us (weps-or...@lsi.uned.es) the following points describing your clustering system/s. You don't have to give a long explanation for each point, just the essentials to understand your approach.

Thank you for your collaboration,

All the best,

Javier Artiles.

  • HTML to text: tool/method to extract text from html file

  • Pre-process: pre-process including text reformat or NLP analysis

  • NE: tool/method for named entity tagger

  • Features: type of features to be used in similarity calculation

  • Span: span in the text where the features are extracted (e.g. all document, 20 word window of the name in the text)

  • Similarity: method to calculate similarity

  • Feature weighting: measure used to weight the features that represent each document
  • Clustering: clustering algorithm

  • Clustering stop criteria: how your system decides the number of clusters in the output (e.g. similarity threshold, intrinsic measure of the clustering quality, etc).

  • Duplication (if contemplated by your system): How do you handle multiple people referred by the same ambiguous name in the same document.

  • Notes: Other aspects of your participation that you think are important




--
------------------------------------------------------------------------------
Javier Artiles Picón
Departamento de Lenguajes y Sistemas Informáticos
ETSI Informática, UNED

Phone: +34 91 398 8106
Fax:     +34 91 398 65 35
Home page:    nlp.uned.es/~javier
LinkedIn page: www.linkedin.com/in/javierartiles
------------------------------------------------------------------------------

Javier Artiles

unread,
Feb 10, 2009, 5:16:03 PM2/10/09
to Web People Search Task
The subject of the previous email might have been confusing, it was
supposed to be
"WePS2 clustering description " instead of "WePS2-AE clustering
description".
Participants in the AE task have already been requested to send their
system description
summaries. So this is only directed to the participants in the
clustering task.

Javier Artiles.

On Feb 10, 5:12 pm, Javier Artiles <jav...@gmail.com> wrote:
> Dear WePS participants,
>
> We are currently in the process of writing the WePS2 clustering task
> description paper. The deadline for your system description papers is the
> 20th of February, and this gives us a very short time to read all the
> papers.  In order to make the whole process more efficiently I would like to
> ask you to submit to us (weps-organiz...@lsi.uned.es) the following points
> describing your clustering system/s. You don't have to give a long
> explanation for each point, just the essentials to understand your approach.
>
> Thank you for your collaboration,
>
> All the best,
>
> Javier Artiles.
>
>    -
>
>    *HTML to text: *tool/method to extract text from html file
>    -
>
>    *Pre-process: *pre-process including text reformat or NLP analysis
>    -
>
>    *NE:* tool/method for named entity tagger
>    -
>
>    *Features:* type of features to be used in similarity calculation
>    -
>
>    *Span: *span in the text where the features are extracted (e.g. all
>    document, 20 word window of the name in the text)
>    -
>
>    *Similarity:* method to calculate similarity
>
>    - *Feature weighting*: measure used to weight the features that represent
>    each document
>    -
>
>    *Clustering: *clustering algorithm
>    -
>
>    *Clustering stop criteria:* how your system decides the number of
>    clusters in the output (e.g. similarity threshold, intrinsic measure of the
>    clustering quality, etc).
>    -
>
>    *Duplication (if contemplated by your system): *How do you handle
>    multiple people referred by the same ambiguous name in the same document*
>    .*
>    -
>
>    *Notes:* Other aspects of your participation that you think are important

ying chen

unread,
Feb 11, 2009, 3:24:48 AM2/11/09
to web-people-...@googlegroups.com
Javier Artiles,

PolyUHK clustering system has the following information:
    • HTML to text: tool/method to extract text from html file 

    •  Beautiful Soup


    • Pre-process: pre-process including text reformat or NLP analysis

    •  UPen tokenization, MXTERMINATOR


    • NE: tool/method for named entity tagger:

    • None


    • Features: type of features to be used in similarity calculation

    • 1         Token-based features

      Queryname tokens: the tokens occurring in sentences that include a mention of the ambiguous personal name;

      Full tokens: the tokens occurring in a given webpage;

      URL tokens: the tokens occurring in the corresponding URL of a given webpage;

      Title tokens in root page (TTRP): Tokens occurring in the title of the root page of a given webpage.

      2         N-gram features

      Unigram feature: re-learning weight for each token with the help of the Web 1T 5-gram corpus.

      Bigram feature:: extract bigrams and learning their weight with the help of the Web 1T 5-gram corpus.

      3         Snippet-based features

      More information about the focus person in a webpage are collection from Search engine.


    • Span: span in the text where the features are extracted (e.g. all document, 20 word window of the name in the text)

    • all document and other web data


    • Similarity: method to calculate similarity

    • cosine similarity



    • Feature weighting: measure used to weight the features that represent each document
    • TFIDF

    • Clustering: clustering algorithm

    • agglomerative clustering with single linkage

    • Clustering stop criteria: how your system decides the number of clusters in the output (e.g. similarity threshold, intrinsic measure of the clustering quality, etc).

    • similarity threshold

    Els Lefever

    unread,
    Feb 11, 2009, 3:42:53 AM2/11/09
    to web-people-...@googlegroups.com
    Hi Javier,

    this is the information for the AUG submissions:

    * HTML to text: cleaning script we've implemented ourselves
    * Pre-process: tokenizing, PoS-tagging and chunking (MBSP: memory based shallow parser)
    * NE: combination of gazetteers and PoS information
    * features:
    - place/date of birth/dead, NE, IP address, geographical location coordinates, weighted keywords, URL/email addresses, telephone and fax numbers
    * span: entire document
    * similarity: cosine
    * feature weighting: gain ratio
    * clustering: two algorithms 1. Fuzzy ants clustering 2. Agnes (hierarchical clustering)
    * clustering stop: similarity threshold for hierarchical, not needed for fuzzy ants (number of runs has to be predefined though)

    Best regards,
    Els

    Paul Kalmar

    unread,
    Feb 12, 2009, 8:34:28 PM2/12/09
    to jav...@gmail.com, web-people-...@googlegroups.com
    Hi Javier,

    Here is a quick answer to your questions:
      • HTML to text: tool/method to extract text from html file

        • Convert all boundary tags to <p> </p> and eliminate all other tags.
      • Pre-process: pre-process including text reformat or NLP analysis

        • None.
      • NE: tool/method for named entity tagger

        • In house system, based on supervised training of hidden Markov models followed by name list and rule-based post-processing
      • Features: type of features to be used in similarity calculation

        • Named Entities
        • URL tokens
        • Page Title Tokens
        • Named Entity Lists
        • Name match
        • Gender
        • Title tokens
      • Span: span in the text where the features are extracted (e.g. all document, 20 word window of the name in the text)

        • Document
      • Similarity: method to calculate similarity

        • Heuristic based on number of matching features and non matching feature
        • Feature weighting: measure used to weight the features that represent each document
          • Information
        • Clustering: clustering algorithm

          • Greedy agglomeration within a block
        • Clustering stop criteria: how your system decides the number of clusters in the output (e.g. similarity threshold, intrinsic measure of the clustering quality, etc).

          • Threshold based on twice the information of most informative name token -- the person with the most rare first and last name should match without other features
        • Duplication (if contemplated by your system): How do you handle multiple people referred by the same ambiguous name in the same document

          • Within-document disambiguation, followed by cross document match of within document entities
        • Notes: Other aspects of your participation that you think are important

          • Cross document disambiguation process is completely unsupervised.
        Thank you,
        Paul Kalmar

        Juan Martinez Romo

        unread,
        Feb 15, 2009, 10:36:32 AM2/15/09
        to web-people-...@googlegroups.com
        Dear Javi,

        Features of UNED clustering system are the following:

        • HTML to text: HttpUnit and HTMLParser
        • Pre-process: documents indexed with Lucene and Stemming (Porter).
        • NE: None
        • Features: relevant terms extracted with language model techniques
        • Span: Document and title
        • Similarity: language models and lucene similarity (cosine)
        • Feature weighting: Kullback-Leibler divergence
        • Clustering: Heuristic based on similarity
        • Clustering stop criteria: Heuristic based on obtained votes by means of similarity between documents and occupations


        Cheers,

        Juan.
        --
        ---------------------------------------------------------
        Juan Martínez Romo
        Dpto. Lenguajes y Sistemas Informáticos
        E.T.S.I. Informática (UNED)
        C/ Juan del Rosal nº 16 - Despacho 2.03
        Ciudad Universitaria - 28040 Madrid - España
        Tef. +34 91 398 93 78---------------------------------------------------------

        David Eduardo Pinto Avendaño

        unread,
        Feb 15, 2009, 11:45:14 PM2/15/09
        to web-people-...@googlegroups.com, weps-or...@lsi.uned.es
        Dear Javier,

        The description of the BUAP system is given as follows:

        HTML to text: Two methods that we implemented were evaluated (one of
        them programmed with Java, whereas the second was implemented with
        ÄWK). No html tags nor url's were considered. We only took into
        account the text.

        Pre-process: Elimination of punctuation symbols and all words to lowercase.

        NE: We used the Stanford Name Entity Recognizer

        Features: Entities

        Span: All the document

        Similarity: Dot product between the vectorial representation of the
        document and a set of reference vectors.

        Feature weighting: Term frecuency

        Clustering: Fingerprinting (we assigned a simple value to each
        document by using hash-based functions)

        Clustering stop criteria: We used a pre-defined threshold to determine
        a range of hash-based values that should belong to the same cluster.

        Duplication (if contemplated by your system): It was not considered.

        Notes: The implementation is considered to be unsupervised, since it
        does not requires a training dataset.


        Best regards!

        David Pinto

        Masaki Ikeda

        unread,
        Feb 16, 2009, 4:23:29 AM2/16/09
        to web-people-...@googlegroups.com, ik...@r.dl.itc.u-tokyo.ac.jp
        Dear Javier,

        The description of the ITC-UT system is given as follows:

        * HTML to text: tool/method to extract text from HTML file

        - lxml: We removed useless tag.
        - Automatic English Sentence Segmenter: We extracted sentences from an
        HTML file
        and changed to a TEXT file (one sentence per line).

        * Pre-process: pre-process including text reformat or NLP analysis
        - Tokenizer: Tree Tokenizer
        - POS Tagger: Tree Tagger

        * NE: tool/method for named entity tagger
        - Stanford Named Entity Recognizer

        * Features: type of features to be used in similarity calculation
        - Named entity Features
        -- Person
        -- Organization
        -- Location

        - Compound Key Word Features
        -- We extracted key words based on an important score calculated with
        the method proposed by Nakagawa et al.
        (http://www.r.dl.itc.u-tokyo.ac.jp/~nakagawa/resource/termext/atr-e.html)

        - Link Features
        -- <a href="URL">
        -- page's URL

        * Span: span in the text where the features are extracted (e.g. all


        document, 20 word window of the name in the text)

        - Named Entity: Whole document
        - Compound key word: 100 word window around the name in the text

        * Similarity: method to calculate similarity
        Each type of features: Overlap coefficient to calculate similarity
        Document similarity: the highest similarity among them

        * Feature weighting: measure used to weight the features that represent
        each document
        - First-stage: We removed unnecessary terms based on document frequency
        of the terms. Every term has the same weight to calculate Overlap
        coefficient.
        - Second-stage: We used compound key word ranks which are re-calculated
        within the clusters built at the first-stage.

        * Clustering: clustering algorithm
        We used two-stage clustering.
        - First-stage clustering: Agglomerative hierarchical clustering with
        group-average.
        - Second-stage clustering: Clustering with the result of the first-stage
        clustering.
        At the second-stage clustering, clusters are added up to one cluster based
        on the weights re-calculated within the big clusters.

        * Clustering stop criteria: how your system decides the number of


        clusters in the output (e.g. similarity threshold, intrinsic measure of
        the clustering quality, etc).

        Similarity threshold

        * Duplication (if contemplated by your system): How do you handle


        multiple people referred by the same ambiguous name in the same document.

        The second-stage clustering enables to handle multiple people referred by


        the same ambiguous name in the same document.

        In the second-stage clustering, a cluster formed in the first-stage
        clustering: C1 picks up other documents that are similar with C1 and
        merge them into C1.

        * Notes: Other aspects of your participation that you think are important

        Best regards!
        Masaki Ikeda

        ---
        Masaki Ikeda
        ik...@r.dl.itc.u-tokyo.ac.jp

        Nakagawa Laboratory
        TEL: +81-3-5841-2729
        Location: The University of Tokyo
        General Library 4th floor
        7-3-1 Hongo, Bunkyo-ku, Tokyo
        113-0033
        http://www.r.dl.itc.u-tokyo.ac.jp/

        priya venkateshan

        unread,
        Feb 18, 2009, 10:48:17 PM2/18/09
        to web-people-...@googlegroups.com
        Hi, Javier
        The details for system built by team priyaven are as follows:
        • HTML to text: beautiful soup, extraction of only text after ignoring contents of javascript, forms, and other irrelevant data.

        • Pre-process: convert to lowercase, porter stemmer, stopword removal, removal of words in the search string.

        • NE: none

        • Features: email IDs, hyperlinks,

        • Span: all document.

        • Similarity: weighted jaccard

        • Feature weighting: tf/idf
        • Clustering: fuzzy ant clustering algorithm.

        • Clustering stop criteria: algorithm takes care of it; we don't need to determine it by other means.

        • Duplication (if contemplated by your system): algorithm handles it. 

        Reply all
        Reply to author
        Forward
        0 new messages