Alpino corpus is broken

7 views
Skip to first unread message

Alexis Dimitriadis

unread,
Mar 27, 2015, 11:01:46 AM3/27/15
to <nltk-users@googlegroups.com>
The Alpino corpus contains parsed sentences in Dutch. But there seems to be something wrong with the way the corpus has been serialized to XML, or with the assumptions made by the parser. The problem is already evident in the second and third sentence. For example:

>>> print(" ".join(alpino.sents()[6])
, , zou ik moeten Eigenlijk doen niks anders dan de hele dag vastleggen vastleggen vastleggen .

The first five tokens have somehow been lifted to the front from various places in the sentence. The file alpino.xml provides the correct text in another span:

<sentence>Eigenlijk zou ik niks anders moeten doen dan de hele dag vastleggen , vastleggen , vastleggen .</sentence>

Examination of alpino.xml suggests that the XML nodes are sorted numerically by tree node ID (the attribute "id"), rather than by span position; clearly the reader's assumption that this reflects reading order is unwarranted. But was the XML generated incorrectly, or is it the canonical alpino dump and must be handled better? 

<alpino_ds version="1.2" id="0007">
  <node begin="0" cat="top" end="17" id="0" rel="top">
    <node begin="12" end="13" id="1" pos="punct" rel="--" root="," word=","/>
    <node begin="14" end="15" id="2" pos="punct" rel="--" root="," word=","/>
    <node begin="0" cat="smain" end="16" id="3" rel="--">
      <node begin="1" end="2" id="4" pos="verb" rel="hd" root="zal" word="zou"/>
      <node begin="2" end="3" id="5" index="1" pos="noun" rel="su" root="ik" word="ik"/>
      <node begin="0" cat="inf" end="16" id="6" rel="vc">
        <node begin="2" end="3" id="7" index="1" rel="su"/>
        <node begin="5" end="6" id="8" pos="verb" rel="hd" root="moet" word="moeten"/>
        <node begin="0" cat="inf" end="16" id="9" rel="vc">
          <node begin="0" end="1" id="10" pos="adv" rel="mod" root="eigenlijk" word="Eigenlijk"/>
          <node begin="2" end="3" id="11" index="1" rel="su"/>
          <node begin="6" end="7" id="12" pos="verb" rel="hd" root="doe" word="doen"/>
  
I don't actually need the parses so I haven't checked the trees; but this particular tree is drawn really flat, suggesting that the hierarchical structure is mishandled as well.

Is this a known problem? And has anyone successfully dealt with this before? 

Alexis


Dr. Alexis Dimitriadis | Assistant Professor and Senior Research Fellow | Utrecht Institute of Linguistics OTS | Utrecht University | Trans 10, 3512 JK Utrecht, room 2.33 | +31 30 253 65 68 | a.dimi...@uu.nl | www.hum.uu.nl/medewerkers/a.dimitriadis

Reply all
Reply to author
Forward
0 new messages