Question xpath

37 views
Skip to first unread message

Rik

unread,
Jan 24, 2017, 12:30:57 PM1/24/17
to CorpLing with R
Hi all,

I'm trying to use an xpath to extract all words marked with a word tag from a series of XML documents from the Dutch SONAR corpus (https://portal.clarin.nl/node/4195). What I've been doing so far, is loading a file such as the one below using xmlInternalTreeParse(), and then applying xpathSApply(file, "//w", xmlValue). This, however, returns an empty list -- and I honestly don't have a clue why. It's probably something very obvious, but as I haven't really worked with xpath before, any pointers would be welcome! 

(When I use xpathSApply(file,'//*[@pos]', xmlValue), it does work, but that seems more like a workaround.)

Thanks,

Rik

=== code ===

file <- xmlInternalTreeParse("example-file-dcoi.xml")
xpathSApply(file, "//w", xmlValue) # returns empty list
xpathSApply(file,'//*[@pos]', xmlValue) # does work

=== sample file ===

<?xml version="1.0" encoding="iso-8859-15"?>
  <text xml:id="WR-P-E-G-0000000008.text">
      <body>
        <div1 xml:id="WR-P-E-G-0000000008.div1.1">
          <head xml:id="WR-P-E-G-0000000008.head.1">
          <s xml:id="WR-P-E-G-0000000008.head.1.s.1">
          <w xml:id="WR-P-E-G-0000000008.head.1.s.1.w.1" pos="SPEC(vreemd)" lemma="subtitles">Subtitles</w>
          <w xml:id="WR-P-E-G-0000000008.head.1.s.1.w.2" pos="SPEC(vreemd)" lemma="for">for</w>
          <w xml:id="WR-P-E-G-0000000008.head.1.s.1.w.3" pos="SPEC(vreemd)" lemma="testuitzending">testuitzending</w>
  </s>
  </head>
   <p xml:id="WR-P-E-G-0000000008.p.1">
   <s xml:id="WR-P-E-G-0000000008.p.1.s.1">
   <w xml:id="WR-P-E-G-0000000008.p.1.s.1.w.1" pos="VNW(aanw,pron,stan,vol,3o,ev)" lemma="dit">Dit</w>
   <w xml:id="WR-P-E-G-0000000008.p.1.s.1.w.2" pos="WW(pv,tgw,mv)" lemma="zijn">zijn</w>
   <w xml:id="WR-P-E-G-0000000008.p.1.s.1.w.3" pos="VNW(onbep,det,stan,prenom,met-e,rest)" lemma="enkele">enkele</w>
   <w xml:id="WR-P-E-G-0000000008.p.1.s.1.w.4" pos="N(soort,mv,basis)" lemma="testtitel">testtitels</w>
   <w xml:id="WR-P-E-G-0000000008.p.1.s.1.w.5" pos="LET()" lemma=".">.</w>
  </s>
  </p>
   <p xml:id="WR-P-E-G-0000000008.p.2">
   <s xml:id="WR-P-E-G-0000000008.p.2.s.1">
   <w xml:id="WR-P-E-G-0000000008.p.2.s.1.w.1" pos="VG(neven)" lemma="en">En</w>
   <w xml:id="WR-P-E-G-0000000008.p.2.s.1.w.2" pos="BW()" lemma="nog">nog</w>
   <w xml:id="WR-P-E-G-0000000008.p.2.s.1.w.3" pos="LID(onbep,stan,agr)" lemma="een">een</w>
   <w xml:id="WR-P-E-G-0000000008.p.2.s.1.w.4" pos="N(soort,ev,basis,onz,stan)" lemma="paar">paar</w>
   <w xml:id="WR-P-E-G-0000000008.p.2.s.1.w.5" pos="LET()" lemma=".">.</w>
  </s>
  </p>
   <p xml:id="WR-P-E-G-0000000008.p.3">
   <s xml:id="WR-P-E-G-0000000008.p.3.s.1">
   <w xml:id="WR-P-E-G-0000000008.p.3.s.1.w.1" pos="VNW(aanw,pron,stan,vol,3o,ev)" lemma="dat">Dat</w>
   <w xml:id="WR-P-E-G-0000000008.p.3.s.1.w.2" pos="WW(pv,verl,ev)" lemma="zijn">was</w>
   <w xml:id="WR-P-E-G-0000000008.p.3.s.1.w.3" pos="VNW(pers,pron,stan,red,3,ev,onz)" lemma="het">het</w>
   <w xml:id="WR-P-E-G-0000000008.p.3.s.1.w.4" pos="LET()" lemma=".">.</w>
  </s>
  </p>
        </div1>
      </body>
    </text>
  </DCOI>

example-file-dcoi.xml

Earl Brown

unread,
Jan 24, 2017, 11:29:42 PM1/24/17
to CorpLing with R
The default namespace (that doesn't have a prefix, that is: xmlns="http://lands.let.ru.nl/projects/d-coi/ns/1.0") seems to be causing the problem. I downloaded the file you attached and got it working with the xml2 package. In that package (I don't know what the XML package does), the first default namespace is referred internally as "d1", while the second default namespace is referred to as "d2", etc. So, this works for me:
library(xml2)
library(magrittr)
current_file <- read_xml("/Users/earlbrown/Desktop/example-file-dcoi.xml")
xml_find_all(current_file, "//w")  # returns an empty nodeset, like you got
xml_find_all(current_file, "//d1:w")  # returns 17 lines
xml_find_all(current_file, "//d1:w") %>% xml_text()  # returns 17 words

Cheers.
Reply all
Reply to author
Forward
0 new messages