Question xpath

Skip to first unread message


Jan 24, 2017, 12:30:57 PM1/24/17
to CorpLing with R
Hi all,

I'm trying to use an xpath to extract all words marked with a word tag from a series of XML documents from the Dutch SONAR corpus ( What I've been doing so far, is loading a file such as the one below using xmlInternalTreeParse(), and then applying xpathSApply(file, "//w", xmlValue). This, however, returns an empty list -- and I honestly don't have a clue why. It's probably something very obvious, but as I haven't really worked with xpath before, any pointers would be welcome! 

(When I use xpathSApply(file,'//*[@pos]', xmlValue), it does work, but that seems more like a workaround.)



=== code ===

file <- xmlInternalTreeParse("example-file-dcoi.xml")
xpathSApply(file, "//w", xmlValue) # returns empty list
xpathSApply(file,'//*[@pos]', xmlValue) # does work

=== sample file ===

<?xml version="1.0" encoding="iso-8859-15"?>
  <text xml:id="WR-P-E-G-0000000008.text">
        <div1 xml:id="WR-P-E-G-0000000008.div1.1">
          <head xml:id="WR-P-E-G-0000000008.head.1">
          <s xml:id="WR-P-E-G-0000000008.head.1.s.1">
          <w xml:id="WR-P-E-G-0000000008.head.1.s.1.w.1" pos="SPEC(vreemd)" lemma="subtitles">Subtitles</w>
          <w xml:id="WR-P-E-G-0000000008.head.1.s.1.w.2" pos="SPEC(vreemd)" lemma="for">for</w>
          <w xml:id="WR-P-E-G-0000000008.head.1.s.1.w.3" pos="SPEC(vreemd)" lemma="testuitzending">testuitzending</w>
   <p xml:id="WR-P-E-G-0000000008.p.1">
   <s xml:id="WR-P-E-G-0000000008.p.1.s.1">
   <w xml:id="WR-P-E-G-0000000008.p.1.s.1.w.1" pos="VNW(aanw,pron,stan,vol,3o,ev)" lemma="dit">Dit</w>
   <w xml:id="WR-P-E-G-0000000008.p.1.s.1.w.2" pos="WW(pv,tgw,mv)" lemma="zijn">zijn</w>
   <w xml:id="WR-P-E-G-0000000008.p.1.s.1.w.3" pos="VNW(onbep,det,stan,prenom,met-e,rest)" lemma="enkele">enkele</w>
   <w xml:id="WR-P-E-G-0000000008.p.1.s.1.w.4" pos="N(soort,mv,basis)" lemma="testtitel">testtitels</w>
   <w xml:id="WR-P-E-G-0000000008.p.1.s.1.w.5" pos="LET()" lemma=".">.</w>
   <p xml:id="WR-P-E-G-0000000008.p.2">
   <s xml:id="WR-P-E-G-0000000008.p.2.s.1">
   <w xml:id="WR-P-E-G-0000000008.p.2.s.1.w.1" pos="VG(neven)" lemma="en">En</w>
   <w xml:id="WR-P-E-G-0000000008.p.2.s.1.w.2" pos="BW()" lemma="nog">nog</w>
   <w xml:id="WR-P-E-G-0000000008.p.2.s.1.w.3" pos="LID(onbep,stan,agr)" lemma="een">een</w>
   <w xml:id="WR-P-E-G-0000000008.p.2.s.1.w.4" pos="N(soort,ev,basis,onz,stan)" lemma="paar">paar</w>
   <w xml:id="WR-P-E-G-0000000008.p.2.s.1.w.5" pos="LET()" lemma=".">.</w>
   <p xml:id="WR-P-E-G-0000000008.p.3">
   <s xml:id="WR-P-E-G-0000000008.p.3.s.1">
   <w xml:id="WR-P-E-G-0000000008.p.3.s.1.w.1" pos="VNW(aanw,pron,stan,vol,3o,ev)" lemma="dat">Dat</w>
   <w xml:id="WR-P-E-G-0000000008.p.3.s.1.w.2" pos="WW(pv,verl,ev)" lemma="zijn">was</w>
   <w xml:id="WR-P-E-G-0000000008.p.3.s.1.w.3" pos="VNW(pers,pron,stan,red,3,ev,onz)" lemma="het">het</w>
   <w xml:id="WR-P-E-G-0000000008.p.3.s.1.w.4" pos="LET()" lemma=".">.</w>


Earl Brown

Jan 24, 2017, 11:29:42 PM1/24/17
to CorpLing with R
The default namespace (that doesn't have a prefix, that is: xmlns="") seems to be causing the problem. I downloaded the file you attached and got it working with the xml2 package. In that package (I don't know what the XML package does), the first default namespace is referred internally as "d1", while the second default namespace is referred to as "d2", etc. So, this works for me:
current_file <- read_xml("/Users/earlbrown/Desktop/example-file-dcoi.xml")
xml_find_all(current_file, "//w")  # returns an empty nodeset, like you got
xml_find_all(current_file, "//d1:w")  # returns 17 lines
xml_find_all(current_file, "//d1:w") %>% xml_text()  # returns 17 words

Reply all
Reply to author
0 new messages