Hi all,
I'm trying to use an xpath to extract all words marked with a word tag from a series of XML documents from the Dutch SONAR corpus (
https://portal.clarin.nl/node/4195). What I've been doing so far, is loading a file such as the one below using xmlInternalTreeParse(), and then applying xpathSApply(file, "//w", xmlValue). This, however, returns an empty list -- and I honestly don't have a clue why. It's probably something very obvious, but as I haven't really worked with xpath before, any pointers would be welcome!
(When I use xpathSApply(file,'//*[@pos]', xmlValue), it does work, but that seems more like a workaround.)
Thanks,
Rik
=== code ===
file <- xmlInternalTreeParse("example-file-dcoi.xml")
xpathSApply(file, "//w", xmlValue) # returns empty list
xpathSApply(file,'//*[@pos]', xmlValue) # does work
=== sample file ===
<?xml version="1.0" encoding="iso-8859-15"?>
<text xml:id="WR-P-E-G-0000000008.text">
<body>
<div1 xml:id="WR-P-E-G-0000000008.div1.1">
<head xml:id="WR-P-E-G-0000000008.head.1">
<s xml:id="WR-P-E-G-0000000008.head.1.s.1">
<w xml:id="WR-P-E-G-0000000008.head.1.s.1.w.1" pos="SPEC(vreemd)" lemma="subtitles">Subtitles</w>
<w xml:id="WR-P-E-G-0000000008.head.1.s.1.w.2" pos="SPEC(vreemd)" lemma="for">for</w>
<w xml:id="WR-P-E-G-0000000008.head.1.s.1.w.3" pos="SPEC(vreemd)" lemma="testuitzending">testuitzending</w>
</s>
</head>
<p xml:id="WR-P-E-G-0000000008.p.1">
<s xml:id="WR-P-E-G-0000000008.p.1.s.1">
<w xml:id="WR-P-E-G-0000000008.p.1.s.1.w.1" pos="VNW(aanw,pron,stan,vol,3o,ev)" lemma="dit">Dit</w>
<w xml:id="WR-P-E-G-0000000008.p.1.s.1.w.2" pos="WW(pv,tgw,mv)" lemma="zijn">zijn</w>
<w xml:id="WR-P-E-G-0000000008.p.1.s.1.w.3" pos="VNW(onbep,det,stan,prenom,met-e,rest)" lemma="enkele">enkele</w>
<w xml:id="WR-P-E-G-0000000008.p.1.s.1.w.4" pos="N(soort,mv,basis)" lemma="testtitel">testtitels</w>
<w xml:id="WR-P-E-G-0000000008.p.1.s.1.w.5" pos="LET()" lemma=".">.</w>
</s>
</p>
<p xml:id="WR-P-E-G-0000000008.p.2">
<s xml:id="WR-P-E-G-0000000008.p.2.s.1">
<w xml:id="WR-P-E-G-0000000008.p.2.s.1.w.1" pos="VG(neven)" lemma="en">En</w>
<w xml:id="WR-P-E-G-0000000008.p.2.s.1.w.2" pos="BW()" lemma="nog">nog</w>
<w xml:id="WR-P-E-G-0000000008.p.2.s.1.w.3" pos="LID(onbep,stan,agr)" lemma="een">een</w>
<w xml:id="WR-P-E-G-0000000008.p.2.s.1.w.4" pos="N(soort,ev,basis,onz,stan)" lemma="paar">paar</w>
<w xml:id="WR-P-E-G-0000000008.p.2.s.1.w.5" pos="LET()" lemma=".">.</w>
</s>
</p>
<p xml:id="WR-P-E-G-0000000008.p.3">
<s xml:id="WR-P-E-G-0000000008.p.3.s.1">
<w xml:id="WR-P-E-G-0000000008.p.3.s.1.w.1" pos="VNW(aanw,pron,stan,vol,3o,ev)" lemma="dat">Dat</w>
<w xml:id="WR-P-E-G-0000000008.p.3.s.1.w.2" pos="WW(pv,verl,ev)" lemma="zijn">was</w>
<w xml:id="WR-P-E-G-0000000008.p.3.s.1.w.3" pos="VNW(pers,pron,stan,red,3,ev,onz)" lemma="het">het</w>
<w xml:id="WR-P-E-G-0000000008.p.3.s.1.w.4" pos="LET()" lemma=".">.</w>
</s>
</p>
</div1>
</body>
</text>
</DCOI>