Parsing MASC files

23 views
Skip to first unread message

Victor Yan

unread,
Mar 13, 2014, 8:26:57 AM3/13/14
to poio-d...@googlegroups.com
Hello:)
I am new to graf-python and have found it a very useful tool. I am trying to parse some MASC XML files from http://www.anc.org/MASC/Download.html but have encountered some problems.

For example, when parsing the attached xml:
graf.GraphParser().parse("110CYL068-vc.xml")

I get the following errors:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/graf_python-0.3.0-py3.3.egg/graf/io.py", line 622, in startElement
    fn = self._start_handlers[name]
KeyError: 'header'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/hbyan2/Dropbox/Projects/newtagger/graf_reader.py", line 4, in <module>
    gp.parse("/Users/hbyan2/Downloads/Mini-MASC 2/data/written/110CYL068-vc.xml")
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/graf_python-0.3.0-py3.3.egg/graf/io.py", line 912, in parse
    do_parse(stream, graph)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/graf_python-0.3.0-py3.3.egg/graf/io.py", line 853, in do_parse
    parser.parse(filename)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/xml/sax/expatreader.py", line 207, in feed
    self._parser.Parse(data, isFinal)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/xml/sax/expatreader.py", line 304, in start_element
    self._cont_handler.startElement(name, AttributesImpl(attrs))
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/graf_python-0.3.0-py3.3.egg/graf/io.py", line 627, in startElement
    raise SAXException('No start handler for tag {0!r}'.format(name))  # FIXME: better exception
xml.sax._exceptions.SAXException: No start handler for tag 'header'



Could anyone kindly help me out with this?
Thanks a lot in advance!

Best,
Victor

110CYL068-vc.xml

pbouda

unread,
Mar 13, 2014, 8:38:55 AM3/13/14
to poio-d...@googlegroups.com
Hi Victor,

it seems the header tage changed from "graphHeader" to "header". I just changed the code on Github, can you try with the current master?

Best,
Peter

Victor Yan

unread,
Mar 13, 2014, 9:00:37 AM3/13/14
to poio-d...@googlegroups.com
Hi Peter,
Thanks a lot for the very prompt reply (and commit)!
Unfortunately, changing the header tag doesn't solve the problem. Now the error becomes:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/graf_python-0.3.0-py3.3.egg/graf/io.py", line 622, in startElement
    fn = self._start_handlers[name]
KeyError: 'tagsDecl'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/hbyan2/Dropbox/Projects/newtagger/graf_reader.py", line 4, in <module>
    gp.parse("/Users/hbyan2/Downloads/Mini-MASC 2/data/written/110CYL068-vc.xml")
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/graf_python-0.3.0-py3.3.egg/graf/io.py", line 912, in parse
    do_parse(stream, graph)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/graf_python-0.3.0-py3.3.egg/graf/io.py", line 853, in do_parse
    parser.parse(filename)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/xml/sax/expatreader.py", line 207, in feed
    self._parser.Parse(data, isFinal)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/xml/sax/expatreader.py", line 304, in start_element
    self._cont_handler.startElement(name, AttributesImpl(attrs))
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/graf_python-0.3.0-py3.3.egg/graf/io.py", line 627, in startElement
    raise SAXException('No start handler for tag {0!r}'.format(name))  # FIXME: better exception
xml.sax._exceptions.SAXException: No start handler for tag 'tagsDecl'



I guess the problem lies in the way xml headers are defined and handled. If we simply change the header tags, then the parser wouldn't be able to parse other GrAF annotations, e.g. the xml files from your tutorials.

I wonder if there is an up-to-date reference of GrAF headers that defines the tags and whether we can implement a flexible way to handle varying header definitions. :)

Thanks,
Victor

pbouda

unread,
Mar 13, 2014, 9:43:42 AM3/13/14
to poio-d...@googlegroups.com
Hi Victor,

yes you are right. I thought you were using the latest MASC files and that the specification changed. The official document is here, it still has "graphHeader" instead of "header":

http://www.iso.org/iso/catalogue_detail.htm?csnumber=37326

Did you try MASC 3.0.0?

http://www.anc.org/data/masc/downloads/data-download/

I reset the git master.

Peter

Victor Yan

unread,
Mar 13, 2014, 10:36:46 AM3/13/14
to poio-d...@googlegroups.com
Oh I see what the problem is now. I am using a sub-corpus (mini-MASC) of the MASC, which has more layers of annotations but was only annotated with the old xml definitions. They didn't indicate this on their webpage thus causing the confusion. Now I may just try to convert the old version of xmls to the new one for it to be parsed by graf-python.
Thanks a lot again Peter! 

Best,
Victor Yan

--
Sie erhalten diese Nachricht, weil Sie in Google Groups ein Thema der Gruppe "poio-discuss" abonniert haben.
Wenn Sie sich von diesem Thema abmelden möchten, rufen Sie https://groups.google.com/d/topic/poio-discuss/f3oP092KK0Y/unsubscribe auf.
Wenn Sie sich von dieser Gruppe und allen Themen dieser Gruppe abmelden möchten, senden Sie eine E-Mail an poio-discuss...@googlegroups.com.
Weitere Optionen finden Sie unter https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages