How do I deal with XML entities in parsing an XML file?

1,092 views
Skip to first unread message

Ben Bullock

unread,
Aug 21, 2010, 10:08:00 PM8/21/10
to golang-nuts
I am trying to write an XML parser for JMdict_e as downloadable here:

http://ftp.monash.edu.au/pub/nihongo/JMdict_e.gz (5.7 megabytes)

I already have an Expat-based parser for this file in C. As an
experiment, I tried to make a Go version of it.

When I try to run the program as follows, I get this error message:

error occurred XML syntax error on line 381: invalid character entity
&n;

The entity is defined in the file.

Here is the offending input:

----------

<entry>
<ent_seq>1000000</ent_seq>
<r_ele>
<reb>ヽ</reb>
</r_ele>
<r_ele>
<reb>くりかえし</reb>
</r_ele>
<sense>
<pos>&n;</pos> -------------------- line 381
<gloss>repetition mark in katakana</gloss>
</sense>
</entry>

----------

Here is the program:

-----------

package main

import (
"fmt"
"xml"
"os"
)

type Entry struct {
ent_seq string "chardata"
}

func main () {
jmdict_file := "/share/projects/j2e/dict/JMdict_e"
src, err := os.Open (jmdict_file, os.O_RDONLY, 0)
defer src.Close ()
if err != nil {
return
}
var entry Entry
for {
err := xml.Unmarshal(src, & entry)
if err != nil {
fmt.Printf ("error occurred %s\n", err);
break
}
fmt.Printf ("%s\n", entry.ent_seq);
}
}

------------

Any suggestions about how to go about this?

jimt

unread,
Aug 21, 2010, 11:14:54 PM8/21/10
to golang-nuts
You need to supply the XML parser with a EntityMap which defines all
the possible stuff you can encounter.

Here's a fnction that fills a map with all all w3c defined xml
entities:
http://github.com/jteeuwen/go-pkg-xmlx/blob/master/xmlx/entitymap.go#L58

This bit shows how to use it:

xp := xml.NewParser(strings.NewReader(s))
xp.Entity = myEntityMap

I don't recommend loading up the entire map every time, because it's
pretty huge.
Try to pick and chose only those entities you are likely to encounter
in your data.

Ben Bullock

unread,
Aug 22, 2010, 12:17:26 AM8/22/10
to golang-nuts


On Aug 22, 12:14 pm, jimt <jimteeu...@gmail.com> wrote:
> You need to supply the XML parser with a EntityMap which defines all
> the possible stuff you can encounter.
>
> Here's a fnction that fills a map with all all w3c defined xml
> entities:http://github.com/jteeuwen/go-pkg-xmlx/blob/master/xmlx/entitymap.go#L58

Actually the entities are defined in that file like this:

<!ENTITY n "noun (common) (futsuumeishi)">

With expat it automatically decodes them so that if I ask for the text
segment I get the above rather than &n;.

> This bit shows how to use it:
>
>    xp := xml.NewParser(strings.NewReader(s))
>    xp.Entity = myEntityMap

> I don't recommend loading up the entire map every time, because it's
> pretty huge.
> Try to pick and chose only those entities you are likely to encounter
> in your data.

Thanks, but I don't need those entities, I need the ones defined in
the JMdict_e file, where &n; means "noun".

How can I read in the entities from the file into the XML parser?

Ben Bullock

unread,
Aug 23, 2010, 1:34:44 AM8/23/10
to golang-nuts
On Aug 22, 1:17 pm, Ben Bullock <benkasminbull...@gmail.com> wrote:

> How can I read in the entities from the file into the XML parser?

Since there was no answer, I went and looked at the source code:

http://golang.org/src/pkg/xml/xml.go

It seems like this facility does not exist in the current parser.

pmora...@gmail.com

unread,
Feb 22, 2015, 1:58:45 PM2/22/15
to golan...@googlegroups.com
5 years later I am attempting the same and the same error persists, have you been able to solve it?

On Saturday, August 21, 2010 at 11:08:00 PM UTC-3, Ben Bullock wrote:
I am trying to write an XML parser for JMdict_e as downloadable here:

http://ftp.monash.edu.au/pub/nihongo/JMdict_e.gz  (5.7 megabytes)

I already have an Expat-based parser for this file in C. As an
experiment, I tried to make a Go version of it.

When I try to run the program as follows, I get this error message:

error occurred XML syntax error on line 381: invalid character entity
&n;

The entity is defined in the file.

Here is the offending input:

----------

<entry>
<ent_seq>1000000</ent_seq>
<r_ele>
<reb>�</reb>

Tamás Gulácsi

unread,
Feb 22, 2015, 3:47:46 PM2/22/15
to golan...@googlegroups.com, pmora...@gmail.com

2015. február 22., vasárnap 19:58:45 UTC+1 időpontban pmora...@gmail.com a következőt írta:
5 years later I am attempting the same and the same error persists, have you been able to solve it?

It is not too hard to parse those ENTITY declarations and provide the xml.Decoder with the entity map:
http://play.golang.org/p/RnCRMSoTwz

Matt Sanford

unread,
Feb 22, 2015, 7:04:47 PM2/22/15
to golan...@googlegroups.com, pmora...@gmail.com
Maybe you could do it with a single pass by intercepting the directive token and parsing it as a sub-doc, as it seems W3C intended. Something along the lines of http://play.golang.org/p/taNuSwwGij
Reply all
Reply to author
Forward
0 new messages