Detecting proper nouns

Darren Cook

unread,

May 25, 2009, 9:20:49 PM5/25/09

to nlp-ja...@googlegroups.com

(I asked this on edict-jmdict list originally; here it is to kick off
some discussion; next post will summarize some of the answers I received
there.)

Background: I'm trying to extract likely sentence templates from a large
corpus (*). My basic plan is to change numbers to [NUMBER] and proper
nouns to [PROPER_NOUN] (I might break them into [NAME], [PLACE],
[OTHER], etc.).

Numbers are easy; I'm now considering ways to handle proper nouns.

I'm considering using chasen, but its output is rather fine-grained. Is
there an alternative where e.g. verb endings are kept with the verb and
successive verbs are merged [1], and runs of nouns are merged [2]?

I think I could write a post-parser for chasen to do this, by detecting
the main type (名詞, 動詞, etc.) and grouping words when it doesn't
change. With a few extra rules to handle things like "助詞-接続助詞"
coming after a verb. But I wondered if anyone here knows of an existing
project that has already done something similar? Or if there is an
alternative to chasen that would suit my purposes better?

(I also thought of using the list of Wikipedia articles as a list of
proper nouns. But it contains normal nouns too, and anyway I suspect
trying to match with it will get messy.)

Any suggestions welcome, thanks,

Darren

*: As background to the background, this "large corpus" is actually
sentences for which my experimental MT system has no translation, or a
low-confidence translation, and at this stage I am trying to get a feel
for how I can most efficiently increase its coverage.

[1]: E.g.
強まっツヨマッ強まる動詞-自立五段・ラ行連用タ接続
てテて助詞-接続助詞
売りウリ売る動詞-自立五段・ラ行連用形
--> 強まって売り

[2]: E.g.
東京トウキョウ東京名詞-固有名詞-地域-一般
株式カブシキ株式名詞-一般
市場シジョウ市場名詞-一般
--> 東京株式市場

--
Darren Cook, Software Researcher/Developer
http://dcook.org/mlsn/ (English-Japanese-German-Chinese-Arabic
open source dictionary/semantic network)
http://dcook.org/work/ (About me and my work)
http://dcook.org/blogs.html (My blogs and articles)

Darren Cook

unread,

May 25, 2009, 9:23:35 PM5/25/09

to nlp-ja...@googlegroups.com

(Summary of thread from edict-jmdict list)

Some background reading, from Jim Breen:

Chasen/Yamcha being used:
http://www.ldc.upenn.edu/acl/N/N03/N03-1002.pdf
http://www.csse.monash.edu.au/~jwb/res/66-203.pdf
Juman+KNP being used:
http://www.aclweb.org/anthology/D08-1045

Francis Bond wrote:
For Ubuntu users, UTF-8 versions of YamCha, Kabocha and loads more stuff:
http://cl.aist-nara.ac.jp/~eric-n/ubuntu-nlp/dists/hardy/japanese/

(Eric Nichols added: "You can install both EUC and UTF-8 versions and
switch between them using ubuntu's alternatives system.")

English docs for chasen:
http://sourceforge.jp/projects/chasen-legacy/docs/chasen-2.4.0-manual-en.pdf/en/1/chasen-2.4.0-manual-en.pdf.pdf

CHASEN has a "run these POSs together" option (section 13 of manual; set
it in your .chasenrc).

Jim wrote:
I should point out that for what Darren is doing, Kabocha is probably
more useful than YamCha. YamCha is "empty" - you have to train it for
the target language and problem. Kabocha comes with a massive training
file for Japanese (installation takes an age.)

Eric added:

Yamcha is a program that cabocha uses for chunking and named
recognition. It isn't interesting in and of itself unless you want to
train your own models. It has been deprecated in cabocha 0.59 for a
CRF-based model instead.

Francis Bond

unread,

May 25, 2009, 9:43:56 PM5/25/09

to nlp-ja...@googlegroups.com

G'day,

> Background: I'm trying to extract likely sentence templates from a large
> corpus (*). My basic plan is to change numbers to [NUMBER] and proper
> nouns to [PROPER_NOUN] (I might break them into [NAME], [PLACE],
> [OTHER], etc.).
>
> Numbers are easy; I'm now considering ways to handle proper nouns.
>
> I'm considering using chasen, but its output is rather fine-grained. Is
> there an alternative where e.g. verb endings are kept with the verb and
> successive verbs are merged [1], and runs of nouns are merged [2]?
>
> I think I could write a post-parser for chasen to do this, by detecting
> the main type (名詞, 動詞, etc.) and grouping words when it doesn't
> change. With a few extra rules to handle things like "助詞-接続助詞"
> coming after a verb. But I wondered if anyone here knows of an existing
> project that has already done something similar? Or if there is an
> alternative to chasen that would suit my purposes better?

Yamcha, which takes Chasen output (or MeCab output) as input, tags
NEs as follows:

下水道ゲスイドウ下水道名詞-一般 B-ORGANIZATION
新シン新接頭詞-名詞接続 I-ORGANIZATION
技術ギジュツ技術名詞-一般 I-ORGANIZATION
推進スイシン推進名詞-サ変接続 I-ORGANIZATION
機構キコウ機構名詞-一般 I-ORGANIZATION

２ニ２名詞-数 B-DATE
０ゼロ０名詞-数 I-DATE
０ゼロ０名詞-数 I-DATE
４ヨン４名詞-数 I-DATE
年ネン年名詞-接尾-助数詞 I-DATE

B means Beginning, I means Intermediate, so the tag starts at the
B-tag and keeps going until you run out of I's, that is
下水道新技術推進機構 is an ORGANIZATION and ２００４年 is a DATE.

If you download Cabocha, then it includes Yamcha with a trained model
for doing NE recognition. It goes on to do dependency analysis.

Yours,

--
Francis Bond <http://www2.nict.go.jp/x/x161/en/member/bond/>
NICT Language Infrastructure Group

Jim Breen

unread,

May 25, 2009, 9:58:18 PM5/25/09

to nlp-ja...@googlegroups.com

> Jim wrote:
> I should point out that for what Darren is doing, Kabocha is probably
> more useful than YamCha. YamCha is "empty" - you have to train it for
> the target language and problem. Kabocha comes with a massive training
> file for Japanese (installation takes an age.)

Sorry about the "Kabocha". I have a problem with "ca" when I'm thinking か.

> Eric added:
>
> Yamcha is a program that cabocha uses for chunking and named
> recognition. It isn't interesting in and of itself unless you want to
> train your own models. It has been deprecated in cabocha 0.59 for a
> CRF-based model instead.

cabocha is the place to start, but it's a bit spotty with name identification.
When I try it with: 和子が胸をはだけて赤ん坊に乳をふくませた。 it produces:
<PERSON>和子</PERSON>が---D
胸を-D
はだけて-----D
赤ん坊に---D
乳を-D
ふくませた。

which is great. But when I tried: 池田が姓で和子が名です。 it gives:
<PERSON>池田</PERSON>が-----D
姓で---D
和子が-D
名です。

The problem seems to lie in Chasen, which identifies 和子 as
"カズコ" and "名詞-固有名詞-人名-名" in the first example and
"ワコ" and "名詞-一般" in the second. (I see MeCab goes for ワコ
in both sentences.) The weights of カズコ and ワコ in IPADIC are
2052 and 3999 respectively.

Jim

--
Jim Breen
Adjunct Senior Research Fellow
Clayton School of Information Technology,
Monash University, VIC 3800, Australia
http://www.csse.monash.edu.au/~jwb/

Francis Bond

unread,

May 25, 2009, 10:07:00 PM5/25/09

to nlp-ja...@googlegroups.com

G'day,

> cabocha is the place to start, but it's a bit spotty with name identification.

This is very true, we tried running Japanese (Cabocha) and English
(Oak) on both sides of the Tanaka corpus and found there was so little
overlap as to be useless for aligning names. However, as far as I
know, it is the best available Japanese NE recognizer --- this is the
state of the art (;_;).

Darren Cook

unread,

May 25, 2009, 10:19:02 PM5/25/09

to nlp-ja...@googlegroups.com

> If you download Cabocha, then it includes Yamcha with a trained model
> for doing NE recognition. It goes on to do dependency analysis.

Hi Francis,
I was trying Cabocha yesterday (the ready-made ubuntu package is working
perfectly with UTF-8).

I like what it is doing; it is closer to what I need than raw chasen
output. It also has XML output (-f 3) making it straightforward to use
the data from my PHP script.

Sample output is included below [1]. I love the diverse date formats it
is discovering. But this one is unfortunate
<ORGANIZATION>日経</ORGANIZATION>平均株価-------------------D

as 日経平均株価, or 日経平均株価（２２５種）, is really the noun.

I'm also bothered by:
<LOCATION>米</LOCATION>政府
<LOCATION>米</LOCATION>株式市場

I think these bother me as 米 is part of a bigger noun, compared to the
standalone nouns as in [2].

However, in the XML, I can easily notice if a noun with ne="LOCATION" is
followed by another noun. Or simply group consecutive nouns and only use
the ne="..." attribute if set for all nouns in the group.

Darren

[2]:
<LOCATION>イギリス</LOCATION>に-D
生まれて、---D
<LOCATION>日本</LOCATION>に-D
住んでいます。

Francis Bond

unread,

May 25, 2009, 10:29:44 PM5/25/09

to nlp-ja...@googlegroups.com

G'day,

2009/5/26 Darren Cook <dar...@dcook.org>:

> Hi Francis,
> I was trying Cabocha yesterday (the ready-made ubuntu package is working
> perfectly with UTF-8).

Great.

> I like what it is doing; it is closer to what I need than raw chasen
> output. It also has XML output (-f 3) making it straightforward to use
> the data from my PHP script.
>
> Sample output is included below [1]. I love the diverse date formats it
> is discovering. But this one is unfortunate
> <ORGANIZATION>日経</ORGANIZATION>平均株価-------------------D
>
> as 日経平均株価, or 日経平均株価（２２５種）, is really the noun.

I think that technically 日経平均株価 "Nikkei Stock Average (Price)" is a
noun phrase, consisting of 3 (or 4) nouns. Two of the problems with
NE recognition is that you normally want something larger than a noun,
and that there are often more than one way of naming the same entity.

> I'm also bothered by:
> <LOCATION>米</LOCATION>政府
> <LOCATION>米</LOCATION>株式市場
>
> I think these bother me as 米 is part of a bigger noun, compared to the
> standalone nouns as in [2].

Again, I think 米政府 "American government" is a noun phrase, not a noun.
I think the NE recognizer would be more helpful if it recognized the
whole NP as a location, so this is a bug :-).

Reply all

Reply to author

Forward