Detecting proper nouns

4 views
Skip to first unread message

Darren Cook

unread,
May 25, 2009, 9:20:49 PM5/25/09
to nlp-ja...@googlegroups.com
(I asked this on edict-jmdict list originally; here it is to kick off
some discussion; next post will summarize some of the answers I received
there.)

Background: I'm trying to extract likely sentence templates from a large
corpus (*). My basic plan is to change numbers to [NUMBER] and proper
nouns to [PROPER_NOUN] (I might break them into [NAME], [PLACE],
[OTHER], etc.).

Numbers are easy; I'm now considering ways to handle proper nouns.

I'm considering using chasen, but its output is rather fine-grained. Is
there an alternative where e.g. verb endings are kept with the verb and
successive verbs are merged [1], and runs of nouns are merged [2]?

I think I could write a post-parser for chasen to do this, by detecting
the main type (名詞, 動詞, etc.) and grouping words when it doesn't
change. With a few extra rules to handle things like "助詞-接続助詞"
coming after a verb. But I wondered if anyone here knows of an existing
project that has already done something similar? Or if there is an
alternative to chasen that would suit my purposes better?

(I also thought of using the list of Wikipedia articles as a list of
proper nouns. But it contains normal nouns too, and anyway I suspect
trying to match with it will get messy.)

Any suggestions welcome, thanks,

Darren

*: As background to the background, this "large corpus" is actually
sentences for which my experimental MT system has no translation, or a
low-confidence translation, and at this stage I am trying to get a feel
for how I can most efficiently increase its coverage.

[1]: E.g.
強まっ ツヨマッ 強まる 動詞-自立 五段・ラ行 連用タ接続
て テ て 助詞-接続助詞
売り ウリ 売る 動詞-自立 五段・ラ行 連用形
--> 強まって売り

[2]: E.g.
東京 トウキョウ 東京 名詞-固有名詞-地域-一般
株式 カブシキ 株式 名詞-一般
市場 シジョウ 市場 名詞-一般
--> 東京株式市場


--
Darren Cook, Software Researcher/Developer
http://dcook.org/mlsn/ (English-Japanese-German-Chinese-Arabic
open source dictionary/semantic network)
http://dcook.org/work/ (About me and my work)
http://dcook.org/blogs.html (My blogs and articles)

Darren Cook

unread,
May 25, 2009, 9:23:35 PM5/25/09
to nlp-ja...@googlegroups.com
(Summary of thread from edict-jmdict list)

Some background reading, from Jim Breen:

Chasen/Yamcha being used:
http://www.ldc.upenn.edu/acl/N/N03/N03-1002.pdf
http://www.csse.monash.edu.au/~jwb/res/66-203.pdf
Juman+KNP being used:
http://www.aclweb.org/anthology/D08-1045

Francis Bond wrote:
For Ubuntu users, UTF-8 versions of YamCha, Kabocha and loads more stuff:
http://cl.aist-nara.ac.jp/~eric-n/ubuntu-nlp/dists/hardy/japanese/

(Eric Nichols added: "You can install both EUC and UTF-8 versions and
switch between them using ubuntu's alternatives system.")

English docs for chasen:
http://sourceforge.jp/projects/chasen-legacy/docs/chasen-2.4.0-manual-en.pdf/en/1/chasen-2.4.0-manual-en.pdf.pdf

CHASEN has a "run these POSs together" option (section 13 of manual; set
it in your .chasenrc).

Jim wrote:
I should point out that for what Darren is doing, Kabocha is probably
more useful than YamCha. YamCha is "empty" - you have to train it for
the target language and problem. Kabocha comes with a massive training
file for Japanese (installation takes an age.)

Eric added:

Yamcha is a program that cabocha uses for chunking and named
recognition. It isn't interesting in and of itself unless you want to
train your own models. It has been deprecated in cabocha 0.59 for a
CRF-based model instead.

Francis Bond

unread,
May 25, 2009, 9:43:56 PM5/25/09
to nlp-ja...@googlegroups.com
G'day,

> Background: I'm trying to extract likely sentence templates from a large
> corpus (*). My basic plan is to change numbers to [NUMBER] and proper
> nouns to [PROPER_NOUN] (I might break them into [NAME], [PLACE],
> [OTHER], etc.).
>
> Numbers are easy; I'm now considering ways to handle proper nouns.
>
> I'm considering using chasen, but its output is rather fine-grained. Is
> there an alternative where e.g. verb endings are kept with the verb and
> successive verbs are merged [1], and runs of nouns are merged [2]?
>
> I think I could write a post-parser for chasen to do this, by detecting
> the main type (名詞, 動詞, etc.) and grouping words when it doesn't
> change. With a few extra rules to handle things like "助詞-接続助詞"
> coming after a verb. But I wondered if anyone here knows of an existing
> project that has already done something similar? Or if there is an
> alternative to chasen that would suit my purposes better?

Yamcha, which takes Chasen output (or MeCab output) as input, tags
NEs as follows:

下水道 ゲスイドウ 下水道 名詞-一般 B-ORGANIZATION
新 シン 新 接頭詞-名詞接続 I-ORGANIZATION
技術 ギジュツ 技術 名詞-一般 I-ORGANIZATION
推進 スイシン 推進 名詞-サ変接続 I-ORGANIZATION
機構 キコウ 機構 名詞-一般 I-ORGANIZATION

2 ニ 2 名詞-数 B-DATE
0 ゼロ 0 名詞-数 I-DATE
0 ゼロ 0 名詞-数 I-DATE
4 ヨン 4 名詞-数 I-DATE
年 ネン 年 名詞-接尾-助数詞 I-DATE

B means Beginning, I means Intermediate, so the tag starts at the
B-tag and keeps going until you run out of I's, that is
下水道新技術推進機構 is an ORGANIZATION and 2004年 is a DATE.

If you download Cabocha, then it includes Yamcha with a trained model
for doing NE recognition. It goes on to do dependency analysis.

Yours,

--
Francis Bond <http://www2.nict.go.jp/x/x161/en/member/bond/>
NICT Language Infrastructure Group

Jim Breen

unread,
May 25, 2009, 9:58:18 PM5/25/09
to nlp-ja...@googlegroups.com
> Jim wrote:
> I should point out that for what Darren is doing, Kabocha is probably
> more useful than YamCha. YamCha is "empty" - you have to train it for
> the target language and problem. Kabocha comes with a massive training
> file for Japanese (installation takes an age.)

Sorry about the "Kabocha". I have a problem with "ca" when I'm thinking か.

> Eric added:
>
> Yamcha is a program that cabocha uses for chunking and named
> recognition. It isn't interesting in and of itself unless you want to
> train your own models. It has been deprecated in cabocha 0.59 for a
> CRF-based model instead.

cabocha is the place to start, but it's a bit spotty with name identification.
When I try it with: 和子が胸をはだけて赤ん坊に乳をふくませた。 it produces:
<PERSON>和子</PERSON>が---D
胸を-D
はだけて-----D
赤ん坊に---D
乳を-D
ふくませた。

which is great. But when I tried: 池田が姓で和子が名です。 it gives:
<PERSON>池田</PERSON>が-----D
姓で---D
和子が-D
名です。

The problem seems to lie in Chasen, which identifies 和子 as
"カズコ" and "名詞-固有名詞-人名-名" in the first example and
"ワコ" and "名詞-一般" in the second. (I see MeCab goes for ワコ
in both sentences.) The weights of カズコ and ワコ in IPADIC are
2052 and 3999 respectively.

Jim

--
Jim Breen
Adjunct Senior Research Fellow
Clayton School of Information Technology,
Monash University, VIC 3800, Australia
http://www.csse.monash.edu.au/~jwb/

Francis Bond

unread,
May 25, 2009, 10:07:00 PM5/25/09
to nlp-ja...@googlegroups.com
G'day,

> cabocha is the place to start, but it's a bit spotty with name identification.

This is very true, we tried running Japanese (Cabocha) and English
(Oak) on both sides of the Tanaka corpus and found there was so little
overlap as to be useless for aligning names. However, as far as I
know, it is the best available Japanese NE recognizer --- this is the
state of the art (;_;).

Darren Cook

unread,
May 25, 2009, 10:19:02 PM5/25/09
to nlp-ja...@googlegroups.com
> If you download Cabocha, then it includes Yamcha with a trained model
> for doing NE recognition. It goes on to do dependency analysis.

Hi Francis,
I was trying Cabocha yesterday (the ready-made ubuntu package is working
perfectly with UTF-8).

I like what it is doing; it is closer to what I need than raw chasen
output. It also has XML output (-f 3) making it straightforward to use
the data from my PHP script.

Sample output is included below [1]. I love the diverse date formats it
is discovering. But this one is unfortunate
<ORGANIZATION>日経</ORGANIZATION>平均株価-------------------D

as 日経平均株価, or 日経平均株価(225種), is really the noun.

I'm also bothered by:
<LOCATION>米</LOCATION>政府
<LOCATION>米</LOCATION>株式市場

I think these bother me as 米 is part of a bigger noun, compared to the
standalone nouns as in [2].

However, in the XML, I can easily notice if a noun with ne="LOCATION" is
followed by another noun. Or simply group consecutive nouns and only use
the ne="..." attribute if set for all nouns in the group.

Darren


[1]:
<DATE>24日</DATE>の-D
東京株式市場は、-----------------------------------D
国内外の-D |
景気悪化が-D |
長期化する-D |
懸念が-D |
強まって---D |
売りが-D |
優勢となり、---------------------D
<ORGANIZATION>日経</ORGANIZATION>平均株価-------------------D
(225種)は-----------------D
一時、-------D |
<DATE>昨年-D | |
10月-D | |
27日</DATE>に-D |
つけた-D |
終値の-D |
バブル後最安値---D
(<ARTIFACT>7162円90銭</ARTIFACT>)を-D
下回った。
EOS
EOS
<LOCATION>米</LOCATION>政府が-----------D
<DATE>前日</DATE>、---------D
金融機関への---D |
追加的な-D |
資本注入の-D |
実施を-D
発表したにもかかわらず、-----D
<LOCATION>米</LOCATION>株式市場で---D
株価が-D
急落した-D
ことが-D
嫌気された。
EOS

[2]:
<LOCATION>イギリス</LOCATION>に-D
生まれて、---D
<LOCATION>日本</LOCATION>に-D
住んでいます。

Francis Bond

unread,
May 25, 2009, 10:29:44 PM5/25/09
to nlp-ja...@googlegroups.com
G'day,

2009/5/26 Darren Cook <dar...@dcook.org>:

> Hi Francis,
> I was trying Cabocha yesterday (the ready-made ubuntu package is working
>  perfectly with UTF-8).

Great.

> I like what it is doing; it is closer to what I need than raw chasen
> output. It also has XML output (-f 3) making it straightforward to use
> the data from my PHP script.
>
> Sample output is included below [1]. I love the diverse date formats it
> is discovering. But this one is unfortunate
>  <ORGANIZATION>日経</ORGANIZATION>平均株価-------------------D
>
> as 日経平均株価, or 日経平均株価(225種), is really the noun.

I think that technically 日経平均株価 "Nikkei Stock Average (Price)" is a
noun phrase, consisting of 3 (or 4) nouns. Two of the problems with
NE recognition is that you normally want something larger than a noun,
and that there are often more than one way of naming the same entity.


> I'm also bothered by:
>  <LOCATION>米</LOCATION>政府
>  <LOCATION>米</LOCATION>株式市場
>
> I think these bother me as 米 is part of a bigger noun, compared to the
> standalone nouns as in [2].

Again, I think 米政府 "American government" is a noun phrase, not a noun.
I think the NE recognizer would be more helpful if it recognized the
whole NP as a location, so this is a bug :-).

Reply all
Reply to author
Forward
0 new messages