Pialign Results (Chinese-English) Alignment

30 views
Skip to first unread message

Lee Sapphire

unread,
May 12, 2015, 10:32:54 AM5/12/15
to pialig...@googlegroups.com
Dear Professor Neubig,

I am a postgraduate student from the Graduate Program in Translation and Interpretation at National Taiwan University. My name is Sapphire Lee.

I am doing a translation research [extracting Noun phrases among all translation pairs]  using your Pialign tool to get translation phrasal pairs (from a manually-aligned legal corpus in Chinese-English). When I read the results produced by Pialign, I see that there are 7 types of scores in the third column, shown as follows:
(Could you tell me what does these scores [in red] stand for?)


Chinese

English


世界 各國

from foreign states

0.2 0.2 0.6 0.6 0.2 0.2

世界 各國 地區

from foreign states and regions

0.2 0.6 0.2 0.2 0.6 0.2

世界 各國 地區

from foreign states or regions

0.2 0.6 0.2 0.2 0.6 0.2

中央 人民 政府

to the central people's government

0.2 0.2 0.6 0.6 0.2 0.2

中央 人民 政府

to the central people's government and

0.142857142857143 0.142857142857143 0.714285714285714 0.714285714285714 0.142857142857143 0.142857142857143

國防 外交

defence and foreign affairs

0.2 0.2 0.6 0.2 0.6 0.2

外國

to foreign

0.2 0.2 0.6 0.6 0.2 0.2

外籍 公務人員

regarding public servants of foreign

0.6 0.2 0.2 0.2 0.2 0.6

居民

or other premises shall be

0.6 0.2 0.2 0.2 0.2 0.6

居民

residents in

0.6 0.2 0.2 0.2 0.2 0.6

居民

shall be

0.111111111111111 0.555555555555556 0.333333333333333 0.111111111111111 0.555555555555556 0.333333333333333

政府 工作

government work

0.2 0.6 0.2 0.2 0.2 0.6

政府 工作

on the work of the government

0.2 0.6 0.2 0.2 0.6 0.2

政府 工作

the government work

0.2 0.6 0.2 0.2 0.6 0.2

政府 工作 提出

about the government work

0.2 0.6 0.2 0.2 0.6 0.2

本法 進行 解釋 徵詢

before giving an interpretation of this law

0.142857142857143 0.142857142857143 0.714285714285714 0.142857142857143 0.714285714285714 0.142857142857143

法案

bills

0.2 0.2 0.6 0.6 0.2 0.2

法案 議案

bills and motions

0.2 0.2 0.6 0.2 0.6 0.2

法院 審判權

their jurisdiction

0.2 0.2 0.6 0.6 0.2 0.2

所屬

consult its

0.142857142857143 0.142857142857143 0.714285714285714 0.142857142857143 0.142857142857143 0.714285714285714

人民

people's

0.658181818181818 0.00363636363636364 0.338181818181818 0.992727272727273 0.00363636363636364 0.00363636363636364

人民

the

0.2 0.2 0.6 0.2 0.2 0.6

人民 代表

national

0.2 0.2 0.6 0.6 0.2 0.2

人民 代表

of

0.142857142857143 0.142857142857143 0.714285714285714 0.142857142857143 0.714285714285714 0.142857142857143

人民 代表 大會

an

0.714285714285714 0.142857142857143 0.142857142857143 0.142857142857143 0.142857142857143 0.714285714285714

人民 代表 大會

interpretation of

0.142857142857143 0.142857142857143 0.714285714285714 0.714285714285714 0.142857142857143 0.142857142857143

人民 代表 大會

national people’s

0.2 0.6 0.2 0.2 0.6 0.2

人民 代表 大會 常務 委員會

interpretation of the standing committee

0.142857142857143 0.142857142857143 0.714285714285714 0.142857142857143 0.714285714285714 0.142857142857143

人民 代表 大會 常務 委員會

the interpretation of the standing committee

0.714285714285714 0.142857142857143 0.142857142857143 0.142857142857143 0.142857142857143 0.714285714285714

人民 共和國

people's republic of

0.010989010989011 0.956043956043956 0.032967032967033 0.010989010989011 0.010989010989011 0.978021978021978

人民 共和國

people's republic

0.0105263157894737 0.0105263157894737 0.978947368421053 0.0526315789473684 0.0105263157894737 0.936842105263158

人民 共和國

the people's republic of

0.010989010989011 0.956043956043956 0.032967032967033 0.010989010989011 0.472527472527473 0.516483516483517

人民 共和國

people's republic of

0.142857142857143 0.714285714285714 0.142857142857143 0.142857142857143 0.142857142857143 0.714285714285714


In your paper "An Unsupervised Model for Joint Phrase Alignment and Extraction" (p.5, 5.1 Heuristic Phrase Extraction), it seems that you have only mentioned five features in the phrase table -- 
1. (1st & 2nd score) The conditional probabilities of the phrases p_t(c|e), p_t(e|c)   
2. (3th score) The joint probability of the phrase pair p_t(c,e)   
3. (4th score) The average posterior probability of a span containing c,e
4. (5 & 6th score) Lexical weighting probabilities using model 1 word probabilities

5. (7th score) The uniform phrase penalty (?)


In the first phase of my research, I did not add any POS tagging to both Chinese and English texts [just segmentation]. I am wondering if the texts are tagged with POS, Pialign will yield  more accurate results(?)

I would like to use the results produced by Pialign to develop my thesis. If it is possible, I want to extract Noun phrases and collocations from Pialign. But as some parts of the results are noisy, I am still finding a way to eliminate these noisy results and get the term/phrasal patterns that I want.

I would appreciate very much if you can answer my questions. Thank you.

Sincerely,
Sapphire
05.12

Lee Sapphire

unread,
May 13, 2015, 7:29:18 AM5/13/15
to pialig...@googlegroups.com
Dear Professor Neubig,

I am Sapphire and I would also like to know the meanings of different kinds of brackets shown in a XXX.samp.file.
For example,  [ { < (((  ?

Thank you very much for your help.

Sincerely,
Sapphire
05.13


Lee Sapphire於 2015年5月12日星期二 UTC+8下午10時32分54秒寫道:

Graham Neubig

unread,
May 13, 2015, 8:37:12 AM5/13/15
to pialig...@googlegroups.com
Hi Sapphire,

Sorry about the late reply.

Regarding the scores: Perhaps you are looking at the wrong file. Are you looking at the XXX.lex file? This is for the reordering probabilities. To get the translation probabilities you need to look at the XXX.pt file.

About tagging with part of speech tags, maybe the results will be better and maybe they won't. It's worth trying though I guess!

About the brackets in the XXX.samp file:

Graham

--
You received this message because you are subscribed to the Google Groups "pialign-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pialign-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Lee Sapphire

unread,
May 13, 2015, 8:59:20 PM5/13/15
to pialig...@googlegroups.com
I see. Thank you so much for your reply!



2015年5月13日 周三 20:37 Graham Neubig
neu...@is.naist.jp>:

You received this message because you are subscribed to a topic in the Google Groups "pialign-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/pialign-users/9eEQMEYHD7I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to pialign-user...@googlegroups.com.

Lee Sapphire

unread,
May 13, 2015, 9:10:55 PM5/13/15
to pialig...@googlegroups.com
Dear Professor Neubig,

But I have one more question: are all phrase units set as 7 word-unit to be the maximum limit? If so, will POS taggers be counted as a word-unit?
Can I manually increase the number of the preset word unit? (Perl commands, for example?)

Thank you.

Sincerely,
Sapphire
0514
2015年5月13日 周三 20:37 Graham Neubig <neu...@is.naist.jp>:
Hi Sapphire,
You received this message because you are subscribed to a topic in the Google Groups "pialign-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/pialign-users/9eEQMEYHD7I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to pialign-user...@googlegroups.com.

Graham Neubig

unread,
May 21, 2015, 9:50:22 PM5/21/15
to pialig...@googlegroups.com
Hi Sapphire,

Sorry about the late reply.
Yes, you can set this with the -printmax option for pialign.

Graham
Reply all
Reply to author
Forward
0 new messages