关于训练的问题

张沈鹏

unread,

Feb 13, 2009, 12:14:20 PM2/13/09

to nlpb...@googlegroups.com

忽然发现有这么一个好东西
这几天刚好在做spammer过滤
虽然自己用自己以前写的山寨分词+山寨tf-idf表+山寨关键词提取算法的效果看起来还行.
不过,感觉太不专业:)

想试试
Bamboo 1.1.0
不过遇到了一点问题,还望大家指教

我运行
./auto_build -tseg
他从网上下载下来语录
但是每次训练,发现好像会出现
./auto_build: line 29: read: read error: 0: Bad file descriptor
这种错误
我看了../data/people-daily-bamboo-edition.txt
还是满正常的
只是有一个空行而已
"""
团结/a 一致/a ，/w 扎实/ad 工作/v ，/w 奋勇/d 前进/v ，/w 一定/d 能够/v 创造/v
出/v 更加/d 辉煌/a 的/u 业绩/n ！/w

19980101-01-003-001/m 北京/ns 举行/v 新年/t 音乐会/n
"""

另外,我仿照wiki上的演示程序,写了一个cpp程序
但是运行到
const char * text = "我爱北京天安门";
bamboo_setopt(handle, BAMBOO_OPTION_TEXT, const_cast<char *>(text));
就会报错
zuroc@aragorn ~/nlpbamboo/bamboo-python $ g++ test.cpp -lbamboo
-L/home/zuroc/lib
zuroc@aragorn ~/nlpbamboo/bamboo-python $ ./a.out
Segmentation fault
可能是什么原因呢?

谢谢大家的指教:)
我按照的不是默认路径,不过配置,CMake中的路径都已经改了

另外,现在有什么好的基于统计识别新词的算法吗?
有空想拿那些小组上的帖子算算新词玩:)

附:
错误信息

Segmentation Training:
Normalizing /home/zuroc/nlpbamboo/bamboo/bin/../data/people-daily-bamboo-edition.txt:
22721 items processed.
Building 1-gram Lexicon from
/home/zuroc/nlpbamboo/bamboo/bin/../build/normalized.txt:
57210 items generated.
making index
57210 items processed.
making index
0 items processed.
making index
29 items processed.
making index
0 items processed.

1). Training CRF Segment Model. (may take dozens of hours)
*). Do nothing.
./auto_build: line 29: read: read error: 0: Bad file descriptor
Done.

--
张沈鹏
软件工程师 Software Engineer
Douban Inc.
office: +86 8479 9008
Mobile: 13693622296
No.14 Jiuxianqiao Road, Area 51 A1-1-2106, Beijing 100016 , China
北京市酒仙桥路14号51楼A1区1门2016，100016

det...@gmail.com

unread,

Feb 15, 2009, 2:44:15 AM2/15/09

to NlpBamboo

从上面的描述看那个Segmentation Fault应该是没有CRF模型造成的。auto_build 29行错误正好询问是否制作字典的语
句。

不太清楚为啥那行会有错，能否提供一下下面命令的输出信息：
readlink -f `which bash`
bash -V

另外，一个quick-fix的办法是把auto_build的第29行read choice,直接换成choice=1。然后直接制作crf模
型。

张沈鹏

unread,

Feb 23, 2009, 1:30:32 PM2/23/09

to NlpBamboo

zuroc@aragorn ~ $ readlink -f `which bash`
readlink -f `which bash`
which bash
/bin/bash
echo -ne "\033]0;${USER}@${HOSTNAME%%.*}:${PWD/$HOME/~}\007"

zuroc@aragorn ~ $ bash -v

# /etc/bash/bashrc
#
# This file is sourced by all *interactive* bash shells on startup,
# including some apparently interactive shells such as scp and rcp
# that can't tolerate any output. So make sure this doesn't display
# anything or bad things will happen !

# Test for an interactive shell. There is no need to set anything
# past this point for scp and rcp, and it's important to refrain from
# outputting anything in those cases.
if [[ $- != *i* ]] ; then
# Shell is non-interactive. Be done now!
return
fi

# Bash won't get SIGWINCH if another process is in the foreground.
# Enable checkwinsize so that bash will check the terminal size when
# it regains control. #65623
# http://cnswww.cns.cwru.edu/~chet/bash/FAQ (E11)
shopt -s checkwinsize

# Enable history appending instead of overwriting. #139609
shopt -s histappend

# Change the window title of X terminals
case ${TERM} in
xterm*|rxvt*|Eterm|aterm|kterm|gnome*)
PROMPT_COMMAND='echo -ne "\033]0;${USER}@${HOSTNAME%%.*}:${PWD/$HOME/
~}\007"'
;;
screen)
PROMPT_COMMAND='echo -ne "\033_${USER}@${HOSTNAME%%.*}:${PWD/$HOME/~}
\033\\"'
;;
esac

use_color=false

# Set colorful PS1 only on colorful terminals.
# dircolors --print-database uses its own built-in database
# instead of using /etc/DIR_COLORS. Try to use the external file
# first to take advantage of user additions. Use internal bash
# globbing instead of external grep binary.
safe_term=${TERM//[^[:alnum:]]/?} # sanitize TERM
match_lhs=""
[[ -f ~/.dir_colors ]] && match_lhs="${match_lhs}$(<~/.dir_colors)"
[[ -f /etc/DIR_COLORS ]] && match_lhs="${match_lhs}$(</etc/
DIR_COLORS)"
</etc/DIR_COLORS
[[ -z ${match_lhs} ]] \
&& type -P dircolors >/dev/null \
&& match_lhs=$(dircolors --print-database)
[[ $'\n'${match_lhs} == *$'\n'"TERM "${safe_term}* ]] &&
use_color=true

if ${use_color} ; then
# Enable colors for ls, etc. Prefer ~/.dir_colors #64489
if type -P dircolors >/dev/null ; then
if [[ -f ~/.dir_colors ]] ; then
eval $(dircolors -b ~/.dir_colors)
elif [[ -f /etc/DIR_COLORS ]] ; then
eval $(dircolors -b /etc/DIR_COLORS)
fi
fi

if [[ ${EUID} == 0 ]] ; then
PS1='\[\033[01;31m\]\h\[\033[01;34m\] \W \$\[\033[00m\] '
else
PS1='\[\033[01;32m\]\u@\h\[\033[01;34m\] \w \$\[\033[00m\] '
fi

alias ls='ls --color=auto'
alias grep='grep --colour=auto'
else
if [[ ${EUID} == 0 ]] ; then
# show root@ when we don't have colors
PS1='\u@\h \W \$ '
else
PS1='\u@\h \w \$ '
fi
fi
dircolors -b /etc/DIR_COLORS
LS_COLORS='no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:su=37;41:sg=30;43:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.
7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.mng=01;35:*.pcx=01;35:*.yuv=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.pdf=00;32:*.ps=00;32:*.txt=00;32:*.patch=00;32:*.diff=00;32:*.log=00;32:*.tex=00;32:*.doc=00;32:*.flac=01;35:*.mp3=01;35:*.mpc=00;36:*.ogg=00;36:*.wav=00;36:*.mid=00;36:*.midi=00;36:*.au=00;36:*.flac=00;36:*.aac=00;36:*.ra=01;36:*.mka=01;36:';
export LS_COLORS

张沈鹏

unread,

Feb 23, 2009, 1:35:46 PM2/23/09

to NlpBamboo

前几天忙别的事情去了没能来看回复囧

我改成choice=1就可以训练

不知道还会有什么问题

先睡觉去了:)

张沈鹏

unread,

Feb 23, 2009, 10:33:05 PM2/23/09

to NlpBamboo

ok 现在可以运行了
不过example还是有问题,晚上回去研究了

在问一个问题

crf_seg, crf分词

crf_pos,

crf_ner_nr

crf_ner_ns crf地名提取

crf_ner_nt

keyword 主题词

上面没有写对应中文的几个分别是干什么的呢?

jianing yang

unread,

Feb 25, 2009, 12:45:16 AM2/25/09

to nlpb...@googlegroups.com

crf_pos是Part of Speech(词性标注), crf_ner_nr是人民提取，nt是机构名提取

2009/2/24 张沈鹏 <zsp...@gmail.com>

张沈鹏

unread,

Feb 25, 2009, 12:53:16 AM2/25/09

to nlpb...@googlegroups.com

crf_pos的训练好慢比分词训练慢多了

我训练了十几个小时才训练到

iter=43 terr=0.12477 serr=0.86126 act=14486516 obj=517100.03458 diff=0.01695
iter=44 terr=0.11792 serr=0.85598 act=14486516 obj=500330.10035 diff=0.03243
iter=45 terr=0.11513 serr=0.85356 act=14486516 obj=491568.99435 diff=0.01751

............

jianing yang

unread,

Feb 25, 2009, 12:55:51 AM2/25/09

to nlpb...@googlegroups.com

特征太多了。。。POS其实用ME性能上会比较好。

Reply all

Reply to author

Forward