Issue 18 in cjklib: CEDICT reading problem

10 views

Skip to first unread message

cjk...@googlecode.com

unread,

Oct 3, 2012, 5:37:59 AM10/3/12

to cjklib...@googlegroups.com

Status: New
Owner: ----
Labels: Type-Defect Priority-Medium

New issue 18 by caj...@gmail.com: CEDICT reading problem
http://code.google.com/p/cjklib/issues/detail?id=18

What steps will reproduce the problem?
1. import cjklib.dictionary
2. d =
cjklib.dictionary.CEDICT(databaseUrl='sqlite:////path/to/your/cedict.db')
3. d.getAll()

The method above should return all entries in CEDICT database. However, an
AttributeError exception is raised while applying format on this record:
卡拉ＯＫ|卡拉ＯＫ|ka3 la1 O K|/karaoke (loanword)/

The problem is, reading is not a standard Pinyin. Method
SingleColumnAdapter.format returns None therefore;
NonReadingEntityWhitespace.format raises the exception trying to call split
method on None type.

Problem exists in SVN trunk version (Rev: 446). I am using Ubuntu Linux
11.04.1 LTS

I suggest either fixing such records in installcjkdict script, or fix the
formatter of dictionary module to be able handle such records. My hotfix:

(line 126):
def format(self, string):
toReading = self.toReading or self.fromReading
try:
return self._readingFactory.convert(string, self.fromReading,
toReading, sourceOptions=self.sourceOptions,
targetOptions=self.targetOptions)
except (exception.DecompositionError, exception.CompositionError,
exception.ConversionError):
# wighack
return string
#return None

cjk...@googlecode.com

unread,

Oct 3, 2012, 1:57:45 PM10/3/12

to cjklib...@googlegroups.com

Updates:
Status: Accepted

Comment #1 on issue 18 by christop...@gmail.com: CEDICT reading problem
http://code.google.com/p/cjklib/issues/detail?id=18

It seems that the PinyinOperator thinks that 'O' is an entity in Pinyin,
and complains that no tonal information is available. This leads to an
error in conversion resulting a None value.

Your fix would be an improvement, but really we should be fixing the
conversion.

What I tried to do was to tell the reading conversion to ignore
the "invalid" characters. That should be solvable by
adding 'missingToneMark': 'ignore' to the converter settings. However, this
leads to a breakage in another part of the software, as two different code
paths make use of the same reading converter instance. More precisely
the "search by reading" component (TonelessWildcardReading) needs a reading
conversion that supports missing tones, something we want to change above
by ignoring syllables without tonal marks. The solution here would be to
separate both paths, but that needs a bit more time.

Will keep that on my radar. Feel free to have a go at this yourself.

Reply all

Reply to author

Forward

0 new messages