Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

RFR (Request for Review): Some NLP processing

3 views
Skip to first unread message

Emre Sevinc

unread,
Oct 25, 2006, 10:11:37 AM10/25/06
to

For those Lisp masters who have time and patience to review the code below,

This is a simple and not yet complete Turkish NLP code for trying to convert
XML-like sentence structures (taken from Turkish Treebank files) into some s-exp
structure and then do some syntactical processing. Currently the focus of the
code is on converting an active sentence into passive voice, e.g.
"I have seen the blue book." --> the blue book has been seen.

The XML data I have used can be seen at

http://ileriseviye.org/cogsci/ddd/deneme1.xml
http://ileriseviye.org/cogsci/ddd/deneme2.xml
.
.
.
http://ileriseviye.org/cogsci/ddd/deneme9.xml

Since I still consider myself a novice, I expect code reviews, basically if there
are some more elegant ways for some of the data processing stuff I have done, some
more idiomatic ways that I have missed, code redundancy, bugs, etc.

I naturally don't expect any criticism about the NLP functionality
itself since that stuff is about Turkish language, however if there are
Turkish Lispers out there who can comment on that too are welcome.

(Before you review, yes I know I could have used cl-ppcre
and avoid writing some redundant and error-prone code.
That and why I have used Allegro CL 8.0 Express Edition is
another story.)

The main testing loop can be seen at the very end of the
code listing. It works partially for some sentence structures
and fails for lots of others, I'm still trying to improve it.

Here's the code:

(require :regexp2)

(defparameter vowels '(#\a #\e #\i #\ı #\ö #\o #\ü #\u))

(defun vowel-p (character)
"Returns the vowels list if character is a vowel, otherwise NIL"
(member character vowels))

(defun add-suffix (root &rest suffix-categories)
"Adds the required suffixes to the given root word."
(let ((word root))
(loop for suffix-category in suffix-categories do
(case suffix-category
(:passive
(setf word (add-passive word)))
(:past
(setf word (add-past word)))
(:future
(setf word (add-future word)))
(:present-progressive
(setf word (add-present-progressive word)))
(:negative
(setf word (add-negative word)))
(:ability
(setf word (add-ability word)))
(:present
(setf word (add-present word)))
(t
(format t "what the heck..."))
(otherwise
(format t "what do you mean otherwise...")))
finally (return word))))

(defun ends-with-vowel (word)
"Returns true if the word ends with a vowel."
(vowel-p (char word (- (length word) 1))))

(defun last-vowel (root)
"Finds the last vowel of the root word.
The root word is assumed not to end with a vowel"
(loop for char across (reverse root) do
(if (vowel-p char) (return char))))

(defun add-passive (root)
"Adds the passive suffix for the given root word."
(cond ((ends-with-vowel root)
(concatenate 'string root "n"))
((string-equal root "öl" :start1 (- (length root) 2))
(concatenate 'string root "ün"))
((string-equal root "ül" :start1 (- (length root) 2))
(concatenate 'string root "ün"))
((string-equal root "il" :start1 (- (length root) 2))
(concatenate 'string root "in"))
((string-equal root "ul" :start1 (- (length root) 2))
(concatenate 'string root "un"))
((string-equal root "el" :start1 (- (length root) 2))
(concatenate 'string root "in"))
((string-equal root "et" :start1 (- (length root) 2))
(concatenate 'string (subseq root 0 (- (length root) 1)) "dil"))
((member (last-vowel root) '(#\a #\ı))
(concatenate 'string root "ıl"))
((member (last-vowel root) '(#\e #\i))
(concatenate 'string root "il"))
((member (last-vowel root) '(#\u #\o))
(concatenate 'string root "ul"))
((member (last-vowel root) '(#\ü #\ö))
(concatenate 'string root "ül"))
(t
nil)))

(defun add-past (root)
"Adds the past suffix to the given root word."
(let ((lv (last-vowel root)))
(cond ((member (char root (1- (length root))) '(#\k #\ş))
(cond ((member lv '(#\a #\ı))
(concatenate 'string root "tı"))
((member lv '(#\e #\i))
(concatenate 'string root "ti"))
((member lv '(#\u #\o))
(concatenate 'string root "tu"))
((member lv '(#\ü #\ö))
(concatenate 'string root "tü"))))
((member lv '(#\a #\ı))
(concatenate 'string root "dı"))
((member lv '(#\e #\i))
(concatenate 'string root "di"))
((member lv '(#\u #\o))
(concatenate 'string root "du"))
((member lv '(#\ü #\ö))
(concatenate 'string root "dü"))
(t
nil))))

(defun add-future (root)
"Adds the past suffix to the given root word."
(let ((lv (last-vowel root)))
(cond ((member lv '(#\a #\ı))
(concatenate 'string root "acak"))
((member lv '(#\e #\i))
(concatenate 'string root "ecek"))
((member lv '(#\u #\o))
(concatenate 'string root "acak"))
((member lv '(#\ü #\ö))
(concatenate 'string root "ecek"))
(t
nil))))

(defun add-present-progressive (root)
"Adds the present progressive suffix to the given root word.
TODO: Does not account for all of the letter conditions, FIX THIS."
(let ((lv (last-vowel root)))
(cond ((member lv '(#\a #\ı))
(concatenate 'string root "ıyor"))
((member lv '(#\e #\i))
(concatenate 'string root "iyor"))
((member lv '(#\u #\o))
(concatenate 'string root "uyor"))
((member lv '(#\ü #\ö))
(concatenate 'string root "üyor"))
(t
nil))))

(defun add-negative (root)
"Adds the suitable negation suffix to the given root verb."
(let ((lv (last-vowel root)))
(cond ((member lv '(#\a #\ı #\u #\o))
(concatenate 'string root "ma"))
((member lv '(#\e #\i #\ö #\ü))
(concatenate 'string root "me"))
(t
nil))))

(defun add-ability (root)
"Adds the suitable ability suffix to the given root verb."
;;; Certainly not complete at all.
(let ((lv (last-vowel root)))
(cond ((member lv '(#\a #\ı #\u #\o))
(concatenate 'string root "abil"))
((member lv '(#\e #\i #\ö #\ü))
(concatenate 'string root "ebil"))
(t
nil))))

(defun add-present (root)
"Adds the suitable present tense suffix to the given root verb."
(let ((lv (last-vowel root)))
(cond ((member lv '(#\a #\ı #\u #\o))
(concatenate 'string root "ır"))
((member lv '(#\e #\i #\ö #\ü))
(concatenate 'string root "ir"))
(t
nil))))

;;;
;;; some very preliminary and basic tests for adding suffixes
;;;
(defparameter *suffix-test-data* '("oku" "koş" "bil" "tap" "yap" "gör" "toplan"
"oyna" "dur" "getir" "yaz" "seyret" "böl" "sor"
"vur" "saklan" "dolan" "sağ" "kız" "çık" "yapıştır"
"programla" "del" "sızdır" "bahset" "söyle"))

(loop for word in *suffix-test-data* do
(print (add-suffix word :passive :future :past)))

(loop for word in *suffix-test-data* do
(print (add-suffix word :passive :negative :present-progressive :past)))


(loop for word in *suffix-test-data* do
(print (add-suffix word :passive :negative :past)))


(loop for word in *suffix-test-data* do
(print (add-suffix word :passive :ability :present)))


(defun all-matches (regex string)
;; Couldn't figure out why a NIL comes at the
;; end of the start-list, currently handled by removing the last element of start-list
(let ((start-list '()))
(do ((start
(multiple-value-bind (result pos-list)
(match-re regex string :return :index :start 0)
(car pos-list))
(multiple-value-bind (result pos-list)
(match-re regex string :return :index :start start)
(if (null pos-list) (return (subseq start-list 0 (1- (length start-list))))
(1+ (cdr pos-list))))))
((> start (length string)) start-list)
(setf start-list (append start-list (list start))))))

(defun all-matches-as-strings (regex string)
(let ((start-list (all-matches regex string))
(string-list '()))
(loop for i in start-list
collect (multiple-value-bind (r m) (match-re regex string :start i)
m))))

(defun parse-ig (line)
"Parses the IG structure of a given line from a Turkish Treebank .xml file."
(let ((result (multiple-value-bind (r m)
(match-re "IG='\\[(\\([0-9]+,\"(\\w)+(\\+\\w+)+\"\\))+\\]'"
line)
m)))
(if result
(loop for item in (all-matches-as-strings "\\w+(\\+\\w+)+" result)
collect (split-re "\\+" (string item)))
;; this may be punctuation so try this
(let ((result (multiple-value-bind (r m)
(match-re "IG='\\[(\\([0-9]+,\"(\\w|\\.|\\,|\\;|\\!|(\\.\\.\\.))+(\\+\\w+)+\"\\))+\\]'"
line)
m)))
(if result
(loop for item in (all-matches-as-strings "(\\.|\\,|\\;|\\!|(\\.\\.\\.))(\\+\\w+)+" result)
collect (split-re "\\+" (string item))))))))


(defun parse-ig-final (ig-list)
"Convert the ig-list into final required s-exp format."
(let ((result '()))
(if (not (null ig-list))
(progn
(setf result (append result (list (caar ig-list))))
(setf result (append result (list (cdar ig-list))))
(setf result (append result
(loop for i in (cdr ig-list) collect i)))))))

(defun parse-surface-form (line)
"Grabs the surface form of the given line and creates a list."
(let ((s (multiple-value-bind (r m)
(match-re ">\\s+(\\w+|(\\.|\\,|\\;|\\!|(\\.\\.\\.)))\\s+<" line)
m)))
(if s
(progn
(setf s (subseq s 2 (- (length s) 2)))
(list :surface-form s)))))


(defun parse-rel (line)
"Parses the REL part of the line from one Turkish Treebank DB XML file."
(let ((m (multiple-value-bind (r m)
(match-re "REL=\"\\[(\\d\\,)+\\(\\w+(\\.\\w+)?\\)\\]" line)
m)))
(if m
(let ((m (multiple-value-bind (r m)
(match-re "(\\d\\,)+\\(\\w+(\\.\\w+)?\\)" m)
m)))
(split-re "\\," m)))))


(defun parse-sentence (file)
"Tries to return a very simple s-exp based structure for
the given Turkish Treebank XML-like file."
(let ((sentence '()))
(with-open-file (stream file)
(do ((line (read-line stream nil)
(read-line stream nil)))
((null line))
(progn
(if (parse-ig-final (parse-ig line))
(setf sentence (append sentence
(list
(list (parse-ig-final (parse-ig line))
(parse-rel line)
(parse-surface-form line)))))))))
sentence))


(defun print-sentence (sentence)
"Prints the surface form of the sentence, given the s-exp sentence list."
;;;
;;; I had to do type checking for listp because the (rplacd ...) in make-passive
;;; creates (:surface-form . "...") instead (:surface-form "...")
;;;
(loop for i in sentence
do (if (not (listp (cdaddr i)))
(format t "~a " (cdaddr i))
(format t "~a " (car (cdaddr i))))))


(defun add-suffix-list-to-root (root suffix-list)
"Adds the required suffixes to the root and computes the surface form."
(let ((word root))
(loop for suffix in suffix-list
do (progn
(cond ((string-equal suffix "passive")
(setf word (add-suffix word :passive)))
((string-equal suffix "Neg")
(setf word (add-suffix word :negative)))
((string-equal suffix "Past")
(setf word (add-suffix word :past)))
((string-equal suffix "Able")
(setf word (add-suffix word :ability)))
((string-equal suffix "Aor")
(setf word (add-suffix word :present)))
((string-equal suffix "Prog1")
(setf word (add-suffix word :present-progressive)))
(t
'nil))))
word))

(defun make-passive (sentence)
"Tries to convert an active sentence into a passive sentence."
;;; Rule 1: This is the simplest ACTIVE -> PASSIVE rule.
;;; Try to find the last verb.
;;; Remove any person suffix from the verb. This means removing the last element.
;;; Add PASSIVE to the verb.
;;; Find the direct object of the verb.
;;; Remove the ACC from the OBJECT the verb affects.
;;; Remove the SUBJECT of the VERB.
;;; Return the sentence.

;;
;; Find and modify the verb. Return the index of the verb.
;; Does not handle many cases.
;;
(let ((index-of-verb 1))
(loop for i in sentence
do (if (equal "Verb" (caar (last (cdar i))))
(progn
;;; (print (car (last i)))
;;; (print (caar i))
;;; (print (append '("passive")
;;; (subseq (car (last (cdar i)))
;;; 1 (1- (length (car (last (cdar i))))))))

(rplacd (car (last i))
(add-suffix-list-to-root (caar i)
(append '(passive)
(subseq (car (last (cdar i)))
1 (1- (length (car (last (cdar i)))))))))

;;(rplaca (car i) (add-suffix (caar i) :passive :negative :past))
(return))
(incf index-of-verb)))

;; Find if there is any SUBJECT RELated to the VERB found above and remove it.
(let ((index-of-subject 0) (my-index 1))
(loop for i in sentence
do (progn
(if (caadr i)
(progn
(if (and
(= (parse-integer (caadr i)) index-of-verb)
(string-equal (car (last (cadr i))) "(SUBJECT)"))
(progn
(setf index-of-subject my-index)
(return)))))
(incf my-index)))

;; if a SUBJECT is found then rebuild the SENTENCE list
;; by skipping the SUBJECT
(if (not (= 0 index-of-subject))
(setf sentence
(loop
for i in sentence
for my-index = 1 then (1+ my-index)
unless (= my-index index-of-subject)
collect i)))))

sentence)

;;; Testing the make-passive for different sentences
(loop for i in '("1" "2" "3" "4" "5" "6" "7" "8" "9")
do (progn
(format t "~%")
(format t "~%")
(print-sentence
(parse-sentence (concatenate 'string "~/cogsci/ddd/Turkish Treebank/deneme" i ".xml")))
(format t "~%")
(print-sentence
(make-passive
(parse-sentence
(concatenate 'string "~/cogsci/ddd/Turkish Treebank/deneme" i ".xml"))))))


Regards,

--
Emre Sevinc

eMBA Software Developer Actively engaged in:
http://emba.bilgi.edu.tr http://ileriseviye.org
http://www.bilgi.edu.tr http://fazlamesai.net
Cognitive Science Student http://cazci.com
http://www.cogsci.boun.edu.tr

0 new messages