Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

newbie wants help: Splitting delimited lines

7 views

Skip to first unread message

Robert L.

unread,

Sep 17, 2017, 1:56:49 PM9/17/17

> research on chat an email within a system. The log files are character
> delimited, one message per line, but the last field on the line is
> "dirty" with unescaped delimiters. So a typical line might look
> something like this (not an actual line):
>
> userID^username^gender^site^messageDate^world^messagetext
> 2706^user^m^center^2004-03-01 09:21^chatWorld^Dirty text with ^caret.
>
> What I want is a generalizable function that allows me to do something
> like this:
>
> ;;----
> ;;;the field list
> (defparameter *field-list* '("userID"
> "username"
> "gender"

People are of the male sex or of the female sex.
Only words have gender.

> "site"
> "messageDate"
> "world"
> "messageText")
> "the field list for splitting and identifying fields")
>
> (setf record (split-line #\^ line *field-list*))
> (get-field "userID" record)
> (get-field "gender" record)
>
> ;;---------
>
> Because I'm dealing with different log file formats, I really want to be
> able to reference field values by name rather than remember that the
> message text is (nth 4 line) in one format, and (nth 6 line) in another
> format.
>
> ;;---------
> ;;This function splits the line into count number of fields
> ;;tacking on the last field as a possibly "dirty" remainder.
> (defun pythonic-split (split-char line count)
> "split the line using split-char to produce a maximum of count fields"
> (multiple-value-bind (result-list place)
> ;;subtract one from count so that you can pass the total number
> ;;of desired fields
> (split-sequence:split-sequence split-char line :count (- count 1))

CL does not have "split-sequence".
That code will not work under SBCL.

> (append result-list `(,(subseq line place)))))
>
> (defun split-line (split-char line labels)
> "split a line into an alist of length count with labels"
> ;;a solution for matching fields to labels, use the length of the
> ;;labels list to get the count.
> (pairlis labels (pythonic-split split-char line (length labels))))

(require srfi/13) ; string-tokenize
(require srfi/14) ; char sets

(define (split-line sep-char line labels)
(define parts
(string-tokenize line (char-set-complement (char-set sep-char))))
(define n-clean (- (length labels) 1))
(define fields
(append (take parts n-clean)
(list (string-join (drop parts n-clean) (string sep-char)))))
(map cons labels fields))

(split-line #\^
"2706^user^m^center^2004-03-01 09:21^chatWorld^Dirty text with ^caret."
'("userID" "username" "sex" "site" "messageDate" "world" "messageText"))

===>
(("userID" . "2706")
("username" . "user")
("sex" . "m")
("site" . "center")
("messageDate" . "2004-03-01 09:21")
("world" . "chatWorld")
("messageText" . "Dirty text with ^caret."))

--
This is the unbelievable story of a teenage girl who was raped and murdered by
an invader in Germany. ... At her funeral, her father asked for donations for
"refugees." https://archive.org/details/youtube-aBr2Zf-m2GE

0 new messages