What I'd like to do is take a string like "I am a string." and return the words and punctuation as multiple values such as:
(my-read-from-string "I am a string.") => "I" "am" "a" "string" "."
Or have it build a list. Which ever is more efficient: (my-read-from-string "I am a string.") => ("I" "am" "a" "string" ".")
`read-from-string' uppercases everything, and returns `STRING.' as the last word. I've been trying at this for awhile now and it seems that the only function that can do what I want would have to understand the list of alphabetic characters a-z and A-Z and use `subseq'. Is there any built in function that almost does what I want, or any pointers about how to do this most efficiently, and what to name it when it's done? The iterative `subseq' version of this function would be enormous if I got it working right and make a lot of garbage.
-----------== Posted via Deja News, The Discussion Network ==---------- http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own
<robert_m_mi...@yahoo.com> wrote: >What I'd like to do is take a string like "I am a string." and return the >words and punctuation as multiple values such as:
>(my-read-from-string "I am a string.") >=> "I" > "am" > "a" > "string" > "."
>Or have it build a list. Which ever is more efficient: >(my-read-from-string "I am a string.") >=> ("I" "am" "a" "string" ".")
I would imagine that you'd have to build a list ( or some sequence ) to do get to the point where you could do
(funcall #'values seq-of-values )
anyway. Since the number of values varies here. So the latter seems the better way to go.
>`read-from-string' uppercases everything, and returns `STRING.' as the last >word.
Not only that, it returns symbols, not strings.
I'm afraid you'll need to write you own lexer. (i.e. walk down the string testing to see word breaks (non alpha-char-p )
>function that almost does what I want, or any pointers about how to do this >most efficiently, and what to name it when it's done? > The iterative `subseq' >version of this function would be enormous if I got it working right and make >a lot of garbage.
subseq takes a "start" and "end" argument. If sequentially called with the proper pair of indices this should create no garbage. Just the strings you want. In this case, the lexer's "job" is to find these positions.
012345678901234 "I am a string."
USER(1): (subseq "I am a string." 0 1 ) "I" USER(2): (subseq "I am a string." 2 4 ) "am" USER(3): (subseq "I am a string." 5 6 ) "a" USER(4): (subseq "I am a string." 7 13 ) "string" USER(5): (subseq "I am a string." 13 14 ) "."
[ This is probably an O(n^2) process since subseq must skip the ever lenghtening prefix each time.... if subseq is naive about the sequence in question. Probably not worth worrying about. ]
If you're not so concerned with garbage then string-trim ( string-trim-left) could help tokenize the given string.
--
Lyman S. Taylor "I'm a Doctor! Not a commando." (ly...@cc.gatech.edu) The enhanced EMH Doctor in a ST:Voyager epidsode.
> What I'd like to do is take a string like "I am a string." and return the > words and punctuation as multiple values such as:
> (my-read-from-string "I am a string.") > => "I" > "am" > "a" > "string" > "."
> Or have it build a list. Which ever is more efficient: > (my-read-from-string "I am a string.") > => ("I" "am" "a" "string" ".")
> `read-from-string' uppercases everything, and returns `STRING.' as the last > word. I've been trying at this for awhile now and it seems that the only > function that can do what I want would have to understand the list of > alphabetic characters a-z and A-Z and use `subseq'. Is there any built in > function that almost does what I want, or any pointers about how to do this > most efficiently, and what to name it when it's done? The iterative `subseq' > version of this function would be enormous if I got it working right and make > a lot of garbage.
> -----------== Posted via Deja News, The Discussion Network ==---------- > http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own
Actually, the subseq version may not be as bad as you think. Here are a couple of versions I hacked together; one uses displaced arrays and the other uses subseq. WRT memory allocation, the displaced arrays version wins for long words, while the subseq version wins for short words.
In MCL 4.2, the boundary seems to be at a "word" size of 20 characters. The actual allocation has to do with granularity of memory requests and the relative sizes of simple vs non-simple array headers.
So if I was going to use MCL 4.2 to chunk English text and was concerned about total memory allocation, the subseq version would save me about 8 bytes per word.
OTOH, if I was more concerned about stressing the GC, I'd probably preallocate a bunch of displaced arrays, and change the code slightly to use them as resources (rather than creating them for each use and then discarding them).
;; Tests of subseq version
? (let ((string "T ")) (time (read-short-words-from-string string))) (READ-SHORT-WORDS-FROM-STRING STRING) took 0 milliseconds (0.000 seconds) to run. 16 bytes of memory allocated. ("T")
? (let ((string "ThisThat ")) (time (read-short-words-from-string string))) (READ-SHORT-WORDS-FROM-STRING STRING) took 0 milliseconds (0.000 seconds) to run. 24 bytes of memory allocated. ("ThisThat")
? (let ((string "ThisThatThisThatThis ")) (time (read-short-words-from-string string))) (READ-SHORT-WORDS-FROM-STRING STRING) took 0 milliseconds (0.000 seconds) to run. 32 bytes of memory allocated. ("ThisThatThisThatThis")
? (let ((string "ThisThatThisThatThisX ")) (time (read-short-words-from-string string))) (READ-SHORT-WORDS-FROM-STRING STRING) took 0 milliseconds (0.000 seconds) to run. 40 bytes of memory allocated. ("ThisThatThisThatThisX")
;; Tests of displaced arrays version
? (let ((string "ThisThatThisThatThisX ")) (time (read-long-words-from-string string))) (READ-LONG-WORDS-FROM-STRING STRING) took 0 milliseconds (0.000 seconds) to run. 32 bytes of memory allocated. ("ThisThatThisThatThisX")
? (let ((string "ThisThatThisThatThisThisThatThisThatThisThisThatThisThat")) (time (read-long-words-from-string string))) (READ-LONG-WORDS-FROM-STRING STRING) took 0 milliseconds (0.000 seconds) to run. 32 bytes of memory allocated. ("ThisThatThisThatThisThisThatThisThatThisThisThatThisThat")
;; Note: These only recognize words composed of alpha characters, and ;; require that the last word be followed by a non-alpha character. ;; Marking the "word" boundaries properly is left as an exercise. <g>
Taylor) wrote: > In article <770c57$vr...@nnrp1.dejanews.com>, > <robert_m_mi...@yahoo.com> wrote: >>What I'd like to do is take a string like "I am a string." and return the >>words and punctuation as multiple values such as:
>>(my-read-from-string "I am a string.") >>=> "I" >> "am" >> "a" >> "string" >> "."
>>Or have it build a list. Which ever is more efficient: >>(my-read-from-string "I am a string.") >>=> ("I" "am" "a" "string" ".")
> I would imagine that you'd have to build a list ( or some sequence ) > to do get to the point where you could do
> (funcall #'values seq-of-values )
> anyway. Since the number of values varies here. > So the latter seems the better way to go.
Also, with the funcall form you can run into CALL-ARGUMENTS-LIMIT, which can be as low as 50.
[snip]
>> The iterative `subseq' >>version of this function would be enormous if I got it working right and make >>a lot of garbage.
> subseq takes a "start" and "end" argument. If sequentially called with > the proper pair of indices this should create no garbage. Just the > strings you want. In this case, the lexer's "job" is to find these > positions.
"subseq creates a sequence that is a copy of the subsequence of sequence bounded by start and end." (CLHS)
Therefore, subseq allocates additional memory (which may eventually turn into garbage).
[snip]
> [ This is probably an O(n^2) process since subseq must skip > the ever lenghtening prefix each time.... if subseq is > naive about the sequence in question. Probably not worth > worrying about. ]
Not if the argument is a string; then subseq can index directly to the character of interest in constant time.
In article <770gc1$...@pravda.cc.gatech.edu>, Lyman S. Taylor <ly...@cc.gatech.edu> wrote:
> I would imagine that you'd have to build a list ( or some sequence ) > to do get to the point where you could do
> (funcall #'values seq-of-values )
> anyway. Since the number of values varies here. > So the latter seems the better way to go.
I think you meant:
(apply #'values seq-of-values)
which is equivalent to:
(values-list seq-of-values)
-- Barry Margolin, bar...@bbnplanet.com GTE Internetworking, Powered by BBN, Burlington, MA *** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups. Don't bother cc'ing followups to me.
In article <PUQk2.5258$W_.3447...@news1.teleport.com>, David B. Lamkins <dlamk...@teleport.com> wrote:
>In article <770gc1$...@pravda.cc.gatech.edu> , ly...@cc.gatech.edu (Lyman S. >Taylor) wrote: ... >> subseq takes a "start" and "end" argument. If sequentially called
>"subseq creates a sequence that is a copy of the subsequence of sequence >bounded by start and end." (CLHS)
>Therefore, subseq allocates additional memory (which may eventually turn >into garbage).
I was operating under the presumption that result consisted of individual strings. Not aliases to portions of the original. And were "useful", so not "garbage". [ In constrast to something like (subseq (subseq "A string" 2 8) 0 3 )
which does create some intermediate garabge. ]
If you later wished to flip these strings into symbols you'd need to (or should ) make a copy of the strings anyway.
Additionally if the "source string" is some reusable buffer then you can just refill the next string. As opposed to creating a new string for "line". [ e.g., using read-sequence to fill/refill the buffer in constrast to repeatively calling read-line. ]
Most everything eventually turns into garbage. Even the aliases.
Also, if your "source" string consists of lots of "white space" then if you let it go and just keep the tokens you have a net space reduction.
--
Lyman S. Taylor "Twinkie Cream; food of the Gods" (ly...@cc.gatech.edu) Jarod, "The Pretender"
"David B. Lamkins" <dlamk...@teleport.com> writes:
> "subseq creates a sequence that is a copy of the subsequence of sequence > bounded by start and end." (CLHS)
> Therefore, subseq allocates additional memory (which may eventually turn > into garbage).
I sometimes use displaced arrays where the space usage of subseq gets critical:
(defun substr (string from &optional (to (length string))) (make-array (- to from) :element-type 'character :displaced-to string :displaced-index-offset from))
However, with ACL 5, the initial overhead of this function compared to subseq seems to be 32 bytes, which means, I guess, that you'll need to have substrings longer than than 32 characters to make any sensible use of it...
robert_m_mi...@yahoo.com writes: > What I'd like to do is take a string like "I am a string." and return the > words and punctuation as multiple values such as:
> (my-read-from-string "I am a string.") > => "I" > "am" > "a" > "string" > "."
> Or have it build a list. Which ever is more efficient: > (my-read-from-string "I am a string.") > => ("I" "am" "a" "string" ".")
Here is a simplistic tokenization function most Perl users will find vaguely familiar. It currently will only tokenize along a given set of whitespace (which by definition will be elided from the result), so you will have to change it a bit for your punctuation characters, along which you'll split, but which should be _kept_ in the result sequence:
(defun split (string &optional max (ws '(#\Space #\Tab))) "Split `string' along whitespace as defined by the sequence `ws'. The whitespace is elided from the result. The whole string will be split, unless `max' is a non-negative integer, in which case the string will be split into `max' tokens at most, the last one containing the whole rest of the given `string', if any." (flet ((is-ws (char) (find char ws))) (loop for start = (position-if-not #'is-ws string) then (position-if-not #'is-ws string :start index) for index = (and start (if (and max (= (1+ word-count) max)) nil (position-if #'is-ws string :start start))) while start collect (subseq string start index) count 1 into word-count while index)))
Note:
a) This is in no way optimized. Depending on your usage patterns and your implementation this can be tuned quite a lot. Modify it first to meet your needs, use it, if it's too slow, profile it (this can have very interesting ramifications!), and then optimize it, if need be. b) The documentation string is slightly dense...
Regs, Pierre.
PS: I still think that these kind of string-processing functions should be collected somewhere in a nice library, so that new users will not find Perl more suitable for simple string-bashing things... As time permits (which it seldomly does), I'll try to put something usable together, and release it to the PD. Anyone who has something nice to contribute, feel free to contact me... ;)
Pierre Mai <p...@acm.org> writes: > robert_m_mi...@yahoo.com writes:
> > What I'd like to do is take a string like "I am a string." and return the > > words and punctuation as multiple values such as:
> > (my-read-from-string "I am a string.") > > => "I" > > "am" > > "a" > > "string" > > "."
> > Or have it build a list. Which ever is more efficient: > > (my-read-from-string "I am a string.") > > => ("I" "am" "a" "string" ".")
[tokenizing function deleted]
While looking through the HyperSpec just recently I found this TOKENIZE-SENTENCE function on the LOOP-FINISH page:
;; While LOOP-FINISH can be used can be used in a variety of ;; situations it is really most needed in a situation where a need ;; to exit is detected at other than the loop's `top level' ;; (where UNTIL or WHEN often work just as well), or where some ;; computation must occur between the point where a need to exit is ;; detected and the point where the exit actually occurs. For example: (defun tokenize-sentence (string) (macrolet ((add-word (wvar svar) `(when ,wvar (push (coerce (nreverse ,wvar) 'string) ,svar) (setq ,wvar nil)))) (loop with word = '() and sentence = '() and endpos = nil for i below (length string) do (let ((char (aref string i))) (case char (#\Space (add-word word sentence)) (#\. (setq endpos (1+ i)) (loop-finish)) (otherwise (push char word)))) finally (add-word word sentence) (return (values (nreverse sentence) endpos))))) => TOKENIZE-SENTENCE
(tokenize-sentence "this is a sentence. this is another sentence.") => ("this" "is" "a" "sentence"), 19
(tokenize-sentence "this is a sentence") => ("this" "is" "a" "sentence"), NIL
> Regs, Pierre.
> PS: I still think that these kind of string-processing functions > should be collected somewhere in a nice library, so that new users > will not find Perl more suitable for simple string-bashing things... > As time permits (which it seldomly does), I'll try to put something > usable together, and release it to the PD. Anyone who has something > nice to contribute, feel free to contact me... ;)
I think I remember this issue coming up before, and someone mentioning that at least one vendor put together a comprehensive strings package. I also think that a nice PD library of string bashing functions that is portable across implementations would be extremely useful, and I'll help out.
Does anyone know of a strings package that has already been put together by a vendor, and where API-related documentation for it might be accessed?
If we don't have to reinvent the wheel here and have two incompatible interfaces, that would be wonderful.
Pierre's `split' blows the doors off of the `tokenize-sentence' example, based one simple profiling I did. It's also more straightforward. I just think that if this will be done, then it should be well thought-out, and use of CLOS should be considered (i.e. as to whether these functions should be "extensible" in a sense, so we can "split" our own data structures, etc. Also, Lisp really doesn't limit many of these manipulations to strings, but maybe we can apply these things to sequences, more generally.
This is a very serious issue. These functions should be implementation-depenedent, and they should be completely PD, IMO. Anyway, I began working on this years ago in elisp (i.e. trying to write many Perl functions in Lisp), but I'll have to find it, and when/if I do, I'll also make it public. My `split' was almost a facsimile of Pierre's, and I think that it should not be too hard to do this (although to do it _right_ (e.g. extensible, versatile, etc) is another story altogether).
Remember, the least efficient way to read words from a string is for YOU to rewrite code for different cases. This real-world software engineering. So write one defun that handles the general case of large and small strings, make it easy to understand, and don't sweat CPU time or memory.
Here's a way to break up strings. It's not bit-fiddling, but then worrying about efficiency when using LISP is really like rearranging deck chairs on the Titanic.
For efficiency, use C/C++.
Parentheses have been known to go awry during cut and paste...
(defun string-token ;;; ;;; DESCRIPTION: Tokenize a string similar to the C string library ;;; function strtkn. Locates the position p of any member of ;;; list token within the input string and returns the ;;; substring consisting of the characters from position 0 ;;; to p-1 and a substring from p+1 to the end of the input ;;; string. If no member of token is not found, the string ;;; and nil are returned. ;;; ;;; Examples: ;;; (string-token "frog-went-a-courtin'" #\-) ;;; =>"frog" ;;; "went-a-courtin'" ;;; ;;; (string-token "frog-went-a-courtin'" #\z) ;;; =>"frog-went-a-courtin'" ;;; NIL ;;; ;;; (string-token "frog-went-a-courtin'" (list #\g #\a)) ;;; =>"fro" ;;; "-went-a-courtin'" ;;; ;;; (string-token "frog-went-a-courtin'" #\f) ;;; =>"" ;;; "rog-went-a-courtin'" ;;; ;;; PARAMETERS: string -string- ;;; token -character or list of characters- (string token ) (if (and string token) (do* ((delim-list (if (and token (listp token)) token (list token))) (char-list (coerce string 'list)) (limit (list-length char-list)) (char-count 0 (+ 1 char-count)) (char (car char-list) (nth char-count char-list)) ) ((or (member char delim-list) (= char-count limit) ) (return (values-list (if (= char-count limit) (list string nil) (list (coerce (butlast char-list (- limit char-count)) 'string) (coerce (nthcdr (+ char-count 1) char-list) 'string) ) ) ) ) ) ) ) )
SLong <stevel...@isomedia.com> writes: > Here's a way to break up strings. It's not bit-fiddling, but then > worrying > about efficiency when using LISP is really like rearranging deck chairs > on > the Titanic.
> For efficiency, use C/C++.
For efficiency, use the best functions and abstractions that are available to you.
> Parentheses have been known to go awry during cut and paste...
Parentheses have also been known to be used much more effectively than you do.
> (defun string-token > ;;; > ;;; DESCRIPTION: Tokenize a string similar to the C string library > ;;; function strtkn. Locates the position p of any > member of > ;;; list token within the input string and returns > the > ;;; substring consisting of the characters from > position 0 > ;;; to p-1 and a substring from p+1 to the end of the > input > ;;; string. If no member of token is not found, the > string > ;;; and nil are returned. > ;;; > ;;; Examples: > ;;; (string-token "frog-went-a-courtin'" #\-) > ;;; =>"frog" > ;;; "went-a-courtin'" > ;;; > ;;; (string-token "frog-went-a-courtin'" #\z) > ;;; =>"frog-went-a-courtin'" > ;;; NIL > ;;; > ;;; (string-token "frog-went-a-courtin'" (list #\g > #\a)) > ;;; =>"fro" > ;;; "-went-a-courtin'" > ;;; > ;;; (string-token "frog-went-a-courtin'" #\f) > ;;; =>"" > ;;; "rog-went-a-courtin'" > ;;; > ;;; PARAMETERS: string -string- > ;;; token -character or list of characters- > (string > token > ) > (if (and string token) > (do* ((delim-list (if (and token (listp token)) token (list > token))) > (char-list (coerce string 'list)) > (limit (list-length char-list)) > (char-count 0 (+ 1 char-count)) > (char (car char-list) (nth char-count char-list)) > ) > ((or > (member char delim-list) > (= char-count limit) > ) > (return > (values-list > (if (= char-count limit) > (list string nil) > (list > (coerce (butlast char-list (- limit char-count)) > 'string) > (coerce (nthcdr (+ char-count 1) char-list) 'string) > ) > ) > ) > ) > ) > ) > ) > )
e.g. your code could be rewritten as
(defun string-token (string token) "DESCRIPTION: Tokenize a string similar to the C string library function strtkn. Locates the position p of any member of list token within the input string and returns the substring consisting of the characters from position 0 to p-1 and a substring from p+1 to the end of the input string. If no member of token is not found, the string and nil are returned.
which is equivalent, faster and less memory hungry. No wonder your Lisp code is so slow.
Moreover, what you usually really want is
(defun string-split (str &optional (separator #\Space)) "Splits the string STR at each SEPARATOR character occurrence. The resulting substrings are collected into a list which is returned. A SEPARATOR at the beginning or at the end of the string STR results in an empty string in the first or last position of the list returned." (declare (type string str) (type character separator)) (loop for start = 0 then (1+ end) for end = (position separator str :start 0) then (position separator str :start start) for substr = (subseq str start end) then (subseq str start end) collect substr into result when (null end) do (return result) ))
However, in order to achieve the same effect in C/C++ you'd have to manually allocate (i.e. with 'malloc') some arrays. Which in *your* case would probably result in memory leaks around your application too.
Finally, what is usually very useful is something more complex that implements a tokenizer in the spirit of Java StringTokenizer class. (I have it. Do you want it?)
* SLong <stevel...@isomedia.com> | It's not bit-fiddling, but then worrying about efficiency when using LISP | is really like rearranging deck chairs on the Titanic.
well, the interesting question is how _did_ you survive? if you arranged deck chairs the way you write code, it's no wonder the whole ship sank.
Steve Long <stevel...@isomedia.com> is apparently trolling in the lisp groups. He says:
Here's a way to break up strings.
And translates his C code into lisp (badly). As others have pointed out, the whole thing could be done better and much more simply with POSITION and SUBSEQ. I want to add that I write parsers in lisp all the time (usually to process human-friendly commands to complex systems). Here's the lexer I use, which does a good bit more than Steve's tokenizer, and is quite efficient.