Hi,
Sometimes I wish for python-like string-strip to be a standard
function. For example when I want to parse a list of numbers from a
text-field, I could get the string value of the text-field, strip
space from the beginning and end, split on whitespace, map
string->number over it and check the result with `every'.
Is there a better way? If not then perhaps some form of string-strip
could be added (perhaps under regexp-strip name?) since it isn't
exactly extremely simple unless you're a regexp wizard.
Marijn
Some code (ter leering ende vermaak):
#lang racket
(require rackunit (only-in srfi/1 every))
(define (string-strip string (strip "[:space:]"))
(define rexp
(pregexp (regexp-replace*
"strip"
"^[strip]*(.*?)[strip]*$"
strip)))
(cadr (regexp-match rexp string)))
(check-equal? (string-strip " strip this ") "strip this")
(define (string-split string (split "[[:space:]]+"))
(regexp-split (pregexp split) (string-strip string)))
(check-equal? (string-split " 12. .12
34") '("12." ".12" "34"))
(define (string->number-list string)
(map string->number (string-split string)) )
(check-equal? (string->number-list " 2003 2005 2007\t ") '(2003 2005
2007))
(define (list-of-years? lst)
(every exact-positive-integer? lst))
(check-equal? (string->number "not a number") #f)
(check-equal? (list-of-years? '(2003 3004 43)) #t)
(check-equal? (list-of-years? '(45 #f 20)) #f)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAk76Bo0ACgkQp/VmCx0OL2xxHwCgmzdAEYEIg92eigYetBw6EWB1
JeQAnRHlTA3C+ZCJnz83lj6YkGCEA3Oy
=spER
-----END PGP SIGNATURE-----
____________________
Racket Users list:
http://lists.racket-lang.org/users
string-trim-both in srfi-13 does that.
http://docs.racket-lang.org/srfi-std/srfi-13.html#string-trim
Cheers,
Sam
For example, my first attempt to implement "string->number" (after
briefly considering just using "read"), was 17 times faster in a
shootout on my computer, compared to the one that was built using
"string-strip" and other string operations. (It also helped that I used
Racket regexps rather than pregexps.)
(define (string->number-list/mk2 str)
(let loop ((start 0))
(cond ((regexp-match-positions #rx"^[ \t\r\n\f]*([0-9]+(?:[0-9]+)?)"
str
start)
=> (lambda (m)
(let* ((range (cadr m))
(end (cdr range)))
(cons (string->number (substring str
(car range)
(cdr range)))
(loop end)))))
(else (if (regexp-match? #rx"^[ \t\r\n\f]*$" str start)
'()
(error 'string->number-list
"could not parse string ~S at start position ~S"
str
start))))))
If I was writing performance-sensitive code, I would also see whether a
procedure that scanned a character at a time, and didn't need
"string->number", was faster.
As everyone knows, if the code is not performance-sensitive, do whatever
is easiest to develop in a correct and maintainable way. But, when
performance matters -- such as in code that takes a significant
percentage of time in serving a HTTP request, or is in a PLaneT package
that other people might use in performance-sensitive context -- we also
have to think about algorithmic efficiency, as well as costs specific to
the language implementation.
--
http://www.neilvandyke.org/
On 27-12-11 18:59, Sam Phillips wrote:
> On Tue, Dec 27, 2011 at 09:55, Marijn <hk...@gentoo.org> wrote:
>> Sometimes I wish for python-like string-strip to be a standard
>> function. For example when I want to parse a list of numbers from
>> a text-field, I could get the string value of the text-field,
>> strip space from the beginning and end, split on whitespace, map
>> string->number over it and check the result with `every'.
>>
>> Is there a better way? If not then perhaps some form of
>> string-strip could be added (perhaps under regexp-strip name?)
>> since it isn't exactly extremely simple unless you're a regexp
>> wizard.
>
> string-trim-both in srfi-13 does that.
Hi Sam,
now that you remind me of it I do recall that I figured this same
thing out a long time ago. Thanks for that!
Perhaps the docs could put a cross-reference under string-split for
srfi-13's string-trim-both.
> http://docs.racket-lang.org/srfi-std/srfi-13.html#string-trim
>
> Cheers, Sam
On 27-12-11 20:56, Neil Van Dyke wrote:
> Tangential comment: These string operations are convenient to have
> around, and they can be expedient, but composing with them can make
> for inefficient evaluation. For example, my first attempt to
> implement "string->number" (after briefly considering just using
> "read"), was 17 times faster in a shootout on my computer, compared
> to the one that was built using "string-strip" and other string
> operations. (It also helped that I used Racket regexps rather than
> pregexps.)
>
> (define (string->number-list/mk2 str) (let loop ((start 0)) (cond
> ((regexp-match-positions #rx"^[ \t\r\n\f]*([0-9]+(?:[0-9]+)?)" str
> start) => (lambda (m) (let* ((range (cadr m)) (end (cdr range)))
> (cons (string->number (substring str (car range) (cdr range)))
> (loop end))))) (else (if (regexp-match? #rx"^[ \t\r\n\f]*$" str
> start) '() (error 'string->number-list "could not parse string ~S
> at start position ~S" str start))))))
>
> If I was writing performance-sensitive code, I would also see
> whether a procedure that scanned a character at a time, and didn't
> need "string->number", was faster.
Thank you, Neil, that was quite inspirational. Your use of
regexp-match-positions is instructive (I was never sure how to use
that function and didn't consider it for this use case). I did not
measure a noticeable difference between pregexp and regexp btw and I
changed your first regexp to either #rx"^[ \t\r\n\f]*([1-9][0-9]*)" or
#px"^[[:space:]]*([1-9][0-9]*)" which I think is equivalent.
> As everyone knows, if the code is not performance-sensitive, do
> whatever is easiest to develop in a correct and maintainable way.
> But, when performance matters -- such as in code that takes a
> significant percentage of time in serving a HTTP request, or is in
> a PLaneT package that other people might use in
> performance-sensitive context -- we also have to think about
> algorithmic efficiency, as well as costs specific to the language
> implementation.
I don't think my use of this code is very performance, but I couldn't
help myself, so I looked into making it faster according to your
suggestion. What I found was that it is much slower to treat a string
as a port and then read-char from that then it is to directly index
the string. Also peek-char seems to take almost as much time as
read-char, so you're basically reading everything twice since there is
no such thing as putting back a char into a port (doesn't C/C++ have
that?). Also while for/fold is extremely nice, it isn't as efficient
as a named-let in the cases where both work. I found that out when my
code was still collecting chars into a list and later walking that
list to construct a number. Of course its better to construct the
number straight away, so you don't have to cons that list at all.
In the end I was able to construct code that is another factor 5.5
faster than your version:
(define (string->number-list/mk5 str)
(define length (string-length str))
(define (match-unsigned-integer str pos)
(define (match-digits result pos)
(if (< pos length)
(let* ((char (string-ref str pos)) (pos (+ pos 1)))
(case char
((#\0) (match-digits (* 10 result) pos))
((#\1) (match-digits (+ (* 10 result) 1) pos))
((#\2) (match-digits (+ (* 10 result) 2) pos))
((#\3) (match-digits (+ (* 10 result) 3) pos))
((#\4) (match-digits (+ (* 10 result) 4) pos))
((#\5) (match-digits (+ (* 10 result) 5) pos))
((#\6) (match-digits (+ (* 10 result) 6) pos))
((#\7) (match-digits (+ (* 10 result) 7) pos))
((#\8) (match-digits (+ (* 10 result) 8) pos))
((#\9) (match-digits (+ (* 10 result) 9) pos))
(else (values result pos))))
(values result pos)))
(define (match-first-digit pos)
(if (< pos length)
(let* ((char (string-ref str pos)) (pos (+ pos 1)))
(case char
((#\1) (match-digits 1 pos))
((#\2) (match-digits 2 pos))
((#\3) (match-digits 3 pos))
((#\4) (match-digits 4 pos))
((#\5) (match-digits 5 pos))
((#\6) (match-digits 6 pos))
((#\7) (match-digits 7 pos))
((#\8) (match-digits 8 pos))
((#\9) (match-digits 9 pos))
((#\space #\tab #\return #\newline #\page)
(match-first-digit pos))
(else (values #f pos))))
(values #f pos)))
(match-first-digit pos))
(let loop ((pos 0) (result '()))
(let-values (((unt pos) (match-unsigned-integer str pos)))
(if unt
(loop pos (cons unt result))
(if (< pos length)
#f
(reverse result))))) )
Marijn
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAk77S0EACgkQp/VmCx0OL2wOMwCfbEXpEd+sEXw33tCSYMRDWHii
H8EAn3WKNBjxiqBuSKFfxMhg7w1US1s2
=I4Mo
-----END PGP SIGNATURE-----
This is the best spirit. :)
> What I found was that it is much slower to treat a string
> as a port and then read-char from that then it is to directly index
> the string.
That string input ports are often noticeably slower than string indexing
is one of the banes of my existence. Most reading and parsing
operations you implement, you want to work on both ports and strings.
But, if you first write a procedure that works on a port, and then write
a wrapper procedure that works on a string (by doing an
"open-input-string" and calling your procedure that works on ports), the
string one can be noticeably slower than if you'd handwritten the string
one. But having to write two separate procedures has big development
cost, and I always just take the performance hit on strings instead, or
write a string procedure and then not have a port procedure when I need
it later. One approach that might help is to design a macro that lets
people define processing on strings and ports, and expands to produce
two closure definitions -- one that works on ports, and one on strings,
and avoids a lot of port-related overhead in the string one.
> In the end I was able to construct code that is another factor 5.5
> faster than your version:
>
Marijn, that's a great implementation.
And it's encouraging to see that forgoing regexps and "string->number",
and doing a character-by-character DFA, is faster in this case. It's
pretty common in interpreted languages to use regexps for performance
reasons; nice when we see that pure Racket code can be faster.
One tangential comment: I don't think it's significant in this
situation, but I believe that "case" is currently slower than, say,
people coming from a C background might think. So, sometimes, if one
really wants to micro-optimize, one sometimes might be better off doing,
say strategic "if"s and arithmetic, instead of "case".
--
http://www.neilvandyke.org/
On 28-12-11 19:27, Neil Van Dyke wrote:
> Marijn wrote at 12/28/2011 12:00 PM:
>> I don't think my use of this code is very performance, but I
>> couldn't help myself, so I looked into making it faster
>
> This is the best spirit. :)
>
>> What I found was that it is much slower to treat a string as a
>> port and then read-char from that then it is to directly index
>> the string.
>
> That string input ports are often noticeably slower than string
> indexing is one of the banes of my existence. Most reading and
> parsing operations you implement, you want to work on both ports
> and strings. But, if you first write a procedure that works on a
> port, and then write a wrapper procedure that works on a string (by
> doing an "open-input-string" and calling your procedure that works
> on ports), the string one can be noticeably slower than if you'd
> handwritten the string one. But having to write two separate
> procedures has big development cost, and I always just take the
> performance hit on strings instead, or write a string procedure and
> then not have a port procedure when I need it later. One approach
> that might help is to design a macro that lets people define
> processing on strings and ports, and expands to produce two closure
> definitions -- one that works on ports, and one on strings, and
> avoids a lot of port-related overhead in the string one.
Matthew, any comments on this? Is there a fundamental reason that
treating a string as a port is so much slower than direct indexing or
is there something that can be done about it? Or should we look into
automatically duplicating code with macros?
Marijn
>> In the end I was able to construct code that is another factor
>> 5.5 faster than your version:
>>
>
> Marijn, that's a great implementation.
>
> And it's encouraging to see that forgoing regexps and
> "string->number", and doing a character-by-character DFA, is faster
> in this case. It's pretty common in interpreted languages to use
> regexps for performance reasons; nice when we see that pure Racket
> code can be faster.
>
> One tangential comment: I don't think it's significant in this
> situation, but I believe that "case" is currently slower than,
> say, people coming from a C background might think. So, sometimes,
> if one really wants to micro-optimize, one sometimes might be
> better off doing, say strategic "if"s and arithmetic, instead of
> "case".
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAk78LIsACgkQp/VmCx0OL2x7hQCfb77YNdgro1gKb3hhUxYQ+za7
hfAAnRwlJ2qdTOCZbNuyZvFZw34oDebI
=AyrQ
-----END PGP SIGNATURE-----
On 29-12-11 10:02, Marijn wrote:
>>> What I found was that it is much slower to treat a string as a
>>> port and then read-char from that then it is to directly index
>>> the string.
>
>> That string input ports are often noticeably slower than string
>> indexing is one of the banes of my existence. Most reading and
>> parsing operations you implement, you want to work on both ports
>> and strings. But, if you first write a procedure that works on a
>> port, and then write a wrapper procedure that works on a string
>> (by doing an "open-input-string" and calling your procedure that
>> works on ports), the string one can be noticeably slower than if
>> you'd handwritten the string one. But having to write two
>> separate procedures has big development cost, and I always just
>> take the performance hit on strings instead, or write a string
>> procedure and then not have a port procedure when I need it
>> later. One approach that might help is to design a macro that
>> lets people define processing on strings and ports, and expands
>> to produce two closure definitions -- one that works on ports,
>> and one on strings, and avoids a lot of port-related overhead in
>> the string one.
>
> Matthew, any comments on this? Is there a fundamental reason that
> treating a string as a port is so much slower than direct indexing
> or is there something that can be done about it? Or should we look
> into automatically duplicating code with macros?
Hi Matthew,
I would still like to hear your comments on this,
Marijn
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAk8NlIoACgkQp/VmCx0OL2yiaQCfU8Iv/8phG2PVM2Sb1eWqN+61
e0oAnRYpNh7g1NgLy9OUNMphpBBZuU3n
=0tLH
Performance of string-output-ports (especially looking at how different
``profiles'' of write activity might influence representation) has been
on my list-of-promising-things-I-may-never-get-to, and now we have have
string-input-ports that we want to look at.
One thing I think Racket could use is a bunch of people with useful TODO
lists of enhancements/additions they'd like to make to Racket, even if
each person ultimately only checks off one or two items from their list
this year.
--
http://www.neilvandyke.org/
> I'm sure that Matthew could speak to some of the finer points of I/O ports, but I think that string port performance is one of those things that some bright and motivated person not necessarily named Matthew could look into themself.
>
> Performance of string-output-ports (especially looking at how different ``profiles'' of write activity might influence representation) has been on my list-of-promising-things-I-may-never-get-to, and now we have have string-input-ports that we want to look at.
>
> One thing I think Racket could use is a bunch of people with useful TODO lists of enhancements/additions they'd like to make to Racket, even if each person ultimately only checks off one or two items from their list this year.
A list exists, Vincent and Sam are running it but I don't know where it is :-( -- Matthias
I think people it works best if people find their own itches they want
to scratch.
When you try to scratch someone else's itch, you might not be as
inspired, and you might not even know if you've scratched it well.
Besides, if you haven't yet experienced your own first itch, you're
probably not ready to be scratching other people's.
(Half-serious.)
--
http://www.neilvandyke.org/
The list is here: https://github.com/plt/racket/wiki/Intro-Projects .
Feel free to add to it -- it's a wiki. Currently, it focuses on
projects that are suitable to people relatively new to Racket hacking,
but I'm happy to expand that.
--
sam th
sa...@ccs.neu.edu
On 2012 Jan 11, at 16:03, Sam Tobin-Hochstadt wrote:
> The list is here: https://github.com/plt/racket/wiki/Intro-Projects .
> Feel free to add to it -- it's a wiki. Currently, it focuses on
> projects that are suitable to people relatively new to Racket hacking,
> but I'm happy to expand that.
Well, I've got a WikiCreole parser at <http://planet.racket-lang.org/display.ss?package=squicky.plt&owner=nxg>. I can sort-of see how I'd generalise that to cope with markdown, and have from time to time thought of implementing that.
Does that look like the sort of thing you mean? It's the first and so-far only thing I've put on PLaneT, so would be very happy to take criticism!
Best wishes,
Norman
--
Norman Gray : http://nxg.me.uk
SUPA School of Physics and Astronomy, University of Glasgow, UK
That's exactly the sort of thing I mean. You should update the wiki
page to point to your planet package, so that people who wanted to
contribute will know where to start looking.
--
sam th
sa...@ccs.neu.edu
The biggest overhead in port access compared to byte/string access
comes from maintaining the port state, an indirection to support
different kinds of ports, and, in the case of comparing strings,
potentially decoding characters from UTF-8. Another effect (equally
large for byte strings) is that `bytes-ref' and `string-ref' are
JIT-inlined, while `read-byte' and `read-char' are not.
See the example program below. The `port-like' variants demonstrate the
work that `read-byte' and `read-char' actually do in the simple case
that line-counting is disabled, no UTF-8 decoding turns out to be
needed, etc.
On large strings:
'bytes
cpu time: 78 real time: 77 gc time: 0
'bytes/no-inline
cpu time: 220 real time: 220 gc time: 0
'bytes-port-like
cpu time: 430 real time: 429 gc time: 0
'bytes-port
cpu time: 373 real time: 372 gc time: 0
'string
cpu time: 81 real time: 81 gc time: 0
'string/no-inline
cpu time: 201 real time: 201 gc time: 0
'string-port-like
cpu time: 560 real time: 560 gc time: 17
'string-port
cpu time: 524 real time: 523 gc time: 1
The difference in the last case was bigger prior to v5.2, where we cut
the overhead of maintaining port state roughly in half. That
improvement is reflected in the `simple-ok?' field of a `port-like'
structure that guards the simple case. Currently, I don't see how to
improve that further.
As strings get smaller, then the cost of creating a port becomes more
important; in the case of strings, that creation involves a conversion
of the string to UTF-8.
If I shift 3 zeros from N to M in the example program (strings 1/1000th
of the old size, 1000 times as many iterations):
'bytes
cpu time: 77 real time: 78 gc time: 0
'bytes/no-inline
cpu time: 226 real time: 225 gc time: 0
'bytes-port-like
cpu time: 551 real time: 552 gc time: 21
'bytes-port
cpu time: 670 real time: 672 gc time: 21
'string
cpu time: 101 real time: 102 gc time: 0
'string/no-inline
cpu time: 258 real time: 258 gc time: 0
'string-port-like
cpu time: 759 real time: 760 gc time: 8
'string-port
cpu time: 902 real time: 904 gc time: 22
I think the `port-like' variants fare better here because the actual
implementation has a more complex structure to set up each time. I
found one minor improvement (around 10%) that's reflected above, but I
don't see other easy improvements.
----------------------------------------
#lang racket
(define N 10000)
(define bstr (bytes->immutable-bytes (make-bytes N (char->integer #\x))))
(define str (make-string N #\x))
(define M 1000)
'bytes
(time
(for ([i (in-range M)])
(for ([j (in-range N)])
(bytes-ref bstr j))))
'bytes/no-inline
(define bytes-ref/not-inlined #f)
(set! bytes-ref/not-inlined bytes-ref)
(time
(for ([i (in-range M)])
(for ([j (in-range N)])
(bytes-ref/not-inlined bstr j))))
'bytes-port-like
(struct port-like (proc [simple-ok? #:mutable]))
(define (make-port-like-bytes bstr)
(define pos 0)
(port-like
(lambda ()
(let ([b (bytes-ref bstr pos)])
(set! pos (add1 pos))
b))
#t))
(define (like-read-byte pl)
(if (port-like? pl)
(if (port-like-simple-ok? pl)
((port-like-proc pl))
(error "not implemented"))
(error "bad")))
(time
(for ([i (in-range M)])
(define p (make-port-like-bytes bstr))
(for ([j (in-range N)])
(like-read-byte p))))
'bytes-port
(time
(for ([i (in-range M)])
(define p (open-input-bytes bstr))
(for ([j (in-range N)])
(read-byte p))))
'string
(time
(for ([i (in-range M)])
(for ([j (in-range N)])
(string-ref str j))))
'string/no-inline
(define string-ref/not-inlined #f)
(set! string-ref/not-inlined string-ref)
(time
(for ([i (in-range M)])
(for ([j (in-range N)])
(string-ref/not-inlined str j))))
'string-port-like
(define (make-port-like-string str)
(define pos 0)
(define bstr (string->bytes/utf-8 str))
(port-like
(lambda ()
(let ([b (bytes-ref bstr pos)])
(if (b . > . 127)
(error "complicated case not implemented")
(begin
(set! pos (add1 pos))
(integer->char b)))))
#t))
(define (like-read-char pl)
(if (port-like? pl)
(if (port-like-simple-ok? pl)
((port-like-proc pl))
(error "not implemented"))
(error "bad")))
(time
(for ([i (in-range M)])
(define p (make-port-like-string str))
(for ([j (in-range N)])
(like-read-char p))))
'string-port
(time
(for ([i (in-range M)])
(define p (open-input-string str))
(for ([j (in-range N)])
(read-char p))))