Getting the first N words: speed comparison

21 views
Skip to first unread message

Joel Dueck

unread,
Mar 20, 2019, 11:19:47 PM3/20/19
to Pollen
This isn’t a question, just sharing a rabbit hole I went down tonight that might be of use to other Pollenators.

I wrote a function to get the first N words out of the "visible text" of a tagged X-expression:

    (define doc '(root (p [[class "new-section"]] "She counted (" (em "one, two") "— silently, eyes unblinking")))

    (first-words (get-elements doc) 5) ; → "She counted one two silently"

In particular I wanted to a) ignore attribute strings, b) filter out some, but not all, punctuation, and c) examine only one txexpr at a time until I found enough words.

I thought about using regular expressions. But then I thought I could get it to run faster by manually examining each character one at a time, so I tried that first. Then I got curious, so I went back and wrote the regular-expression version and benchmarked them against each other.

The functions and the benchmark results are at this gist: https://gist.github.com/otherjoel/4366960058983073ce01fa27f1a4d09d

The regular-expression version is slower, much more so, it seems, the larger the first txexpr that you give it.

I am sure both functions could be made much faster. In particular the regex version matches all the words in each string, there is probably a better pattern that would stop after the first N words.

Matthew Butterick

unread,
Mar 21, 2019, 1:39:24 AM3/21/19
to Joel Dueck, Pollen


On Mar 20, 2019, at 8:19 PM, Joel Dueck <dueck...@gmail.com> wrote:

The regular-expression version is slower, much more so, it seems, the larger the first txexpr that you give it.

I am sure both functions could be made much faster. In particular the regex version matches all the words in each string, there is probably a better pattern that would stop after the first N words.

Your regexp-based function is slower because of `regexp-match*`, which eagerly finds all the matches (whether you need them or not). Whereas the port-based function is faster because it works incrementally. 

But you can do both at the same time, by passing an input port as the argument to `regexp-match`. In this example, the pattern is matched incrementally, and if we don't get enough words, we incrementally process the next txexpr.

(require racket/string)
(define (first-words-regex2 txs n)
  (define words
    (let loop ([txs txs][n n])
      (define ip (open-input-string (tx-strs (car txs))))
      (define words (for*/list ([i (in-range n)]
                                [bs (in-value (regexp-match #px"\\w+" ip))]
                                #:break (not bs))
                      (bytes->string/utf-8 (car bs))))
      (if (= (length words) n)
          words
          (append words (loop (cdr txs) (- n (length words)))))))
  (string-join words " "))

Joel Dueck

unread,
Mar 21, 2019, 3:15:02 PM3/21/19
to Pollen
Yes, first-words-regex2 is pretty much identical in performance to my longer regex-less version. Thanks for the pointer! I was not familiar with the use of regexp-match functions on an input port. It’s a little wild to me how even using a string port, a general-purpose pattern matching function can be just about as fast as a custom-built loop that knows exactly what it wants. But the regex library has probably been pretty well optimized by now.

    raco test: (submod "test-func.rkt" test)
    cpu time: 140 real time: 140 gc time: 9      ; first-words
    cpu time: 117 real time: 117 gc time: 0       ; first-words-regex2
    "She counted one two silently"
    "She counted one two silently"
    cpu time: 124 real time: 124 gc time: 0
    cpu time: 122 real time: 122 gc time: 0
    "Stop! she called. She was"
    "Stop she called. She was"
    cpu time: 345 real time: 346 gc time: 0
    cpu time: 358 real time: 358 gc time: 2
    "In a 2005 episode of"
    "In a 2005 episode of"

Matthew Butterick

unread,
Mar 21, 2019, 8:10:49 PM3/21/19
to Joel Dueck, Pollen
As a Racket rule of thumb, I find that most efforts toward "custom-built loops" end in defeat, because the Racket macro expander and JIT compiler are aware of better optimizations. If I were writing another book on Racket, it would be High-Performance Racket, which I know more about than I used to, but still not very much ;)

Joel Dueck

unread,
Mar 21, 2019, 10:24:12 PM3/21/19
to Pollen
I could have used such a book during the 2018 Advent of Code. I seem to recall I spent a lot of time scrounging for faster ways to do things after the simple, “idiomatic” Racket approach (or rather my novice-level conceptions of those approaches) resulted in extremely slow code. I was more or less successful in most cases[1] but I was never able to match the speeds that the Python users said they were achieving.

Matthew Butterick

unread,
Mar 21, 2019, 11:07:04 PM3/21/19
to Joel Dueck, Pollen
I like Eric Wastl’s work (so much that I licensed one of his puzzles for Beautiful Racket). He often designs them to foil naive algorithms. What other languages call a list is often better modeled in Racket as a vector. I did the initial 2018 puzzles specifically with the goal of making them fast. In one case I switched to mutable pairs; in another, a double linked list (rather than the usual, which is single linked) to avoid excessive traversal. In general I often try to think of algorithmic problems in terms of physical movement, and then how to minimize that movement. 

--
You received this message because you are subscribed to the Google Groups "Pollen" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pollenpub+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages