Getting the first N words: speed comparison

Joel Dueck

unread,

Mar 20, 2019, 11:19:47 PM3/20/19

to Pollen

This isn’t a question, just sharing a rabbit hole I went down tonight that might be of use to other Pollenators.

I wrote a function to get the first N words out of the "visible text" of a tagged X-expression:

(define doc '(root (p [[class "new-section"]] "She counted (" (em "one, two") "— silently, eyes unblinking")))

(first-words (get-elements doc) 5) ; → "She counted one two silently"

In particular I wanted to a) ignore attribute strings, b) filter out some, but not all, punctuation, and c) examine only one txexpr at a time until I found enough words.

I thought about using regular expressions. But then I thought I could get it to run faster by manually examining each character one at a time, so I tried that first. Then I got curious, so I went back and wrote the regular-expression version and benchmarked them against each other.

The functions and the benchmark results are at this gist: https://gist.github.com/otherjoel/4366960058983073ce01fa27f1a4d09d

The regular-expression version is slower, much more so, it seems, the larger the first txexpr that you give it.

I am sure both functions could be made much faster. In particular the regex version matches all the words in each string, there is probably a better pattern that would stop after the first N words.

Matthew Butterick

unread,

Mar 21, 2019, 1:39:24 AM3/21/19

to Joel Dueck, Pollen

On Mar 20, 2019, at 8:19 PM, Joel Dueck <dueck...@gmail.com> wrote:

The regular-expression version is slower, much more so, it seems, the larger the first txexpr that you give it.

I am sure both functions could be made much faster. In particular the regex version matches all the words in each string, there is probably a better pattern that would stop after the first N words.

Your regexp-based function is slower because of `regexp-match*`, which eagerly finds all the matches (whether you need them or not). Whereas the port-based function is faster because it works incrementally.

But you can do both at the same time, by passing an input port as the argument to `regexp-match`. In this example, the pattern is matched incrementally, and if we don't get enough words, we incrementally process the next txexpr.

(require racket/string)

(define (first-words-regex2 txs n)

(define words

(let loop ([txs txs][n n])

(define ip (open-input-string (tx-strs (car txs))))

(define words (for*/list ([i (in-range n)]

[bs (in-value (regexp-match #px"\\w+" ip))]

#:break (not bs))

(bytes->string/utf-8 (car bs))))

(if (= (length words) n)

words

(append words (loop (cdr txs) (- n (length words)))))))

(string-join words " "))

Joel Dueck

unread,

Mar 21, 2019, 3:15:02 PM3/21/19

to Pollen

Yes, first-words-regex2 is pretty much identical in performance to my longer regex-less version. Thanks for the pointer! I was not familiar with the use of regexp-match functions on an input port. It’s a little wild to me how even using a string port, a general-purpose pattern matching function can be just about as fast as a custom-built loop that knows exactly what it wants. But the regex library has probably been pretty well optimized by now.

raco test: (submod "test-func.rkt" test)

cpu time: 140 real time: 140 gc time: 9 ; first-words

cpu time: 117 real time: 117 gc time: 0 ; first-words-regex2

"She counted one two silently"

cpu time: 124 real time: 124 gc time: 0

cpu time: 122 real time: 122 gc time: 0

"Stop! she called. She was"

"Stop she called. She was"

cpu time: 345 real time: 346 gc time: 0

cpu time: 358 real time: 358 gc time: 2

"In a 2005 episode of"

Matthew Butterick

unread,

Mar 21, 2019, 8:10:49 PM3/21/19

to Joel Dueck, Pollen

As a Racket rule of thumb, I find that most efforts toward "custom-built loops" end in defeat, because the Racket macro expander and JIT compiler are aware of better optimizations. If I were writing another book on Racket, it would be High-Performance Racket, which I know more about than I used to, but still not very much ;)

Joel Dueck

unread,

Mar 21, 2019, 10:24:12 PM3/21/19

to Pollen

I could have used such a book during the 2018 Advent of Code. I seem to recall I spent a lot of time scrounging for faster ways to do things after the simple, “idiomatic” Racket approach (or rather my novice-level conceptions of those approaches) resulted in extremely slow code. I was more or less successful in most cases[1] but I was never able to match the speeds that the Python users said they were achieving.

[1]: https://thenotepad.org/repos/aoc2018/

Matthew Butterick

unread,

Mar 21, 2019, 11:07:04 PM3/21/19

to Joel Dueck, Pollen

I like Eric Wastl’s work (so much that I licensed one of his puzzles for Beautiful Racket). He often designs them to foil naive algorithms. What other languages call a list is often better modeled in Racket as a vector. I did the initial 2018 puzzles specifically with the goal of making them fast. In one case I switched to mutable pairs; in another, a double linked list (rather than the usual, which is single linked) to avoid excessive traversal. In general I often try to think of algorithmic problems in terms of physical movement, and then how to minimize that movement.

https://github.com/mbutterick/aoc-racket/tree/master/2018

--
You received this message because you are subscribed to the Google Groups "Pollen" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pollenpub+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward