performance, json

132 views
Skip to first unread message

Brian Craft

unread,
Feb 22, 2019, 12:47:07 PM2/22/19
to Racket Users
I'm doing a few performance tests, just to get an idea of racket performance. The following result surprised me a bit. Parsing 1M strings from a json array, like

(define samples (time (read-json (open-input-file "test.json"))))

running with 'racket test.rkt'

Comparing to js, java, and clojure:

js 0.128s
java 0.130s
clojure 1.3s
racket 10s

This is pretty slow. Is this typical? Are there other steps I should be taking, for performance?

John Clements

unread,
Feb 22, 2019, 1:36:23 PM2/22/19
to Brian Craft, Racket Users
I’m not that surprised :).

My guess is that our json reader could be sped up quite a bit. This looks like the heart of the read-json implementation:

(define (read-json* who i jsnull)
;; Follows the specification (eg, at json.org) -- no extensions.
;;
(define (err fmt . args)
(define-values [l c p] (port-next-location i))
(raise-read-error (format "~a: ~a" who (apply format fmt args))
(object-name i) l c p #f))
(define (skip-whitespace) (regexp-match? #px#"^\\s*" i))
;;
;; Reading a string *could* have been nearly trivial using the racket
;; reader, except that it won't handle a "\/"...
(define (read-string)
(define result (open-output-bytes))
(let loop ()
(define esc
(let loop ()
(define c (read-byte i))
(cond
[(eof-object? c) (err "unterminated string")]
[(= c 34) #f] ;; 34 = "
[(= c 92) (read-bytes 1 i)] ;; 92 = \
[else (write-byte c result) (loop)])))
(cond
[(not esc) (bytes->string/utf-8 (get-output-bytes result))]
[(case esc
[(#"b") #"\b"]
[(#"n") #"\n"]
[(#"r") #"\r"]
[(#"f") #"\f"]
[(#"t") #"\t"]
[(#"\\") #"\\"]
[(#"\"") #"\""]
[(#"/") #"/"]
[else #f])
=> (λ (m) (write-bytes m result) (loop))]
[(equal? esc #"u")
(let* ([e (or (regexp-try-match #px#"^[a-fA-F0-9]{4}" i)
(err "bad string \\u escape"))]
[e (string->number (bytes->string/utf-8 (car e)) 16)])
(define e*
(if (<= #xD800 e #xDFFF)
;; it's the first part of a UTF-16 surrogate pair
(let* ([e2 (or (regexp-try-match #px#"^\\\\u([a-fA-F0-9]{4})" i)
(err "bad string \\u escape, ~a"
"missing second half of a UTF16 pair"))]
[e2 (string->number (bytes->string/utf-8 (cadr e2)) 16)])
(if (<= #xDC00 e2 #xDFFF)
(+ (arithmetic-shift (- e #xD800) 10) (- e2 #xDC00) #x10000)
(err "bad string \\u escape, ~a"
"bad second half of a UTF16 pair")))
e)) ; single \u escape
(write-string (string (integer->char e*)) result)
(loop))]
[else (err "bad string escape: \"~a\"" esc)])))
;;
(define (read-list what end-rx read-one)
(skip-whitespace)
(if (regexp-try-match end-rx i)
'()
(let loop ([l (list (read-one))])
(skip-whitespace)
(cond [(regexp-try-match end-rx i) (reverse l)]
[(regexp-try-match #rx#"^," i) (loop (cons (read-one) l))]
[else (err "error while parsing a json ~a" what)]))))
;;
(define (read-hash)
(define (read-pair)
(define k (read-json))
(unless (string? k) (err "non-string value used for json object key"))
(skip-whitespace)
(unless (regexp-try-match #rx#"^:" i)
(err "error while parsing a json object pair"))
(list (string->symbol k) (read-json)))
(apply hasheq (apply append (read-list 'object #rx#"^}" read-pair))))
;;
(define (read-json [top? #f])
(skip-whitespace)
(cond
[(and top? (eof-object? (peek-char i))) eof]
[(regexp-try-match #px#"^true\\b" i) #t]
[(regexp-try-match #px#"^false\\b" i) #f]
[(regexp-try-match #px#"^null\\b" i) jsnull]
[(regexp-try-match
#rx#"^-?(?:0|[1-9][0-9]*)(?:\\.[0-9]+)?(?:[eE][+-]?[0-9]+)?" i)
=> (λ (bs) (string->number (bytes->string/utf-8 (car bs))))]
[(regexp-try-match #rx#"^[\"[{]" i)
=> (λ (m)
(let ([m (car m)])
(cond [(equal? m #"\"") (read-string)]
[(equal? m #"[") (read-list 'array #rx#"^\\]" read-json)]
[(equal? m #"{") (read-hash)])))]
[else (err (format "bad input~n ~e" (peek-bytes (sub1 (error-print-width)) 0 i)))]))
;;
(read-json #t))


… and my guess is that the JS performance would be similar, if the json reader in JS was written in JS. I think there are probably a lot of provably-unneeded checks, and you could probably get rid of the byte-at-a-time reading.

It would be interesting to see how much faster (if at all) it is to run the TR version of this code.

John
> --
> You received this message because you are subscribed to the Google Groups "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to racket-users...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



Greg Trzeciak

unread,
Feb 22, 2019, 1:46:56 PM2/22/19
to Racket Users

There is http://docs.racket-lang.org/tjson/index.html available (haven't checked how similar the code is though)

Matthew Flatt

unread,
Feb 22, 2019, 3:35:48 PM2/22/19
to John Clements, Brian Craft, Racket Users
I think the bigger bottleneck is the main parsing loop, which uses
`regexp-try-match` even more. Although `regexp-try-match` is
convenient, it's much slower than using `peek-char` directly to check
for one character. I'll experiment with improvements there.

Jon Zeppieri

unread,
Feb 22, 2019, 4:34:38 PM2/22/19
to Matthew Flatt, John Clements, Brian Craft, Racket Users
On a related (but not too related) note: is there an efficient way to skip multiple bytes in an input stream? It looks like there are two choices:
  - You can read the bytes you want to skip, but that implies either allocating a useless byte array or keeping one around for this very purpose.
  - You can use (I think?) port-commit-peeked, bit given the API, it seems like that was designed with a particular (and more complicated) use in mind.

WarGrey Gyoudmon Ju

unread,
Feb 22, 2019, 5:00:24 PM2/22/19
to Jon Zeppieri, Matthew Flatt, John Clements, Brian Craft, Racket Users
I have tried my best to find the "best practice" to do Racket IO.

Here are some tips I found in writing CSV reader:
With a MacBook Pro 15, 2013, it takes 3.5s to read a 70MB file.

I agreed that `read-char` is the first choice, but `peek-char` may be slow somehow.
Instead, just read the `peek`ing chars and pass it or them as the leading ones to the parsing routine.
This strategy may require a re-design of your parsing workflow
since every subroutine should accept another input argument and return one more value.  

Brian Craft

unread,
Feb 28, 2019, 2:14:34 PM2/28/19
to Racket Users
I added some type annotations & tried it. The result is 16s, substantially slower.

Haven't tried tjson, yet. I'm not actually even that interested in json, it was just something at-hand to try.

From this thread, it sounds like there's some knowledge about racket performance that isn't yet in the docs. Are there any other resources on performance?

Matthew Flatt

unread,
Mar 2, 2019, 3:49:30 PM3/2/19
to Jon Zeppieri, Racket Users
At Fri, 22 Feb 2019 16:34:24 -0500, Jon Zeppieri wrote:
> On a related (but not too related) note: is there an efficient way to skip
> multiple bytes in an input stream? It looks like there are two choices:
> - You can read the bytes you want to skip, but that implies either
> allocating a useless byte array or keeping one around for this very purpose.
> - You can use (I think?) port-commit-peeked, bit given the API, it seems
> like that was designed with a particular (and more complicated) use in mind.

I've run into this a few times, too. Assuming that `file-position`
doesn't apply, having a buffer for discarded byes is the best approach
that I know. To make it faster, I think we'd have to build a
`discard-bytes` function into the I/O layer.

Reply all
Reply to author
Forward
0 new messages