file/gzip trouble reading

19 views
Skip to first unread message

Ben Greenman

unread,
Feb 16, 2021, 3:44:58 PM2/16/21
to Racket Users
I'm trying to use `gzip-through-ports` and I haven't been able to
unzip compressed data.

Here's a tiny example. I think this should print "hello world":

```
#lang racket

(require
(only-in file/gzip gzip-through-ports)
(only-in file/gunzip gunzip-through-ports))

(define src "hello world")

(define (compress str)
(call-with-output-string
(lambda (out-port)
(call-with-input-string str
(lambda (in-port)
(gzip-through-ports in-port out-port #f 0))))))

(define (decompress str)
(call-with-output-string
(lambda (out-port)
(call-with-input-string str
(lambda (in-port)
(gunzip-through-ports in-port out-port))))))

(define tgt (decompress (compress src)))

(displayln tgt)
```

But instead, it stops with "gnu-unzip: bad header"

The source code says that the header is the first two bytes, and these
should be #o037 and #o213. But in my compressed string, the second
byte is #o357 for some reason. I'm not sure how that could have
happened ... some kind of encoding issue with string ports?

Matthew Flatt

unread,
Feb 16, 2021, 3:56:49 PM2/16/21
to Ben Greenman, Racket Users
At Tue, 16 Feb 2021 15:44:54 -0500, Ben Greenman wrote:
> But in my compressed string, the second
> byte is #o357 for some reason. I'm not sure how that could have
> happened ... some kind of encoding issue with string ports?

Yes.

You want `call-with-output-bytes` on the compress size and
`call-with-input-bytes` on the decompress side. Otherwise, you'll get a
UTF-8 decoding of compressed bytes (which is unlikely to be
meaningful).

Ben Greenman

unread,
Feb 16, 2021, 4:03:33 PM2/16/21
to Matthew Flatt, Racket Users
Thanks, that helps.

Sadly, I've already compressed a few files using
`call-with-output-string` ... is there an easy way to decompress those
/ undo the UTF-8 encoding?

Matthew Flatt

unread,
Feb 16, 2021, 4:27:19 PM2/16/21
to Ben Greenman, Racket Users
At Tue, 16 Feb 2021 16:03:29 -0500, Ben Greenman wrote:
> Sadly, I've already compressed a few files using
> `call-with-output-string` ... is there an easy way to decompress those
> / undo the UTF-8 encoding?

Unfortunately, the underlying `get-output-string` conversion is lossy,
because bytes that don't form a UTF-8 encoding are converted to U+FFFD.

(I see that the docs say #\? instead of #\uFFFD, and I'll fix the docs.)

Dominik Pantůček

unread,
Feb 16, 2021, 4:44:02 PM2/16/21
to racket...@googlegroups.com
#\uFFFD is #\� (bytes EF BF BD in UTF-8)

For those who do not see it (I suspect encoding issues) it is a white
question mark on black vertically elongated hexagon.

And actually Racket REPL in my terminal displays it like this (7.9 BC,
8.0 CS, both on Ubuntu 20.04 in GNOME terminal).


Dominik

Ben Greenman

unread,
Feb 16, 2021, 7:18:03 PM2/16/21
to racket...@googlegroups.com
Alas, I see lots of those question mark hexagons in my data. Good to know.
Reply all
Reply to author
Forward
0 new messages