Hygiene and Records

33 views
Skip to first unread message

Aaron W. Hsu

unread,
Apr 11, 2010, 4:22:37 PM4/11/10
to scheme-re...@googlegroups.com
Much to my surprise, a hygiene issue was revealed to me the other day,
and I think it warrants attention. Let's first consider the usual
paradigm for records, and more particular, a specific paradigm for
procedural records.

Most of us, myself included, had thought that you could reasonably
build a record system on top of procedural records using field names.

So, something like

(define-record-type a (fields b))

Would become something like

(begin
(define a (make-rtd '#(b)))
(define (a-b x) ((record-accessor a 'b) x)))

And moreover, one of the let-record forms specified by a member here
also bases is accessors by symbolic name.

However, this model, of having symbolic field names, breaks rather
spectacularly the hygiene of the system. That is, we expect to be
able to write macros that do their job, and preserve identifiers and
the like. Consider, therefore, the following macro:

;;; R6RS
(define-syntax build-box-type
(syntax-rules ()
[(_ box-type name getter)
(define-record-type box-type
(fields
(immutable foo get-foo)
(immutable name getter)))]))

;;; SRFI 99
(define-syntax build-box-type
(syntax-rules ()
[(_ box-type name getter)
(define-record-type box-type #t #t
(foo get-foo)
(name getter))]))

Now, let's consider the following series of expressions:

(build-box-type my foo get-foo)
(define x (make-my-foo 'foo 3))
(get-foo x)

What should this series of expression return? We would expect that it
would return 3. Moreover, we expect that the definition of the
internal get-foo and the external get-foo do not conflict.

Fortunately, both the implementation of SRFI 99 and R6RS' record
system preserve the latter behavior, and neither makes the internal
binding to get-foo visible. However, SRFI 99 has its syntactic record
system based on symbolic field names. This means that there are two
fields named foo in the record now. Since all accessors are defined
by index in the R6RS version, this isn't a problem. In the second one,
however, we now have a data leakage and a hygiene breakage, as we
expected that the externally defined get-foo should reference the second
field of the record, but instead, it references the first.

This means that in the R6RS version (as implemented in Chez Scheme),
the first gives us the expected return value of 3, but with the SRFI 99
implementation, we get 'foo instead.

Therefore, in any record system that we define, we must be careful to
ensure that accessor do not leak or shadow each other like this. In
this example, while functionally disastrous, from a security point of
view, this seems somewhat benign. However, if the first field were
meant to be a hidden field, or if we instead got an error saying that
you can't have two fields named the same thing, you have suddenly
leaked information out of your abstraction that could be very bad
for the safety and security of the program.

Aaron W. Hsu

Brian Harvey

unread,
Apr 11, 2010, 5:30:08 PM4/11/10
to scheme-re...@googlegroups.com
> could be very bad for the safety and security of the program.

Can we not assume that people writing web servers will use WG2?

John Cowan

unread,
Apr 12, 2010, 12:50:56 AM4/12/10
to scheme-re...@googlegroups.com
Aaron W. Hsu scripsit:

> ;;; SRFI 99
> (define-syntax build-box-type
> (syntax-rules ()
> [(_ box-type name getter)
> (define-record-type box-type #t #t
> (foo get-foo)
> (name getter))]))

What this does completely depends on how the Scheme implements syntax
definitions that expand into definitions only. I tried the following
program on my usual array of Schemes:

> (define-syntax foo
(syntax-rules ()
((foo) (define x 32))))
> (foo)
> x

to see whether I got 32 as the value of x or blew up on an error such
as invalid syntax (in the define-syntax) or undefined variable. PLT,
MIT, Gambit, scsh/Scheme48, Guile, SISC, Chez, Ikarus, Mosh blew up;
Gauche, Chicken, Bigloo, Kawa, SCM, Larceny (in R5RS mode), Scheme 9,
STklos, sscm, SXM, VSCM, Chibi were fine with it and returned 32.

I haven't tested it, but I would expect that the definitions that SRFI-9(9)
expands into in the first set of Schemes would be invisible outside the
define-syntax, whereas they would be visible in the second set.

> SRFI 99 has its syntactic record system based on symbolic field
> names. This means that there are two fields named foo in the record
> now. Since all accessors are defined by index in the R6RS version,
> this isn't a problem. In the second one, however, we now have a data
> leakage and a hygiene breakage, as we expected that the externally
> defined get-foo should reference the second field of the record,
> but instead, it references the first.

In SRFI-99, the last line of the procedural layer definition says:
"Fields in derived record-types shadow fields of the same name in a
parent record-type." So that's normal and expected behavior.

> This means that in the R6RS version (as implemented in Chez Scheme),
> the first gives us the expected return value of 3, but with the SRFI 99
> implementation, we get 'foo instead.
>
> Therefore, in any record system that we define, we must be careful to
> ensure that accessor do not leak or shadow each other like this. In
> this example, while functionally disastrous, from a security point of
> view, this seems somewhat benign. However, if the first field were
> meant to be a hidden field, or if we instead got an error saying that
> you can't have two fields named the same thing, you have suddenly
> leaked information out of your abstraction that could be very bad
> for the safety and security of the program.

In SRFI-99 the field names are symbols, not identifiers, so they are
exposed as part of the type.

(If I'm talking completely past your point, as may be the case,
please let me know.)

--
John Cowan http://ccil.org/~cowan co...@ccil.org
Lope de Vega: "It wonders me I can speak at all. Some caitiff rogue did
rudely yerk me on the knob, wherefrom my wits still wander."
An Englishman: "Ay, a filchman to the nab betimes 'll leave a man
crank for a spell." --Harry Turtledove, Ruled Britannia

Emmanuel Medernach

unread,
Apr 13, 2010, 5:54:09 AM4/13/10
to scheme-reports-wg1
On Apr 12, 6:50 am, John Cowan <co...@ccil.org> wrote:
>
> What this does completely depends on how the Scheme implements syntax
> definitions that expand into definitions only. I tried the following
> program on my usual array of Schemes:
>
> > (define-syntax foo
>
> (syntax-rules ()
> ((foo) (define x 32))))
>
> > (foo)
> > x
>
> to see whether I got 32 as the value of x or blew up on an error such
> as invalid syntax (in the define-syntax) or undefined variable. PLT,
> MIT, Gambit, scsh/Scheme48, Guile, SISC, Chez, Ikarus, Mosh blew up;
> Gauche, Chicken, Bigloo, Kawa, SCM, Larceny (in R5RS mode), Scheme 9,
> STklos, sscm, SXM, VSCM, Chibi were fine with it and returned 32.
>

As specified here in R5RS :

* If a macro transformer inserts a binding for an
identifier
(variable or keyword), the identifier will in effect be
renamed
throughout its scope to avoid conflicts with other
identifiers.
Note that a `define' at top level may or may not introduce
a
binding; see section *Note
Definitions::.

So as I understand it free references in a macro definition should be
renamed internally so that it cannot be externally accessed:

(define-syntax foo
(syntax-rules ()
((foo) (define x 32))))

(foo)
; => (define <internal x 292> 32) therefore useless here...

This enforces "top-level definition inserting" macros to be explicit
and take names as a parameter, like this:

(define-syntax foo
(syntax-rules ()
((foo x) (define x 32))))

I really think this is a Good Thing (I like having explicit, clear and
unambiguous expressions)

If current implementations of macros makes different choices here,
would it be good to set a consensus ?

Best regards,
--
Emmanuel Medernach

Aaron W. Hsu

unread,
Apr 15, 2010, 10:40:54 AM4/15/10
to scheme-re...@googlegroups.com
> > could be very bad for the safety and security of the program.
>
> Can we not assume that people writing web servers will use WG2?

Can we not assume that WG2 will use WG1's language?

In all serious, I find the implied line of reasoning here
very...suspect. It's like saying that we don't need to do any array
bounds checking, because the programs you expect to write should be
small enough that it won't matter.

Aaron W. Hsu

Brian Harvey

unread,
Apr 15, 2010, 10:51:33 AM4/15/10
to scheme-re...@googlegroups.com
> In all serious, I find the implied line of reasoning here
> very...suspect. It's like saying that we don't need to do any array
> bounds checking, because the programs you expect to write should be
> small enough that it won't matter.

Any argument can be subject to reductio ad absurdum. But there is a world
of difference between bounds checking and /security/ concerns. If we have
to stand up against willful attacks I think we can give up right now; none
of us afaik have the very special expertise (and mindset) for that.

Aaron W. Hsu

unread,
Apr 20, 2010, 8:14:24 PM4/20/10
to scheme-re...@googlegroups.com
> Aaron W. Hsu scripsit:
>
> > ;;; SRFI 99
> > (define-syntax build-box-type
> > (syntax-rules ()
> > [(_ box-type name getter)
> > (define-record-type box-type #t #t
> > (foo get-foo)
> > (name getter))]))
>
> What this does completely depends on how the Scheme implements syntax
> definitions that expand into definitions only. I tried the following
> program on my usual array of Schemes:
>
> > (define-syntax foo
> (syntax-rules ()
> ((foo) (define x 32))))
> > (foo)
> > x
>
> to see whether I got 32 as the value of x or blew up on an error such
> as invalid syntax (in the define-syntax) or undefined variable. PLT,
> MIT, Gambit, scsh/Scheme48, Guile, SISC, Chez, Ikarus, Mosh blew up;
> Gauche, Chicken, Bigloo, Kawa, SCM, Larceny (in R5RS mode), Scheme 9,
> STklos, sscm, SXM, VSCM, Chibi were fine with it and returned 32.
>
> I haven't tested it, but I would expect that the definitions that SRFI-9(9)
> expands into in the first set of Schemes would be invisible outside the
> define-syntax, whereas they would be visible in the second set.

There is a good reason that the Schemes that let this through are
wrong in my opinion. In the general view of hygiene, as I have always
heard it, identifiers that are introduced or bound during the
expansion of a macro should be wrapped anew with their own wraps,
unique for that particular expansion instance. This has important
ramifications for things like:

(define current-index
(let ([i 0])
(lambda ()
(set! i (+ i 1))
i)))

(define-syntax bind-current-index
(syntax-rules ()
[(_ name)
(begin
(define dummy (current-index))
(define (name) dummy))]))

If the implementation does not blow up on your example, then I expect
that they will not do what most people expect when running the above
macro twice in a row in some context, like the top-level. In
implementations like Chez and MIT, which do blow up at your
example, the above works correctly, and as would be expected where
each dummy remains unique, and doesn't "creep" out into the rest of
the world. If the implementation allows the creep, however, then the
dummy value should be overwritten every time the macro is expanded in
that same scope, and you will lead either to an error of multiple
definitions, or the dummy binding will be overwritten, and the bound
names will all return the same result. This is not correct behavior.
However, this isn't really the point of my hygiene problem here. This
is a big hygiene problem, but the tested SRFI-99 code above does not
--- fortunately --- suffer from the above hygiene problem whenever the
Scheme properly implements hygiene.

> > SRFI 99 has its syntactic record system based on symbolic field
> > names. This means that there are two fields named foo in the record
> > now. Since all accessors are defined by index in the R6RS version,
> > this isn't a problem. In the second one, however, we now have a data
> > leakage and a hygiene breakage, as we expected that the externally
> > defined get-foo should reference the second field of the record,
> > but instead, it references the first.
>
> In SRFI-99, the last line of the procedural layer definition says:
> "Fields in derived record-types shadow fields of the same name in a
> parent record-type." So that's normal and expected behavior.

That particular behavior --- which I don't like --- aside, this should
not be the expected behavior for a single record which has no parents.
The problem that I illustrated above has nothing to do with parents,
but with shadowing a field within the same level. That is, if you have
a record with a hidden field named X, and it just so happens that
someone uses your syntax to create a record, but provides another
field name that is exactly the same as your hidden field name, then
the macro will break. This *is* broken. It should not be possible to
have this sort of breakage in whatever record system we create.

> > This means that in the R6RS version (as implemented in Chez Scheme),
> > the first gives us the expected return value of 3, but with the SRFI 99
> > implementation, we get 'foo instead.
> >
> > Therefore, in any record system that we define, we must be careful to
> > ensure that accessor do not leak or shadow each other like this. In
> > this example, while functionally disastrous, from a security point of
> > view, this seems somewhat benign. However, if the first field were
> > meant to be a hidden field, or if we instead got an error saying that
> > you can't have two fields named the same thing, you have suddenly
> > leaked information out of your abstraction that could be very bad
> > for the safety and security of the program.
>
> In SRFI-99 the field names are symbols, not identifiers, so they are
> exposed as part of the type.
>
> (If I'm talking completely past your point, as may be the case,
> please let me know.)

My point is that having a one-to-one function mapping symbols to field
indexes in records is a flawed idea. The illustration above
demonstrates that even if you have a field name hidden from the
outside world when creating the record, it is possible to have very
surprising results if you just happen to choose the same field name as
the hidden field name. What happens with SRFI-99 is that the accessor
that is created will grab the first field with that name, and entirely
ignore the other field. This is akin to the problems with DEFMACRO
where you accidently capture a field name from the outside. The way to
fix this here would be to use GENSYM to create a field name that no
one else would provide. That's totally unacceptable in my book. This
should not happen, and I think that any record system that we
standardize should *not* have this weakness built into the system.
Doing so is basically saying that we don't care about hygiene at all.

Aaron W. Hsu


--
Subscription settings: http://groups.google.com/group/scheme-reports-wg1/subscribe?hl=en

Aaron W. Hsu

unread,
Apr 20, 2010, 8:19:14 PM4/20/10
to scheme-re...@googlegroups.com
> This enforces "top-level definition inserting" macros to be explicit
> and take names as a parameter, like this:
>
> (define-syntax foo
> (syntax-rules ()
> ((foo x) (define x 32))))
>
> I really think this is a Good Thing (I like having explicit, clear and
> unambiguous expressions)
>
> If current implementations of macros makes different choices here,
> would it be good to set a consensus ?

I believe that R6RS already made this rather clear in its expression
of hygiene, and I would argue that no one really agrees that the other
behavior is a *good* thing, because the breaking of the hygiene only
shows up in select places. So, I would argue that there is already
consensus here, but that it's a "bug" that slips through some
implementations because it only shows up in special instances. It's a
rather important guarantee to make concerning hygiene, and I think we
most certainly should make it clear that the correct behavior makes it
impossible for a definition introduced inside of an expansion that
doesn't already get wrapped otherwise will get renamed. This would
make the correct behavior to blow up whenever you tried to access the
internal definition.

Aaron W. Hsu

unread,
Apr 20, 2010, 9:14:45 PM4/20/10
to scheme-re...@googlegroups.com
If you would like, I can give you plenty of literature and recent work
in this area. However, I consider array bounds checking to be a form
of security. Hygiene is a form of security. Lexical scope is a form of
security. These are all assurances that our programming language gives
us that the program will behave in a certain way reliably. They are
also all abstractions.

Brian Harvey

unread,
Apr 21, 2010, 1:28:17 AM4/21/10
to scheme-re...@googlegroups.com
> I consider array bounds checking to be a form
> of security. Hygiene is a form of security. Lexical scope is a form of
> security.

Ah, perhaps we are misunderstanding each other here. When I hear "security"
I think we are in the realm of Russian mobsters with botnets. Maybe it's
because around here the people who call themselves security researchers,
downstairs, are different from the people who call themselves programming
language researchers, down the hall. (Not that the security people don't
take an interest in programming language features, of course.)

Emmanuel Medernach

unread,
May 3, 2010, 10:37:38 AM5/3/10
to scheme-reports-wg1

On Apr 21, 2:19 am, "Aaron W. Hsu" <arcf...@sacrideo.us> wrote:
> > If current implementations of macros makes different choices here,
> > would it be good to set a consensus ?
>
> I believe that R6RS already made this rather clear in its expression of hygiene,
> and I would argue that no one really agrees that the other behavior is a *good*
> thing, because the breaking of the hygiene only shows up in select places.

Sure

> So, I would argue that there is already consensus here,

Unfortunately not in reality, I mean different implementations are
using the same name for different things, this is very confusing. As
some implementation seems to keep a different behavior why not simply
accept that people have different point of view, identify differences
and
proposing to simply have different names ?
Reply all
Reply to author
Forward
0 new messages