A couple of questions about Neil's html reader/writer

91 views
Skip to first unread message

Thomas Lynch

unread,
Jul 28, 2015, 8:50:36 AM7/28/15
to Neil Van Dyke, racket...@googlegroups.com
Is Neil's xexp format different that the one being used in racket due to the 
attribute list beginning with an '@'?

(write-html
 '((html (head (title "My Title"))
         (body (@ (bgcolor "white"))
               (h1 "My Heading")
               (p "This is a paragraph.")
               (p "This is another paragraph."))))
 (current-output-port))

Here is an example xexp from some guys homework assignment (http://web.cs.wpi.edu/~cs1102/a12/Assignments/Hwk7/html.html)
    (form ((action "http://localhost:8088/hello"))
          "What is your first name?"
          (input ((type "text") (name "firstName")))
          (input ((type "submit") (value "Click Here"))))
Here is the syntax for an xexp from xexp? in the reference:

  xexpr = string
  | (list symbol (list (list symbol string) ...) xexpr ...)
  | (cons symbol (list xexpr ...))
  | symbol
  | valid-char?
  | cdata
  | misc

And in this latter syntax, how is the attribute list distinguished from a list of embedded xexps? Is it due to the nesting in the attribute list?

Another question, the racket manuals show xexp->string being used to generate html.  Neil has a separate write-html function.  Why the divergence?

Thanks in advance!

Neil Van Dyke

unread,
Jul 28, 2015, 10:51:25 AM7/28/15
to Thomas Lynch, racket...@googlegroups.com
In short, it's a historical accident, but the confusion seems less
costly than a compromise would, IMHO.

Details...

15 years ago, the famous Oleg Kiselyov defined SXML:
http://okmij.org/ftp/Scheme/xml.html

Scheme people in general saw that SXML was good. Perhaps more a
motivation, we saw that Oleg's SSAX XML parser work was solid work that
no one wanted to redo (see papers links on above URL). Oleg, Dmitry
Lizorkin, Jim Bender, and others ended up making other XML tools that
used SXML.

My Scheme pragmatic HTML parser used an ad hoc sexp HTML format I'd made
up, but I converted my parser to use the de facto standard, SXML. I
then wrote other tools using SXML.

In parallel, IIRC, Racket (nee PLT Scheme) already had a help/Web
browser at the time, and a continuation-based Web server, and so were
already using their own XML and HTML stuff heavily.

A few/several years ago, I put considerable work into trying to unify
SXML and Racket xexprs. In the end, I threw out that work, when it
appeared that the differences I was trying to get rid of were actually
mostly wins. (One I recall offhand: arbitrary nesting means efficient
splicing in manipulations of large XML documents, as well as not
requiring a programmer to get all their ",@" right in programmatic
constructions of SXML like in Racket xexprs.)

As much as I wanted to resolve the accidental schism between SXML and
xexprs, the value of keeping SXML's advantages seemed more than the
value of eliminating the recurring SXML-xexpr confusion (of which you
are the latest victim). This discomfort will keep coming up, because
most of the neat HTML and XML tools for Racket actually use SXML, but
other Racket users and documents need xexpr.

I'm currently talking with Oleg about making a couple small tweaks to
the SXML standard, to make it more convenient for manually-written
HTML. I'm trying to bring my tools into strict compliance with standard
SXML. But I have no plans to change anything else about SXML nor change
my tools to use xexpr.

As an engineer, how dumb the situation looks at first glance is
annoying. But not so annoying that I'd push Racket to go to a lot of
work to rip up its carpet, and then take a floor-sander to Racket's
xexpr users.

BTW, even some of the most fervent Racket acolytes (you will know them
by their square brackets) will sneak some SXML, from time to time.

Neil V.

Neil Van Dyke

unread,
Jul 28, 2015, 11:03:21 AM7/28/15
to Thomas Lynch, racket...@googlegroups.com
Oops, I meant to mention Kirill Lisovsky as an early developer of neat
SXML tools, too.

Greg Hendershott

unread,
Jul 28, 2015, 4:31:37 PM7/28/15
to Neil Van Dyke, Thomas Lynch, Racket-Users List
Maybe a dumb question, but:

Imagine conversion functions `xexpr->sxml` and `sxml->xexpr`.

Would implementing them be any easier than unifying xexprs and sxml
(or is it really just the same problem)?

If it turns out there isn't any ideal implementation, is there at
least some pragmatic implementation whose limitations could be clearly
described -- and therefore might still be useful to some people some
of the time?

Neil Van Dyke

unread,
Jul 28, 2015, 5:15:19 PM7/28/15
to Greg Hendershott, Racket-Users List
Greg Hendershott wrote on 07/28/2015 04:30 PM:
> Imagine conversion functions `xexpr->sxml` and `sxml->xexpr`.
>
> Would implementing them be any easier than unifying xexprs and sxml
> (or is it really just the same problem)?

Yes, I think those procedures would be easy to implement in such a way
that they worked well. Seems like a reasonable convenience, given the
situation.

Details...

Conversion is mostly just a simple recursive tree/DAG traversal and some
consing. And you have to decide how to handle XML namespaces (see how
SSAX handles them). Fortunately, you don't have to get into DTDs in
either representation.

Allocations-wise, you can generally share the XML CDATA strings verbatim
between the two representations, without copying, so the allocations
probably won't be too bad.

It's probably not very expensive unless, of course, your XML is so huge
and pair-heavy that a copy pushes you into a hard GC or (shudder!)
swap. (Or if you have a lot of this conversion going on, like in
high-volume server transactions, and it's a significant part of the
time/space/garbage overhead; but if you're doing that, then you are
already solving harder problems than doing these conversions, and you
probably wouldn't need these procedures in the first place.)

For `sxml->xexpr`, note that it might not be obvious that SXML element
attributes can actually appear interspersed with element contents.
You'll have to accumulate element attributes separate from element
contents, and cons attributes onto the front as you finish constructing
a result element.

In the documentation, you'd want to quickly note that the procedures
exist for "historical reasons" (or "interoperation"), because otherwise
smart people might immediately wonder why the procedures exist at all.
And maybe also tell people which representation they should be using by
default, since I think that might be the next question on people's minds.

Neil V.

Matthew Butterick

unread,
Jul 28, 2015, 7:05:06 PM7/28/15
to Thomas Lynch, racket...@googlegroups.com
Yes, more or less. In an X-expression, an attribute list is the only element that's a list made of sublists. A list of embedded X-expressions, OTOH, will start with a symbol. To look at it another way,

(cons symbol (list xexpr ...))

really amounts to

(list symbol xexpr ...)

which is just

(list symbol (list (list symbol string) ...) xexpr ...)

but without the attribute list, cf.

'(p "foo" "bar")

'(p ((style "default")) "foo" "bar")

A recurring annoyance in X-expressions is distinguishing these two cases on input, because the second element can be either an attribute list or nested X-expression. You can use `xexpr-drop-empty-attributes` to force an attribute list (even empty). My `txexpr` package also has utilities for handling them.

Thomas Lynch

unread,
Jul 29, 2015, 7:56:43 AM7/29/15
to Matthew Butterick, racket...@googlegroups.com
I wrote primitive conversion routines to bring the xexpr or Neil's xexpr into ... oh gosh, my parser token format, which by coincidence is very close.  Just playing with this now .. In my target format token children are always other tokens.  All values given as attributes in value tokens.  I use an empty list if there are no attributes.  The conversion routines are very simple, though I'm just playing with this, there will be other cases I missed, I suspect:

 (define (xexpr->tok-tree an-xexpr)
    (define (is-at-list e)
      (and
        (pair? e)
        (pair? (car e))))
    (cond
      [(null? an-xexpr) an-xexpr]
      [(not (pair? an-xexpr)) (tok-make 'tok:value `((value ,an-xexpr)))] ; actually diff toks for diff types
      [else
        (let(
              [tag (car an-xexpr)]
              [r1  (cdr an-xexpr)]
              )
          (cond
            [(null? r1) an-xexpr]
            [else
              (let(
                    [first-element (car r1)]
                    [r2 (cdr r1)]
                    )
                (cond
                  [(is-at-list first-element) (cons tag (cons first-element (map xexpr->tok-tree r2)))]
                  [else (cons tag (cons '() (map xexpr->tok-tree r1)))]
                  ))]))]))

(define (test-xexpr-tok-tree-0)
  (equal?
    (xexpr->tok-tree
      '(html (head (title "My Title"))
         (body ((bgcolor "white"))
           (h1 "My Heading")
           (p ((style "default")) "This is a paragraph.")

           (p "This is another paragraph."))))
    '(html
       ()
       (head () (title () (tok:value ((value "My Title")))))
       (body
         ((bgcolor "white"))
         (h1 () (tok:value ((value "My Heading"))))
         (p ((style "default")) (tok:value ((value "This is a paragraph."))))
         (p () (tok:value ((value "This is another paragraph.")))))
       )))

Alexander D. Knauth

unread,
Jul 29, 2015, 10:56:07 AM7/29/15
to Thomas Lynch, Matthew Butterick, racket...@googlegroups.com
Would it be easier using match?

(define (xexpr->tok-tree an-xexpr)
  (match as-xexpr
    ['()
     '()]
    [(not (cons _ _))
     (tok-make ...)]
    [(list tag)
     (list tag)]
    [(list-rest tag (? is-at-list at-list) r2)
     ....]
    ....))

--
You received this message because you are subscribed to the Google Groups "Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to racket-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Thomas Lynch

unread,
Jul 29, 2015, 10:25:24 PM7/29/15
to Matthew Butterick, Neil Van Dyke, racket...@googlegroups.com
.. the conversion for neil's xexpr  .. at this point the two converters can be abstracted by passing in two lambdas, is-at-list predicate, and extract-at-list.   Neil can you comment on what other differences I might expect to find?

  (define (neil-xexpr->tok-tree an-xexpr)
    (define (is-at-list e)
      (and
        (pair? e)
        (eqv? '@ (car e))))
    (define (extract-at-list e)
      (cdr e)
      )
    (cond
      [(null? an-xexpr) an-xexpr]
      [(not (pair? an-xexpr)) (tok-make 'tok:value `((value ,an-xexpr)))]
      [else
        (let(
              [tag (car an-xexpr)]
              [r1  (cdr an-xexpr)]
              )
          (cond
            [(null? r1) an-xexpr]
            [else
              (let(
                    [first-element (car r1)]
                    [r2 (cdr r1)]
                    )
                (cond
                  [(is-at-list first-element) (cons tag (cons (extract-at-list first-element) (map neil-xexpr->tok-tree r2)))]
                  [else (cons tag (cons '() (map neil-xexpr->tok-tree r1)))]
                  ))]))]))

(define (test-neil-xexpr-tok-tree-0)
  (equal?
    (neil-xexpr->tok-tree
      '(html (head (title "My Title"))
               (body (@ (bgcolor "white"))
                     (h1 "My Heading")
                     (p (@ (style "default")) "This is a paragraph.")
                     (p "This is another paragraph."))))
    '(html
       ()
       (head () (title () (tok:value ((value "My Title")))))
       (body
         ((bgcolor "white"))
         (h1 () (tok:value ((value "My Heading"))))
         (p ((style "default")) (tok:value ((value "This is a paragraph."))))
         (p () (tok:value ((value "This is another paragraph.")))))
       )))

An example:

(define (test-neil-xexpr-tok-tree-0)
  (equal?
    (neil-xexpr->tok-tree
      '(html (head (title "My Title"))
               (body (@ (bgcolor "white"))
                     (h1 "My Heading")
                     (p (@ (style "default")) "This is a paragraph.")
                     (p "This is another paragraph."))))
    '(html
       ()
       (head () (title () (tok:value ((value "My Title")))))
       (body
         ((bgcolor "white"))
         (h1 () (tok:value ((value "My Heading"))))
         (p ((style "default")) (tok:value ((value "This is a paragraph."))))
         (p () (tok:value ((value "This is another paragraph.")))))
       )))

Thomas Lynch

unread,
Jul 29, 2015, 10:45:57 PM7/29/15
to Alexander D. Knauth, Matthew Butterick, racket...@googlegroups.com
Alexander,  you will notice I pulled a couple lambdas to the top, and added a test routine at the bottom take those out ... then the code you sent isn't shorter.  I am also expecting to have to add more code to the intermediate points.  I like seeing those named.  I think it is easy to read. ... What advantage do you see in using match?

So I have been thinking about this, have come to this conclusion, what do you think: In general grammar driven parsing is elegant against well formed input data.  However a complication enters because the error grammar explodes in size in comparison.  Where as if one builds a match with cascading let (naming the parts) and cond (doing the match), all end conditions are exposed and can be handled.  As an example, I will have to check the tag to see if it aliases against tags the parser is using, so that people can't send in poison xml statement.   The alternative would be to first run the xml through a well formed checker, but that would just move the complication to the checker, just so the parse can look pretty.

In addition, I'm concerned about the stream behavior of racket's match - something I've promised to work on.  Notice, what happens if you try to match pattern delineated by paren literals within nested parens, because it does a greedy match, and it will wait for the end of stream to complete that match using the outside most right paren.  So I don't understand match behavior enough yet to be completely comfortable with it yet.  Independent, in this case I suspect I like the cascading cond/let match.

Be glad to hear your thoughts on this.  Be glad to learn more about match.

Neil Van Dyke

unread,
Jul 29, 2015, 11:02:55 PM7/29/15
to Thomas Lynch, Matthew Butterick, racket...@googlegroups.com
Thomas Lynch wrote on 07/29/2015 10:25 PM:
> Neil can you comment on what other differences I might expect to find?

Are the below 2 messages to the list helpful?

* Historical background on SXML and Racket xexpr:
https://groups.google.com/d/msg/racket-users/yaOtPkd_qvs/8ruIg-Smr7cJ

* Technical notes on implementing conversion kludge functions between
SXML and Racket xexpr:
https://groups.google.com/d/msg/racket-users/yaOtPkd_qvs/vhvAi90_CgAJ

The historical one should also clarify why not to say "neil-xexpr" in
identifiers. (I'll have to earn some better claim to fame.)

Good luck. I and many others have been using SXML tools happily for
over a decade, and at this point I'm sure we all wish the xexpr people
best of luck with whatever their problem is. :)

Neil V.

Thomas Lynch

unread,
Jul 29, 2015, 11:09:30 PM7/29/15
to Neil Van Dyke, Matthew Butterick, racket...@googlegroups.com
Great!  thanks Neil!

Thomas Lynch

unread,
Jul 29, 2015, 11:22:36 PM7/29/15
to Neil Van Dyke, Matthew Butterick, racket...@googlegroups.com


On Thu, Jul 30, 2015 at 11:09 AM, Thomas Lynch <thomas...@reasoningtechnology.com> wrote:
Great!  thanks Neil!

Ah spoke too soon! Those links just point back into this very same thread!

The most obvious difference between racket's xexpr and yours is the '@' as the head of the attributes list.  Any idea where else I will see divergences?

Neil Van Dyke

unread,
Jul 29, 2015, 11:57:58 PM7/29/15
to Thomas Lynch, Matthew Butterick, racket...@googlegroups.com
Thomas Lynch wrote on 07/29/2015 11:22 PM:
>
> The most obvious difference between racket's xexpr and yours is the
> '@' as the head of the attributes list. Any idea where else I will
> see divergences?

You might not be able to implement a tool that works correctly with all
conforming SXML, until you read Oleg's SXML spec. It is often helpful
to assume that Oleg is smarter than all of us.

After you understand Oleg's stuff, there might be something in the
implementation notes I posted that is a small bit of additional help.

Or, consider not doing these conversion kludges, after all, and just
pick an HTML/XML representation and stick with that one exclusively.
This problem does not rate the amount of energy being flung at it.
Stupid engineering situations often beget stupid, the stupid can
compound, and pretty soon we've all degenerated to smearing out code on
the wall.

Reply all
Reply to author
Forward
0 new messages