xml library clarification - """ symbol parsing

53 views
Skip to first unread message

Kira

unread,
Nov 21, 2019, 12:51:28 PM11/21/19
to Racket Users
I have a few questions.

1. Consider XML like this:
<ROOT><A><B1>test</B1><B2>&quot;test qoute&quot;</B2></A></ROOT>


(string->xexpr "<ROOT><A><B1>test</B1><B2>&quot;test qoute&quot;</B2></A></ROOT>")
Will produce this:
'(ROOT () (A () (B1 () "test") (B2 () "\"" "test qoute" "\"")))

The problem is &quot; is parsed as separate entity. You can see that they added to list of element content as separate strings "\"" "test qoute" "\""

And, (read-xml) do the same thing.
And they somehow converted back to text in rigth maner.

I tested sxml. And it is producer rigth (for me) output.

I feel that this is wrong behavior. Because this &quot; symbols is direct part of one and whole element content. And must be read as "\"test qoute\"".

Can please someone explain reasoning under such behaivor, and can we change it? Perhaps it is important for some other racket libs?
It is just totally contrintuitive for me. And creating a huge problems with even simple XML parsing. (I am basically battling XML lib all day already to do most simple tasks)

Jay McCarthy

unread,
Nov 21, 2019, 8:03:57 PM11/21/19
to Kira, Racket Users
Hi Kira,

I think this is consistent with how XML is defined. There is a
sequence of character data inside of tags. Character data is
represented by strings in the `xml` library. And there is a sequence
of those.

Jay

--
Jay McCarthy
Associate Professor @ CS @ UMass Lowell
http://jeapostrophe.github.io
Vincit qui se vincit.
> --
> You received this message because you are subscribed to the Google Groups "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to racket-users...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/racket-users/b5047678-0d11-4d00-a1e9-1579ad745e6b%40googlegroups.com.

Matthew Butterick

unread,
Nov 21, 2019, 10:18:38 PM11/21/19
to Kira, Racket Users

On Nov 21, 2019, at 9:51 AM, Kira <peacekee...@gmail.com> wrote:

I tested sxml. And it is producer rigth (for me) output.

I feel that this is wrong behavior. Because this &quot; symbols is direct part of one and whole element content. And must be read as "\"test qoute\"".

Can please someone explain reasoning under such behaivor, and can we change it? Perhaps it is important for some other racket libs?
It is just totally contrintuitive for me. And creating a huge problems with even simple XML parsing. (I am basically battling XML lib all day already to do most simple tasks)


If you want string elements to be concatenated where possible, you can do that after you parse:

#lang racket
(require xml txexpr rackunit)
(check-equal?
 (let loop ([x (string->xexpr "<ROOT><A><B1>test</B1><B2>&quot;test qoute&quot;</B2></A></ROOT>")])
   (match x
     [(txexpr tag attrs elements) (txexpr tag attrs (loop elements))]
     [(list (? string? strs) ..1 xs ...) (cons (string-join strs "") (loop xs))]
     [(list xs ...) (map loop xs)]
     [x x]))
 '(ROOT (A (B1 "test") (B2 "\"test qoute\""))))


But AFAIK there is nothing in the XML spec that requires XML to be parsed in your preferred style. So it is not "wrong behavior". 

Moreover, if you are writing code that treats '(tag "a" "b") and '(tag "ab") as semantically distinct, that's a code smell, because as XML, they're equivalent:

#lang racket
(require xml rackunit)
(check-equal? "<tag>ab</tag>" (xexpr->string '(tag "a" "b")))
(check-equal? "<tag>ab</tag>" (xexpr->string '(tag "ab")))

Kira

unread,
Nov 22, 2019, 12:43:35 AM11/22/19
to Racket Users
I understand that there is no law that forbids some ways of XML parsing. :)
My question is in another plain.
Perhaps some other languages have the same realization?
I am trying to understated what purpose it serves? Does this done intentionally, or this is just random side effect?

For example this particular thing hit me when I tried to use (se-path*) and (se-path*/list), because this quotes just trowed in plain list with the text, lets say I can filter them, but how I must solve this if there will be such quotes inside the text?

Yes, they are equivalent by this particular parser/format design, but they are semantically distinct by reality.

As well as escape strings from other characters. I am trying to understated why this particular semantic distinction is taken into account.

I am trying to understand why xml lib makes this semantic distinction when format does not defines such thing.

Neil Van Dyke

unread,
Nov 22, 2019, 8:50:41 AM11/22/19
to Kira, Racket Users
Kira wrote on 11/22/19 12:43 AM:
> I am trying to understated what purpose it serves? Does this done
> intentionally, or this is just random side effect?

I suspect it's an implementation decision of the parser, done for
reasons of implementation ease or runtime efficiency.  It's not-unusual
in XML and HTML parsers I've seen.

For example, imagine a parser that has a fast way to scan an input
stream for the next special character (including `&`), and then take
that chunk of all the non-special characters as a string.  That string
can then be used as-is in the parsed representation.  Then a different
mode of the parser starts parsing from the `&`, and ends up adding a
new, different string for the result of that, perhaps coming from a
lookup table.

That hypothetical parser assembling the parsed representation *could*
then concatenate sequences of 2 or more contiguous strings representing
CDATA, but that could be expensive, and might not be needed.  Consider
how large some XML and HTML documents can be, and how little information
out of them is sometimes needed (e.g., price scraper) --
performance-wise, the concatenation might be best left up to whatever
uses that parsed representation.

If you're using a DSL for XML querying, pattern-matching, extraction,
transformation, etc., then you might have the DSL do that concatenation
when worthwhile (e.g., when extracting the content of an element, with
type-checking).  I've implemented such a DSL before.  Or you might do
that concatenation in your application code, as needed.  Or you might
not do the concatenation at all, because, even if you used query tools
to narrow in on the information you wanted, you're streaming it out to
somewhere else, or transforming it in some way that doesn't benefit from
(and even might suffer from) an intermediate concatenation.

Anyway, that's just a quick explanation in answer to your question of
*why* a parser might happen to do it the way you say.  But I agree that
it's not intuitive, and you'd also like to have better off-the-shelf
DSLs for working with that parsed representation.  XML processing is no
nearly longer as popular as in the early days of Racket (PLT Scheme),
which is when most all of the XML tools available for Racket were written.

If you wanted, you could make better tools.  Though be aware that I
think the "market" for XML tools in Racket is even smaller now than it
used to be.  So I suggest only making such for your own reasons, not out
of altruism to help solve this problem for others, nor to "promote"
Racket.  (Racket was promoted by some of the XML and HTML tools earlier,
but not anymore that I'm aware of.)

Sage Gerard

unread,
Nov 22, 2019, 10:10:19 AM11/22/19
to Kira, Racket Users
I'm interested in this quote:

 [...] creating a huge problems with even simple XML parsing. (I am basically battling XML lib all day already to do most simple tasks)

I think that when you asked about why the xml collection behaves the way it does, the conversation turned away from your experience.

Could you elaborate on the specifics of what you are trying to do in one of your projects so that we can see your pain points in context?

~slg


‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
--
You received this message because you are subscribed to the Google Groups "Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to racket-users...@googlegroups.com.

Kira

unread,
Nov 22, 2019, 8:10:00 PM11/22/19
to Racket Users
Thank you for your answer.
My guess from the beginning was that this related to HTML parsing and web server packages.


Neil Van Dyke

Philip McGrath

unread,
Nov 22, 2019, 10:09:48 PM11/22/19
to Neil Van Dyke, Kira, Racket Users
On Fri, Nov 22, 2019 at 8:50 AM Neil Van Dyke <ne...@neilvandyke.org> wrote:
That hypothetical parser assembling the parsed representation *could*
then concatenate sequences of 2 or more contiguous strings representing
CDATA, but that could be expensive, and might not be needed.  Consider
how large some XML and HTML documents can be, and how little information
out of them is sometimes needed (e.g., price scraper) --
performance-wise, the concatenation might be best left up to whatever
uses that parsed representation.

I think a key point here is that the very features that make a representation of XML ideal for some uses will be troublesome for other uses.

I parse a lot of XML in Racket, and I often wish the x-expression grammar were different in various ways, which basically amount to eliminating artifacts of the concrete syntax: turning numeric entities (`valid-char?`) and the `cdata` struct into strings, plus concatenating contiguous strings. When I wander over toward the front-end, though, I start writing HTML pages as x-expressions, and then I want adjacent strings to be allowed so I can format my code nicely (perhaps with Scribble's at-syntax). If I were writing an XML-aware text editor (at one point I took a few small steps in that direction), I would very much care about the concrete syntax and even source-location information. While I don't personally want this, some people have even wished that x-expressions supported HTML-isms like "boolean attributes."

Of course, these tensions aren't specific to XML: one could also wish for fancier representations of strings than linear (mutable!) sequences of characters, like "ropes"/"cords"/"texts" (a tree representation) or substring "views" that can share storage.
 
To me, the fact that x-expressions are a good-enough representation of XML for a lot of different uses suggests that they're in the right neighborhood for a general-purpose library representation. (As Neil knows, there are also Racket libraries that use a different representation, SXML, that's fairly close neighbor in the design space.) I particularly like that, when I'm doing the kind of parsing where I want a more normalized representation, I can come up with a subset of the x-expression grammar that meets my needs (and enforce it with memoized contracts) and do a normalization pass: I can rely on stronger invariants internally while still taking full advantage of existing libraries (for x-expressions, lists, etc.).

-Philip

Kira

unread,
Nov 22, 2019, 10:15:29 PM11/22/19
to Racket Users
I just cannot understand how to solve XML related problems by using this library.
Perhaps there is lack of examples, and no description of functions purpose.
And from bare description I am failing to imagine practical use.

For example, why (source) struct exists? And how I can use it?

Why there functions (read-xml/document [in]) and (read-xml/element [in]) And how i can use them?
As I mentioned earlier, my guess was that I can process XML sequentially by using them in tandem, but it seems this guess was wrong.

What is the pattern for navigating inside (element) structure? For example getting to ROOT->tagA->tagB ?
And I cannot go from tagB to tagA, rigth?

Lets assume such XML:
<ROOT>
 
<A>
   
<B1>test</B1>
   
<B2>&quot;test qoute&quot;</B2>
 
</A>

 
<A>
   
<B1>test2</B1>
   
<B2>&quot;test &quot;qoute2&quot;&quot;</B2>
 
</A>
</ROOT>




I want to transform this into list of struct (data b1 b2)

And perhaps do this in sequential manner if i need to parse 10 millions of A tags.

I tried to use (se-path*/list), like this:
(define rawmxl "<ROOT><A><B1>test</B1><B2>&quot;test qoute&quot;</B2></A><A><B1>test2</B1><B2>&quot;test &quotqoute2&quot&quot;</B2></A></ROOT>")
(define xexpr (string->xexpr rawmxl))
(se-path*/list '(A) xexpr)




But this gives me plain list:
'((B1 () "test") (B2 () "\"" "test qoute" "\"") (B1 () "test2") (B2 () "\"" "test " "\"" "qoute2" "\"" "\""))


Without distinction of what A tag content is where.
So i cannot reason about it.

And (se-path*/list '(A B2) xexpr) gives me:
'("\"" "test qoute" "\"" "\"" "test " "\"" "qoute2" "\"" "\"")


so this is even worse.

And it will be great if a can get 1 list from the beginning, because I have 20 millions of this records.

So now I am moved to (match) solution.
For example:
(match xexpr
 
[(list 'ROOT '()
         
(list 'A '()
               
(list 'B1 '() b1)
               
(list 'B2 '() b2 __1)) __1) (list b1 b2)]
 
[_ 'empty])




Now I am getting somewhere, but I get 2 separate lists again, and I am not sure about memory effectiveness of match (I am assume it is effective).
Can I get one list of structs in this scenario? (without manually looping over A tags) and how about nested (match) effectiveness?
And now I begin to feel that parsing raw XML "by hand" won't be much harder then solution I am getting now. Perhaps this is due to intrinsic nature of XML itself?



пятница, 22 ноября 2019 г., 17:10:19 UTC+2 пользователь Sage Gerard написал:

Neil Van Dyke

unread,
Nov 23, 2019, 2:06:09 AM11/23/19
to Kira, Racket Users
Kira wrote on 11/22/19 10:15 PM:
> So now I am moved to (match) solution.

Last I looked, `match` isn't great for XML, regardless of what
representation the XML is in.

You might want to make a DSL that does exactly what you want.  Don't
expect the off-the-shelf tools to be great -- all the neat ones I can
think of were written by people who moved on to other things over a
decade ago.

But if you're using SXML, and want to try some aging DSLs, two very
useful DSLs to use in combination are Oleg Kiselyov's SXPath, and Jim
Bender's `sxml-match`, and you could look at those for ideas.

One of my unreleased (sorry) XML DSLs was a variation on `sxml-match`,
which supported things like matching unordered sets of sub-elements.

Automatic conversion to better Racket types (e.g., string, number,
date), based on schema or DSL annotations, would've been a nice
convenience.  Racket struct type definitions derived from an XML schema,
in combination with a validating parser, would also be nice.  An
interesting language design problem was for XML transformation DSLs, but
there's less need of XML-to-XML transformation nowadays than was in the
original vision.

If you want better XML tools, it's probably up to you.  And being
empowered to make a DSL that works better for your needs than anything
that exists is half the reason to use Racket or another Lisp.  I'm not
doing any further work with Racket, and am just about to unsubscribe.

Sorawee Porncharoenwase

unread,
Nov 23, 2019, 2:43:53 AM11/23/19
to Kira, Racket Users

I’ll probably do this if I were you:

#lang racket

(require xml)

(struct data (b1 b2) #:transparent)

;; extract :: XExpr -> (Listof data?)
(define (extract xexpr)
  (match-define `(ROOT () ,xs ...) xexpr)
  (for/list ([e xs] #:when (list? e))
    (match-define `(A () ,b1 ,b2) e)
    (data b1 b2))) (define raw-xml #<<EOF <ROOT> <A><B1>test</B1><B2>&quot;test qoute&quot;</B2></A> <A><B1>test2</B1><B2>&quot;test &quot;qoute2&quot;&quot;</B2></A> </ROOT> EOF ) (extract (string->xexpr raw-xml))

Alternatively, you can use this hacky way:

;; extract :: XExpr -> (Listof data?)
(define (extract xexpr)
  (for/list ([slice (in-slice 2 (se-path*/list '(A) xexpr))])
    (apply data slice)))
If there are really 10 millions of A tags, I would suggest you use for/stream instead of for/list, which will produce a stream instead of a list. 

--
You received this message because you are subscribed to the Google Groups "Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to racket-users...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages