Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Alternatives to XML?

178 views
Skip to first unread message

Frank Millman

unread,
Aug 24, 2016, 10:59:25 AM8/24/16
to
Hi all

I have mentioned in the past that I use XML for storing certain structures
'off-line', and I got a number of comments urging me to use JSON or YAML
instead.

In fact XML has been working very well for me, but I am looking into
alternatives simply because of the issue of using '>' and '<' in attributes.
I can convert them to '&gt;' and '&lt;', but that imposes a cost in terms of
readability.

Here is a simple example -

<case>
<compare src="_param.auto_party_id" op="is_not" tgt="$None">
<case>
<on_insert>
<auto_gen args="_param.auto_party_id"/>
</on_insert>
<not_exists>
<literal value="&lt;new&gt;"/>
</not_exists>
</case>
</compare>
</case>

This is equivalent to the following python code -

if _param.auto_party_id is not None:
if on_insert:
value = auto_gen(_param.auto_party_id)
elif not_exists:
value = '<new>'

The benefit of serialising it is partly to store it in a database and read
it in at runtime, and partly to allow non-trusted users to edit it without
raising security concerns.

I ran my XML through some online converters, but I am not happy with the
results.

Here is a JSON version -

{
"case": {
"compare": {
"-src": "_param.auto_party_id",
"-op": "is_not",
"-tgt": "$None",
"case": {
"on_insert": {
"auto_gen": { "-args": "_param.auto_party_id" }
},
"not_exists": {
"literal": { "-value": "<new>" }
}
}
}
}
}

I can see how it works, but it does not seem as readable to me. It is not so
obvious that 'compare' has three arguments which are evaluated, and if true
the nested block is executed.

Here is a YAML version -

case:
compare:
case:
on_insert:
auto_gen:
_args: "_param.auto_party_id"
not_exists:
literal:
_value: "<new>"
_src: "_param.auto_party_id"
_op: is_not
_tgt: "$None"

This seems even worse from a readability point of view. The arguments to
'compare' are a long way away from the block to be executed.

Can anyone offer an alternative which is closer to my original intention?

Thanks

Frank Millman


Marko Rauhamaa

unread,
Aug 24, 2016, 11:12:16 AM8/24/16
to
"Frank Millman" <fr...@chagford.com>:
> I have mentioned in the past that I use XML for storing certain
> structures 'off-line', and I got a number of comments urging me to use
> JSON or YAML instead.

JSON is very good.

> In fact XML has been working very well for me, but I am looking into
> alternatives simply because of the issue of using '>' and '<' in
> attributes. I can convert them to '&gt;' and '&lt;', but that imposes
> a cost in terms of readability.

Precise syntax requires escaping. In fact, XML is not good enough at it.
XML is far too much and too little at the same time.

> Here is a simple example -
>
> <case>
> <compare src="_param.auto_party_id" op="is_not" tgt="$None">
> <case>
> <on_insert>
> <auto_gen args="_param.auto_party_id"/>
> </on_insert>
> <not_exists>
> <literal value="&lt;new&gt;"/>
> </not_exists>
> </case>
> </compare>
> </case>

> Can anyone offer an alternative which is closer to my original intention?

There's S-expressions:

(case
(compare #:src "_param.auto_party_id"
#:op "is_not"
#:tgt #f
(case
#:on-insert (auto-gen "_param.auto_party_id")
#:not-exists (literal "<new>"))))


Marko

alister

unread,
Aug 24, 2016, 11:21:43 AM8/24/16
to
On Wed, 24 Aug 2016 16:58:54 +0200, Frank Millman wrote:

> Hi all
>
> I have mentioned in the past that I use XML for storing certain
> structures 'off-line', and I got a number of comments urging me to use
> JSON or YAML instead.
>
> In fact XML has been working very well for me, but I am looking into
> alternatives simply because of the issue of using '>' and '<' in
> attributes.
> I can convert them to '&gt;' and '&lt;', but that imposes a cost in
> terms of readability.
>

are these files expected to be read/written by a human being or are they
for your application to save & restore its settings?

if the former then you probably need to choose a specification/format
that was designed to be human readable form the start (such as the
old .ini format)

if it is primarily for your app then you need a format that efficiently &
accurately saves the data you require, readability is a secondary (but
still desirable) requirement for debugging & the rare case where a manual
change is req.

XLM is bulky with lots of redundant information & still not readily
readable without extra tools.
Json is quite terse but I find it quite readable & is well suited for
saving most data structures
pickle can save the data efficiently but is certainly not readable


--
Even a blind pig stumbles upon a few acorns.

Peter Otten

unread,
Aug 24, 2016, 12:51:21 PM8/24/16
to
Frank Millman wrote:

> I have mentioned in the past that I use XML for storing certain structures
> 'off-line', and I got a number of comments urging me to use JSON or YAML
> instead.
>
> In fact XML has been working very well for me, but I am looking into
> alternatives simply because of the issue of using '>' and '<' in
> attributes. I can convert them to '&gt;' and '&lt;', but that imposes a
> cost in terms of readability.
>
> Here is a simple example -
>
> <case>
> <compare src="_param.auto_party_id" op="is_not" tgt="$None">
> <case>
> <on_insert>
> <auto_gen args="_param.auto_party_id"/>
> </on_insert>
> <not_exists>
> <literal value="&lt;new&gt;"/>
> </not_exists>
> </case>
> </compare>
> </case>
>
> This is equivalent to the following python code -
>
> if _param.auto_party_id is not None:
> if on_insert:
> value = auto_gen(_param.auto_party_id)
> elif not_exists:
> value = '<new>'

I think we have a winner here ;)

> The benefit of serialising it is partly to store it in a database and read
> it in at runtime, and partly to allow non-trusted users to edit it without
> raising security concerns.

If you store what is basically a script as XML or JSON, why is that safer
than Python or Javascript?



Rob Gaddi

unread,
Aug 24, 2016, 1:12:44 PM8/24/16
to
You've been staring at that XML too long; you've become familiar with
it. It's just as unreadable as the JSON or the YAML unless you already
know what it says.

That said, one core difference between XML and the other two is that XML
allows for the idea that order matters. JSON and YAML both consider
key/value pair objects in the same way Python does a dict -- there is AN
order because there has to be one, but it's not expected to be
important or preserved. And when you start talking about "close to"
you're talking about preserving ordering.

--
Rob Gaddi, Highland Technology -- www.highlandtechnology.com

Email address domain is currently out of order. See above to fix.

Thomas 'PointedEars' Lahn

unread,
Aug 24, 2016, 1:30:29 PM8/24/16
to
Frank Millman wrote:

> I have mentioned in the past that I use XML for storing certain structures
> 'off-line', and I got a number of comments urging me to use JSON or YAML
> instead.
>
> In fact XML has been working very well for me, but I am looking into
> alternatives simply because of the issue of using '>' and '<' in
> attributes. I can convert them to '&gt;' and '&lt;', but that imposes a
> cost in terms of readability.

What does this have to do with Python?

--
PointedEars

Twitter: @PointedEars2
Please do not cc me. / Bitte keine Kopien per E-Mail.

Chris Angelico

unread,
Aug 24, 2016, 9:57:13 PM8/24/16
to
On Thu, Aug 25, 2016 at 2:50 AM, Peter Otten <__pet...@web.de> wrote:
>> if _param.auto_party_id is not None:
>> if on_insert:
>> value = auto_gen(_param.auto_party_id)
>> elif not_exists:
>> value = '<new>'
>
> I think we have a winner here ;)

Agreed.

http://thedailywtf.com/articles/The_Enterprise_Rules_Engine

If you create a non-code way to do code, you either have to make it
far FAR simpler than code, or... you should just use code.

Use code.

ChrisA

Frank Millman

unread,
Aug 25, 2016, 1:34:08 AM8/25/16
to
"Frank Millman" wrote in message news:npkcnf$kq7$1...@blaine.gmane.org...

> Hi all

> I have mentioned in the past that I use XML for storing certain structures
> 'off-line', and I got a number of comments urging me to use JSON or YAML
> instead.

> Can anyone offer an alternative which is closer to my original intention?

Many thanks for the replies. I will respond to them all in one place.

@alister
"are these files expected to be read/written by a human being or are they
for your application to save & restore its settings?"

Good question.

My project is a business/accounting system. It provides a number of tables
and columns pre-defined, but it also allows users to create their own. I
allow them to define business rules to be invoked at various points.
Therefore it must be in a format which is readable/writable by humans, but
executable at runtime by my program.

.ini is an interesting idea - I will look into it

@rob
"You've been staring at that XML too long; you've become familiar with
it. It's just as unreadable as the JSON or the YAML unless you already
know what it says."

Good comment! I am sure you are right. Whichever format I settle on, I will
have to provide some sort of cheat-sheet explaining to non-technical users
what the format is.

Having said that, I do find the XML more readable in the sense that the
attributes are closely linked with their elements. I think it is easier to
explain what <compare src="_param.auto_party_id" op="is_not" tgt="$None"> is
doing than the equivalent in JSON or YAML.

@Marko
I have never heard of S-expressions before, but they look interesting. I
will investigate further.

@Peter/Chris
I don't understand - please explain.

If I store the business rule in Python code, how do I prevent untrusted
users putting malicious code in there? I presume I would have to execute the
code by calling eval(), which we all know is dangerous. Is there another way
of executing it that I am unaware of?

Frank


Chris Angelico

unread,
Aug 25, 2016, 1:46:21 AM8/25/16
to
On Thu, Aug 25, 2016 at 3:33 PM, Frank Millman <fr...@chagford.com> wrote:
> @Peter/Chris
> I don't understand - please explain.
>
> If I store the business rule in Python code, how do I prevent untrusted
> users putting malicious code in there? I presume I would have to execute the
> code by calling eval(), which we all know is dangerous. Is there another way
> of executing it that I am unaware of?

The real question is: How malicious can your users be?

If the XML file is stored adjacent to the Python script that runs it,
anyone who can edit one can edit the other. Ultimately, that means
that (a) any malicious user can simply edit the Python script, and
therefore (b) anyone who's editing the other file is not malicious.

If that's not how you're doing things, give some more details of what
you're trying to do. How are you preventing changes to the Python
script? How frequent will changes be? Can you simply put all changes
through a git repository and use a pull request workflow to ensure
that a minimum of two people eyeball every change?

ChrisA

Marko Rauhamaa

unread,
Aug 25, 2016, 1:59:59 AM8/25/16
to
"Frank Millman" <fr...@chagford.com>:
> If I store the business rule in Python code, how do I prevent
> untrusted users putting malicious code in there? I presume I would
> have to execute the code by calling eval(), which we all know is
> dangerous. Is there another way of executing it that I am unaware of?

This is a key question.

A couple of days back I stated the principle that a programming language
is better than a rule language. That principle is followed by
PostScript printers, Java applets, web pages with JavaScript, emacs
configuration files etc. The question is how do you get the desired
benefits without opening the door to sabotage. You have to shield CPU
usage, memory usage, disk access, network access etc.

You can google for solutions with search terms such as "python sandbox",
"linux sandbox" and "linux container sandbox".


Marko

Marko Rauhamaa

unread,
Aug 25, 2016, 2:05:36 AM8/25/16
to
Chris Angelico <ros...@gmail.com>:
> The real question is: How malicious can your users be?

Oh, yes, the simple way to manage the situation is for the server to
call seteuid() before executing the code after authenticating the user.


Marko

Frank Millman

unread,
Aug 25, 2016, 2:12:12 AM8/25/16
to
"Chris Angelico" wrote in message
news:CAPTjJmq2bcQPmQ9itVvZrBZJPcbYe5z6vDpKGYQj=8H+q...@mail.gmail.com...

On Thu, Aug 25, 2016 at 3:33 PM, Frank Millman <fr...@chagford.com> wrote:
> @Peter/Chris
> > I don't understand - please explain.
> >
> > If I store the business rule in Python code, how do I prevent untrusted
> > users putting malicious code in there? I presume I would have to execute
> > the
> > code by calling eval(), which we all know is dangerous. Is there another
> > way
> > of executing it that I am unaware of?

> The real question is: How malicious can your users be?

> If the XML file is stored adjacent to the Python script that runs it,
> anyone who can edit one can edit the other. Ultimately, that means that
> (a) any malicious user can simply edit the Python script, and therefore
> (b) anyone who's editing the other file is not malicious.

> If that's not how you're doing things, give some more details of what
> you're trying to do. How are you preventing changes to the Python script?
> How frequent will changes be? Can you simply put all changes through a git
> repository and use a pull request workflow to ensure that a minimum of two
> people eyeball every change?

All interaction with users is via a gui. The database contains tables that
define the database itself - tables, columns, form definitions, etc. These
are not purely descriptive, they drive the entire system. So if a user
modifies a definition, the changes are immediate.

Does that answer your question? I can go into a lot more detail, but I am
not sure where to draw the line.

Frank





Chris Angelico

unread,
Aug 25, 2016, 2:30:04 AM8/25/16
to
Sounds to me like you have two very different concerns, then. My
understanding of "GUI" is that it's a desktop app running on the
user's computer, as opposed to some sort of client/server system - am
I right?

1) Malicious users, as I describe above, can simply mess with your
code directly, or bypass it and talk to the database, or anything. So
you can ignore them.

2) Non-programmer users, without any sort of malice, want to be able
to edit these scripts but not be caught out by a tiny syntactic
problem.

Concern #2 is an important guiding principle. You need your DSL to be
easy to (a) read/write, and (b) verify/debug. That generally means
restricting functionality some, but you don't have to dig into the
nitty-gritty of security. If someone figures out a way to do something
you didn't intend to be possible, no big deal; but if someone CANNOT
figure out how to use the program normally, that's a critical failure.

So I would recommend making your config files as simple and clean as
possible. That might mean Python code; it does not mean XML, and
probably not JSON either. It might mean YAML. It might mean creating
your own DSL. It might be as simple as "Python code with a particular
style guide". There are a lot of options. Here's your original XML and
Python code:

<case>
<compare src="_param.auto_party_id" op="is_not" tgt="$None">
<case>
<on_insert>
<auto_gen args="_param.auto_party_id"/>
</on_insert>
<not_exists>
<literal value="&lt;new&gt;"/>
</not_exists>
</case>
</compare>
</case>

if _param.auto_party_id is not None:
if on_insert:
value = auto_gen(_param.auto_party_id)
elif not_exists:
value = '<new>'

Here's a very simple format, borrowing from RFC822 with a bit of Python added:

if: _param.auto_party_id != None
if: on_insert
value: =auto_gen(_param.auto_party_id)
elif: not_exists
value: <new>

"if" and "elif" expect either a single boolean value, or two values
separated by a known comparison operator (==, !=, <, >, etc). I'd
exclude "is" and "is not" for parser simplicity. Any field name that
isn't a keyword (in this case "value:") sets something to the given
value, or evals the given string if it starts with an equals sign.

You can mess around with the details all you like, but this is a
fairly simple format (every line consists of "keyword: value" at some
indentation level), and wouldn't be too hard to parse. The use of
eval() means this is assuming non-malicious usage, which is (as
mentioned above) essential anyway, as a full security check on this
code would be virtually impossible.

And, as mentioned, Python itself makes a fine DSL for this kind of thing.

ChrisA

Frank Millman

unread,
Aug 25, 2016, 3:13:40 AM8/25/16
to
"Chris Angelico" wrote in message
news:CAPTjJmof_sXqax0Ury5LsBEj7cdFv92WiWKbfvAC+bM=Hwt...@mail.gmail.com...

> Sounds to me like you have two very different concerns, then. My
> understanding of "GUI" is that it's a desktop app running on the user's
> computer, as opposed to some sort of client/server system - am I right?

Not exactly, but the difference in not important, as you have got the
essentials below spot on.

For the record, the server runs an HTTP server, and anyone on the LAN/WAN
can access the system using a browser. Because the tables that define the
database are stored in the database itself, there is no difference between a
form that allows a user to capture an invoice, and a form that allows a user
to modify a column definition. It is all controlled through permissions, but
technically they are identical.

> 1) Malicious users, as I describe above, can simply mess with your code
> directly, or bypass it and talk to the database, or anything. So you can
> ignore them.

Absolutely. If an organisation running my system wants to be secure, they
should keep the server in a locked room only accessible by a trusted system
administrator.

> 2) Non-programmer users, without any sort of malice, want to be able to
> edit these scripts but not be caught out by a tiny syntactic problem.

Now we are getting to the nitty-gritty.

[snip some good comments]

> Here's a very simple format, borrowing from RFC822 with a bit of Python
> added:

if: _param.auto_party_id != None
if: on_insert
value: =auto_gen(_param.auto_party_id)
elif: not_exists
value: <new>

Getting close, but it is not *quite* that simple.

For example, having isolated the LHS of the if clause, I process it
something like this -

if source.startswith("'"):
source_value = source[1:-1]
elif '.' in source:
source_objname, source_colname = source.split('.', 1)
source_record = caller.data_objects[source_objname]
source_value = source_record.getval(source_colname)
elif source == '$None':
source_value = None
elif source == '$True':
source_value = True
elif source == '$False':
source_value = False
elif source.startswith('int('):
source_value = int(source[4:-1])

Anyway, you have isolated the essential issue. I need a DSL which is easy
for a non-technical user to read/write, and easy to verify that it is
achieving the desired result.

I suspect that this is quite challenging whatever format I use. Up to now I
have been using XML, and it works for me. As Rob pointed out, I have become
too comfortable with it to be objective, but no-one has yet convinced me
that the alternatives are any better. I may eventually end up with an
additional layer that prompts the user through their requirement in 'wizard'
style, and generates the underlying XML (or whatever) automatically.

Frank


Peter Otten

unread,
Aug 25, 2016, 6:07:21 AM8/25/16
to
It may be a little more work, but whatever check you can think of for your
XML can also be applied to a piece of Python code. For example the above
might become

$ cat eval_source.py
import ast


def source_value_dot(source_objname, source_colname):
return "<{}.{}>".format(source_objname, source_colname)
# may become:
source_record = caller.data_objects[source_objname]
return source_record.getval(source_colname)


def get_source(node):
# Modeled after ast.literal_eval()

def _convert(node):
if isinstance(node, ast.Expression):
return _convert(node.body)
elif isinstance(node, ast.Str):
return node.s
elif isinstance(node, ast.Attribute):
if not isinstance(node.value, ast.Name):
raise ValueError("only one dot, please", node)
name = node.value.id
column = node.attr
return source_value_dot(name, column)
elif isinstance(node, ast.NameConstant):
return node.value
elif isinstance(node, ast.Num):
n = node.n
if not isinstance(n, int):
raise ValueError("only integers, please", node)
return n
else:
raise ValueError("oops", node)
return _convert(node)


EXAMPLES = """\
None
True
42
3.4
bar.baz
foo.bar.baz
'whatever.you.want'
""".splitlines()

for source in EXAMPLES:
e = ast.parse(source, mode="eval")
try:
result = get_source(e)
except ValueError as e:
result = "#error: {}".format(e.args[0])
print(source, "-->", result)

$ python3 eval_source.py
None --> None
True --> True
42 --> 42
3.4 --> #error: only integers, please
bar.baz --> <bar.baz>
foo.bar.baz --> #error: only one dot, please
'whatever.you.want' --> whatever.you.want

Of course you could also use a slightly more general approach (use a
whitelist of function names, feed dotted names to a function etc.), so that
the resulting ast can safely be fed to exec().

Whatever you do, the most likely problem you may run into is annoyed users
asking "This looks like Python, why can't I..."

Chris Angelico

unread,
Aug 25, 2016, 6:07:44 AM8/25/16
to
On Thu, Aug 25, 2016 at 5:13 PM, Frank Millman <fr...@chagford.com> wrote:
> "Chris Angelico" wrote in message
> news:CAPTjJmof_sXqax0Ury5LsBEj7cdFv92WiWKbfvAC+bM=Hwt...@mail.gmail.com...
>
>> Sounds to me like you have two very different concerns, then. My
>> understanding of "GUI" is that it's a desktop app running on the user's
>> computer, as opposed to some sort of client/server system - am I right?
>
>
> Not exactly, but the difference in not important, as you have got the
> essentials below spot on.

Cool. The essential difference, in this case, being where the edits
are happening.

> Absolutely. If an organisation running my system wants to be secure, they
> should keep the server in a locked room only accessible by a trusted system
> administrator.

Right. That saves a ton of effort for you - it's no longer about
malicious users.

>> Here's a very simple format, borrowing from RFC822 with a bit of Python
>> added:
>
>
> if: _param.auto_party_id != None
> if: on_insert
> value: =auto_gen(_param.auto_party_id)
> elif: not_exists
> value: <new>
>
> Getting close, but it is not *quite* that simple.
>
> For example, having isolated the LHS of the if clause, I process it
> something like this -
>
> if source.startswith("'"):
> source_value = source[1:-1]

I don't like this; if you're going to surround strings with quotes,
you either need some sort of escaping mechanism, or disallow the same
quote character from coming up inside. (At very least, demand that the
string ends with an apostrophe too, or you'll have a lot of VERY
confusing failures if someone omits the closing apostrophe. Make it an
instant error instead.)

> elif '.' in source:
> source_objname, source_colname = source.split('.', 1)
> source_record = caller.data_objects[source_objname]
> source_value = source_record.getval(source_colname)

Fair enough; your dot notation is restricted to "this column from this
thing", where a thing effectively identifies a table. Though I'm
beginning to get the idea that the snippet you're showing here has
elided all error checking - assuming that you actually do have that
checking in the real version.

> elif source == '$None':
> source_value = None
> elif source == '$True':
> source_value = True
> elif source == '$False':
> source_value = False

If you're looking for specific strings, you shouldn't need to adorn
them. Point of note: "None" is a peculiarly Python concept, so anyone
who's using it is going to need to understand its semantics from a
Python perspective. If None is significant to your code, it's probably
semantically important, and that might cause issues if anyone doesn't
know code. And if everyone who edits this file knows Python, well,
we're back to "why not just let them write in Python".

> elif source.startswith('int('):
> source_value = int(source[4:-1])

Hrm. This is where I'd move away from Python syntax. If you're only
allowing a scant few typecasts, I'd go instead for an SQL-like syntax
eg "some_value::int", or switch it around as "int::some_value", and
make it clearly NOT a function call. Otherwise, I'd have a very simple
"drop to Python" mode, which would be handled something like this:

elif source.startswith("="):
source_value = eval(source[1:])

> Anyway, you have isolated the essential issue. I need a DSL which is easy
> for a non-technical user to read/write, and easy to verify that it is
> achieving the desired result.

Yes; or, if not easy to verify the result, at least make it easy to
verify the syntax, even in the face of errors.

> I suspect that this is quite challenging whatever format I use. Up to now I
> have been using XML, and it works for me. As Rob pointed out, I have become
> too comfortable with it to be objective, but no-one has yet convinced me
> that the alternatives are any better. I may eventually end up with an
> additional layer that prompts the user through their requirement in 'wizard'
> style, and generates the underlying XML (or whatever) automatically.

That's one option. At that point, the storage format becomes opaque to
humans - it's just an interchange format between two programs. As
such, I would advocate the use of JSON.

Of course, you still need to build that UI, which is a lot of work.
You'll also have to have GUI controls on your wizard for every
possible option, which makes options very expensive (in contrast to a
textual language, in which options are far cheaper) - every user has
to see every option every time. So you have to strike a *very* careful
balance between expressiveness/power and simplicity, and it's entirely
possible to fall down on both fronts.

ChrisA

Frank Millman

unread,
Aug 25, 2016, 7:23:46 AM8/25/16
to
"Frank Millman" wrote in message news:nplvvl$ci2$1...@blaine.gmane.org...

> Hi all

> I have mentioned in the past that I use XML for storing certain structures
> 'off-line', and I got a number of comments urging me to use JSON or YAML
> instead.

> Can anyone offer an alternative which is closer to my original intention?

Thanks to Chris and Peter for their additional input - much appreciated.

At the risk of disappointing some of you, this is how I am going to proceed.

1. I am writing tools to make it easier to develop business systems, and at
the same time I am developing a business system. As the tools mature I am
spending more time on the system, but as I work on the system I am finding
shortcomings in the tools, so I am bouncing between the two at the moment.

2. There are many areas of the tools which other users will find confusing
at first and will require explanations and documentation. I am more than
ready to make changes based on the reactions I get. The subject of this
thread is one small part of this.

3. My immediate priority is to finish the business system, get it out there,
and get feedback. Hopefully other users will then start dabbling with the
tools and provide feedback there as well.

4. As I have said already, for good or ill, I am comfortable with my current
use of XML, so I do not have a pressing need to change to anything else. The
problem that prompted this thread was the issue of storing '<' and '>' in
attributes. I have come up with a simple workaround - pass the XML through a
function between the database and the gui, converting from '&gt;' to '>' in
one direction, and back to '&gt;' in the other. It works.

5. I have learned a lot from this thread, but for now it is staying in the
back of my mind. If I ever get my project to the point where I need to move
it to the front, I will know that I am getting somewhere!

Thanks again to all.

Frank


Chris Angelico

unread,
Aug 25, 2016, 7:40:20 AM8/25/16
to
On Thu, Aug 25, 2016 at 9:23 PM, Frank Millman <fr...@chagford.com> wrote:
>
> At the risk of disappointing some of you, this is how I am going to proceed.
>
> 4. As I have said already, for good or ill, I am comfortable with my current
> use of XML, so I do not have a pressing need to change to anything else. The
> problem that prompted this thread was the issue of storing '<' and '>' in
> attributes. I have come up with a simple workaround - pass the XML through a
> function between the database and the gui, converting from '&gt;' to '>' in
> one direction, and back to '&gt;' in the other. It works.

Should be fine, as long as the actual XML file has &lt; in it. It
won't disappoint me at all, if the XML file is (primarily) a transport
format between programs; just make sure it's always valid XML, rather
than some "XML-like" file structure. In a greenfield project, I would
advise strongly against this, but since you already have something to
work with, it's not worth changing everything up.

Know that every decision has consequences, and make your decisions
with open eyes. I'm not disappointed by someone who, with greater
knowledge of the situation and the likely consequences of various
choices, chooses something different from what I, from thousands of
miles away, recommended :)

ChrisA

Peter Otten

unread,
Aug 25, 2016, 9:58:57 AM8/25/16
to
Frank Millman wrote:

> "Frank Millman" wrote in message news:nplvvl$ci2$1...@blaine.gmane.org...
>
>> Hi all
>
>> I have mentioned in the past that I use XML for storing certain
>> structures 'off-line', and I got a number of comments urging me to use
>> JSON or YAML instead.
>
>> Can anyone offer an alternative which is closer to my original intention?
>
> Thanks to Chris and Peter for their additional input - much appreciated.
>
> At the risk of disappointing some of you, this is how I am going to
> proceed.

'Tis too late for me to stop ;)

> 1. I am writing tools to make it easier to develop business systems, and
> at the same time I am developing a business system. As the tools mature I
> am spending more time on the system, but as I work on the system I am
> finding shortcomings in the tools, so I am bouncing between the two at the
> moment.
>
> 2. There are many areas of the tools which other users will find confusing
> at first and will require explanations and documentation. I am more than
> ready to make changes based on the reactions I get. The subject of this
> thread is one small part of this.
>
> 3. My immediate priority is to finish the business system, get it out
> there, and get feedback. Hopefully other users will then start dabbling
> with the tools and provide feedback there as well.
>
> 4. As I have said already, for good or ill, I am comfortable with my
> current use of XML, so I do not have a pressing need to change to anything
> else. The problem that prompted this thread was the issue of storing '<'
> and '>' in attributes. I have come up with a simple workaround - pass the
> XML through a
> function between the database and the gui, converting from '&gt;' to '>'
> in one direction, and back to '&gt;' in the other. It works.

As you have to keep the "<", why bother?

> 5. I have learned a lot from this thread, but for now it is staying in the
> back of my mind. If I ever get my project to the point where I need to
> move it to the front, I will know that I am getting somewhere!

At that point you may also look at my messy/buggy/incomplete attempt to
convert between xml and python:

$ cat convert.py
import ast
from xml.etree import ElementTree as etree

XML = """\
<case>
<compare src="_param.auto_party_id" op="is_not" tgt="$None">
<case>
<on_insert>
<auto_gen args="_param.auto_party_id"/>
</on_insert>
<not_exists>
<literal value="&lt;new&gt;"/>
</not_exists>
</case>
</compare>
</case>
"""

tree = etree.fromstring(XML)

TARGETS = {
"$None": "None",
}
RTARGETS = {
None: "$None",
}
assert len(RTARGETS) == len(TARGETS)

OPS = {
"is_not": "is not",
}
ROPS = {ast.IsNot: "is_not"}
assert len(ROPS) == len(OPS)

NAMES = {"on_insert", "not_exists"}
FUNCS = {"auto_gen"}


def getchildren(elem):
yield from elem.getchildren()


def to_python(elem, indent=""):
# XXX build an AST rather than source code.
if elem.tag == "compare":
yield "{}if {} {} {}:".format(
indent,
elem.attrib["src"],
OPS[elem.attrib["op"]],
TARGETS[elem.attrib["tgt"]]
)
[child] = getchildren(elem)
yield from to_python(child, indent)
elif elem.tag in NAMES:
yield "{}if {}:".format(indent, elem.tag)
for child in getchildren(elem):
yield from to_python(child, indent + " ")
elif elem.tag == "case":
for child in getchildren(elem):
yield from to_python(child, indent + " ")
elif elem.tag == "literal":
yield "{}value = {!r}".format(indent, elem.attrib["value"])
elif elem.tag in FUNCS:
yield "{}auto_gen({})".format(indent, elem.attrib["args"])
else:
raise ValueError("Unknown tag {!r}".format(elem.tag))


def dotted(node):
if isinstance(node, ast.Attribute):
return dotted(node.value) + "." + node.attr
else:
return node.id


def to_xml(python):
module = ast.parse(python)
[body] = module.body
root = etree.Element("case")

def _convert(node, parent):
if isinstance(node, ast.If):
test = node.test
if isinstance(test, ast.Compare):
compare = etree.Element("compare")
[op] = test.ops
compare.attrib["src"] = dotted(test.left)
compare.attrib["op"] = ROPS[type(op)]
right = test.comparators[0].value
compare.attrib["tgt"] = RTARGETS[right]
case = etree.Element("case")
compare.append(case)
parent.append(compare)
for child in node.body:
_convert(child, case)
elif isinstance(test, ast.Name):
ename = etree.Element(test.id)
parent.append(ename)
for child in node.body:
_convert(child, ename)
elif isinstance(node, ast.Expr):
evalue = node.value
evalue.func.id
invoke = etree.Element(evalue.func.id)
invoke.attrib["args"] = dotted(evalue.args[0])
parent.append(invoke)
elif isinstance(node, ast.Assign):
assign = etree.Element("literal")
[target] = node.targets
assign.attrib["value"] = node.value.s
parent.append(assign)
else:
global x
x = node
exit("unhandled")
_convert(body, root)
return root


# http://stackoverflow.com/questions/17402323/
# use-xml-etree-elementtree-to-write-out-nicely-formatted-xml-files
ElementTree = etree
from xml.dom import minidom


def prettify(elem):
"""Return a pretty-printed XML string for the Element.
"""
rough_string = ElementTree.tostring(elem, 'utf-8')
reparsed = minidom.parseString(rough_string)
return reparsed.toprettyxml(indent=" ")

print("XML...")
print(XML)
print("\nbecomes Python...")
python = "\n".join(to_python(next(getchildren(tree))))
print(python)
print("\nbecomes XML:")
root = to_xml(python)
print(prettify(root))
$ python3 convert.py
XML...
<case>
<compare src="_param.auto_party_id" op="is_not" tgt="$None">
<case>
<on_insert>
<auto_gen args="_param.auto_party_id"/>
</on_insert>
<not_exists>
<literal value="&lt;new&gt;"/>
</not_exists>
</case>
</compare>
</case>


becomes Python...
if _param.auto_party_id is not None:
if on_insert:
auto_gen(_param.auto_party_id)
if not_exists:
value = '<new>'

becomes XML:
<?xml version="1.0" ?>
<case>
<compare op="is_not" src="_param.auto_party_id" tgt="$None">

Frank Millman

unread,
Aug 25, 2016, 10:46:23 AM8/25/16
to
"Peter Otten" wrote in message news:npmti0$qvu$1...@blaine.gmane.org...

> Frank Millman wrote:

> > At the risk of disappointing some of you, this is how I am going to
> > proceed.

> 'Tis too late for me to stop ;)

> > The problem that prompted this thread was the issue of storing '<' and
> > '>' in attributes. I have come up with a simple workaround - pass the
> > XML through a function between the database and the gui, converting from
> > '&gt;' to '>' in one direction, and back to '&gt;' in the other. It
> > works.

> As you have to keep the "<", why bother?

If you mean why don't I convert the '<' to '&lt;', the answer is that I do -
I just omitted to say so. However, explicit is better than implicit :-)

> > If I ever get my project to the point where I need to move it to the
> > front, I will know that I am getting somewhere!

> At that point you may also look at my messy/buggy/incomplete attempt to
> convert between xml and python:

[snip some impressive code]

Wow - that is impressive! I don't know if I will go that route, but I have
kept a copy and will study it closely when I have time.

Many thanks

Frank


Peter Otten

unread,
Aug 25, 2016, 11:17:35 AM8/25/16
to
Frank Millman wrote:

>> As you have to keep the "<", why bother?
>
> If you mean why don't I convert the '<' to '&lt;', the answer is that I do
> - I just omitted to say so. However, explicit is better than implicit :-)

Doesn't that make the XML document invalid or changes it in an irreversible
way? How would you know whether

"<foo><bar/></foo>"

started out as

"<foo><bar/></foo>"

or

"<foo>&lt;bar/></foo>"

?

Frank Millman

unread,
Aug 26, 2016, 1:22:32 AM8/26/16
to
"Peter Otten" wrote in message news:npn25e$s5n$1...@blaine.gmane.org...
I cheat ;-)

It is *my* XML, and I know that I only use the offending characters inside
attributes, and attributes are the only place where double-quote marks are
allowed.

So this is my conversion routine -

lines = string.split('"') # split on attributes
for pos, line in enumerate(lines):
if pos%2: # every 2nd line is an attribute
lines[pos] = line.replace('&lt;', '<').replace('&gt;', '>')
return '"'.join(lines)

Frank


Joonas Liik

unread,
Aug 26, 2016, 6:37:58 AM8/26/16
to
> --
> https://mail.python.org/mailman/listinfo/python-list

or.. you could just escape all & as &amp; before escaping the > and <,
and do the reverse on decode

Frank Millman

unread,
Aug 26, 2016, 9:10:55 AM8/26/16
to
"Joonas Liik" wrote in message
news:CAB1GNpQnJDENaA-GZgt0Tbcv...@mail.gmail.com...

> On 26 August 2016 at 08:22, Frank Millman <fr...@chagford.com> wrote:
> >
> > So this is my conversion routine -
> >
> > lines = string.split('"') # split on attributes
> > for pos, line in enumerate(lines):
> > if pos%2: # every 2nd line is an attribute
> > lines[pos] = line.replace('&lt;', '<').replace('&gt;', '>')
> > return '"'.join(lines)
> >
>
> or.. you could just escape all & as &amp; before escaping the > and <,
> and do the reverse on decode
>

Thanks, Joonas, but I have not quite grasped that.

Would you mind explaining how it would work?

Just to confirm that we are talking about the same thing -

This is not allowed - '<root><fld name="<new>"/></root>' [A]

>>> import xml.etree.ElementTree as etree
>>> x = '<root><fld name="<new>"/></root>'
>>> y = etree.fromstring(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File
"C:\Users\User\AppData\Local\Programs\Python\Python35\lib\xml\etree\ElementTree.py",
line 1320, in XML
parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1,
column 17

You have to escape it like this - '<root><fld name="&lt;new&gt;"/></root>'
[B]

>>> x = '<root><fld name="&lt;new&gt;"/></root>'
>>> y = etree.fromstring(x)
>>> y.find('fld').get('name')
'<new>'
>>>

I want to convert the string from [B] to [A] for editing, and then back to
[B] before saving.

Thanks

Frank


Joonas Liik

unread,
Aug 26, 2016, 10:59:07 AM8/26/16
to
> --
> https://mail.python.org/mailman/listinfo/python-list

something like.. (untested)

def escape(untrusted_string):
''' Use on the user provided strings to render them inert for storage
escaping & ensures that the user cant type sth like '&gt;' in
source and have it magically decode as '>'
'''
return untrusted_string.replace("&","&amp;").replace("<",
"&lt;").replace(">", "&gt;")

def unescape(escaped_string):
'''Once the user string is retreived from storage use this
function to restore it to its original form'''
return escaped_string.replace("&lt;","<").replace("&gt;",
">").replace("&amp;", "&")

i should note tho that this example is very ad-hoc, i'm no xml expert
just know a bit about xml entities.
if you decide to go this route there are probably some much better
tested functions out there to escape text for storage in xml
documents.

Joonas Liik

unread,
Aug 26, 2016, 11:00:55 AM8/26/16
to
you might want to un-wrap that before testing tho.. no idea why my
messages get mutilated like that :(
(sent using gmail, maybe somebody can comment on that?)

Frank Millman

unread,
Aug 26, 2016, 11:23:44 AM8/26/16
to
"Joonas Liik" wrote in message
news:CAB1GNpTP0GD4s4kx07r1ujRN...@mail.gmail.com...

> something like.. (untested)

def escape(untrusted_string):
''' Use on the user provided strings to render them inert for storage
escaping & ensures that the user cant type sth like '&gt;' in
source and have it magically decode as '>'
'''
return untrusted_string.replace("&","&amp;").replace("<",
"&lt;").replace(">", "&gt;")

def unescape(escaped_string):
'''Once the user string is retreived from storage use this
function to restore it to its original form'''
return escaped_string.replace("&lt;","<").replace("&gt;",
">").replace("&amp;", "&")

> i should note tho that this example is very ad-hoc, i'm no xml expert just
> know a bit about xml entities. if you decide to go this route there are
> probably some much better tested functions out there to escape text for
> storage in xml documents.

Thanks very much, Joonas.

I understand now, and it seems to work fine.

As a bonus, I can now include '&' in my attributes in the future if the need
arises.

Much appreciated.

Frank


Roland Koebler

unread,
Aug 26, 2016, 11:59:59 AM8/26/16
to
Hi,

> It is *my* XML, and I know that I only use the offending characters inside
> attributes, and attributes are the only place where double-quote marks are
> allowed.
>
> So this is my conversion routine -
>
> lines = string.split('"') # split on attributes
> for pos, line in enumerate(lines):
> if pos%2: # every 2nd line is an attribute
> lines[pos] = line.replace('&lt;', '<').replace('&gt;', '>')
> return '"'.join(lines)
OMG!
So, you have a fileformat, which looks like XML, but actually isn't XML,
and will break if used with some "real" XML.

Although I don't like XML, if you want XML, you should follow Chris advice:
On Thu, Aug 25, 2016 at 09:40:03PM +1000, Chris Angelico wrote:
> just make sure it's always valid XML, rather
> than some "XML-like" file structure.

So, please:
- Don't try to write your own (not-quite-)XML-parser.
- Read how XML-files work.
- Read https://docs.python.org/3/library/xml.html
and https://pypi.python.org/pypi/defusedxml/
- Think what you have done.
- Use a sensible XML-parser/dumper. This should escape most special-
characters for you (at least: < > & " ').


Roland

Marko Rauhamaa

unread,
Aug 26, 2016, 1:48:15 PM8/26/16
to
"Frank Millman" <fr...@chagford.com>:

> "Joonas Liik" wrote in message
> news:CAB1GNpTP0GD4s4kx07r1ujRN...@mail.gmail.com...
>> i should note tho that this example is very ad-hoc, i'm no xml expert
>> just know a bit about xml entities. if you decide to go this route
>> there are probably some much better tested functions out there to
>> escape text for storage in xml documents.
>
> Thanks very much, Joonas.
>
> I understand now, and it seems to work fine.
>
> As a bonus, I can now include '&' in my attributes in the future if the
> need arises.
>
> Much appreciated.

XML attributes are ridiculously complicated. From the standard:

Before the value of an attribute is passed to the application or
checked for validity, the XML processor MUST normalize the attribute
value by applying the algorithm below, or by using some other method
such that the value passed to the application is the same as that
produced by the algorithm.

1. All line breaks MUST have been normalized on input to #xA as
described in 2.11 End-of-Line Handling, so the rest of this
algorithm operates on text normalized in this way.

2. Begin with a normalized value consisting of the empty string.

3. For each character, entity reference, or character reference in
the unnormalized attribute value, beginning with the first and
continuing to the last, do the following:

* For a character reference, append the referenced character to
the normalized value.

* For an entity reference, recursively apply step 3 of this
algorithm to the replacement text of the entity.

* For a white space character (#x20, #xD, #xA, #x9), append a
space character (#x20) to the normalized value.

* For another character, append the character to the normalized
value.

If the attribute type is not CDATA, then the XML processor MUST
further process the normalized attribute value by discarding any
leading and trailing space (#x20) characters, and by replacing
sequences of space (#x20) characters by a single space (#x20)
character.

Note that if the unnormalized attribute value contains a character
reference to a white space character other than space (#x20), the
normalized value contains the referenced character itself (#xD, #xA
or #x9). This contrasts with the case where the unnormalized value
contains a white space character (not a reference), which is replaced
with a space character (#x20) in the normalized value and also
contrasts with the case where the unnormalized value contains an
entity reference whose replacement text contains a white space
character; being recursively processed, the white space character is
replaced with a space character (#x20) in the normalized value.

All attributes for which no declaration has been read SHOULD be
treated by a non-validating processor as if declared CDATA.

It is an error if an attribute value contains a reference to an
entity for which no declaration has been read.

Following are examples of attribute normalization. Given the
following declarations:

<!ENTITY d "&#xD;">
<!ENTITY a "&#xA;">
<!ENTITY da "&#xD;&#xA;">

the attribute specifications in the left column below would be
normalized to the character sequences of the middle column if the
attribute a is declared NMTOKENS and to those of the right columns if
a is declared CDATA.

=================================================================
Attribute specification: a="

xyz"
a is NMTOKENS: x y z
a is CDATA: #x20 #x20 x y z
=================================================================
Attribute specification: a="&d;&d;A&a;&#x20;&a;B&da;"
a is NMTOKENS: A #x20 B
a is CDATA: #x20 #x20 A #x20 #x20 #x20 B #x20 #x20
=================================================================
Attribute specification: a="&#xd;&#xd;A&#xa;&#xa;B&#xd;&#xa;"
a is NMTOKENS: #xD #xD A #xA #xA B #xD #xA
a is CDATA: #xD #xD A #xA #xA B #xD #xA
=================================================================

Note that the last example is invalid (but well-formed) if a is
declared to be of type NMTOKENS.

<URL: https://www.w3.org/TR/REC-xml/#AVNormalize>


Marko

Roland Koebler

unread,
Aug 26, 2016, 3:28:08 PM8/26/16
to
Hi,

after reading the mails of this thread, I would recommend one of the
following ways:

1. Use a computer-readable format and some small editor for humans.

The file-format could then be very simple -- I would recommend JSON.
Or some kind of database (e.g. SQLite).

For humans, you would have to write a (small/nice) graphical editor,
where they can build the logic e.g. by clicking on buttons.
This can also work for non-programmers, since the graphical editor
can be adapted to the indended users, give help, run wizards etc.

or:

2. Use a human-readable format and a parser for the computer.

Then, the fileformat should be optimized for human readability.
I would recommend a restricted subset of Python. This is much more
readable/writeable for humans than any XML/JSON/YAML.
And you could even add a graphical editor to further support
non-programming-users.

The computer would then need a special parser. But by using
Python-expressions (only eval, no exec) and a parser for flow
control (if/else/for/...) and assignments, this is not too much
work and is good for many applications.

I've written such a parser incl. some kind of (pseudo-)sandbox [2]
for my template-engine "pyratemp" [1], and I've also used it for
small user-created-procedures.

[1] http://www.simple-is-better.org/template/pyratemp.html
[2] It's not a real sandbox -- it's secured only by restricting
the available commands. If you add unsafe commands to the
pseudo-sandbox (e.g. Pythons "open"), the user can do bad
things.
But without manually adding unsafe commands, I don't know
any way to get out of this pseudo-sandbox.
And if you really need a sandbox which is more powerful
than my pseudo-sandbox, you may want to have a look at
the sandbox of PyPy.


Trying to use a format which is both directly computer-readable
(without a special parser) and well human readable never really
works well in my experience. Then, you usually have to manually
read/write/edit some kind of parse-tree, which is usually much
harder to read/write than code. But if you want to do this, I
recommend LISP ;).

(By the way: If I did understand your mails correctly, your
program would probably break if someone edits the XML-files
manually, since you're using some kind of XML-like-fileformat
with many non-intuitive assumptions.)


Roland


PS:

On Wed, Aug 24, 2016 at 04:58:54PM +0200, Frank Millman wrote:
> Here is a JSON version -
>
> {
> "case": {
> "compare": {
> "-src": "_param.auto_party_id",
> "-op": "is_not",
> "-tgt": "$None",
> "case": {
> "on_insert": {
> "auto_gen": { "-args": "_param.auto_party_id" }
> },
> "not_exists": {
> "literal": { "-value": "<new>" }
> }
> }
> }
> }
> }
I think this is not really good. In JSON, you also have lists, and in this
case, it would probably be better to use some lists instead of dicts, e.g.:

[
["if", ["_param.auto_party_id", "is not", "None"],
["if", ["on_insert"], ["set", "value", ["call", "auto_gen", "_param.auto_party_id"]]],
["elif", ["not_exists"], ["set", "value", "'<new>'"]]
]
]

I think this is much more readable than your XML-code and the
auto-converted JSON.

And it's even less ambigious. (How do you distinguish between the
variable _param.auto_party_id and the string "_param.auto_party_id"
in your XML-example?)

Frank Millman

unread,
Aug 27, 2016, 1:29:39 AM8/27/16
to
"Roland Koebler" wrote in message news:20160826140213.GA17438@localhost...

> Hi,

> OMG!
> So, you have a fileformat, which looks like XML, but actually isn't XML,
> and will break if used with some "real" XML.

I don't want to pursue this too much further, but I would like to point out
that my format is genuine XML. I can back that up with two pieces of
supporting evidence -

1. After going through the unescape > gui > edit > back_to_application >
escape routine, I validate the result before saving it -

try:
return etree.fromstring(value, parser=self.parser)
except (etree.XMLSyntaxError, ValueError) as e:
raise AibError(head=self.col_defn.short_descr,
body='Xml error - {}'.format(e.args[0]))

2. This is how I process the result at run-time. I do not try to parse it
myself. I convert it into an EtreeElement, and step through it. Each 'tag'
in the XML maps to a function name in the processing module -

for xml in elem:
await globals()[xml.tag](caller, xml)

The built-in ElementTree would work for this, but I actually use lxml,
because I use a little bit of xpath in my processing, which ElementTree does
not support.

Frank


0 new messages