Writing parsetab.py data to /tmp

209 views
Skip to first unread message

David Beazley

unread,
Jul 8, 2013, 7:24:19 AM7/8/13
to ply-...@googlegroups.com, David Beazley
So, I was thinking about the PLY parsetab.py file this morning. What would people think if the data contained in this file was simply written somewhere in the system temporary directory (e.g., /tmp) and regenerated as needed? Under such a scheme, everything would work pretty much the way it does now except that I could deprecate that whole sea of options about parsetab files, output directories, and whatnot. The only real downside that I can think of is that the parser tables would have to be regenerated after a reboot. However, who would really care given that it only takes a few seconds?

Thoughts?

-Dave


Alex Gaynor

unread,
Jul 8, 2013, 7:26:08 AM7/8/13
to ply-...@googlegroups.com, David Beazley
As a datapoint, sticking the cache in /tmp is exactly what I do in rply. I've never heard a complaint about it (admittingly I have waaaay fewer users).

Alex




--
You received this message because you are subscribed to the Google Groups "ply-hack" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ply-hack+u...@googlegroups.com.
To post to this group, send email to ply-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ply-hack/6733F746-12F2-4393-9269-286EF0C043A2%40dabeaz.com.
For more options, visit https://groups.google.com/groups/opt_out.





--
"I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire)
"The people's good is the highest law." -- Cicero
GPG Key fingerprint: 125F 5C67 DFE9 4084

John Szakmeister

unread,
Jul 8, 2013, 7:38:46 AM7/8/13
to ply-...@googlegroups.com, David Beazley
On Mon, Jul 8, 2013 at 7:24 AM, David Beazley <da...@dabeaz.com> wrote:
> So, I was thinking about the PLY parsetab.py file this morning. What would people think if the data contained in this file was simply written somewhere in the system temporary directory (e.g., /tmp) and regenerated as needed? Under such a scheme, everything would work pretty much the way it does now except that I could deprecate that whole sea of options about parsetab files, output directories, and whatnot. The only real downside that I can think of is that the parser tables would have to be regenerated after a reboot. However, who would really care given that it only takes a few seconds?
>
> Thoughts?

Can you provide more information about how this would work? Say user1
has a tool installed, built on PLY, that's using version 1.0 of the
grammar, and user2 has the same tool installed, but it's using version
2.0 of the grammar. Do they both get written to the same file when
the parse tables are generated? Would the developer using PLY provide
something that's used to influence how the directory or file gets
named? Is there an issue with someone monkeying with parse tables to
make something happen that shouldn't happen--a security problem of
sorts? Could we still package the parsetab.py file with our RPM
install, if we wanted to avoid the overhead?

Sorry for the dumb questions!

-John

David Beazley

unread,
Jul 8, 2013, 7:38:55 AM7/8/13
to Alex Gaynor, David Beazley, ply-...@googlegroups.com
What do you name the file? That's the only somewhat tricky part I can think of. I was thinking about keying it either off the absolute path of the file invoking ply.yacc() or maybe the signature of all of the parsing rules (currently used to determine if the tables need to be regenerated).

-Dave

Alex Gaynor

unread,
Jul 8, 2013, 7:48:54 AM7/8/13
to David Beazley, ply-...@googlegroups.com
The filename is a combination of a version tag for rply itself, a cache tag supplied by a user, and a hash of the grammar: https://github.com/alex/rply/blob/master/rply/parsergenerator.py#L125

Alex

A.T.Hofkamp

unread,
Jul 8, 2013, 7:55:09 AM7/8/13
to ply-...@googlegroups.com
On 07/08/2013 01:24 PM, David Beazley wrote:
> So, I was thinking about the PLY parsetab.py file this morning. What would people think if the
> data contained in this file was simply written somewhere in the system temporary directory (e.g.,

What about giving the user two extra methods

1. get_table_data()
Generates the parser data table, and returns it to the caller. Simplest would be a string of
some kind or so, ie a native data type easily handled if you don't care about the contents.

2. set_table_data(data)
User provides the table data obtained from 1 at some point in the past.

Also

If no table has been provided before the first parse, generate the table during the first parse
internally. Just keep the table internally in memory, during the life time of the praser object.



This way people have complete freedom what they want to do (or don't want to do) with the table
data. I can see options like storing the table data in a multi-line string literal in a generated
Python file, or writing it in some data file in the application, or whatever.

David Beazley

unread,
Jul 8, 2013, 7:58:49 AM7/8/13
to John Szakmeister, David Beazley, ply-...@googlegroups.com
On Jul 8, 2013, at 6:38 AM, John Szakmeister <jo...@szakmeister.net> wrote:

> On Mon, Jul 8, 2013 at 7:24 AM, David Beazley <da...@dabeaz.com> wrote:
>> So, I was thinking about the PLY parsetab.py file this morning. What would people think if the data contained in this file was simply written somewhere in the system temporary directory (e.g., /tmp) and regenerated as needed? Under such a scheme, everything would work pretty much the way it does now except that I could deprecate that whole sea of options about parsetab files, output directories, and whatnot. The only real downside that I can think of is that the parser tables would have to be regenerated after a reboot. However, who would really care given that it only takes a few seconds?
>>
>> Thoughts?
>
> Can you provide more information about how this would work? Say user1
> has a tool installed, built on PLY, that's using version 1.0 of the
> grammar, and user2 has the same tool installed, but it's using version
> 2.0 of the grammar. Do they both get written to the same file when
> the parse tables are generated?

I was thinking about making the name of the written file incorporate some kind of unique signature to keep it separate from other users of PLY. For example, it could be based on the absolute file parse of the code that invokes PLY or maybe the parsetab signature that gets computed. So the parsing data from different grammars and users of PLY would be written to separate files.

> Would the developer using PLY provide
> something that's used to influence how the directory or file gets
> named?

No, although an option to specify the location of the temporary directory would be provided.

> Is there an issue with someone monkeying with parse tables to
> make something happen that shouldn't happen--a security problem of
> sorts?

Hard to say. The new approach wouldn't be using 'import' to load parsing data however. As such, it could be encoded in a different format such as JSON. Maybe it's safer than what's done now.

> Could we still package the parsetab.py file with our RPM
> install, if we wanted to avoid the overhead?

There wouldn't be a proper parsetab.py file anymore. I suppose the generated output file could be distributed as long as you were willing to tell PLY the directory where it was located.

-Dave



David Beazley

unread,
Jul 8, 2013, 8:03:28 AM7/8/13
to ply-...@googlegroups.com, David Beazley
I like this idea, although having some kind of default caching scheme is also useful for people who just don't want to think about it.

-Dave
> --
> You received this message because you are subscribed to the Google Groups "ply-hack" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to ply-hack+u...@googlegroups.com.
> To post to this group, send email to ply-...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/ply-hack/51DAA89D.5010509%40tue.nl.

John Szakmeister

unread,
Jul 8, 2013, 8:11:57 AM7/8/13
to ply-...@googlegroups.com
On Mon, Jul 8, 2013 at 7:58 AM, David Beazley <da...@dabeaz.com> wrote:
[snip]
>> Can you provide more information about how this would work? Say user1
>> has a tool installed, built on PLY, that's using version 1.0 of the
>> grammar, and user2 has the same tool installed, but it's using version
>> 2.0 of the grammar. Do they both get written to the same file when
>> the parse tables are generated?
>
> I was thinking about making the name of the written file incorporate some kind of unique signature to keep it separate from other users of PLY. For example, it could be based on the absolute file parse of the code that invokes PLY or maybe the parsetab signature that gets computed. So the parsing data from different grammars and users of PLY would be written to separate files.

That makes sense.

>> Would the developer using PLY provide
>> something that's used to influence how the directory or file gets
>> named?
>
> No, although an option to specify the location of the temporary directory would be provided.
>
>> Is there an issue with someone monkeying with parse tables to
>> make something happen that shouldn't happen--a security problem of
>> sorts?
>
> Hard to say. The new approach wouldn't be using 'import' to load parsing data however. As such, it could be encoded in a different format such as JSON. Maybe it's safer than what's done now.

I think there are probable a few more things to think about along this
line. The resultant file would be created with the running user's
permissions. So, without some care, one user could create the file,
and another might not be able to read it due to permissions issues.
Maybe the user's umask isn't set up to allow everyone read access, for
instance. I could definitely see some headaches in this arena.

>> Could we still package the parsetab.py file with our RPM
>> install, if we wanted to avoid the overhead?
>
> There wouldn't be a proper parsetab.py file anymore. I suppose the generated output file could be distributed as long as you were willing to tell PLY the directory where it was located.

I guess if PLY doesn't attempt to open the file with write permissions
(if it's happy with the signature), then that could work.

-John

A.T.Hofkamp

unread,
Jul 8, 2013, 8:18:39 AM7/8/13
to ply-...@googlegroups.com
On 07/08/2013 02:03 PM, David Beazley wrote:
> I like this idea, although having some kind of default caching scheme is also useful for people who just don't want to think about it.

You can see writing the table to /tmp as a layer on top of this proposal.
A simple approach could be to add a method

3. setupTable()
Reads and/or writes the table data to /tmp using methods 1 & 2.
It should be able to detect whether a table has already been supplied, making it a "pass" method.

"supply a table" should also be true after calling 1, imho.


The open question is whether you have to do 3 explicitly or not.

If yes, and the user calls 3 before parsing, which does all the table & file magic (unless 2 is done
first, due to the "pass" reduction mentioned above).
If he doesn't call 3 beforehand, the "internal generate fallback" is performed.

If no, then 3 is done automatically before or during the first parse if 2 is not done. The fallback
to internal generation could then be a call to 1, and never save the returned data.


I would prefer the explicit call, but both options seem viable.

David Bliss

unread,
Jul 10, 2013, 8:26:47 AM7/10/13
to ply-...@googlegroups.com
Just saw this.

I don't mind the idea of using tempfiles at all, but I noticed this discussion seems to be very *nix focused. Please consider using tempfile.gettempdir() instead of literally /tmp, for those of us stuck in Windoze land part-time.

David Bliss
1-616-284-1273


--
You received this message because you are subscribed to the Google Groups "ply-hack" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ply-hack+u...@googlegroups.com.
To post to this group, send email to ply-...@googlegroups.com.

David Beazley

unread,
Jul 10, 2013, 8:31:08 AM7/10/13
to ply-...@googlegroups.com, David Beazley
Oh definitely.  Any solution implemented would emphasize portability.

-Dave

Andrew Dalke

unread,
Aug 5, 2013, 10:23:08 PM8/5/13
to ply-...@googlegroups.com
Catching up on email ...


> On Jul 8, 2013, at 6:38 AM, John Szakmeister <jo...@szakmeister.net> wrote:
>> Is there an issue with someone monkeying with parse tables to
>> make something happen that shouldn't happen--a security problem of
>> sorts?


On Jul 8, 2013, at 1:58 PM, David Beazley wrote:
> Hard to say. The new approach wouldn't be using 'import' to load parsing data however. As such, it could be encoded in a different format such as JSON. Maybe it's safer than what's done now.


This sounds like a bad security hole.

On a multiuser system, if I know that user X with grammar Y located at position Z (assuming those are the three factors that go into making the unique name), then I can use the same algorithm to determine what the filename will be. Let's suppose it's $PARSETAB.

I then write a file to $PARSETAB and make it unreadable. This prevents user X from running the program.


Or, taking Alex's code as an example implementation:

if os.path.exists(cache_file):
with open(cache_file) as f:
data = json.load(f)
if self.data_is_valid(g, data):
table = LRTable.from_cache(g, data)
if table is None:
table = LRTable.from_grammar(g)
with open(cache_file, "w") as f:
json.dump(self.serialize_table(table), f)


I make $PARSETAB be readable, hooked to a named pipe. When someone starts to read from the pipe, I have the writer process rename $PARSETAB to $PARSETAB.old then make a symbolic link from $PARSETAB to an arbitrary file in X's account. The process feeding the named pipe then returns an invalid table, so as to trigger the cache write. Remember, the cache write occurs in a process owned by X.

The cache write occurs, saving to $PARSETAB, but because of the symlink the write actually goes to some other file of X's, of my choosing.



Even more fun, I can use make that file be a readable file, but with a different grammar than what's expected. Imagine a command-oriented language:

open "filename"
list 10-30
quit

which also has a command:

unlink "filename"


I, being who I am, might provide an alternate parsetab grammar which maps the "open" token to the unlink rule. Now when X's program reads the Y command to "open" it actually deletes the file.

Even if there's nothing so obviously security prone as that in the grammar, it's still plenty easy for me to introduce an alternate parser definition which can mess things up for X, like swapping "+" and "-".


In short, I don't see any way to get what you want with /tmp and still be secure in a multi-user system with possible malicious users. Your safer bet is to default to a $HOME/.ply-cache directory.


Even then, with non-malicious cases, there are still timing problems. What happens when two instances start at the same time and try to cache the parsetab file? In Alex's code it may produce a ValueError if one process has only managed to write part of the cache when the other process tries to read it. So at the very least it needs to be more robust against odd sorts of timing failures.


I would still want some way to override where the parsetab comes from. I don't like assuming that I have a writeable disk. get_table_data() and set_table_data() seem like they would be fine.

Cheers,

Andrew
da...@dalkescientific.com


Andrew Dalke

unread,
Aug 6, 2013, 5:19:19 AM8/6/13
to ply-...@googlegroups.com
Catching up on email ...


> On Jul 8, 2013, at 6:38 AM, John Szakmeister <jo...@szakmeister.net> wrote:
>> Is there an issue with someone monkeying with parse tables to
>> make something happen that shouldn't happen--a security problem of
>> sorts?


On Jul 8, 2013, at 1:58 PM, David Beazley wrote:
> Hard to say. The new approach wouldn't be using 'import' to load parsing data however. As such, it could be encoded in a different format such as JSON. Maybe it's safer than what's done now.


This sounds like a bad security hole.

On a multiuser system, if I know that user X with grammar Y located at position Z (assuming those are the three factors that go into making the unique name), then I can use the same algorithm to determine what the filename will be. Let's suppose it's $PARSETAB.

I then write a file to $PARSETAB and make it unreadable. This prevents user X from running the program.


Or, taking Alex's code as an example implementation:

if os.path.exists(cache_file):
with open(cache_file) as f:
data = json.load(f)
if self.data_is_valid(g, data):
table = LRTable.from_cache(g, data)
if table is None:
table = LRTable.from_grammar(g)
with open(cache_file, "w") as f:
json.dump(self.serialize_table(table), f)


I make $PARSETAB be readable, hooked to a named pipe. When someone starts to read from the pipe, I have the writer process rename $PARSETAB to $PARSETAB.old then make a symbolic link from $PARSETAB to an arbitrary file in X's account. The process feeding the named pipe then returns an invalid table, so as to trigger the cache write. Remember, the cache write occurs in a process owned by X.

The cache write occurs, saving to $PARSETAB, but because of the symlink the write actually goes to some other file of X's, of my choosing.



Even more fun, I can make $PARSETAB be a readable file, but with a different grammar than what's expected. Imagine a command-oriented language:

Andrew Dalke

unread,
Aug 6, 2013, 1:36:04 PM8/6/13
to ply-...@googlegroups.com
On Aug 6, 2013, at 11:19 AM, Andrew Dalke wrote:
> In short, I don't see any way to get what you want with /tmp and still be secure in a multi-user system with possible malicious users. Your safer bet is to default to a $HOME/.ply-cache directory.

I thought of one. If there's a $HOME/.ply-secret which is r------ then its contents can be used as a secret key for some cryptographically strong mechanism used to generate the actual filename.

This at least would solve the garbage collection problem that a $HOME/.ply-cache directory might have.

Cheers,

Andrew
da...@dalkescientific.com


Reply all
Reply to author
Forward
0 new messages