nested key:value in config?

40 views
Skip to first unread message

Krzysztof Sakrejda

unread,
May 29, 2016, 9:22:31 AM5/29/16
to stan development mailing list
In the past we've had some arguments about whether to allow nesting in the configuration specification and I wanted to see if it would fly now (in the configuration output). I'm in favor of that since it makes it easier to specify names (in the style of CmdStan command line) so that they don't conflict among algorithms. For example, the style in the _current_ output is:

algorithm = lbfgs (Default)
lbfgs
init_alpha = 0.001 (Default)
tol_obj = 9.9999999999999998e-13 (Default)
tol_rel_obj = 10000 (Default)
tol_grad = 1e-08 (Default)

Where the meaning of init_alpha is unambiguous because it's nested in lbfgs but another algorithm might have an init_alpha with a completely different meaning. It also makes the config output order-independent since you don't have to search for the algorithm:bfgs key-value pair before you know what init_alpha:0.001 key-value pair means.

We'll have another go at this when it's code rather than spec but if you have any fundamental objections I'd appreciate hearing them now. (or suggestions).

Krzysztof

Bob Carpenter

unread,
May 29, 2016, 1:06:47 PM5/29/16
to stan...@googlegroups.com
I'd like to work backward from how we're applying this.

One goal is just recordkeeping to record how a sample
was gathered. For that, human readability is a big help.

Another goal is to be able to restart. To restart,
each interface will need to read at least some of the
information back in. For that, we need parsers in each
language.

If the readers for nested structures are manageable,
I'd be OK with nesting.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Krzysztof Sakrejda

unread,
May 29, 2016, 8:50:58 PM5/29/16
to stan development mailing list
On Sunday, May 29, 2016 at 1:06:47 PM UTC-4, Bob Carpenter wrote:
> I'd like to work backward from how we're applying this.

Sounds good to me.

> One goal is just recordkeeping to record how a sample
> was gathered. For that, human readability is a big help.

I'm thinking there should be a function for dumping human-readable
output that we do not ever expect to parse again as well as
machine-readable output.

> Another goal is to be able to restart. To restart,
> each interface will need to read at least some of the
> information back in. For that, we need parsers in each
> language.

Yeah, I wasn't worried about exact format yet b/c we could
rely on something that's broadly used like JSON/protobuf/etc...

> If the readers for nested structures are manageable,

I don't even know how to answer that question without just
implementing it in C++ and seeing what API I can produce
for interfaces...

K

Bob Carpenter

unread,
May 29, 2016, 10:20:47 PM5/29/16
to stan...@googlegroups.com

> On May 29, 2016, at 8:50 PM, Krzysztof Sakrejda <krzysztof...@gmail.com> wrote:
>
> On Sunday, May 29, 2016 at 1:06:47 PM UTC-4, Bob Carpenter wrote:
>> I'd like to work backward from how we're applying this.
>
> Sounds good to me.
>
>> One goal is just recordkeeping to record how a sample
>> was gathered. For that, human readability is a big help.
>
> I'm thinking there should be a function for dumping human-readable
> output that we do not ever expect to parse again as well as
> machine-readable output.

I'd like the output that gets dumped to be human readable
if at all possible. The reason is so

* we don't have to maintain two tools (human readable and binary)

* the two versions (human readable and binary) won't get out of synch

* there'll never be a translator of binary to human readable of
the wrong version

The more data-like stuff I'm happier to put in a binary format,
but even then, there's a big advantage to something like CSV.
Rarely is the I/O the bottleneck (though it certainly can be with
really fast systems with lots of data, like big data optimization
problems).

>> Another goal is to be able to restart. To restart,
>> each interface will need to read at least some of the
>> information back in. For that, we need parsers in each
>> language.
>
> Yeah, I wasn't worried about exact format yet b/c we could
> rely on something that's broadly used like JSON/protobuf/etc...
>
>> If the readers for nested structures are manageable,
>
> I don't even know how to answer that question without just
> implementing it in C++ and seeing what API I can produce
> for interfaces...

That's the Catch-22 of designing computer programs.

- Bob

Allen B. Riddell

unread,
May 30, 2016, 7:55:22 AM5/30/16
to stan...@googlegroups.com
On 05/29, Bob Carpenter wrote:
>
> > On May 29, 2016, at 8:50 PM, Krzysztof Sakrejda <krzysztof...@gmail.com> wrote:
> >
> > On Sunday, May 29, 2016 at 1:06:47 PM UTC-4, Bob Carpenter wrote:
> >> I'd like to work backward from how we're applying this.
> >
> > Sounds good to me.
> >
> >> One goal is just recordkeeping to record how a sample
> >> was gathered. For that, human readability is a big help.
> >
> > I'm thinking there should be a function for dumping human-readable
> > output that we do not ever expect to parse again as well as
> > machine-readable output.
>
> I'd like the output that gets dumped to be human readable
> if at all possible. The reason is so
>
> * we don't have to maintain two tools (human readable and binary)
>
> * the two versions (human readable and binary) won't get out of synch
>
> * there'll never be a translator of binary to human readable of
> the wrong version
>
> The more data-like stuff I'm happier to put in a binary format,
> but even then, there's a big advantage to something like CSV.
> Rarely is the I/O the bottleneck (though it certainly can be with
> really fast systems with lots of data, like big data optimization
> problems).
>

I'd like the output to be human-readable as well. One advantage of
picking JSON is that there are well-defined and familiar mappings to and
from binary formats (protobuf, msgpack, cbor, ...)

A

Krzysztof Sakrejda

unread,
May 30, 2016, 2:37:15 PM5/30/16
to stan development mailing list
On Sunday, May 29, 2016 at 10:20:47 PM UTC-4, Bob Carpenter wrote:
> > On May 29, 2016, at 8:50 PM, Krzysztof Sakrejda <krzysztof...@gmail.com> wrote:
> >
> > On Sunday, May 29, 2016 at 1:06:47 PM UTC-4, Bob Carpenter wrote:
> >> I'd like to work backward from how we're applying this.
> >
> > Sounds good to me.
> >
> >> One goal is just recordkeeping to record how a sample
> >> was gathered. For that, human readability is a big help.
> >
> > I'm thinking there should be a function for dumping human-readable
> > output that we do not ever expect to parse again as well as
> > machine-readable output.
>
> I'd like the output that gets dumped to be human readable
> if at all possible. The reason is so
>
> * we don't have to maintain two tools (human readable and binary)
> * the two versions (human readable and binary) won't get out of synch
> * there'll never be a translator of binary to human readable of
> the wrong version
>
> The more data-like stuff I'm happier to put in a binary format,
> but even then, there's a big advantage to something like CSV.
> Rarely is the I/O the bottleneck (though it certainly can be with
> really fast systems with lots of data, like big data optimization
> problems).

I agree about the advantages of human-redable formats. The one
issue I have with it is that we need to record some values faithfully to
make restarts possible.

>
> >> Another goal is to be able to restart. To restart,
> >> each interface will need to read at least some of the
> >> information back in. For that, we need parsers in each
> >> language.
> >
> > Yeah, I wasn't worried about exact format yet b/c we could
> > rely on something that's broadly used like JSON/protobuf/etc...
> >
> >> If the readers for nested structures are manageable,
> >
> > I don't even know how to answer that question without just
> > implementing it in C++ and seeing what API I can produce
> > for interfaces...
>
> That's the Catch-22 of designing computer programs.

Yeah, I'll go for a round of implementation
and then see what things look like.

Krzysztof

>
> - Bob

Krzysztof Sakrejda

unread,
May 30, 2016, 3:02:25 PM5/30/16
to stan development mailing list, a...@ariddell.org
On Monday, May 30, 2016 at 7:55:22 AM UTC-4, Allen B. Riddell wrote:
> On 05/29, Bob Carpenter wrote:
> >
> > > On May 29, 2016, at 8:50 PM, Krzysztof Sakrejda <krzysztof...@gmail.com> wrote:
> > >
> > > On Sunday, May 29, 2016 at 1:06:47 PM UTC-4, Bob Carpenter wrote:
[snip]

> I'd like the output to be human-readable as well. One advantage of
> picking JSON is that there are well-defined and familiar mappings to and
> from binary formats (protobuf, msgpack, cbor, ...)

I'm behind JSON as a human-readable format but it (or any other human-redable format)
can't be the only storage format since it precludes accurate restarts.

>
> A

Allen B. Riddell

unread,
May 30, 2016, 4:35:39 PM5/30/16
to Krzysztof Sakrejda, stan development mailing list
I think it could allow accurate restarts if we used a fault tolerant
parser. I'd rather not deal with multiple formats if JSON can do the
trick.

Krzysztof Sakrejda

unread,
May 30, 2016, 4:49:41 PM5/30/16
to stan development mailing list, krzysztof...@gmail.com, a...@ariddell.org
On Monday, May 30, 2016 at 4:35:39 PM UTC-4, Allen B. Riddell wrote:
> I think it could allow accurate restarts if we used a fault tolerant
> parser. I'd rather not deal with multiple formats if JSON can do the
> trick.

Fault-tolerant gets used to refer to so many things I don't know
what you mean here. K

Allen B. Riddell

unread,
May 30, 2016, 5:35:06 PM5/30/16
to Krzysztof Sakrejda, stan development mailing list
On 05/30, Krzysztof Sakrejda wrote:
In this case I mean a JSON parser which could handle an incomplete JSON
file such as

```
{
"samples": [1.5, 1.2, 1.1
```

Krzysztof Sakrejda

unread,
May 30, 2016, 6:07:51 PM5/30/16
to stan development mailing list, krzysztof...@gmail.com, a...@ariddell.org

I see, I meant that you can't represent floating point numbers accurately so you can't use saved values to restart the sampler in exactly the part of parameter space with exactly the same sampler parameters unless you save a binary version.

Bob Carpenter

unread,
May 30, 2016, 6:41:48 PM5/30/16
to stan...@googlegroups.com
Allen --- where do you see incomplete representations
coming up? The main problem I see with JSON is that
our data is essentially column oriented, but comes in
rows. But we can't stream the output by column, so it
has to be by row. And if we mark up each row, it's too
heavy.

Not having exact restarts with round-trip to ASCII
is an issue. I'm not sure it's possible. We could
get very close. But then it's not

- Bob

Allen B. Riddell

unread,
May 30, 2016, 7:00:11 PM5/30/16
to Krzysztof Sakrejda, stan development mailing list
If a binary representation of doubles is required there are several ways
of storing that in JSON. HDF5 has a specification:
https://github.com/HDFGroup/hdf5-json

They use the tag H5T_IEEE_F64BE for doubles. Here's an example:

https://hdf5-json.readthedocs.io/en/latest/examples/datatype_object.html

Allen B. Riddell

unread,
May 30, 2016, 7:04:22 PM5/30/16
to stan...@googlegroups.com
On 05/30, Bob Carpenter wrote:
> Allen --- where do you see incomplete representations
> coming up? The main problem I see with JSON is that
> our data is essentially column oriented, but comes in
> rows. But we can't stream the output by column, so it
> has to be by row. And if we mark up each row, it's too
> heavy.
>

I was thinking we'd serialize by row. Yes, it would be very heavy but if
our choices are CSV and JSON it's all more or less the same, right?

Bob Carpenter

unread,
May 30, 2016, 7:14:48 PM5/30/16
to stan...@googlegroups.com
JSON files are bigger than CSV files (how much depends on
structure); they're also going to be slower to parse given
the more involved structure. I don't think either of these
issues is a big deal.

We just can't have the rows be structured with keys or it'll
be too heavy.

- Bob

Michael Betancourt

unread,
May 30, 2016, 8:00:57 PM5/30/16
to stan...@googlegroups.com
CSV is also waaaaaay easier to use with Unix tools.
I’m fine with metadata and such being stored in JSON
but having the samples in JSON and not a straight CSV
would be very, very disruptive.

Avraham Adler

unread,
May 30, 2016, 8:58:09 PM5/30/16
to stan development mailing list
On Monday, May 30, 2016 at 8:00:57 PM UTC-4, Michael Betancourt wrote:
> CSV is also waaaaaay easier to use with Unix tools.
> I’m fine with metadata and such being stored in JSON
> but having the samples in JSON and not a straight CSV
> would be very, very disruptive.

I agree with Michael in that CSVs, if set up properly, are much easier to parse and munge than JSONs. As mentioned in another thread, feather may be an option, but if you're dealing with representing real numbers with long decimals, uncompressed CSV may be smaller than uncompressed feather. See Wes Mckinney's last post here at https://github.com/wesm/feather/issues/162.

Allen B. Riddell

unread,
May 31, 2016, 7:41:49 AM5/31/16
to stan...@googlegroups.com
On 05/31, Michael Betancourt wrote:
> CSV is also waaaaaay easier to use with Unix tools.

I haven't found this to be true in my work. I always have to use Python
(pandas) or R to read CSV files because there's frequently some
idiosyncrasy in the CSV file, which we should expect since CSV isn't a
well-defined standard.

I think JSON is becoming easier to deal with on the command-line. For
example, https://github.com/stedolan/jq has 6,713 stars on github.

(Also, how many *Stan users are dealing with output on the unix command
line? I don't think it's more than 5-10%)

> I’m fine with metadata and such being stored in JSON
> but having the samples in JSON and not a straight CSV
> would be very, very disruptive.

I tend to think it's CSV that's the problem. There's no standard and we
have to flatten arrays using yet another bespoke convention that is hard
to communicate (column-major? row-major?).

Space is certainly a consideration -- but we should solve that in a
different way. Compression maybe. JSON and some binary format would be
OK too. Certainly not JSON, CSV, *and* a binary format.

Best,

AR

Allen B. Riddell

unread,
May 31, 2016, 7:42:23 AM5/31/16
to stan...@googlegroups.com
I'd love to use feather but it's far too early, I think. There hasn't
even been a beta release yet, right?

Avraham Adler

unread,
May 31, 2016, 8:03:48 AM5/31/16
to stan development mailing list, a...@ariddell.org
On Tuesday, May 31, 2016 at 7:42:23 AM UTC-4, Allen B. Riddell wrote:
> I'd love to use feather but it's far too early, I think. There hasn't
> even been a beta release yet, right?

It's been released for [Python](https://github.com/wesm/feather/tree/master/python) on [CRAN for R](https://cran.r-project.org/web/packages/feather/index.html), so its probably beyond beta now, although there are bugs (such as reading files >2GB).

Avi

Allen B. Riddell

unread,
May 31, 2016, 8:15:10 AM5/31/16
to Avraham Adler, stan development mailing list
First commit was Jan 27, 2016. I'd like to see a couple more major
adopters and a bit more of a track record.

Krzysztof Sakrejda

unread,
May 31, 2016, 9:15:23 AM5/31/16
to stan development mailing list, a...@ariddell.org
On Tuesday, May 31, 2016 at 7:41:49 AM UTC-4, Allen B. Riddell wrote:
> On 05/31, Michael Betancourt wrote:
> > CSV is also waaaaaay easier to use with Unix tools.
>
> I haven't found this to be true in my work. I always have to use Python
> (pandas) or R to read CSV files because there's frequently some
> idiosyncrasy in the CSV file, which we should expect since CSV isn't a
> well-defined standard.

Fair criticism of CSV but we're only writing numbers so we can write
a widely readable .csv file.

>
> I think JSON is becoming easier to deal with on the command-line. For
> example, https://github.com/stedolan/jq has 6,713 stars on github.

jq is pretty good, I use JSON to specify config files for Stan and then
a bash script with jq to pull out the configs. That makes the configs
readable from the command line as well as from R/etc... I have had
some issues with the R parser being more robust than the jq parser
so sometimes I have to spend some time chasing down JSON corner cases
in my configs when jq doesn't throw good error messages.

>
> (Also, how many *Stan users are dealing with output on the unix command
> line? I don't think it's more than 5-10%)

Could we keep the flame war fodder down? 5-10% is plenty of users
and there are many models too big for rstan/PyStan at the moment. The
tools need to inter-operate.

>
> > I’m fine with metadata and such being stored in JSON
> > but having the samples in JSON and not a straight CSV
> > would be very, very disruptive.
>
> I tend to think it's CSV that's the problem. There's no standard and we
> have to flatten arrays using yet another bespoke convention that is hard
> to communicate (column-major? row-major?).
>
> Space is certainly a consideration -- but we should solve that in a
> different way. Compression maybe. JSON and some binary format would be
> OK too. Certainly not JSON, CSV, *and* a binary format.

Does JSON support a lightweight way of passing arrays of numbers without
cladding it in a lot of extra text?

K
Reply all
Reply to author
Forward
0 new messages