Sorting of bean-extract output

165 views
Skip to first unread message

Oon-Ee Ng

unread,
Nov 28, 2018, 10:00:43 PM11/28/18
to bean...@googlegroups.com
The output is sorted by day, but within the day outputs aren't sorted (or, rather, they're sorted by line number). Is this reliable behaviour (as-in, should I expect it to change at some point)?

The current source file for importing that I'm working on lists transaction in reverse chronological order (I want chronological, in this case) but also has a reference number that is monotonically increasing. So my current hack is to set line number for the transactions to be that reference number.

Martin Blais

unread,
Nov 28, 2018, 10:53:29 PM11/28/18
to bean...@googlegroups.com
On Wed, Nov 28, 2018 at 10:00 PM Oon-Ee Ng <ngoone...@gmail.com> wrote:
The output is sorted by day, but within the day outputs aren't sorted (or, rather, they're sorted by line number). Is this reliable behaviour (as-in, should I expect it to change at some point)?

- It's intended to be "stable," that is, between two runs, the order should be the same.
- It's not intended to be something you should rely on, not the least reason which is that you may reorder input in your file. It may change if there's a good reason for it (but frankly I don't see any reason to right now).


The current source file for importing that I'm working on lists transaction in reverse chronological order (I want chronological, in this case) but also has a reference number that is monotonically increasing. So my current hack is to set line number for the transactions to be that reference number.

That seems not unlike the idea of storing the time in metadata - discussed in a prior thread but not yet implemented - and to use that as a secondary sort key. Maybe I can generalize this idea of a secondary sort key to let users put whatever they want in there (as long as the type is comparable, for sorting).

 

--
You received this message because you are subscribed to the Google Groups "Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beancount+...@googlegroups.com.
To post to this group, send email to bean...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beancount/CAGQ70es-roVBQnWUdeDedgCSNL6PBb4Z%2Bywyo3J5LYwseAjfUA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Oon-Ee Ng

unread,
Nov 29, 2018, 12:48:38 AM11/29/18
to bean...@googlegroups.com
On Thu, Nov 29, 2018 at 11:53 AM Martin Blais <bl...@furius.ca> wrote:
On Wed, Nov 28, 2018 at 10:00 PM Oon-Ee Ng <ngoone...@gmail.com> wrote:
The output is sorted by day, but within the day outputs aren't sorted (or, rather, they're sorted by line number). Is this reliable behaviour (as-in, should I expect it to change at some point)?

- It's intended to be "stable," that is, between two runs, the order should be the same.
- It's not intended to be something you should rely on, not the least reason which is that you may reorder input in your file. It may change if there's a good reason for it (but frankly I don't see any reason to right now).


The current source file for importing that I'm working on lists transaction in reverse chronological order (I want chronological, in this case) but also has a reference number that is monotonically increasing. So my current hack is to set line number for the transactions to be that reference number.

That seems not unlike the idea of storing the time in metadata - discussed in a prior thread but not yet implemented - and to use that as a secondary sort key. Maybe I can generalize this idea of a secondary sort key to let users put whatever they want in there (as long as the type is comparable, for sorting).


Yes having a secondary sort would be useful. That's effectively what I'm using lineno for right now. 

Daniele Nicolodi

unread,
Nov 29, 2018, 1:06:13 AM11/29/18
to bean...@googlegroups.com
On 28/11/2018 20:53, Martin Blais wrote:
> The current source file for importing that I'm working on lists
> transaction in reverse chronological order (I want chronological, in
> this case) but also has a reference number that is monotonically
> increasing. So my current hack is to set line number for the
> transactions to be that reference number.
>
>
> That seems not unlike the idea of storing the time in metadata -
> discussed in a prior thread but not yet implemented - and to use that as
> a secondary sort key. Maybe I can generalize this idea of a secondary
> sort key to let users put whatever they want in there (as long as the
> type is comparable, for sorting).

Aren't directives written in the order in which they are returned by the
importer? I didn't see any mention of the fact that bean-extract does
any sorting on its own in the documentation, but I may have missed it.

Cheers,
Dan

Oon-Ee Ng

unread,
Nov 29, 2018, 1:14:56 AM11/29/18
to bean...@googlegroups.com
That was my impression as well, but after testing it out I found this was not the case. The returned ordered list of transactions is then sorted by date and line number. 

Martin Blais

unread,
Nov 29, 2018, 1:17:39 AM11/29/18
to bean...@googlegroups.com
Yes. Sorted by date.


Daniele Nicolodi

unread,
Nov 29, 2018, 1:22:29 AM11/29/18
to bean...@googlegroups.com
On 28/11/2018 23:17, Martin Blais wrote:
> On Thu, Nov 29, 2018 at 1:14 AM Oon-Ee Ng <ngoone...@gmail.com
>
> That was my impression as well, but after testing it out I found
> this was not the case. The returned ordered list of transactions is
> then sorted by date and line number. 
>
>
> Yes. Sorted by date.

Is that documented?

Would it be a possibility to accept a datetime object instead that a
date object for the date field (and do not serialize the time part into
the written file) be enough to obtain sorting by date and time?

Actually, datetime objects are already accepted building metadata
instances and I guess the sorting does not strip the time away, thus
only the serialization would need to be changed. I guess I can prepare a
patch for this.

Cheers,
Dan

Daniele Nicolodi

unread,
Nov 29, 2018, 2:17:41 AM11/29/18
to bean...@googlegroups.com
The patch is easy enough and attached, although it is largely untested.

Cheers,
Dan

beancount.patch

Martin Blais

unread,
Nov 29, 2018, 9:09:13 AM11/29/18
to bean...@googlegroups.com
On Thu, Nov 29, 2018 at 2:17 AM Daniele Nicolodi <dan...@grinta.net> wrote:
On 28/11/2018 23:22, Daniele Nicolodi wrote:
> On 28/11/2018 23:17, Martin Blais wrote:
>> On Thu, Nov 29, 2018 at 1:14 AM Oon-Ee Ng <ngoone...@gmail.com
>>
>>     That was my impression as well, but after testing it out I found
>>     this was not the case. The returned ordered list of transactions is
>>     then sorted by date and line number. 
>>
>>
>> Yes. Sorted by date.
>
> Is that documented?


>
> Would it be a possibility to accept a datetime object instead that a
> date object for the date field (and do not serialize the time part into
> the written file) be enough to obtain sorting by date and time?

No.
We've discussed this at length before.
This opens a can of worms, will you support all the things that will break on adding a new constraint?

 
>
> Actually, datetime objects are already accepted building metadata
> instances and I guess the sorting does not strip the time away, thus
> only the serialization would need to be changed. I guess I can prepare a
> patch for this.

The patch is easy enough and attached, although it is largely untested.

Let me get this right: you're proposing a patch that affects the parser layer, that has potential to break a bunch other people's stuff, and you're not bothering to write a test that specifically handles that case?
This is a perfect example how not to contribute to a released open source project.
I have no time for "largely untested" patches.


Daniele Nicolodi

unread,
Nov 29, 2018, 10:10:22 AM11/29/18
to bean...@googlegroups.com
Hello Maetin,

On 29/11/2018 07:08, Martin Blais wrote:
> Let me get this right: you're proposing a patch that affects the parser
> layer, that has potential to break a bunch other people's stuff, and
> you're not bothering to write a test that specifically handles that case?
> This is a perfect example how not to contribute to a released open
> source project.
> I have no time for "largely untested" patches.

I think I didn't explain well enough what my idea is (however, if you
would have looked at the patch I believe you could have understood it
rather quickly).

The problem I tried to solve is that in the ingestion process, before
being serialized into a beancount file, entries are sorted by date,
type, and origin file line number. It has ben requested if the sorting
order could take into account transaction time.

The solution I proposed is to allow the ingestion machinery to accept a
datetime object for the date field. The time would be thrown out in the
serialization, but would be used for sorting. It is similar to using
time information stored in the meta field.

The implementation is simple as the only thing that needs to change is
the serialization in minimal part. After looking at the code the only
thing that needs to be done is to replace the string formatting for date
entries from ...format('{e,date} ...', e) to
...format('{e.date:%Y-%m-%d}...', e). There is no change to the parser
layer.

The patch does not come with tests because I thought to it as a better
description of the kind of solution I'm proposing, rather than something
ready to be merges.

I interpret your answer as very hostile (please correct me if this
impression is wrong). Thus I am happy that I haven't invested more time
in my attempt to contribute an idea to your project.

Best,
Dan

Zhuoyun Wei

unread,
Nov 29, 2018, 10:13:40 AM11/29/18
to bean...@googlegroups.com
2018-11-29 14:14:44 Oon-Ee Ng <ngoone...@gmail.com>:
I was once troubled by this sorting behaviour as well. Intra-day
transactions are sorted reverse-chronologically (newer to older) in the
CSV file exported from my bank. My importer parses the entries in
reversed order, returning a correctly ordered list of transactions
(older to newer). However, Beancount will again sort the list returned
by importer, shuffling the intra-day transactions:

https://github.com/beancount/beancount/blob/fb2d29eed89a8e15f1635847beba0e5598c0ea06/beancount/ingest/extract.py#L73-L74

I had once submitted a pull request to disable this behaviour, it was
merged but later reverted by Martin.

So I came up with a workaround:

```
# Re-generate "lineno" to force sorting
for lineno, entry in enumerate(entries, start=1):
entry.meta['lineno'] = lineno
```

Just generate artificial "lineno" keys and insert them into the
transactions before returning them to Beancount. This way I can preserve
the order.

A possible approach to make both parties happy would be to add a new
parameter to "extract_from_file()" function, allowing importers to
disable the sorting behaviour in post-processings.


> --
> You received this message because you are subscribed to the Google Groups "Beancount" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> beancount+...@googlegroups.com.
> To post to this group, send email to bean...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/beancount/
> CAGQ70esZ%2BSFkn4VcaG1dh%2BP1i%3DNgQ9rHvSdx-_8eVKvJ1%3DcRMg%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.

--
Zhuoyun Wei
signature.asc

Martin Blais

unread,
Dec 1, 2018, 8:59:27 PM12/1/18
to bean...@googlegroups.com
On Thu, Nov 29, 2018 at 10:10 AM Daniele Nicolodi <dan...@grinta.net> wrote:
Hello Maetin,

On 29/11/2018 07:08, Martin Blais wrote:
> Let me get this right: you're proposing a patch that affects the parser
> layer, that has potential to break a bunch other people's stuff, and
> you're not bothering to write a test that specifically handles that case?
> This is a perfect example how not to contribute to a released open
> source project.
> I have no time for "largely untested" patches.

I think I didn't explain well enough what my idea is (however, if you
would have looked at the patch I believe you could have understood it
rather quickly).

I had looked at it actually.
The problem with adding support for this (i.e., changing the sort order rule) is that there will be a lot of other things that start relying on it.
So say, you change the schema to implicitly support date or datetime in the printer, as your patch suggests.
Now some codes use date, some codes use datetime.
Someone uses the result of some importer code that puts datetime in this field - code that they didn't write, say - and passes it to some routine which fails on datetime. Boom! A bug. (Note that date's don't have a timezone and aren't subject to some of the timezone related issues.) Or say that person does arithmetic on the dates/datetimes. And it fails. That's unpredictable behavior. Then I'll get a request to fix that bug on an unsupported data type and someone will be surprised it's not supported officially.
Or someone comes to rely on the particular ordering given by the datetime and then I refine the rules in a way that breaks their code.
Or... well, the ordering rules specify that balance check are to be sorted such that they're located before other types of directives (see link I sent). What about a Balance check with a time after a transaction directives on the same day?

That's bad. The schema needs to have a single specific, well-documented type. Python is already loose enough. The schema needs to be clear. I even considered switching to protocol buffers at some point in order to enforce correct datatypes, it's already bad enough these aren't checked.

So I took one look at your patch and I thought all of this.
(We've discussed this issue numerous times on the mailing-list.)
The right solution would be to commit to using datetime everywhere in the source code and make that the unique new datatype (probably with a bunch of instances with time = midnight). But that's not going to happen in this major version, that would be for a future major revamp. This current release is a stable release and people need it to work, it's not a place to experiment so much at the moment, this needs to remain stable, a future revamp would be the right venue for that, and it would come with many other changes. Plus, it's really much more complicated that it seems (timezones?).

I've agreed with others in previous threads to add 
1. Support for the grammar to parse a time and to insert the time portion in some metadata field
2. Adding an option to support specifying a secondary key for sorting
This wouldn't break any of the current code by default.



 

The problem I tried to solve is that in the ingestion process, before
being serialized into a beancount file, entries are sorted by date,
type, and origin file line number. It has ben requested if the sorting
order could take into account transaction time.

The solution I proposed is to allow the ingestion machinery to accept a
datetime object for the date field. The time would be thrown out in the
serialization, but would be used for sorting. It is similar to using
time information stored in the meta field.

The implementation is simple as the only thing that needs to change is
the serialization in minimal part. After looking at the code the only
thing that needs to be done is to replace the string formatting for date
entries from ...format('{e,date} ...', e) to
...format('{e.date:%Y-%m-%d}...', e). There is no change to the parser
layer.

The patch does not come with tests because I thought to it as a better
description of the kind of solution I'm proposing, rather than something
ready to be merges.

I interpret your answer as very hostile (please correct me if this
impression is wrong). Thus I am happy that I haven't invested more time
in my attempt to contribute an idea to your project.

Best,
Dan

--
You received this message because you are subscribed to the Google Groups "Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beancount+...@googlegroups.com.
To post to this group, send email to bean...@googlegroups.com.

Daniele Nicolodi

unread,
Dec 1, 2018, 9:33:33 PM12/1/18
to bean...@googlegroups.com
Hi Martin,

thank you for your answer. I read the threads you referring to, and I
agree with you that having time information attached to directives
introduces a lot of complication without much benefits. I don't have an
usecase that requires it.

What I was trying to do was to have a less surprising behavior of the
ingest machinery. I have been surprised twice writing importers: the
first time when using datetime classes did not raise an error but
resulted in an invalid benacount file; the second time when I found that
entries are sorted on a key including the line number.

On 01/12/2018 18:59, Martin Blais wrote:
> The problem with adding support for this (i.e., changing the sort order
> rule) is that there will be a lot of other things that start relying on it.
> So say, you change the schema to implicitly support date or datetime in
> the printer, as your patch suggests.
> Now some codes use date, some codes use datetime.

The code already supports any subclass of the date class that prints as
a date. In my importers I just use the class below (minus typos) as a
date field, and I get the sorting I want.

class datetime(datetime.datetime):
def __str__(self):
return '{:%Y-%m-%d}'.format(self)

> Someone uses the result of some importer code that puts datetime in this
> field - code that they didn't write, say - and passes it to some routine
> which fails on datetime. Boom! A bug. (Note that date's don't have a
> timezone and aren't subject to some of the timezone related issues.) Or
> say that person does arithmetic on the dates/datetimes. And it fails.
> That's unpredictable behavior. Then I'll get a request to fix that bug
> on an unsupported data type and someone will be surprised it's not
> supported officially.

I understand your point of view, but there is not much you can do to
stop people using Python language features to interact with beancount
library functions. Other than, well, rewrite it in another language.

> I've agreed with others in previous threads to add 
> 1. Support for the grammar to parse a time and to insert the time
> portion in some metadata field

I agree that extending the grammar to understand a type with date and
time information would be useful, but I don't see any advantage in
having it as part of the directive date in the file format, if it is not
used by beancount. Having a 'time' entry in the metadata could be
otherwise useful.

> 2. Adding an option to support specifying a secondary key for sorting
> This wouldn't break any of the current code by default.

Are you referring here to the sorting that is performed before
processing or before serialization in the ingest framework?

In the first case, I don't know what use case it solves, thus I cannot
comment. In the second case, I don't know what would be a good
interface to expose it.

Cheers,
Dan

Daniele Nicolodi

unread,
Dec 1, 2018, 9:34:03 PM12/1/18
to bean...@googlegroups.com
On 29/11/2018 07:08, Martin Blais wrote:
> On Thu, Nov 29, 2018 at 2:17 AM Daniele Nicolodi <dan...@grinta.net
> <mailto:dan...@grinta.net>> wrote:
>
> On 28/11/2018 23:22, Daniele Nicolodi wrote:
> > On 28/11/2018 23:17, Martin Blais wrote:
> >> On Thu, Nov 29, 2018 at 1:14 AM Oon-Ee Ng <ngoone...@gmail.com
> <mailto:ngoone...@gmail.com>
> >>
> >>     That was my impression as well, but after testing it out I found
> >>     this was not the case. The returned ordered list of
> transactions is
> >>     then sorted by date and line number. 
> >>
> >>
> >> Yes. Sorted by date.
> >
> > Is that documented?
>
> https://docs.google.com/document/d/1wAMVrKIA2qtRGmoVDSUBJGmYZSygUaR0uOMW1GV3YE0/edit#heading=h.gni5f2cin8rd
> https://bitbucket.org/blais/beancount/src/4c79f94e085880bb1d4b469c931435eb246855bf/beancount/core/data.py#lines-568

This does not refer to how directives are serialized in the ingest
framework, but on how they are processed. I may have overlooked it, but
the documentation on ingest does not mention sorting, and if it does,
sorting on line number is not mentioned. Assuming it is not there yet, I
may be a good idea to add it.

Cheers,
Dan

Daniele Nicolodi

unread,
Dec 1, 2018, 9:38:20 PM12/1/18
to bean...@googlegroups.com
On 29/11/2018 07:08, Martin Blais wrote:
> Let me get this right: you're proposing a patch that affects the parser
> layer, that has potential to break a bunch other people's stuff, and
> you're not bothering to write a test that specifically handles that case?

In the remote case that you want to reconsider it, I added unit tests to
printer_test.py and for extract_test.py to test the change and the
behavior of the ingest.

Cheers,
Dan

Martin Blais

unread,
Dec 12, 2018, 1:24:55 AM12/12/18
to Beancount
So I came up with a workaround:

```
# Re-generate "lineno" to force sorting
for lineno, entry in enumerate(entries, start=1):
    entry.meta['lineno'] = lineno
```

Just generate artificial "lineno" keys and insert them into the
transactions before returning them to Beancount. This way I can preserve
the order.

A possible approach to make both parties happy would be to add a new
parameter to "extract_from_file()" function, allowing importers to
disable the sorting behaviour in post-processings.


> --
> You received this message because you are subscribed to the Google Groups "Beancount" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> beancount+...@googlegroups.com.
> To post to this group, send email to bean...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/beancount/
> CAGQ70esZ%2BSFkn4VcaG1dh%2BP1i%3DNgQ9rHvSdx-_8eVKvJ1%3DcRMg%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.

--
Zhuoyun Wei

--
You received this message because you are subscribed to the Google Groups "Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beancount+...@googlegroups.com.
To post to this group, send email to bean...@googlegroups.com.

Martin Blais

unread,
Dec 12, 2018, 1:29:55 AM12/12/18
to Beancount
I'm planning to experiment with converting all the data structures to protobufs and rewrite beancount.core & beancount.parser entirely  in C++ at some point if the cross-language overhead is negligible, mainly for speed. This would enforce the schema; you wouldn't be able to set an invalid data type on the protobuf instance.
 

> I've agreed with others in previous threads to add 
> 1. Support for the grammar to parse a time and to insert the time
> portion in some metadata field

I agree that extending the grammar to understand a type with date and
time information would be useful, but I don't see any advantage in
having it as part of the directive date in the file format, if it is not
used by beancount.  Having a 'time' entry in the metadata could be
otherwise useful.

Indeed, after other suggestions, it seems having a secondary sort key in the metadata would solve the problem of sorting in the few necessary cases yet maintain the simplicity of civil dates (without timezones).
 

> 2. Adding an option to support specifying a secondary key for sorting
> This wouldn't break any of the current code by default.

Are you referring here to the sorting that is performed before
processing or before serialization in the ingest framework?

Everywhere.

 

In the first case, I don't know what use case it solves, thus I cannot
comment.  In the second case, I don't know what would be a good
interface to expose it.

I'd just add an option so you can specify the secondary sort key name.


 

Cheers,

Dan

--
You received this message because you are subscribed to the Google Groups "Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beancount+...@googlegroups.com.
To post to this group, send email to bean...@googlegroups.com.

Martin Blais

unread,
Dec 12, 2018, 1:31:02 AM12/12/18
to Beancount
It's not perfect, the docs aren't covering everything.




 

Cheers,
Dan


--
You received this message because you are subscribed to the Google Groups "Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beancount+...@googlegroups.com.
To post to this group, send email to bean...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages