FYI: next release will read only UTF8-encoded data; testing welcome

37 views
Skip to first unread message

Simon Michael

unread,
May 22, 2022, 12:20:13 AMMay 22
to hledger
Hi all,

We have always encouraged hledger data files to be UTF-8-encoded, but all this time it was not enforced, and actually we try to read any text encoding that is specified by the system's locale.

The downside is that,  like most GHC-compiled programs, we have always been far too sensitive to system locale: it is routine for us to crash on encountering non-ascii data if $LANG is unset, or set wrongly, or the right locale is not installed, if using nixos, etc.

So, https://github.com/simonmichael/hledger/commit/65e913b7c56a81658a5972d649d7db67508a136c (#1834) just landed in master. Now we will always try to decode text as UTF-8, ignoring the system locale. This should make hledger's behaviour more predictable, which is a good thing. 

But some of you might be relying on the old behaviour, perhaps without realising it. And the new code may fail to read your non-UTF-8-encoded files. (Gracefully ? I'm not sure.) And if so you would have to convert such files with, or pipe them through, a conversion command like `iconv -t UTF8`.

So: any and all testing of the latest master by folks with real-world non-ascii data is most welcome and helpful. Or even just let us know if you think you will be affected by this at next upgrade time. On unix machines, an easy way to check your files' encoding is

$ file 2022.journal 
2022.journal: Unicode text, UTF-8 text

Thanks!



Henning Thielemann

unread,
May 22, 2022, 2:44:19 AMMay 22
to hledger

On Sat, 21 May 2022, Simon Michael wrote:

> The downside is that,  like most GHC-compiled programs, we have always
> been far too sensitive to system locale: it is routine for us to crash
> on encountering non-ascii data if $LANG is unset, or set wrongly, or the
> right locale is not installed, if using nixos, etc.

That's pretty bad. I use Latin-1 for most of my Hledger files.

Simon Michael

unread,
May 22, 2022, 5:30:04 AMMay 22
to hledger

On May 21, 2022, at 20:44, Henning Thielemann <google...@henning-thielemann.de> wrote:
That's pretty bad. I use Latin-1 for most of my Hledger files.


Thanks Henning. (May I ask why you stick with Latin-1 ?)


All: here are recent binaries that github users, at least, can download for easier testing. Do these binaries fail to read your files ?

linux-x64 (soon):

mac-x64:

windows-x64:

Simon Michael

unread,
May 22, 2022, 5:43:49 AMMay 22
to hledger

Simon Michael

unread,
May 29, 2022, 2:16:40 PMMay 29
to hledger
On May 21, 2022, at 23:30, Simon Michael <si...@joyful.com> wrote:
On May 21, 2022, at 20:44, Henning Thielemann <google...@henning-thielemann.de> wrote:
That's pretty bad. I use Latin-1 for most of my Hledger files.

Thanks Henning. (May I ask why you stick with Latin-1 ?)


Ping ? I don't imply you should change your setup, but this would be useful to understand bettter, if you can say more. 

Henning Thielemann

unread,
May 29, 2022, 2:21:49 PMMay 29
to hledger

On Sun, 29 May 2022, Simon Michael wrote:

> On May 21, 2022, at 20:44, Henning Thielemann <google...@henning-thielemann.de> wrote:

> That's pretty bad. I use Latin-1 for most of my Hledger files.

Because I am happy with Latin-1 for most applications and never switched
to UTF-8. Especially switching encoding in the midst of a revision control
history is not good, right?

Henning Thielemann

unread,
May 29, 2022, 2:25:50 PMMay 29
to hledger

On Sat, 21 May 2022, Simon Michael wrote:

> On May 21, 2022, at 20:44, Henning Thielemann <google...@henning-thielemann.de> wrote:
> That's pretty bad. I use Latin-1 for most of my Hledger files.
>
>
>
> Thanks Henning. (May I ask why you stick with Latin-1 ?)

A work-around would be to convert latin-1 to utf-8 before calling
hledger/hledger-web. This would complicate adding transactions via
hledger-web, but I do not use this currently. I already run hledger-web on
preprocessed data (with unrolled periodic transactions), thus I cannot use
online addition of transactions, anyway.

Simon Michael

unread,
May 30, 2022, 5:25:52 PMMay 30
to hledger
We've had one report of someone using a non-utf-8 encoding, which means there are probably more non-utf-8 users out there. 

We shouldn't avoid doing the right thing - whatever that is - but breaking a user's long-working setup, for no clear benefit to that person, is not a good thing. So I am getting some cold feet here, at least for the imminent 1.26 release. 

Here are some options, some of them feasible for 1.26 and some maybe not:

1. Keep the status quo: read files using the encoding of the system locale.

2. Switch to reading utf-8 always, ignoring the system locale. The idea floated above.

3. 2, but include a fallback flag like --use-system-locale for people who want to keep the old behaviour. (This was always my plan, time allowing.)

4. New idea: 1, but detect the most common system locale misconfigurations ($LANG unset, C, or some value not including variations of "utf8" ?) and in that case assume utf-8. (Too much complication ?)

5. 4, but don't use $LANG heuristics, instead catch and detect all decoding-related errors and in that case try again with utf-8. (Feasible ?)

6. Something else ?

Any thoughts ?


Henning Thielemann

unread,
May 30, 2022, 5:31:17 PMMay 30
to hledger

On Mon, 30 May 2022, Simon Michael wrote:

> Here are some options, some of them feasible for 1.26 and some maybe not:
>
> 1. Keep the status quo: read files using the encoding of the system locale.
>
> 2. Switch to reading utf-8 always, ignoring the system locale. The idea floated above.
>
> 3. 2, but include a fallback flag like --use-system-locale for people who want to keep the old behaviour. (This was always my
> plan, time allowing.)

I prefer that.

Please no heuristics. They fail too easily and then the failure is too
hard to analyze.

Simon Michael

unread,
May 31, 2022, 3:40:05 AMMay 31
to hledger

> On May 30, 2022, at 14:25, Simon Michael <si...@joyful.com> wrote:
> Here are some options, some of them feasible for 1.26 and some maybe not:
>
> 1. Keep the status quo: read files using the encoding of the system locale.
>
> 2. Switch to reading utf-8 always, ignoring the system locale. The idea floated above.

> 3. 2, but include a fallback flag like --use-system-locale for people who want to keep the old behaviour. (This was always my plan, time allowing.)

I forgot:

1.5. Keep the current behaviour (1) as default, but offer a new optional flag to force utf-8 reading (2), like --utf8.


Henning Thielemann

unread,
May 31, 2022, 3:44:28 AMMay 31
to hledger
Would be less surprising than (3).

Simon Michael

unread,
May 31, 2022, 10:26:07 AMMay 31
to hle...@googlegroups.com
I forgot:
>>
>> 1.5. Keep the current behaviour (1) as default, but offer a new optional flag to force utf-8 reading (2), like --utf8.
>
> Would be less surprising than (3).

But also it's hard to see the point - if somebody knows to use this flag, they could also just fix $LANG.

I have lost perspective on, or never quite understood in the first place, what is the ideal behaviour here. Other than "as user or as a maintainer, I would like hledger not to crash because of $LANG".

Timofey Zakrevskiy

unread,
May 31, 2022, 11:10:39 AMMay 31
to hle...@googlegroups.com
Hi!

My two cents - I have a rather vague idea on what $LANG does, yet a CLI option "treat input as utf8 regardless of my system's (mis)configuration" can not be more explicit.


I'd go even further and propose an option "--input-encoding=" together with "--output-encoding=".

Cheers,
Timofey


вт, 31 мая 2022 г., 16:26 Simon Michael <si...@joyful.com>:
--
You received this message because you are subscribed to the Google Groups "hledger" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hledger+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hledger/64CCC466-F519-4BEB-8142-D046E87BA7A2%40joyful.com.

Simon Michael

unread,
May 31, 2022, 8:21:51 PMMay 31
to hle...@googlegroups.com
We decided to revert this for 1.26 (#1864) and contemplate it a bit more. Thanks all for your input! And keep it coming if you have more. 

My current wish is the more gentle step of detecting decoding errors and reporting them nicely with a clear explanation. 

Henning Thielemann

unread,
Jun 1, 2022, 3:07:19 AMJun 1
to hle...@googlegroups.com

On Tue, 31 May 2022, Timofey Zakrevskiy wrote:

> I'd go even further and propose an option "--input-encoding=" together
> with "--output-encoding=".

This would be even better.

Simon Michael (sm)

unread,
Jun 22, 2022, 1:20:17 PM (10 days ago) Jun 22
to hledger
I'm happy to look at new PRs related to this issue (I think none are currently pending).

There is a new issue which is slightly analogous, this time for time locale (affecting parsing of CSV dates):
We should probably follow the same general approach for both.

 
Reply all
Reply to author
Forward
0 new messages