[llvm-dev] RFC: On non 8-bit bytes and the target for it

Dmitriy Borisenkov via llvm-dev

unread,

Oct 23, 2019, 5:17:00 AM10/23/19

to llvm...@lists.llvm.org, Leonid Kholodov, Dmitry Shtukenberg

This RFC is to ask whether the community is interested in further discussion of iN bytes support. Last time the issue was on the agenda in May and the discussion was triggered by Jesper Antonsson's patches (see https://lists.llvm.org/pipermail/llvm-dev/2019-May/132080.html).

It seems that, while some downstream areas benefit from non-8-bit bytes support, this feature is barely maintainable given the lack of utilization targets in the upstream. The reason why I would like to again raise the matter is that we, the TON Labs team, would like to upstream our backend solution.

The backend generates code for TON virtual machine designed to run smart contracts in TON blockchain (see the original specifications for TVM and TON respectively at https://test.ton.org/tvm.pdf and at https://test.ton.org/tblkch.pdf).

The target has the following key particularities:

stack-based virtual machine
257-bit wide integers, signed magnitude representation
no float point arithmetic support
persistent storage
no "native" memory; modeling is possible by costly
presence of custom types (it is exactly the reason for upstreaming)

Given that the TVM only operates with 257 bits wide numbers, we changed LLVM in downstream to get a 257 bits byte. At the moment, we have a hacky implementation with a new byte size hardcoded. For a reference: the scope was to change approximately 20 files in LLVM and about a dozen in Clang. Later on, we plan to integrate the new byte size with data layout according to https://archive.fosdem.org/2017/schedule/event/llvm_16_bit/. And if the community decides to move on, we will upstream and maintain it.

We realize that a 257 bits byte is quite unusual, but for smart contracts it is ok to have at least 256 bits numbers. The leading VM for smart contracts, Ethereum VM, introduced this practice and other blockchain VMs followed. Thus, while TVM might be the first LLVM-based target for blockchain that needs the feature, it is not necessarily the last one. We also found mentions of 12, 16 and 24 bits wide bytes in non-8-bits byte discussions in the past (in reverse chronological order: https://lists.llvm.org/pipermail/llvm-dev/2019-May/132080.html, http://lists.llvm.org/pipermail/llvm-dev/2017-January/109335.html, http://lists.llvm.org/pipermail/llvm-dev/2017-January/108901.html, http://lists.llvm.org/pipermail/llvm-dev/2015-March/083177.html, http://lists.llvm.org/pipermail/llvm-dev/2014-September/076543.html, http://lists.llvm.org/pipermail/llvm-dev/2009-September/026027.html).

Our Toolchain is going to be based only on OSS. It allows using the backend without getting any proprietary software. Also, we hope that implementation for a target similar to TVM would help to generalize some concepts in LLVM and to make the whole framework better suit non-mainstream architectures.

Aside from non-i8 bytes, we would like to bring stack machine support in the Target Independent Code generator. The matter will be discussed at the developers' meeting, see http://llvm.org/devmtg/2019-10/talk-abstracts.html#bof2.

LLVM and Clang for TVM are available at (https://github.com/tonlabs/TON-Compiler). It is currently under LLVM 7 and it can only produce assembler; we have not specified our object file format yet). Moreover, we have introduced custom IR types to model Tuples, Slices, Builders, Cells from the specification. We are going to do an LLVM update and consider using opaque types before starting to upstream.

--
Kind regards, Dmitry Borisenkov

JF Bastien via llvm-dev

unread,

Oct 24, 2019, 5:21:33 PM10/24/19

to Dmitriy Borisenkov, llvm...@lists.llvm.org, Leonid Kholodov, Dmitry Shtukenberg

I’d like to understand what programming model you see programmers using. You don’t need 257 bits per byte if you only offer 257 bit integers. Rather, bytes aren’t really a thing at that point. LLVM kinda handles iN already, and your backend would legalize everything to exactly this type and nothing else, right? Would it be sufficient to expose something like int<unsigned Size> with Size=257 for your programming environment?

It would also be useful to understand what other changes you’re proposing, especially your mention of Tuples, Slices, Builders, Cells.

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

David Chisnall via llvm-dev

unread,

Oct 24, 2019, 7:02:23 PM10/24/19

to JF Bastien, Dmitriy Borisenkov, llvm...@lists.llvm.org, Leonid Kholodov, Dmitry Shtukenberg

On 24/10/2019 14:21, JF Bastien via llvm-dev wrote:
> I’d like to understand what programming model you see programmers using.
> You don’t need 257 bits per byte if you only offer 257 bit integers.
> Rather, bytes aren’t really a thing at that point. LLVM kinda handles iN
> already, and your backend would legalize everything to exactly this type
> and nothing else, right? Would it be sufficient to expose something like
> int<unsigned Size> with Size=257 for your programming environment?

To add to what JF says:

Typically, a byte means some combination of:

1. The smallest unit that can be indexed in memory (irrelevant for you,
you have no memory).
2. The smallest unit that can be stored in a register in such a way that
its representation is opaque to software (i.e. you can't tell the bit
order of a byte in a multi-byte word). For you, it's not clear if this
is 257 bits or something smaller.
3. The smallest unit that is used to build complex types in software.
Since you have no memory, it's not clear that you can build structs or
arrays, and therefore this doesn't seem to apply.

From your description of your VM, it doesn't sound as if you can
translate from any language with a vaguely C-like abstract machine, so
I'm not certain why the size of a byte actually matters to you. LLVM IR
has a quite C-like abstract machine, and several of these features seem
like they will be problematic for you. There is quite a limited subset
of LLVM IR that can be expressed for your VM and it would be helpful if
you could enumerate what you expect to be able to support (and why going
via LLVM is useful, given that you are unlikely to be able to take
advantage of any existing front ends, many optimisations, or most of the
target-agnostic code generator.

David

Dmitriy Borisenkov via llvm-dev

unread,

Oct 25, 2019, 7:44:55 AM10/25/19

to David Chisnall, llvm...@lists.llvm.org, Dmitry Shtukenberg, Leonid Kholodov

Just to clarify, the VM doesn't have memory indeed, but we emulate the memory with dictionaries (address -> value) which are native to TVM. Thus you can work with arrays and structures in TVM.

However, access to a dictionary is very expensive in terms of gas (fee you pay for a contract execution in the blockchain). We really don't want to have unaligned memory access things like that. Aside from that, our "ALU" only support 257-bit operations and handling overflows of smaller types is an additional expense for a user. So we set sizeof(char) == sizeof(short) == sizeof(int) == sizeof(long) == sizeof(long long) == 1 byte == 257 bits in C. Luckily, the C spec allows it. We do not have a specification requirement of doing so, but we found it natural from implementation and user experience point of view.

Our goal is to allow using general-purpose languages to develop smart contracts since we believe it was a shortcoming of Etherium to focus solely on Solidity. That why we decided to use LLVM. As for the LLVM specification coverage, at the moment we support operations with memory (they are probably not well tested yet, but there is a bunch of tests on arrays) and structures, all integer arithmetic and bitwise operations, control-flow instruction excluding exception handling stuff and indirectbr, comparisons, extensions and truncations (we do have smaller values than i257 that are stored in persistent memory, where a user pays for data storage; but persistent memory is a different story, it will likely to become a different address space in future, but now it's only accessible through intrinsics). We also support memcpy and memset in non-persistent memory.

As for Slices, Builders and the rest, we aren't that crazy to really propose them being upstreamed - it's very specific to our VM. It's an implementation detail at the moment - we did introduced these entities as types, basically because of time pressure on the project. We want to switch to opaque types if it's possible without losing the correctness of our backend. If it's impossible well, we will probably start looking for a way to change the framework so that a target could introduce it's own type, but I really hope it won't be the case.

So the scope of the changes we'd like to introduce:

1. Getting rid of byte size assumption in LLVM and Clang (adding byte size to data layout, removing magic number 8 (where it means size of byte) from LLVM and Clang, introducing the notion of byte for memcpy and memset). The C spec doesn't have this constraint, so I'm not sure that LLVM should be more restrictive here.

2. Adding support for stack machines in the backend (generalizing algorithms of converting register-based instruction to stack-based ones, the generic implementation of scheduling appropriate for a stack machine and implementation of stack-aware (i.e. configurable) reassociation). It was discussed during BoF talk at the recent conference. We are going to summarize the results soon.

3. The backend itself.

So basically, we believe that (1) is beneficial for Embecosm, Ericsson and other companies that were actively involved in the previous iterations of non-8-bits byte discussion in the past. (3) fixes the main concern of the community: the testability of these changes. (2) benefits WebAssembly and further stack machines implemented in LLVM.

--

Kind regards, Dmitry

Jesper Antonsson via llvm-dev

unread,

Oct 25, 2019, 11:46:45 AM10/25/19

to David.C...@cl.cam.ac.uk, dmitriy.bo...@gmail.com, llvm...@lists.llvm.org, leny.k...@gmail.com, dmit...@tonlabs.io

Hi Dmitriy,

I can confirm that Ericsson remains interested in the byte-size issue.
We would be more than happy to contribute/collaborate on patches,
suggestions and reviews in that area, should your upstreaming effort
win community approval.

Best regards, Jesper

On Fri, 2019-10-25 at 13:44 +0200, Dmitriy Borisenkov via llvm-dev
wrote:

> _______________________________________________
> LLVM Developers mailing list
> llvm...@lists.llvm.org
>

https://protect2.fireeye.com/v1/url?k=3cfa75d3-60705739-3cfa3548-0cc47ad93e32-a226b272d7cbf41b&q=1&e=e79c42bc-f473-4130-bb12-0a88373d4c99&u=https%3A%2F%2Flists.llvm.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fllvm-dev

Chris Lattner via llvm-dev

unread,

Oct 25, 2019, 10:56:53 PM10/25/19

to David Chisnall, llvm...@lists.llvm.org, Leonid Kholodov, Dmitry Shtukenberg

Right. A 257-bit target is a bit crazy, but there are lots of other targets that only have 16-bit or 32-bit addressable memory. I’ve heard various people saying that they all have out-of-tree patches to support non-8-bit-byte targets, but because there is no in-tree target that uses them, it is very difficult to merge these patches up stream.

I for one would love to see some of these patches get upstreamed. If the only problem is one of testing, then maybe we could make a virtual target exist, or maybe we could accept the patches without test cases (so long as they doesn’t break 8-bit-byte targets obviously).

-Chris

Dmitriy Borisenkov via llvm-dev

unread,

Oct 29, 2019, 3:11:44 PM10/29/19

to Chris Lattner, llvm...@lists.llvm.org, Dmitry Shtukenberg, Leonid Kholodov

Thanks, Chris, for supporting the idea to have non-8-bits byte in LLVM.

I want to clarify the scope and then analyze the options we have.

The scope:
1. BitsPerByte or similar variable should be introduced to data layout; include/CodeGen/ValueTypes.h and some other generic headers also need to be updated and probably become dependent on the data layout.
2. Magic number 8 should be replaced with BitsPerByte. We found that 8 is used as "size of a byte in bits" in Selection DAG, asm printer, analysis and transformation passes. Some of the passes are currently independent of any target specific information. In downstream, we changed about ten passes before our testing succeeded, but we might have missed some cases due to the incompleteness of our tests.
3. &255 and other bits manipulations. We didn't catch many of that with our downstream testing. But again, at the moment, our tests are not sufficiently good for any claims here.
4. The concept of byte should probably be introduced to Type.h. The assumption that Type::getInt8Ty returns type for a byte is baked into the code generator, builtins (notably memcpy and memset) and more than ten analysis and transformation passes.

Noteworthy to say, that these changes should apply to the upcoming patches as well to the existing ones, and if we decide to move on, and developers should no longer assume that byte is 8-bits wide with an exception for target-dependent pieces of code.

The options we have.
1. Perform 1 - 4 w/o any testing in upstream. It seems a very fragile solution to me. Without any non-8-bit target in upstream, it's unlikely that contributors will differentiate between getInt8Ty() and getByteTy(). So I guess that after a couple of months, we'll get a mix of 8s and BitsPerBytes in code, and none of the tests will be regressed. The remedy is probably an active contributor from downstream who is on top of the trunk and checks new patches against its tests daily.
2. Test with a dummy target. It might work if we have a group of contributors who is willing to rewrite and upstream some of their downstream tests as well as to design and implement the target itself. The issue here might be in functional tests, so we'd probably need to implement a dummy virtual machine to run them because lit tests are unlikely to catch all issues from paragraphs (2) and (3) of the scope described.
3. TON labs can provide its crazy target or some lightweight version of it. From the testing point of view, it works similar to the second solution, but it doesn't require any inventions. I could create a separate RFC about the target to find out if the community thinks it's appropriate.

--
Kind regards, Dmitry.

Tim Northover via llvm-dev

unread,

Oct 29, 2019, 3:19:51 PM10/29/19

to Dmitriy Borisenkov, LLVM Developers Mailing List, Leonid Kholodov, Dmitry Shtukenberg

On Tue, 29 Oct 2019 at 19:11, Dmitriy Borisenkov via llvm-dev
<llvm...@lists.llvm.org> wrote:
> 2. Test with a dummy target. It might work if we have a group of contributors who is willing to rewrite and upstream some of their downstream tests as well as to design and implement the target itself. The issue here might be in functional tests, so we'd probably need to implement a dummy virtual machine to run them because lit tests are unlikely to catch all issues from paragraphs (2) and (3) of the scope described.
> 3. TON labs can provide its crazy target or some lightweight version of it. From the testing point of view, it works similar to the second solution, but it doesn't require any inventions. I could create a separate RFC about the target to find out if the community thinks it's appropriate.

I'm not great at history, are there any historically iconic targets
that aren't 8-bit but are otherwise sane? I'd prefer to spend the
project's resources supporting something like that than either an
invented target or a speculative crypto-currency oddity.

Cheers.

Tim.

Joerg Sonnenberger via llvm-dev

unread,

Oct 29, 2019, 6:40:08 PM10/29/19

to llvm...@lists.llvm.org

On Tue, Oct 29, 2019 at 07:19:25PM +0000, Tim Northover via llvm-dev wrote:
> On Tue, 29 Oct 2019 at 19:11, Dmitriy Borisenkov via llvm-dev
> <llvm...@lists.llvm.org> wrote:
> > 2. Test with a dummy target. It might work if we have a group of contributors who is willing to rewrite and upstream some of their downstream tests as well as to design and implement the target itself. The issue here might be in functional tests, so we'd probably need to implement a dummy virtual machine to run them because lit tests are unlikely to catch all issues from paragraphs (2) and (3) of the scope described.
> > 3. TON labs can provide its crazy target or some lightweight version of it. From the testing point of view, it works similar to the second solution, but it doesn't require any inventions. I could create a separate RFC about the target to find out if the community thinks it's appropriate.
>
> I'm not great at history, are there any historically iconic targets
> that aren't 8-bit but are otherwise sane? I'd prefer to spend the
> project's resources supporting something like that than either an
> invented target or a speculative crypto-currency oddity.

PDP10 is iconic enough?

Joerg

James Y Knight via llvm-dev

unread,

Oct 29, 2019, 6:56:26 PM10/29/19

to Tim Northover, LLVM Developers Mailing List, Dmitry Shtukenberg, Leonid Kholodov

I'd note that GCC removed its last upstream target with a BITS_PER_UNIT != 8 in version 4.3 in 2008 (that was TMS320C3x/C4x), and there have been none added since. AFAIK, they're in option #1 mode -- no testing upstream, but maybe with downstream forks that still use the ability to set it to other values, and besides, a constant is nicer than a magic "8" anyways.

Last time this was discussed, the LLVM project already came to a consensus that it's reasonable to remove magic "8"s from the code, at least where it arguably helps code clarity -- and if that helps downstream forks with weird byte-sizes too, that's wonderful.

But, it's not at all clear to me that it's at all worthwhile to do more than that (e.g. changing core stuff like datalayout, introducing weird and otherwise-irrelevant targets, or trying to figure out how to test the functionality for changing the byte-width without a target).

JF Bastien via llvm-dev

unread,

Oct 29, 2019, 8:14:05 PM10/29/19

to Joerg Sonnenberger, llvm...@lists.llvm.org

> On Oct 29, 2019, at 3:39 PM, Joerg Sonnenberger via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> On Tue, Oct 29, 2019 at 07:19:25PM +0000, Tim Northover via llvm-dev wrote:
>> On Tue, 29 Oct 2019 at 19:11, Dmitriy Borisenkov via llvm-dev
>> <llvm...@lists.llvm.org> wrote:
>>> 2. Test with a dummy target. It might work if we have a group of contributors who is willing to rewrite and upstream some of their downstream tests as well as to design and implement the target itself. The issue here might be in functional tests, so we'd probably need to implement a dummy virtual machine to run them because lit tests are unlikely to catch all issues from paragraphs (2) and (3) of the scope described.
>>> 3. TON labs can provide its crazy target or some lightweight version of it. From the testing point of view, it works similar to the second solution, but it doesn't require any inventions. I could create a separate RFC about the target to find out if the community thinks it's appropriate.
>>
>> I'm not great at history, are there any historically iconic targets
>> that aren't 8-bit but are otherwise sane? I'd prefer to spend the
>> project's resources supporting something like that than either an
>> invented target or a speculative crypto-currency oddity.
>
> PDP10 is iconic enough?

Is it relevant to any modern compiler though?

I strongly agree with Tim. As I said in previous threads, unless people will have actual testable targets for this type of thing, I think we shouldn’t add maintenance burden. This isn’t really C or C++ anymore because so much code assumes CHAR_BIT == 8, or at a minimum CHAR_BIT % 8 == 0, that we’re supporting a different language. IMO they should use a different language, and C / C++ should only allow CHAR_BIT % 8 == 0 (and only for small values of CHAR_BIT).

Mehdi AMINI via llvm-dev

unread,

Oct 29, 2019, 9:18:47 PM10/29/19

to JF Bastien, llvm-dev

On Tue, Oct 29, 2019 at 5:14 PM JF Bastien via llvm-dev <llvm...@lists.llvm.org> wrote:

> On Oct 29, 2019, at 3:39 PM, Joerg Sonnenberger via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> On Tue, Oct 29, 2019 at 07:19:25PM +0000, Tim Northover via llvm-dev wrote:
>> On Tue, 29 Oct 2019 at 19:11, Dmitriy Borisenkov via llvm-dev
>> <llvm...@lists.llvm.org> wrote:
>>> 2. Test with a dummy target. It might work if we have a group of contributors who is willing to rewrite and upstream some of their downstream tests as well as to design and implement the target itself. The issue here might be in functional tests, so we'd probably need to implement a dummy virtual machine to run them because lit tests are unlikely to catch all issues from paragraphs (2) and (3) of the scope described.
>>> 3. TON labs can provide its crazy target or some lightweight version of it. From the testing point of view, it works similar to the second solution, but it doesn't require any inventions. I could create a separate RFC about the target to find out if the community thinks it's appropriate.
>>
>> I'm not great at history, are there any historically iconic targets
>> that aren't 8-bit but are otherwise sane? I'd prefer to spend the
>> project's resources supporting something like that than either an
>> invented target or a speculative crypto-currency oddity.
>
> PDP10 is iconic enough?

Is it relevant to any modern compiler though?

I strongly agree with Tim. As I said in previous threads, unless people will have actual testable targets for this type of thing, I think we shouldn’t add maintenance burden.

+1: we should have a testable target in the first place to motivate the maintenance/support in LLVM. Someone should write a HW simulator for such an "academic" architecture and a LLVM backend for it! :)

This isn’t really C or C++ anymore because so much code assumes CHAR_BIT == 8, or at a minimum CHAR_BIT % 8 == 0, that we’re supporting a different language. IMO they should use a different language, and C / C++ should only allow CHAR_BIT % 8 == 0 (and only for small values of CHAR_BIT).

I'm missing the link between the LLVM support for non 8-bits platforms and "they should use a different language" than C/C++, can you clarify?

Best,

--

Mehdi

Philip Reames via llvm-dev

unread,

Oct 29, 2019, 9:19:29 PM10/29/19

to JF Bastien, Joerg Sonnenberger, llvm...@lists.llvm.org

On 10/29/19 5:13 PM, JF Bastien via llvm-dev wrote:
>
>> On Oct 29, 2019, at 3:39 PM, Joerg Sonnenberger via llvm-dev <llvm...@lists.llvm.org> wrote:
>>
>> On Tue, Oct 29, 2019 at 07:19:25PM +0000, Tim Northover via llvm-dev wrote:
>>> On Tue, 29 Oct 2019 at 19:11, Dmitriy Borisenkov via llvm-dev
>>> <llvm...@lists.llvm.org> wrote:
>>>> 2. Test with a dummy target. It might work if we have a group of contributors who is willing to rewrite and upstream some of their downstream tests as well as to design and implement the target itself. The issue here might be in functional tests, so we'd probably need to implement a dummy virtual machine to run them because lit tests are unlikely to catch all issues from paragraphs (2) and (3) of the scope described.
>>>> 3. TON labs can provide its crazy target or some lightweight version of it. From the testing point of view, it works similar to the second solution, but it doesn't require any inventions. I could create a separate RFC about the target to find out if the community thinks it's appropriate.
>>> I'm not great at history, are there any historically iconic targets
>>> that aren't 8-bit but are otherwise sane? I'd prefer to spend the
>>> project's resources supporting something like that than either an
>>> invented target or a speculative crypto-currency oddity.
>> PDP10 is iconic enough?
> Is it relevant to any modern compiler though?
>
> I strongly agree with Tim. As I said in previous threads, unless people will have actual testable targets for this type of thing, I think we shouldn’t add maintenance burden.

Strongly agreed here as well.

Philip

JF Bastien via llvm-dev

unread,

Oct 29, 2019, 10:09:55 PM10/29/19

to Mehdi AMINI, llvm-dev

On Oct 29, 2019, at 6:18 PM, Mehdi AMINI <joke...@gmail.com> wrote:

C code with 257 bits per byte is nominally C, but it’s realistically incompatible with any existing C code. It’s therefore not the C people use, it’s the C the standard says should exist because historically it might have been a good idea.

Similarly, C with signed magnitude or ones’ complement integers isn’t the same C we all use.

Best,

--
Mehdi

Jeroen Dobbelaere via llvm-dev

unread,

Oct 30, 2019, 6:08:01 AM10/30/19

to JF Bastien, Joerg Sonnenberger, llvm...@lists.llvm.org

> From: llvm-dev <llvm-dev...@lists.llvm.org> On Behalf Of JF Bastien via
[..]

> Is it relevant to any modern compiler though?
>
> I strongly agree with Tim. As I said in previous threads, unless people will have
> actual testable targets for this type of thing, I think we shouldn’t add
> maintenance burden. This isn’t really C or C++ anymore because so much code
> assumes CHAR_BIT == 8, or at a minimum CHAR_BIT % 8 == 0, that we’re
> supporting a different language. IMO they should use a different language, and
> C / C++ should only allow CHAR_BIT % 8 == 0 (and only for small values of
> CHAR_BIT).

We (Synopsys ASIP Designer team) and our customers tend to disagree: our customers do create plenty of cpu architectures
with non-8-bit characters (and non-8-bit addressable memories). We are able to provide them with a working c/c++ compiler solution.
Maybe some support libraries are not supported out of the box, but for these kind of architectures that is acceptable.
(Besides that, llvm is also more than just c/c++)

Greetings,

Jeroen Dobbelaere

JF Bastien via llvm-dev

unread,

Oct 30, 2019, 9:36:02 AM10/30/19

to Jeroen Dobbelaere, llvm...@lists.llvm.org

> On Oct 30, 2019, at 3:07 AM, Jeroen Dobbelaere <Jeroen.D...@synopsys.com> wrote:
>
>

>>
>> From: llvm-dev <llvm-dev...@lists.llvm.org> On Behalf Of JF Bastien via
> [..]
>> Is it relevant to any modern compiler though?
>>
>> I strongly agree with Tim. As I said in previous threads, unless people will have
>> actual testable targets for this type of thing, I think we shouldn’t add
>> maintenance burden. This isn’t really C or C++ anymore because so much code
>> assumes CHAR_BIT == 8, or at a minimum CHAR_BIT % 8 == 0, that we’re
>> supporting a different language. IMO they should use a different language, and
>> C / C++ should only allow CHAR_BIT % 8 == 0 (and only for small values of
>> CHAR_BIT).
>
> We (Synopsys ASIP Designer team) and our customers tend to disagree: our customers do create plenty of cpu architectures
> with non-8-bit characters (and non-8-bit addressable memories). We are able to provide them with a working c/c++ compiler solution.
> Maybe some support libraries are not supported out of the box, but for these kind of architectures that is acceptable.

That’s the kind of use case I’d happily support if we had upstream testing, say though a backend. I’m also happy if we remove magic numbers.

Can you share the values you see for CHAR_BIT?

> (Besides that, llvm is also more than just c/c++)

Agreed, I bring up C and C++ because they were the languages discussed in the previous proposals.

David Chisnall via llvm-dev

unread,

Oct 30, 2019, 11:19:17 AM10/30/19

to llvm...@lists.llvm.org

On 30/10/2019 10:07, Jeroen Dobbelaere via llvm-dev wrote:
> We (Synopsys ASIP Designer team) and our customers tend to disagree: our customers do create plenty of cpu architectures
> with non-8-bit characters (and non-8-bit addressable memories). We are able to provide them with a working c/c++ compiler solution.
> Maybe some support libraries are not supported out of the box, but for these kind of architectures that is acceptable.
> (Besides that, llvm is also more than just c/c++)

My main concern in this discussion is that we're conflating several
concepts of a 'byte':

- The smallest unit that can be loaded / stored at a time.

- The smallest unit that can be addressed with a raw pointer in a
specific address space.

- The largest unit whose encoding is opaque to anything above the ISA.

- The type used to represent `char` in C.

- The type that has a size that all other types are a multiple of.

In POSIX C (which imposes some extra constraints not found in ISO C),
when lowered to LLVM IR, all of these are the same type:

- Loads and stores of values smaller than i8 or not a multiple of i8
may be widened to a multiple of i8. Bitfield fields that are smaller
than i8 must use i8 or wider operations and masking.

- GEP indexes are not well defined for anything that is not a multiple
of i8.

- There is no defined bit order of i8 (or bit order for larger types,
only an assumption that, for example, i32 is 4 i8s in a specific order
specified by the data layout).

- char is lowered to i8.

- All ABI-visible types have a size that is a multiple of 8 bits.

It's not clear to me that saying 'a byte is 257 bits' means changing all
of these to 257 or changing only some of them to 257 (which?). For
example, when compiling C for 16-byte-addressible historic
architectures, typically:

- char is 8 bytes.

- char* and void* is represented as a pointer plus a 1-bit offset
(sometimes encoded in the low bit, so the load / store sequence is a
right shift one, a load, and then a mask or mask and shift depending on
the low bit).

- Other pointer types are 16-bit aligned.

IBM's 36-bit word machines use a broadly similar strategy, though with
some important differences and I would imagine that most Synopsis cores
are going to use some variation on this approach.

This probably involves a quite different design to a model with 257-bit
registers, but most of the concerns don't exist if you don't have memory
that can store byte arrays and so involve very different design decisions.

TL;DR: A proposal for supporting non-8-bit bytes needs to explain what
their expected lowerings are and what they mean by a byte.

David

Chris Lattner via llvm-dev

unread,

Oct 30, 2019, 6:31:02 PM10/30/19

to Jeroen Dobbelaere, llvm...@lists.llvm.org

> On Oct 30, 2019, at 3:07 AM, Jeroen Dobbelaere via llvm-dev <llvm...@lists.llvm.org> wrote:
>
>> From: llvm-dev <llvm-dev...@lists.llvm.org> On Behalf Of JF Bastien via
> [..]
>> Is it relevant to any modern compiler though?
>>
>> I strongly agree with Tim. As I said in previous threads, unless people will have
>> actual testable targets for this type of thing, I think we shouldn’t add
>> maintenance burden. This isn’t really C or C++ anymore because so much code
>> assumes CHAR_BIT == 8, or at a minimum CHAR_BIT % 8 == 0, that we’re
>> supporting a different language. IMO they should use a different language, and
>> C / C++ should only allow CHAR_BIT % 8 == 0 (and only for small values of
>> CHAR_BIT).
>
> We (Synopsys ASIP Designer team) and our customers tend to disagree: our customers do create plenty of cpu architectures
> with non-8-bit characters (and non-8-bit addressable memories). We are able to provide them with a working c/c++ compiler solution.
> Maybe some support libraries are not supported out of the box, but for these kind of architectures that is acceptable.
> (Besides that, llvm is also more than just c/c++)

I agree - there are a lot of weird accelerators with LLVM backends, many of them aren’t targeted by C compilers/code. The ones that do have C frontends often use weird dialects or lots of builtins, but they are still useful to support.

I find this thread to be a bit confusing: it seems that people are aware that such chips exists (even today) but some folks are reticent to add generic support for them. While I can see the concern about inventing new backends just for testing, I don’t see an argument against generalizing the core and leaving it untested (in master). If any bugs creep in, then people with downstream targets can fix them in core.

-Chris

Mikael Holmén via llvm-dev

unread,

Oct 31, 2019, 2:51:17 AM10/31/19

to Jeroen.D...@synopsys.com, clat...@nondot.org, llvm...@lists.llvm.org

Thanks Chris! This is what we would like to see as well!

We have a 16bit byte target downstream and we live pretty much on top-
of-tree since we pull from llvm every day. Every now and then we find
new 8bit byte assumptions in the code that break things for us that we
fix downstream.

If we were allowed, we would be happy to upstream such fixes which
would make life easier both for us (as we would need to maintain fewer
downstream diffs) and (hopefully) for others living downstream with
other non-8bit byte targets.

Now, while we try to fix things in ways that would work for several
different byte sizes, what _we_ actually really test is 16bit bytes, so
I'm sure we fail to generalize things enough for all sizes, but at
least our contributions will make things more general than today.

And I imagine that if other downstream targets use other byte sizes
than us they would also notice when things break and would also pitch
in and generalize it further so that it in the end works for all users.

/Mikael

>
> -Chris
>
> _______________________________________________
> LLVM Developers mailing list
> llvm...@lists.llvm.org
>

https://protect2.fireeye.com/v1/url?k=8c219edf-d0a845d0-8c21de44-0cc47ad93e1a-b9df048a1ecb44b1&q=1&e=95c12902-023a-4b29-913c-87a467fe82d9&u=https%3A%2F%2Flists.llvm.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fllvm-dev

Dmitriy Borisenkov via llvm-dev

unread,

Oct 31, 2019, 7:18:05 AM10/31/19

to David Chisnall, llvm...@lists.llvm.org

David, just to clarify a misconception I might have introduced, we do not have linear memory in the sense that all data is stored as a trie. We do support arrays, structures and GEPs, however, as well as all relevant features in C by modeling memory.

So regarding concepts of byte, all 5 statements you gave are true for our target. Either due to the specification or because of performance (gas consumption) issues. But if there are architectures that need less from the notion of byte, we should try to figure out the common denominator. It's probably ok to be less restrictive about a byte.

--
Kind regards, Dmitry

David Chisnall via llvm-dev

unread,

Oct 31, 2019, 7:48:46 AM10/31/19

to Dmitriy Borisenkov, llvm...@lists.llvm.org

On 31/10/2019 11:17, Dmitriy Borisenkov wrote:
> David, just to clarify a misconception I might have introduced, we do
> not have linear memory in the sense that all data is stored as a trie.
> We do support arrays, structures and GEPs, however, as well as all
> relevant features in C by modeling memory.

So, if I understand correctly, your memory is a key-value store where
the keys are 257-bit values and the values are arrays of 257-bit values?
Or they values are 257-bit values? To help the discussion, please can
you explain how the following are represented:

- A pointer to an object.
- A pointer to a field in an object.
- An arbitrary void*.
- The C string "hello world"

> So regarding concepts of byte, all 5 statements you gave are true for
> our target. Either due to the specification or because of
> performance (gas consumption) issues. But if there are architectures
> that need less from the notion of byte, we should try to figure out the
> common denominator. It's probably ok to be less restrictive about a byte.

It seems odd to encode a C string as an array of 257-bit values, rather
than as an array of 8-bit values that are stored in 32-char chunks.

Dmitriy Borisenkov via llvm-dev

unread,

Oct 31, 2019, 9:41:40 AM10/31/19

to David Chisnall, llvm...@lists.llvm.org

> So, if I understand correctly, your memory is a key-value store where the keys are 257-bit values and the values are arrays of 257-bit values?

Both the keys and the values are 257-bits wide:

- A pointer to an object is 257 bits integer.
- The same as a pointer to a field of an object.

- And an arbitrary void* is also 257 bits wide integer.
- "Hello, world" is an array of 257-bit characters.

It's indeed redundant for letters and pointers to occupy that much space. However, a realistic contract that is able to run on a virtual machine without exceeding gas limits can't use strings and memory extensively. So we've chosen the simplest implementation possible. If other targets that have non-8-bits byte pack multiple 8-bit characters into a single byte and it's convenient for the community to maintain this kind of design, we probably can reimplement strings this way too.

Persistent data, which is kept in the blockchain is more compact, but it requires explicit intrinsic calls to deserialize data and then the programmer is able to manipulate with it as with 257-bits integers.

Dmitriy Borisenkov via llvm-dev

unread,

Nov 1, 2019, 6:42:18 AM11/1/19

to Mikael Holmén, llvm...@lists.llvm.org

A summary of the discussion so far.

It seems that there are two possible solutions on how to move forward with non 8 bits byte:

1. Commit changes without tests. Chris Lattner, Mikael Holmen, Jeroen Dobbelaere, Jesper Antonsson support this idea.

James Y Knight says that at least magic numbers should be removed "at least where it arguably helps code clarity". This might be not exactly the scope of the changes discussed, but it's probably worth do discuss code clarity having concrete patches.

GCC (according to James Y Knight) has the same practice meaning non-8 bits byte is supported but there are no tests in upstream and we have downstream contributors who will fix the bugs if they appear in the LLVM core.

David Chisnall raised a question about what to count as a byte (which defines the scope of the changes) and we suggest to use all 5 criteria he granted:

> - The smallest unit that can be loaded / stored at a time.

> - The smallest unit that can be addressed with a raw pointer in a specific address space.
> - The largest unit whose encoding is opaque to anything above the ISA.
> - The type used to represent `char` in C.
> - The type that has a size that all other types are a multiple of.

But if DSPs are less restrictive about byte, some of the criteria could be removed.

2. Use an iconic target. PDP10 was suggested as a candidate. This opinion found support from Tim Northover, Joerg Sonenberger, Mehdi AMINI, Philip Reames. It's not clear though does this opinion oppose upstreaming non-8-bits byte without tests or just a dummy and TVM targets options.

So if there is no strong opposition to the solution 1 from the people supporting an iconic target option, we could probably move to the patches.

--
Kind regards, Dmitry Borisenkov

Finkel, Hal J. via llvm-dev

unread,

Nov 1, 2019, 7:18:01 AM11/1/19

to Dmitriy Borisenkov, Mikael Holmén, llvm...@lists.llvm.org

On 11/1/19 5:41 AM, Dmitriy Borisenkov via llvm-dev wrote:

A summary of the discussion so far.

It seems that there are two possible solutions on how to move forward with non 8 bits byte:

1. Commit changes without tests. Chris Lattner, Mikael Holmen, Jeroen Dobbelaere, Jesper Antonsson support this idea.

James Y Knight says that at least magic numbers should be removed "at least where it arguably helps code clarity". This might be not exactly the scope of the changes discussed, but it's probably worth do discuss code clarity having concrete patches.

GCC (according to James Y Knight) has the same practice meaning non-8 bits byte is supported but there are no tests in upstream and we have downstream contributors who will fix the bugs if they appear in the LLVM core.

David Chisnall raised a question about what to count as a byte (which defines the scope of the changes) and we suggest to use all 5 criteria he granted:
> - The smallest unit that can be loaded / stored at a time.

> - The smallest unit that can be addressed with a raw pointer in a specific address space.
> - The largest unit whose encoding is opaque to anything above the ISA.
> - The type used to represent `char` in C.
> - The type that has a size that all other types are a multiple of.
But if DSPs are less restrictive about byte, some of the criteria could be removed.

2. Use an iconic target. PDP10 was suggested as a candidate. This opinion found support from Tim Northover, Joerg Sonenberger, Mehdi AMINI, Philip Reames. It's not clear though does this opinion oppose upstreaming non-8-bits byte without tests or just a dummy and TVM targets options.

So if there is no strong opposition to the solution 1 from the people supporting an iconic target option, we could probably move to the patches.

I also support option 1, although having an in-tree target would certainly be better. That having been said, I'm assuming that the byte size will be part of the DataLayout and, as a result, we can certainly have unit tests and tests for IR-level changes in the usual way. We should only omit tests where we can't currently have them.

-Hal

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

Adrian Prantl via llvm-dev

unread,

Nov 1, 2019, 11:40:45 AM11/1/19

to Dmitriy Borisenkov, llvm...@lists.llvm.org

> On Nov 1, 2019, at 3:41 AM, Dmitriy Borisenkov via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> A summary of the discussion so far.
>
> It seems that there are two possible solutions on how to move forward with non 8 bits byte:
>
> 1. Commit changes without tests. Chris Lattner, Mikael Holmen, Jeroen Dobbelaere, Jesper Antonsson support this idea.
> James Y Knight says that at least magic numbers should be removed "at least where it arguably helps code clarity". This might be not exactly the scope of the changes discussed, but it's probably worth do discuss code clarity having concrete patches.
> GCC (according to James Y Knight) has the same practice meaning non-8 bits byte is supported but there are no tests in upstream and we have downstream contributors who will fix the bugs if they appear in the LLVM core.
> David Chisnall raised a question about what to count as a byte (which defines the scope of the changes) and we suggest to use all 5 criteria he granted:
> > - The smallest unit that can be loaded / stored at a time.
> > - The smallest unit that can be addressed with a raw pointer in a specific address space.
> > - The largest unit whose encoding is opaque to anything above the ISA.
> > - The type used to represent `char` in C.
> > - The type that has a size that all other types are a multiple of.
> But if DSPs are less restrictive about byte, some of the criteria could be removed.
>
> 2. Use an iconic target. PDP10 was suggested as a candidate. This opinion found support from Tim Northover, Joerg Sonenberger, Mehdi AMINI, Philip Reames. It's not clear though does this opinion oppose upstreaming non-8-bits byte without tests or just a dummy and TVM targets options.
>
> So if there is no strong opposition to the solution 1 from the people supporting an iconic target option, we could probably move to the patches.

I'm in camp (2). Any changes that are not tested are an invitation to upstream developers to "simplify" the code, not knowing that those changes are important. Anyone who commits untested changes to LLVM will inevitably face an uphill battle against benevolent NFC refactorings that break these changes because the expectation of how the code is supposed to behave is not codified in a test. In the short term option (1) sounds more appealing because they can start right away, but I'm going to predict that it will be more expensive for the downstream maintainers of non 8-bit targets in the long term.

-- adrian

Robinson, Paul via llvm-dev

unread,

Nov 1, 2019, 11:43:24 AM11/1/19

to Dmitriy Borisenkov, Mikael Holmén, llvm...@lists.llvm.org

Did somebody say "PDP-10"? 😊

> David Chisnall raised a question about what to count as a byte
> (which defines the scope of the changes) and we suggest to use
> all 5 criteria he granted:
> > - The smallest unit that can be loaded / stored at a time.
> > - The smallest unit that can be addressed with a raw pointer
> in a specific address space.
> > - The largest unit whose encoding is opaque to anything above
> the ISA.
> > - The type used to represent `char` in C.
> > - The type that has a size that all other types are a multiple
> of.
> But if DSPs are less restrictive about byte, some of the criteria
> could be removed.
>
> 2. Use an iconic target. PDP10 was suggested as a candidate. This
> opinion found support from Tim Northover, Joerg Sonenberger, Mehdi
> AMINI, Philip Reames. It's not clear though does this opinion
> oppose upstreaming non-8-bits byte without tests or just a dummy
> and TVM targets options.

Note that for the PDP-10, not all 5 criteria are the same thing.
It is a word-addressed machine (36-bit words) but the ISA has
instructions to handle 18-bit halfwords, and also defines a
"byte pointer" to allow load/store of arbitrary-size bytes within
a word. Byte pointers allow any size byte that fits in a word
(from 1 bit to 36 bits). So what we have is:

> - The smallest unit that can be loaded / stored at a time.

This is 1 bit, from the ISA's perspective, using byte pointers.
Obviously caches and such would be word-based, but that's not
the point of this criterion.

> - The smallest unit that can be addressed with a raw pointer
in a specific address space.

On PDP-10, the naïve interpretation of "raw pointer" would be
a simple memory address, so this is a 36-bit word. (Halfword
access uses different instructions to move the upper or lower
halfwords; it's not encoded in the address.)
Note that `char *` is not a "raw pointer" in this sense; it is
a byte pointer.

> - The largest unit whose encoding is opaque to anything above
the ISA.

I am not clear what this actually means. I could interpret it
as a double-word floating point, but I doubt that was what was
intended.

> - The type used to represent `char` in C.

tl;dr: 7-bit byte.

C is hard to map to PDP-10. DEC did not provide a compiler,
although I was aware of a third-party C compiler; it used 7-bit
ASCII for `char` which was the most typical character size on
that machine. (Sixbit was also used frequently, if you didn't
need lowercase or many special characters, e.g. for filenames.
8-bit ASCII was uncommon, unless you were forced into doing
data transfers to those newfangled PDP-11 and VAX things.)
This means that `char *` and `int *` had different formats, the
former being a byte pointer and the latter being an address;
casting was not free.

> - The type that has a size that all other types are a multiple
of.

Discounting 'char' and strings, I'd have to say this would be
the 36-bit word, i.e. 'int'.

So, in summary, on the PDP-10 a "byte" might be any of:
- one bit
- seven bits
- 18 bits
- 36 bits
depending on what you mean.

Here endeth the lesson. 😊 Let me know if you need any other
historical trivia.

--paulr
DEC employee from 1982-1992

Mehdi AMINI via llvm-dev

unread,

Nov 1, 2019, 11:21:05 PM11/1/19

to Dmitriy Borisenkov, llvm...@lists.llvm.org

On Fri, Nov 1, 2019 at 3:42 AM Dmitriy Borisenkov via llvm-dev <llvm...@lists.llvm.org> wrote:

A summary of the discussion so far.

It seems that there are two possible solutions on how to move forward with non 8 bits byte:

1. Commit changes without tests. Chris Lattner, Mikael Holmen, Jeroen Dobbelaere, Jesper Antonsson support this idea.
James Y Knight says that at least magic numbers should be removed "at least where it arguably helps code clarity". This might be not exactly the scope of the changes discussed, but it's probably worth do discuss code clarity having concrete patches.
GCC (according to James Y Knight) has the same practice meaning non-8 bits byte is supported but there are no tests in upstream and we have downstream contributors who will fix the bugs if they appear in the LLVM core.
David Chisnall raised a question about what to count as a byte (which defines the scope of the changes) and we suggest to use all 5 criteria he granted:
> - The smallest unit that can be loaded / stored at a time.
> - The smallest unit that can be addressed with a raw pointer in a specific address space.
> - The largest unit whose encoding is opaque to anything above the ISA.
> - The type used to represent `char` in C.
> - The type that has a size that all other types are a multiple of.
But if DSPs are less restrictive about byte, some of the criteria could be removed.

2. Use an iconic target. PDP10 was suggested as a candidate. This opinion found support from Tim Northover, Joerg Sonenberger, Mehdi AMINI, Philip Reames. It's not clear though does this opinion oppose upstreaming non-8-bits byte without tests or just a dummy and TVM targets options.

To clarify mine: I'm not in favor of 1. It seems to me like a "short term" motivated approach that does not have a good testing/validation story, I'd rather see a more complete proposal that have a path toward an end-to-end solution.

Not that I am strongly opposed, but I see not going with 1 right now to put the burden (or encourage) these downstream LLVM users to engage and create (if this is critical enough) the complete upstream story instead of stopping short there.

--

Mehdi

Jorg Brown via llvm-dev

unread,

Nov 2, 2019, 3:23:43 AM11/2/19

to Robinson, Paul, llvm...@lists.llvm.org

Fascinating.

So, a 36-bit word could contain 6 Sixbits, 5 7-bit ASCII characters, or 4 8-bit ASCII characters for communicating with later DEC machines?

I was going to ask what the compiler does when it sees "Hello World"... but since DEC didn't provide a compiler, I suppose there can't be an answer to that...

I would say that it's critical for memcpy to work well enough that it copies all the bits, which to me means that the size of a "word" has to be a multiple of whatever 'char' is. That rules out both 8-bit chars and 7-bit chars. I would say your only choices are:

1 bit

6 bits

9 bits

36 bits

3, 4, 12, and 18 also evenly divide 36, but I don't see any compelling reason to want them.

A 9-bit char would have some use if 8-bit characters were only packed 4-to-a-word.

A 1-bit char would be awesome because then you might end up with the only architecture in the world where vector<bool> wasn't an abomination. I guess then that "Hello World" might have a size of 77 chars (84 counting the NUL), assuming that the compiler treated 7-bit as the preferred encoding.

...

Thanks for the lesson. I have a very dim recollection of programming a PDP in college... apparently blissfully unaware of word sizes... which makes me think it was probably an 11/70.

-- Jorg

Jorg Brown via llvm-dev

unread,

Nov 2, 2019, 3:45:55 AM11/2/19

to Adrian Prantl, llvm...@lists.llvm.org

I've worked on multiple codebases where an option existed in order to satisfy an extremely small userbase, with little or no testing, and as such, I'm adamantly opposed to repeating it.

In addition to what Adrian said, where the weird option exists but is constantly being incidentally broken, I've seen the opposite problem: people become afraid to refactor a section of code because it might break the weird option. You might say "if there aren't any tests, people should feel free to refactor the code; their only responsibility is to make sure the tests will still work", but honestly, I've seen the opposite: without tests, it's just presumed that touching certain parts of code is likely to break things, and so after a time, an aura of untouchability starts to surround regions of the code. And then, the more time goes by, the more that code becomes unfamiliar to everyone (because no one is actively maintaining it). In the long run, the cost of an unmaintained option may be far more than the cost of a maintained one.

In short: please don't commit changes without tests. Even if the test is nothing but making sure this works:

int main(int argc, char *argv[]) {

return argv[argc - 1][0];

}

That at least would give some freedom from the guilt of breaking something important.

-- Jorg

David Blaikie via llvm-dev

unread,

Nov 4, 2019, 3:16:16 PM11/4/19

to Jorg Brown, llvm...@lists.llvm.org

On Sat, Nov 2, 2019 at 12:45 AM Jorg Brown via llvm-dev <llvm...@lists.llvm.org> wrote:

On Fri, Nov 1, 2019 at 8:40 AM Adrian Prantl via llvm-dev <llvm...@lists.llvm.org> wrote:
> On Nov 1, 2019, at 3:41 AM, Dmitriy Borisenkov via llvm-dev <llvm...@lists.llvm.org> wrote:
> It seems that there are two possible solutions on how to move forward with non 8 bits byte:
>
> 1. Commit changes without tests. Chris Lattner, Mikael Holmen, Jeroen Dobbelaere, Jesper Antonsson support this idea.
> James Y Knight says that at least magic numbers should be removed "at least where it arguably helps code clarity". This might be not exactly the scope of the changes discussed, but it's probably worth do discuss code clarity having concrete patches.
> GCC (according to James Y Knight) has the same practice meaning non-8 bits byte is supported but there are no tests in upstream and we have downstream contributors who will fix the bugs if they appear in the LLVM core.
> David Chisnall raised a question about what to count as a byte (which defines the scope of the changes) and we suggest to use all 5 criteria he granted:
> > - The smallest unit that can be loaded / stored at a time.
> > - The smallest unit that can be addressed with a raw pointer in a specific address space.
> > - The largest unit whose encoding is opaque to anything above the ISA.
> > - The type used to represent `char` in C.
> > - The type that has a size that all other types are a multiple of.
> But if DSPs are less restrictive about byte, some of the criteria could be removed.
>
> 2. Use an iconic target. PDP10 was suggested as a candidate. This opinion found support from Tim Northover, Joerg Sonenberger, Mehdi AMINI, Philip Reames. It's not clear though does this opinion oppose upstreaming non-8-bits byte without tests or just a dummy and TVM targets options.
>
> So if there is no strong opposition to the solution 1 from the people supporting an iconic target option, we could probably move to the patches.

I'm in camp (2). Any changes that are not tested are an invitation to upstream developers to "simplify" the code, not knowing that those changes are important. Anyone who commits untested changes to LLVM will inevitably face an uphill battle against benevolent NFC refactorings that break these changes because the expectation of how the code is supposed to behave is not codified in a test. In the short term option (1) sounds more appealing because they can start right away, but I'm going to predict that it will be more expensive for the downstream maintainers of non 8-bit targets in the long term.

I've worked on multiple codebases where an option existed in order to satisfy an extremely small userbase, with little or no testing,

In those situations, were the core developers responsible for those features/users? Yeah, if I needed to support a certain observable feature of clang continuing to work, I'd want tests (I'm pretty serious about testing, FWIW).

and as such, I'm adamantly opposed to repeating it.

In addition to what Adrian said, where the weird option exists but is constantly being incidentally broken, I've seen the opposite problem: people become afraid to refactor a section of code because it might break the weird option. You might say "if there aren't any tests, people should feel free to refactor the code; their only responsibility is to make sure the tests will still work", but honestly, I've seen the opposite: without tests, it's just presumed that touching certain parts of code is likely to break things, and so after a time, an aura of untouchability starts to surround regions of the code. And then, the more time goes by, the more that code becomes unfamiliar to everyone (because no one is actively maintaining it). In the long run, the cost of an unmaintained option may be far more than the cost of a maintained one.

I'm not actually opposed to this situation - LLVM as a project is pretty happy about making big structural changes to the codebase & holding the test coverage and downstream users accountable for ensuring quality. We rarely avoid changes due to risk of breakage & as a community push back a fair bit on reviewers suggesting we should - if someone can't demonstrate the breakage in an upstream test (or has a pretty good track record of true positives that may take some time to investigate - and thus it might be better in the short term to revert while waiting for that evidence to be provided) the changes tend to go in and stay in.

Yeah, I think some parts of the code may become complicated enough to warrant separate testing - but most of the code that might move to constants for byte width, iterate over bits to that byte width, etc, will be tested on one value & might have bugs on other values that will be found (or not) downstream - best effort and all. But in cases where the code to handle novel byte widths becomes more complicated - some abstraction and unit testing would seem quite appropriate.

In short: please don't commit changes without tests. Even if the test is nothing but making sure this works:

int main(int argc, char *argv[]) {
return argv[argc - 1][0];
}

That at least would give some freedom from the guilt of breaking something important.

It's hard to make sure that works in a meaningful sense in this context - without a non-8-bit-byte target in upstream LLVM, which is the point of contention/discussion. It's unclear if there's a suitable target/community to provide/maintain such a target upstream. I don't think there's a "cheap"/stub/trivial target we could create that would provide what you're suggesting without bitrotting quickly and being removed (more quickly than, I think, the sort of patches to support but not provide, non-8-bit-byte targets).

Though it's hard to guess without seeing the sort of patches that'd be needed.

- Dave

-- Jorg

James Molloy via llvm-dev

unread,

Nov 4, 2019, 7:00:41 PM11/4/19

to David Blaikie, llvm...@lists.llvm.org

Hi,

To throw my hat into the ring, we also maintain a downstream target that has a subset of this problem - it has non-8-bit-addressable memory. This means that %uglygeps don't work, pointers can't be casted to i8* and expect to be byte-addressed (conversions to memcpy/memset that use i8* aren't welcome either), and GEP lowering code is slightly different.

We maintain this currently as a pile of known technical debt. We have a CGP pass to decompose GEPs and custom-expand them taking our word size into account, but everything before that is just waiting for InstCombine to break something. AliasAnalysis also makes offsetting assumptions that are totally bogus because our pointers are word-addressed and therefore so are pointer offsets.

We'd be really keen to help here. We're keen to upstream anything we possibly can (and have been, over the past few months). We've have several discussions about how best to approach upstream with this, and the sticking point has always been lack of a testing target. It's always felt to me that the idea of addressable granule should be a fairly reasonable DataLayout addition; We can test DataLayout changes purely via opt without requiring a target that uses them. Lowering to instructions was always the testing sticking point.

We'd be keen to help out what the community decides to do here. I personally feel it's reasonable that:

- LangRef/DataLayout is updated with semantically coherent changes.

- The midend optimizer is updated by someone who cares about those changes and tests are added that use the new DataLayout.

- Developers that don't care about those changes maintain a best-effort approach, which is exactly what we do right now; there are features that are tested but are still esoteric enough that I might reasonably break them without realising (OperandBundles come to mind), so I don't think there's any change in mindset here.

- Developers that care perform downstream testing and provide review feedback / revert if really bad / fixes. Again, this is how LLVM works right now - I'd guess that >80% of our real-world test coverage comes from downstream users deploying ToT LLVM rather than the upstream LIT tests / builders.

Cheers,

James

David Blaikie via llvm-dev

unread,

Nov 4, 2019, 7:07:45 PM11/4/19

to James Molloy, llvm...@lists.llvm.org

On Mon, Nov 4, 2019 at 4:00 PM James Molloy <ja...@jamesmolloy.co.uk> wrote:

Hi,

To throw my hat into the ring, we also maintain a downstream target that has a subset of this problem - it has non-8-bit-addressable memory. This means that %uglygeps don't work, pointers can't be casted to i8* and expect to be byte-addressed (conversions to memcpy/memset that use i8* aren't welcome either), and GEP lowering code is slightly different.

We maintain this currently as a pile of known technical debt. We have a CGP pass to decompose GEPs and custom-expand them taking our word size into account, but everything before that is just waiting for InstCombine to break something. AliasAnalysis also makes offsetting assumptions that are totally bogus because our pointers are word-addressed and therefore so are pointer offsets.

We'd be really keen to help here. We're keen to upstream anything we possibly can (and have been, over the past few months). We've have several discussions about how best to approach upstream with this, and the sticking point has always been lack of a testing target. It's always felt to me that the idea of addressable granule should be a fairly reasonable DataLayout addition; We can test DataLayout changes purely via opt without requiring a target that uses them. Lowering to instructions was always the testing sticking point.

We'd be keen to help out what the community decides to do here. I personally feel it's reasonable that:
- LangRef/DataLayout is updated with semantically coherent changes.
- The midend optimizer is updated by someone who cares about those changes and tests are added that use the new DataLayout.
- Developers that don't care about those changes maintain a best-effort approach, which is exactly what we do right now; there are features that are tested but are still esoteric enough that I might reasonably break them without realising (OperandBundles come to mind), so I don't think there's any change in mindset here.
- Developers that care perform downstream testing and provide review feedback / revert if really bad / fixes. Again, this is how LLVM works right now - I'd guess that >80% of our real-world test coverage comes from downstream users deploying ToT LLVM rather than the upstream LIT tests / builders.

This last bit is the bit I'd worry about if the bug were only visible in an out-of-tree target. If it's visible with a DataLayout change that can be tested upstream, then I'd be OK with an upstream patch being reverted given a test case with custom DataLayout showing the failure.

But if the failure is for, say, non-8-bit-bytes that are only actually exercised downstream, I'm less OK with patches being reverted upstream no matter the severity of the breakage to such targets downstream.

- Dave

James Molloy via llvm-dev

unread,

Nov 4, 2019, 7:10:24 PM11/4/19

to David Blaikie, llvm...@lists.llvm.org

> But if the failure is for, say, non-8-bit-bytes that are only actually exercised downstream, I'm less OK with patches being reverted upstream no matter the severity of the breakage to such targets downstream.

I think that's totally reasonable and is the same yardstick applied already - reversion is only justifiable if a testcase can be provided. Anything else would be undue burden on the original committer and would require shotgun debugging.

Sean Silva via llvm-dev

unread,

Nov 5, 2019, 7:36:15 PM11/5/19

to David Blaikie, llvm...@lists.llvm.org

On Mon, Nov 4, 2019 at 4:07 PM David Blaikie via llvm-dev <llvm...@lists.llvm.org> wrote:

On Mon, Nov 4, 2019 at 4:00 PM James Molloy <ja...@jamesmolloy.co.uk> wrote:
Hi,

To throw my hat into the ring, we also maintain a downstream target that has a subset of this problem - it has non-8-bit-addressable memory. This means that %uglygeps don't work, pointers can't be casted to i8* and expect to be byte-addressed (conversions to memcpy/memset that use i8* aren't welcome either), and GEP lowering code is slightly different.

We maintain this currently as a pile of known technical debt. We have a CGP pass to decompose GEPs and custom-expand them taking our word size into account, but everything before that is just waiting for InstCombine to break something. AliasAnalysis also makes offsetting assumptions that are totally bogus because our pointers are word-addressed and therefore so are pointer offsets.

We'd be really keen to help here. We're keen to upstream anything we possibly can (and have been, over the past few months). We've have several discussions about how best to approach upstream with this, and the sticking point has always been lack of a testing target. It's always felt to me that the idea of addressable granule should be a fairly reasonable DataLayout addition; We can test DataLayout changes purely via opt without requiring a target that uses them. Lowering to instructions was always the testing sticking point.

We'd be keen to help out what the community decides to do here. I personally feel it's reasonable that:
- LangRef/DataLayout is updated with semantically coherent changes.
- The midend optimizer is updated by someone who cares about those changes and tests are added that use the new DataLayout.
- Developers that don't care about those changes maintain a best-effort approach, which is exactly what we do right now; there are features that are tested but are still esoteric enough that I might reasonably break them without realising (OperandBundles come to mind), so I don't think there's any change in mindset here.
- Developers that care perform downstream testing and provide review feedback / revert if really bad / fixes. Again, this is how LLVM works right now - I'd guess that >80% of our real-world test coverage comes from downstream users deploying ToT LLVM rather than the upstream LIT tests / builders.

This last bit is the bit I'd worry about if the bug were only visible in an out-of-tree target. If it's visible with a DataLayout change that can be tested upstream, then I'd be OK with an upstream patch being reverted given a test case with custom DataLayout showing the failure.

But if the failure is for, say, non-8-bit-bytes that are only actually exercised downstream, I'm less OK with patches being reverted upstream no matter the severity of the breakage to such targets downstream.

Suppose that the failure is build breakage for the downstream target. This could cause the entirety of the downstream LLVM/Clang build to break in a way that isn’t easy to untangle or fix. This could prevent an entire organization from integrating LLVM for an extended period of time. So it isn’t necessarily “no matter the severity of the breakage for such *targets* downstream” but could be the entire organization unless a tactical downstream revert of the offending change can be made.

— Sean Silva

David Blaikie via llvm-dev

unread,

Nov 5, 2019, 8:02:46 PM11/5/19

to Sean Silva, llvm...@lists.llvm.org

On Tue, Nov 5, 2019 at 4:35 PM Sean Silva <chiso...@gmail.com> wrote:

On Mon, Nov 4, 2019 at 4:07 PM David Blaikie via llvm-dev <llvm...@lists.llvm.org> wrote:

On Mon, Nov 4, 2019 at 4:00 PM James Molloy <ja...@jamesmolloy.co.uk> wrote:
Hi,

To throw my hat into the ring, we also maintain a downstream target that has a subset of this problem - it has non-8-bit-addressable memory. This means that %uglygeps don't work, pointers can't be casted to i8* and expect to be byte-addressed (conversions to memcpy/memset that use i8* aren't welcome either), and GEP lowering code is slightly different.

We maintain this currently as a pile of known technical debt. We have a CGP pass to decompose GEPs and custom-expand them taking our word size into account, but everything before that is just waiting for InstCombine to break something. AliasAnalysis also makes offsetting assumptions that are totally bogus because our pointers are word-addressed and therefore so are pointer offsets.

We'd be really keen to help here. We're keen to upstream anything we possibly can (and have been, over the past few months). We've have several discussions about how best to approach upstream with this, and the sticking point has always been lack of a testing target. It's always felt to me that the idea of addressable granule should be a fairly reasonable DataLayout addition; We can test DataLayout changes purely via opt without requiring a target that uses them. Lowering to instructions was always the testing sticking point.

We'd be keen to help out what the community decides to do here. I personally feel it's reasonable that:
- LangRef/DataLayout is updated with semantically coherent changes.
- The midend optimizer is updated by someone who cares about those changes and tests are added that use the new DataLayout.
- Developers that don't care about those changes maintain a best-effort approach, which is exactly what we do right now; there are features that are tested but are still esoteric enough that I might reasonably break them without realising (OperandBundles come to mind), so I don't think there's any change in mindset here.
- Developers that care perform downstream testing and provide review feedback / revert if really bad / fixes. Again, this is how LLVM works right now - I'd guess that >80% of our real-world test coverage comes from downstream users deploying ToT LLVM rather than the upstream LIT tests / builders.

This last bit is the bit I'd worry about if the bug were only visible in an out-of-tree target. If it's visible with a DataLayout change that can be tested upstream, then I'd be OK with an upstream patch being reverted given a test case with custom DataLayout showing the failure.

But if the failure is for, say, non-8-bit-bytes that are only actually exercised downstream, I'm less OK with patches being reverted upstream no matter the severity of the breakage to such targets downstream.

Suppose that the failure is build breakage for the downstream target. This could cause the entirety of the downstream LLVM/Clang build to break in a way that isn’t easy to untangle or fix. This could prevent an entire organization from integrating LLVM for an extended period of time. So it isn’t necessarily “no matter the severity of the breakage for such *targets* downstream” but could be the entire organization unless a tactical downstream revert of the offending change can be made.

That's already the world they live in today having out of tree targets and especially ones with non-8-bit-bytes. I don't think upstream can sign up for any stronger support here in that situation.

James Molloy via llvm-dev

unread,

Nov 5, 2019, 8:21:14 PM11/5/19

to David Blaikie, llvm...@lists.llvm.org

> That's already the world they live in today having out of tree targets and especially ones with non-8-bit-bytes. I don't think upstream can sign up for any stronger support here in that situation.

I absolutely agree.

Chris Lattner via llvm-dev

unread,

Nov 6, 2019, 12:53:10 AM11/6/19

to James Molloy, llvm...@lists.llvm.org

On Nov 4, 2019, at 4:00 PM, James Molloy via llvm-dev <llvm...@lists.llvm.org> wrote:
> We'd be keen to help out what the community decides to do here. I personally feel it's reasonable that:
> - LangRef/DataLayout is updated with semantically coherent changes.
> - The midend optimizer is updated by someone who cares about those changes and tests are added that use the new DataLayout.

This makes perfect sense to me - if you want the mid-level optimizer to be aware of this, then it makes sense for it to be part of DataLayout.

Question though: given that approach, wouldn’t that make this whole thing in-tree-testable?

What would not be testable given the appropriate data layout extensions?

-Chris

James Molloy via llvm-dev

unread,

Nov 6, 2019, 10:44:05 AM11/6/19

to Chris Lattner, llvm...@lists.llvm.org

Hi Chris,

SelectionDAGBuilder, DAGCombine and anything in CodeGen (but I can't think of any examples there offhand, come to think of it). I must admit that until I wrote up that email, it hadn't quite occurred to me how much of the change would be testable. After the writeup, it doesn't sounds quite as contentious as I considered it.

SelectionDAGBuilder is a simple non-invasive change to visitGetElementPtrInst(); DAGCombiner may have some nasties lurking in it. Inability to test a DAGCombine is hardly a new problem though :)

Cheers,

James

Chris Lattner via llvm-dev

unread,

Nov 7, 2019, 1:41:14 AM11/7/19

to James Molloy, llvm...@lists.llvm.org

On Nov 6, 2019, at 7:43 AM, James Molloy <ja...@jamesmolloy.co.uk> wrote:

Hi Chris,

SelectionDAGBuilder, DAGCombine and anything in CodeGen (but I can't think of any examples there offhand, come to think of it). I must admit that until I wrote up that email, it hadn't quite occurred to me how much of the change would be testable. After the writeup, it doesn't sounds quite as contentious as I considered it.

SelectionDAGBuilder is a simple non-invasive change to visitGetElementPtrInst(); DAGCombiner may have some nasties lurking in it. Inability to test a DAGCombine is hardly a new problem though :)

Yeah, that was my sense. Here is another crazy question - just to probe the limits of your tolerance here :-).

Assuming we added a new field to DataLayout, what would the emergent behavior be for an “invalid" configuration given a target?

Let me consider an existing situation: what would happen if we shoved a .ll file with a datalayout specifying ‘big endian’ into X86 or some other little endian target?

The answer (assuming no hard error) is that all the target independent logic would still key off of this, and the target specific logic wouldn’t. Given that the testing goal is to generalize the target independent logic and test it, this emergent behavior may be enough to test the (e.g.) SelectionDAGBuilder changes with some existing target.

-Chris

Reply all

Reply to author

Forward