Design proposal: Custom constructors for providers

bran...@google.com

unread,

Apr 22, 2021, 2:13:38 PM4/22/21

to bazel-dev

Doc here. This is a proposal to add a feature for overriding the default

MyInfo(foo=1, bar=2)

provider constructor syntax, so that you can automatically call an arbitrary Starlark function to decide the field values based on the constructor arguments. This enables validation, argument transformation, and gatekeeping the possible state any provider instance can be in. It also eases migration of native providers that already have custom constructors to Starlark.

The downside is that it makes providers more complicated and implicit. I'd be interested in hearing how people weigh that readability concern against the benefits.

bran...@google.com

unread,

Apr 28, 2021, 5:12:00 PM4/28/21

to bazel-dev

Here's an admittedly too-lengthy snapshot of my thinking.

tl;dr: We need to pick the exact syntax, and maybe make some decisions about ruling out future feature requests now.

Should we go through with this? (spoiler: yes)

First, we should distinguish between two goals that could possibly be in conflict: 1) improving the usage of Starlark providers in general; 2) avoiding a migration of all callers of JavaInfo(...). Let's decide what an ideal feature would look like first, then decide whether we should sacrifice anything for the sake of JavaInfo.

I had two general concerns about adding a custom constructor feature.

First, I was worried about a slippery slope of expressivity, i.e. that it would lead to requests to add more class-like features. We don't want providers to become classes. I think the rationale to continue with this design is that we're still not allowing abstraction over reading provider instance fields, we're only allowing abstraction over writing (really, initializing) them. dir/getattr/hasattr/str/repr/type etc will all work the same on all providers. There's a bigger benefit-cost ratio for writing vs reading, in that this lets you enforce invariants on the allowed states of any instances of your provider.

Second, I was worried that having construction be a non-primitive operation would complicate reading and analyzing the source code. But we already have the problem today that control flow can be obfuscated by treating functions as data (e.g. a dict of functions serving as a dispatch table). In any situation today where you can easily determine that `Foo(...)` is a provider instantiation, you'd only have to go one more step from Foo's definition to find its custom construction function. And this step should be pretty simple, since you'd expect in most cases they'd live in the same file, like a rule and its implementation function.

(Another minor expressivity concern is the possibility of recursion, but that seems easily dismissed. [1])

So for custom provider constructors, I think the good outweighs the bad. We will support migration of JavaInfo(...)'s implementation to Starlark without requiring updates to call sites. It's just a matter of choosing the exact spelling of this feature. Which brings us to the next topic.

What should the syntax / API look like?

The main proposal I offered would rely on a hidden _sentinel= argument to distinguish between an internal and external caller. This isn't ideal since a user could fake it, but it's also by no means unprecedented in Starlark code. Still, for the sake of tight invariants, it'd be nice to offer a way to actually restrict constructor access to the provider author. (This would also parallel how we currently require you to have the provider symbol available in order to read or write the provider instance on a target.)

Ordinarily, we accomplish this type of thing by naming an internal symbol with an underscore prefix so it can't be imported. But in this case, the thing we're trying to restrict is an operation on a public symbol, e.g. JavaInfo. So we would need the provider() function to give you back some kind of hidden token in addition to the regular public provider. Then construction can be restricted to only those who possess the token. Let's recap the possibilities (names are strawmen):

A) Create an "unfinished" provider that can be used for raw construction, then finish/export it in a way that binds the user-facing constructor to the `JavaInfo(...)` syntax. This is Ivo's alternative proposal from the doc. Questions arose as to what would happen if you attempted to export and use the unfinished provider as if it were the finished one, e.g. to retrieve instances off targets. But I think this confusion can be avoided by a renaming of concepts, such that the unfinished provider is really just the raw constructor itself:

```

_raw_constructor = make_raw_constructor()

def _my_constructor(...):

return _raw_constructor(...)

JavaInfo = provider_with_custom_constructor(

raw_constructor=_raw_constructor,

public_constructor=_my_constructor)

# Calling JavaInfo(...) invokes _my_constructor(...). User code can't instantiate JavaInfo in an unsafe way because it can't load _raw_constructor.

```

This may be weird to implement, particularly when you consider the "export" hack. The process of exporting JavaInfo (such that the provider knows its own name by virtue of assignment syntax) is really also saying something about the type of object that _raw_constructor() generates.

B) Have provider() allow for returning the raw constructor. This requires either a new top-level builtin (like the above option) or a new argument that modifies the return type.

```

def my_constructor(...):

return _raw_constructor(...)

JavaInfo, _raw_constructor = provider_with_custom_constructor(constructor=my_constructor)

# or alternatively:

JavaInfo, _raw_constructor = provider(gimme_a_raw_constructor=True, constructor=my_constructor)

```

This approach, while in some ways tidier than (A), imposes more ordering constraints on the definitions: You have to define my_constructor before it can be referenced in the provider definition. my_constructor of course relies on _raw_constructor in its body, but that's allowed in Starlark so long as it isn't called before _raw_constructor is assigned. Still, it's a pretty ugly definitional constraint IMO.

Note that in both (A) and (B), you can optionally prohibit `JavaInfo(...)` syntax altogether by passing None as the public constructor.

Interaction with existing or future features

Another concern we have is the interaction of custom constructors with existing or hypothetical Build language features.

For the "hidden token" approaches described above, the question becomes: If we have a token that controls access to construction, will we ever want to repurpose it to control access to other features too? We already decided against custom field accessors or private fields, and by the same argument of simplicity we can decide against things like custom equality relations or any form of provider subtyping. It's hard to imagine what else might remain.

Finally, since we're talking about putting constraints on provider construction, we have to ask what other ways there are to instantiate providers besides calling JavaInfo(...). Two possibilities come to mind:

1) Deserialization, e.g. a "from_proto" or "from_json" feature. But this doesn't make much sense since the Build language does not permit reading data from arbitrary inputs, and any hardcoded proto/json data could easily be translated to Starlark ahead of time.

2) Cloning, a la Python's namedtuple()'s `_replace()` method. This feature was proposed for Starlark structs/providers a long time ago but never implemented. It makes sense if you consider providers to just be Plain Old Data, but not so much if you consider them to have custom constructors. You could imagine us adding a `_replace()` feature in the future and restricting it only to provider types that don't use custom constructors. Or we could declare that we'll never add such a feature now.

[1] You could have recursion of a user-defined constructor either directly, or through mutual recursion with another function or even another provider constructor. This would be handled in the same way as recursion of any other Starlark function. Today we simply prohibit it dynamically upon the first instance of a recursive call. But in the future we might relax that to a simple stack-overflow check. Regardless, since provider instances are immutable, it remains impossible to construct a cycle between provider instances (though an instance may contain other cyclic objects like lists).

ai...@google.com

unread,

Apr 29, 2021, 4:41:01 PM4/29/21

to bazel-dev

C)

def _JavaInfo_ctor(...):

return {...}

JavaInfo, rawJavaInfo = Provider(constructor = _JavaInfo_ctor, gimme_a_raw=True, fields=...)

Calling JavaInfo invokes _JavaInfo_ctor. That turns args into a dict. That dict is used to initialize the fields of the returned instance.

if gimmie_a_raw is True, then additionally return a passthrough that does not call the constructor.

bran...@google.com

unread,

Apr 29, 2021, 5:00:48 PM4/29/21

to bazel-dev

Tony and I spoke offline and that pretty much solves it IMO, thanks.

So to be clear, this is proposal (B) except that the custom constructor returns a dict that is automatically used as if it were **kwargs for the raw constructor, rather than calling the raw constructor explicitly. Then the raw constructor itself is optional and not needed when there are no other factory methods.

Incidentally, it's still possible to construct multiple nested JavaInfo instances at once, either by returning {..., rawJavaInfo(...), ...} from _JavaInfo_ctor, or (if Bazel ever permits Starlark recursion in the future) returning {..., JavaInfo(...), ...}, which would implicitly call back into _JavaInfo_ctor. Such a thing should never happen in practice but at least the semantics make sense.

Closing the loop on my other concerns from the previous email:

I'm ready to declare that we're just supporting private constructors, and have no need for a general private token for other operations. So we can use the name "constructor" in any new API we add here.

For any other hypothetical way of constructing a provider, such as a future deserialization or `_replace()` method, we're taking the position that you can't do such a thing generically when the provider has a custom constructor. (Any if we never add such methods then the concern remains hypothetical.)

I think that addresses everything so I'll type it up into the doc for another round of review.

bran...@google.com

unread,

Apr 30, 2021, 4:13:13 PM4/30/21

to bazel-dev

Please see the revised doc for the new proposal and for a summary of all considerations that have come up so far.

Reply all

Reply to author

Forward