Inferring Absolute Source Roots

49 views
Skip to first unread message

Stu Hood

unread,
Jul 31, 2014, 8:46:16 PM7/31/14
to pants-devel
Source roots are currently a stateful part of build file parsing which occasionally (depending on parse order) mean that target restrictions aren't applied within sourceroots.

Additionally, we've recently needed to diverge from the public maven_layout in order to install additional targets into the source root definitions.

What would folks think about doing away with SourceRoot.register in BUILD files, and instead defining source roots as a set of patterns in pants.ini. The patterns would be matched against filenames to determine the absolute path to the source root. One simplifying restriction would be to ignore nested source roots.

Example:
  • for a pattern like 'src/main/java', we'd match any prefix of directories under the build root that contained `src/main/java`:
    • 'path/to/my/project/src/main/java'
    • 'src/main/java'
  • but, assuming there was also a 'src/main/scala' pattern, only the first pattern would be applied:
    • 'project/src/main/scala/src/main/java' -> 'project/src/main/scala'
----

It should be fairly straightforward implementation wise: scan all inferred source roots first, and if an existing root isn't found, find the earliest pattern match and add it to the set of inferred roots.

Thoughts?

--

John Sirois

unread,
Aug 1, 2014, 10:47:19 AM8/1/14
to Stu Hood, pants-devel
I like the idea of restricting source_root registration to exactly 1 point in the early phases of a command line invocation.

Inferring roots from patterns would be a nice bonus but its a bit slow pre-fs-watcher.
Non-scientific test for our "birdcage" maven-layout monorepo say 300ms for cold SSD (I rebooted):
>>> import os
>>> 
>>> def scan(root, patterns):
...   roots = []
...   for path, dirs, _ in os.walk(root):
...     for dir in dirs:
...       candidate = os.path.join(path, dir)
...       for pattern in patterns:
...         if candidate.endswith('src/main/scala'):
...           roots.append(candidate)
...           dirs.remove(dir)
...           break
...   return roots
... 
>>> import time
>>> 
>>> start = time.time()
>>> roots = scan(os.path.curdir, ['src/main/scala', 'src/main/java', 'src/main/python', 'src/test/scala', 'src/test/java', 'src/tests/python'])
>>> elapsed = time.time() - start
>>> print('scan of {count} roots took {time}s'.format(count=len(roots), time=elapsed))
scan of 827 roots took 0.382122039795s

300ms is a good deal to add to every pants run.b

I'd be happy though with an explicit list like we have today with the source_root calls, obviously in a new form.  I'd also be happy with a scheme to describe things like maven_layout, so maybe:
project1: maven
project2: maven
project3/sub1: maven
src/java: ['java_library', 'annotation_processor', ...]
...



Thoughts?

--



--
John Sirois
303-512-3301





Patrick Lawson

unread,
Aug 1, 2014, 11:27:15 AM8/1/14
to John Sirois, Stu Hood, pants-devel
I'm a hundred percent in favor of removing source root declarations from BUILD files.  That's another source of global state we can excise.  A while back we also discussed putting them either in pants.ini, another special INI, JSON, or YAML file, or in magically named config files throughout the source tree.

John Sirois

unread,
Aug 1, 2014, 11:34:54 AM8/1/14
to Patrick Lawson, Stu Hood, pants-devel
On Fri, Aug 1, 2014 at 9:27 AM, Patrick Lawson <p...@foursquare.com> wrote:
I'm a hundred percent in favor of removing source root declarations from BUILD files.  That's another source of global state we can excise.  A while back we also discussed putting them either in pants.ini, another special INI, JSON, or YAML file, or in magically named config files throughout the source tree.

I'm not in favor of config files throughout the source tree per fs walk considerations.  All the rest is reasonable.  A finite list of known config files would be best imo.



--
John Sirois
303-512-3301





Stu Hood

unread,
Aug 1, 2014, 12:29:26 PM8/1/14
to John Sirois, pants-devel
Inferring roots from patterns would be a nice bonus but its a bit slow pre-fs-watcher.
I might have been misunderstanding SourceRoots a bit... my understanding was that we didn't scan for source roots, but that we instead waited until a lookup for the source root of a 'Target' arrived, and then returned it. The inference I was talking about would happen lazily at sourceroot lookup time. Will double check.

John Sirois

unread,
Aug 1, 2014, 12:32:21 PM8/1/14
to Stu Hood, pants-devel
On Fri, Aug 1, 2014 at 10:29 AM, Stu Hood <stu...@twitter.com> wrote:
Inferring roots from patterns would be a nice bonus but its a bit slow pre-fs-watcher.
I might have been misunderstanding SourceRoots a bit... my understanding was that we didn't scan for source roots, but that we instead waited until a lookup for the source root of a 'Target' arrived, and then returned it. The inference I was talking about would happen lazily at sourceroot lookup time. Will double check.

But I think it would be good to kill the lazy as well.



--
John Sirois
303-512-3301





Stu Hood

unread,
Aug 1, 2014, 12:39:09 PM8/1/14
to John Sirois, pants-devel
Inferring roots from patterns would be a nice bonus but its a bit slow pre-fs-watcher.
I might have been misunderstanding SourceRoots a bit... my understanding was that we didn't scan for source roots, but that we instead waited until a lookup for the source root of a 'Target' arrived, and then returned it. The inference I was talking about would happen lazily at sourceroot lookup time. Will double check.

But I think it would be good to kill the lazy as well.
Ok... so we're actually talking about killing the SourceRoot singleton then? I agree, but it might be out of scope for this improvement... maybe as a second review? 

John Sirois

unread,
Aug 1, 2014, 12:39:37 PM8/1/14
to Stu Hood, pants-devel
On Fri, Aug 1, 2014 at 10:39 AM, Stu Hood <stu...@twitter.com> wrote:
Inferring roots from patterns would be a nice bonus but its a bit slow pre-fs-watcher.
I might have been misunderstanding SourceRoots a bit... my understanding was that we didn't scan for source roots, but that we instead waited until a lookup for the source root of a 'Target' arrived, and then returned it. The inference I was talking about would happen lazily at sourceroot lookup time. Will double check.

But I think it would be good to kill the lazy as well.
Ok... so we're actually talking about killing the SourceRoot singleton then? I agree, but it might be out of scope for this improvement... maybe as a second review? 

SGTM



--
John Sirois
303-512-3301





Eric Ayers

unread,
Aug 1, 2014, 1:51:18 PM8/1/14
to Stu Hood, pants-devel
+pants-devel

oops, accidentally replied only to Stu.


On Fri, Aug 1, 2014 at 1:29 PM, Eric Ayers <zun...@squareup.com> wrote:

On Fri, Aug 1, 2014 at 1:00 PM, Stu Hood <stu...@twitter.com> wrote:
Once you setup your BUILD files the only change to these patterns is going to be in conjunction with corresponding changes to BUILD files
I see the purpose of BUILD files to be precisely for storing targets, and nothing more:

Ok, I think I'm having a disconnect because that is not what I heard at the Pants summit.  I heard that we were moving away from that, that a target is only one kind of thing you can store in a BUILD file, and that is part of the reason why AddressMapper is being refactored right now.  From Patrick's design document:

"The proposal here is that we should generalize the notion of BUILD files to exposing Addressable objects, regardless of what the semantics of those objects are.  This separates out the notion of parsing BUILD files and mapping addresses from the notion of a Target DAG..."


Again, in this case, I'd argue that the configuration information about which paths contain targets is tightly coupled to the BUILD file contents and BUILD files don't exist independently of a directory layout.  You can't just move them around without breaking everything.

More reasons not to put them in pants.ini: I don't see any other places pants. ini where we reference target aliases like 'java_library', 'scala_library', ...   

This is a little bit off topic, but frankly, its very confusing to understand why something is being placed in a BUILD file vs. pants. ini.  Look at this from the default pants.ini:  

```
[protobuf-gen]
supportdir: bin/protobuf
version: 2.4.1
javadeps: ['3rdparty:protobuf-%(version)s']
```

IMHO, the javadeps specification is no different than just defining a BUILD target (but in some proprietary format).  Why not just depend on something well known in a BUILD file (BUILD.tools?)  

dependencies(name='protobuf-gen-deps'
  dependencies=[
    '3rdparty:protobuf-2.4.1',
  ],
]

Note that the latter option seems to be what was done for junit.

 
you can see from Patrick's reaction that he probably agrees. He wants to cache target definitions in memory via the fswatch stuff, but having SourceRoots changing in BUILD files in addition to targets would add another dimension of complexity to that.

Although they are used to manipulate filenames/Targets, SourceRoots are more closely related to repository layout. And deciding how your repository is laid out is not something you want to individual teams to be able to override. In some regards, this is a thing that maven gets very right: they chose a convention, and made it very difficult to use anything else. In this case, pants would be allowing the repo/build/test owner to set the repo layout however they choose, but closing the loophole that allows individual teams to fiddle with it.
 
and I can't see any reason you'd want to specify or override this on the command line.
Yea, I wasn't suggesting that it would be beneficial to override from the command line... just that Benjy's work will result in a large cleanup of pants.ini, as other things that have reasonable defaults move out of that file.


On Fri, Aug 1, 2014 at 9:46 AM, Eric Ayers <zun...@squareup.com> wrote:



On Fri, Aug 1, 2014 at 12:36 PM, Stu Hood <stu...@twitter.com> wrote:
(Did you mean to reply-all? can re-reply publicly if so)

What would folks think about doing away with SourceRoot.register in BUILD files, and instead defining source roots as a set of patterns in pants.ini.

Off the cuff, I'm not in favor of this so much.  Why pants.ini instead of a BUILD file?  I thought we were trying to get crap out of pants.ini.  I just went through upgrading pants in our repo and there was all kinds of cruft I had to manually merge in with my hand edited changes.
One of the strongest points of pants (in my mind), is that it encourages an org to centralize responsibility for things... including boilerplate. The fact that BUILD files all have to repeat themselves to call maven_layout (or for any other repetitive/obvious task) feels like a bug to me.

I'm not convinced that is a good reason to use pants.ini for this information.  The directory layout and target contents of this repo are tightly coupled to BUILD files. Once you setup your BUILD files the only change to these patterns is going to be in conjunction with corresponding changes to BUILD files, and I can't see any reason you'd want to specify or override this on the command line.
 

In this case, by centralizing accepted layouts, you'd also be helping keep your repo consistent. As it stands users could choose to skip using maven_layout and invent their own convention, which would make the codebase inconsistent and more difficult to grok. The fact that they don't know this is the only saving grace =)

Granted, the pants.ini contains a lot of stuff that could just be defaults... I expect Benjy's work on flags/options will shake it up quite a bit in the right direction.


On Fri, Aug 1, 2014 at 8:29 AM, Eric Ayers <zun...@squareup.com> wrote:



On Thu, Jul 31, 2014 at 8:46 PM, 'Stu Hood' via Pants Developers <pants...@googlegroups.com> wrote:
Source roots are currently a stateful part of build file parsing which occasionally (depending on parse order) mean that target restrictions aren't applied within sourceroots.

Additionally, we've recently needed to diverge from the public maven_layout in order to install additional targets into the source root definitions.

We are in the same boat here.
 
What would folks think about doing away with SourceRoot.register in BUILD files, and instead defining source roots as a set of patterns in pants.ini.

Off the cuff, I'm not in favor of this so much.  Why pants.ini instead of a BUILD file?  I thought we were trying to get crap out of pants.ini.  I just went through upgrading pants in our repo and there was all kinds of cruft I had to manually merge in with my hand edited changes.

 
The patterns would be matched against filenames to determine the absolute path to the source root. One simplifying restriction would be to ignore nested source roots.

Example:
  • for a pattern like 'src/main/java', we'd match any prefix of directories under the build root that contained `src/main/java`:
    • 'path/to/my/project/src/main/java'
    • 'src/main/java'
  • but, assuming there was also a 'src/main/scala' pattern, only the first pattern would be applied:
    • 'project/src/main/scala/src/main/java' -> 'project/src/main/scala'
----

It should be fairly straightforward implementation wise: scan all inferred source roots first, and if an existing root isn't found, find the earliest pattern match and add it to the set of inferred roots. 

Thoughts?

--

Patrick Lawson

unread,
Aug 1, 2014, 2:12:57 PM8/1/14
to Eric Ayers, Stu Hood, pants-devel
To be clear: the AddressMapper opens up the possibility of having arbitrary configuration stored in BUILD files, _including_ SourceRoots, if you so chose.  The big difference would be that you'd have to go through the AddressMapper to construct a list of your source roots, rather than consult a global singleton which is populated as a side effect of parsing BUILD files.  But you'd still be left with figuring out which BUILD files you should be parsing in order to extract those mapped objects.

I generally agree that source roots belong in their own configuration, outside of BUILD files.  At Foursquare we really don't have an issue with a huge bloat of source roots--we've gotten along fine for a long time with a single list of a dozen or so source roots in our root BUILD file, which would trivially transition over to a single root source root configuration file.  I do understand that others might have different scaling issues, though.

Regarding comment access: all of pants-devel has comment access.  Does that not implicitly include people who are members?  Eric also has full edit access.  I'll add you in as well, Stu--let me know if you need anyone else.

Benjy Weinberger

unread,
Aug 1, 2014, 2:15:49 PM8/1/14
to Patrick Lawson, John Sirois, Stu Hood, pants-devel
=all of you. 

We've discussed this in the past. It seems really unnecessary and fragile to have those in BUILD files.

Eric Ayers

unread,
Aug 1, 2014, 3:26:29 PM8/1/14
to Benjy Weinberger, Patrick Lawson, John Sirois, Stu Hood, pants-devel
Sorry Benjy, maybe I missed your previous conversations or maybe I was not paying attention.  What is fragile about putting something in a BUILD file?  Feel free to point me to a pants-devel message.

Like I said, it seems that there are at least a few inconsistencies in what goes in pants.ini vs. BUILD.  

Benjy Weinberger

unread,
Aug 1, 2014, 3:41:51 PM8/1/14
to Eric Ayers, Patrick Lawson, John Sirois, Stu Hood, pants-devel
The fragility is due to evaluation order of BUILD files. Some source_root declarations may or may not have been evaluated when trying to associate a target you're parsing with a source root.

Eric Ayers

unread,
Aug 5, 2014, 8:42:27 AM8/5/14
to Benjy Weinberger, Patrick Lawson, John Sirois, Stu Hood, pants-devel
Just to follow up, I've come around on this.  Although all of our source_root() declarations are in the root BUILD file in our repo, I started using maven_layout() for some tests I'm writing and I see what Benjy is talking about.  

Upon reflection, my main objection to putting things in pants.ini is around the unpleasantness of merging it with changes to the pants stock pants.ini when I upgrade pants.  I think that is an orthogonal issue.

Patrick Lawson

unread,
Aug 5, 2014, 10:19:29 AM8/5/14
to Eric Ayers, Benjy Weinberger, John Sirois, Stu Hood, pants-devel
I also dislike putting it in pants.ini.  It would mean that you'd have to know about the aliases of Targets whenever you registered these source roots, which means you need to parse pants.ini (for a variety of necessary config), then load your plugins, then parse your source roots based on a section in pants.ini.

I think source roots are important enough but also sufficiently different configuration that they should get their own config file.  Or they should just be registered using the plugin system.  The latter can be done right now without any changes whatsoever other than removing source_root from the exposed objects, and we can always shuffle it back into config if people dislike that approach.
Reply all
Reply to author
Forward
0 new messages