Generated data files

1,617 views
Skip to first unread message

Dan Fabulich

unread,
Mar 4, 2016, 6:38:04 PM3/4/16
to bazel-discuss
I have a binary that requires a significant number of generated data files generated from one source file. How would I declare this in Bazel?

Here are some options I've considered:

1) I could zip the runfiles up during the build and declare only that zip as a data file. At runtime, I could run a wrapper script to unzip them into a temp directory before running the real binary. But it's a lot of files; launching the binary will be unacceptably slow.

2) The build is hermetic, so the output files are technically predictable; I could run the generator once, enumerate the files, and declare them all explicitly as outputs in my BUILD file. But then the BUILD file would just duplicate the data from the source file; I'd really prefer to not to do that.

3) I tried generating the data files in a directory and declaring just the directory as an output, but I got a scary warning that "dependency checking of directories is unsound."

What should I do?

P.S. I've seen the documentation say "dependency checking of directories is unsound," but I don't get it. Verifying the integrity of a zip file containing a tree of files is just as sound as verifying the integrity of the files unextracted. It's admittedly slower to stat/checksum a bunch of files than it is to stat/checksum a single zip file, but it's equally sound, right?

Dan Fabulich

unread,
Mar 8, 2016, 3:49:00 PM3/8/16
to bazel-discuss
I guess this is impossible. I've filed https://github.com/bazelbuild/bazel/issues/1025 on this.

Alex Humesky

unread,
Mar 8, 2016, 5:33:15 PM3/8/16
to Dan Fabulich, bazel-discuss
It would be helpful if you could talk in more detail about what you're trying to accomplish, it's not quite clear to me, but it sounds like the usual way you would do this is to use a genrule to invoke the tool on the source file and zip up the outputs to pass the files down the build. As Austin Schuh said, you can also just use tar to avoid the time to compress and decompress.

For 3, it sounds like you were using the directory as an input to a genrule (I believe that's the only place that warning can be generated from). Using a glob() instead for the srcs might work.

The reason that directory inputs are unsound in bazel is that bazel won't notice changes in those directories (that's what the purpose of glob() is). Specifically, verifying that a zip file was output from the genrule is easy: the genrule will either create it or not. Bazel can't verify the directory because it doesn't know what should be in there. The underlying idea at work here is that bazel wants to know the exact inputs and outputs of each action, and if the action (i.e. the genrule) outputs a directory, bazel can't know if the action generated what it should have, and furthermore can't know if downstream actions will get what they need if they take in a directory. We're working on relaxing this, but it will be a fair while before this is working (especially at the level of a genrule -- it might go to skylark first)

--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/1c829ba8-5655-4090-bd09-60e59e86b3b0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dan Fabulich

unread,
Mar 8, 2016, 7:03:12 PM3/8/16
to bazel-discuss, danfa...@gmail.com
For the record, I'm building Node.js modules.

node scripts look in the current working directory for a "node_modules" folder, which contains a tree of javascript files (and other data files).

It's not uncommon for node_modules folders to get pretty big; tens of thousands or even hundreds of thousands of loose files.

For a reasonably quick example, you could "brew install node", make a temp directory and "npm install babel-core". You'll then have a node_modules folder containing roughly 4,000 files. Our runtime directory has tens of thousands of files.

It is absolutely possible to tar those files up (without compressing/decompressing them). But then, each time I want to run my script, I'd have to untar 40,000 files. It's inherently slow to write that many files at launch. It would be much faster to just symlink to the directory where the files were already generated.

On Tuesday, March 8, 2016 at 2:33:15 PM UTC-8, Alex Humesky wrote:
The reason that directory inputs are unsound in bazel is that bazel won't notice changes in those directories (that's what the purpose of glob() is). Specifically, verifying that a zip file was output from the genrule is easy: the genrule will either create it or not. Bazel can't verify the directory because it doesn't know what should be in there. The underlying idea at work here is that bazel wants to know the exact inputs and outputs of each action, and if the action (i.e. the genrule) outputs a directory, bazel can't know if the action generated what it should have, and furthermore can't know if downstream actions will get what they need if they take in a directory.

With all due respect, I disagree with this paragraph strongly, and so should you.

Zip files are just compressed directories. Everything you say you "can't know" about a directory, you can't know about a zip file either, and everything you can know about a zip file, you can know about a directory, too: you can know that they exist, but you can't verify correctness, because you don't know what the contents should be. You can checksum a zip, and you can checksum a directory.

Your can literally rewrite your paragraph in the reverse and it's equally true:
 
Verifying that a directory was output from the genrule is easy: the genrule will either create it or not. Bazel can't verify a zip file because it doesn't know what should be in there. The underlying idea at work here is that bazel wants to know the exact inputs and outputs of each action, and if the action (i.e. the genrule) outputs a zip file, bazel can't know if the action generated what it should have, and furthermore can't know if downstream actions will get what they need if they take in a zip file.


Alex Humesky

unread,
Mar 9, 2016, 8:35:09 AM3/9/16
to Dan Fabulich, bazel-discuss
On Tue, Mar 8, 2016 at 7:03 PM Dan Fabulich <danfa...@gmail.com> wrote:
For the record, I'm building Node.js modules.

node scripts look in the current working directory for a "node_modules" folder, which contains a tree of javascript files (and other data files).

It's not uncommon for node_modules folders to get pretty big; tens of thousands or even hundreds of thousands of loose files.

For a reasonably quick example, you could "brew install node", make a temp directory and "npm install babel-core". You'll then have a node_modules folder containing roughly 4,000 files. Our runtime directory has tens of thousands of files.

It is absolutely possible to tar those files up (without compressing/decompressing them). But then, each time I want to run my script, I'd have to untar 40,000 files. It's inherently slow to write that many files at launch. It would be much faster to just symlink to the directory where the files were already generated.

On Tuesday, March 8, 2016 at 2:33:15 PM UTC-8, Alex Humesky wrote:
The reason that directory inputs are unsound in bazel is that bazel won't notice changes in those directories (that's what the purpose of glob() is). Specifically, verifying that a zip file was output from the genrule is easy: the genrule will either create it or not. Bazel can't verify the directory because it doesn't know what should be in there. The underlying idea at work here is that bazel wants to know the exact inputs and outputs of each action, and if the action (i.e. the genrule) outputs a directory, bazel can't know if the action generated what it should have, and furthermore can't know if downstream actions will get what they need if they take in a directory.

With all due respect, I disagree with this paragraph strongly, and so should you.

Well, I was only describing how bazel works today. Like I said, we're working on changing things so that you can have an action generate a directory of files ("unpredictable action outputs"), but this will take a while (I unfortunately don't have a timeline). If you have a directory of input files, you can use a glob() to glob up the files in a directory (but that still results in a list of files as opposed to a directory).

If you have something like this:

$ tree
.
├── BUILD
├── in
│   ├── a
│   └── b
└── WORKSPACE
$ cat BUILD
genrule(
  name = "gen",
  srcs = ["in"],
  outs = ["out.txt"],
  cmd = "cat in/a >> $@; cat in/b >> $@")
$ cat in/a
input a
$ cat in/b
input b
$ bazel build :out.txt
INFO: Found 1 target...
WARNING: /tmp/test/BUILD:1:1: input 'in' to //:gen is a directory; dependency checking of directories is unsound.
Target //:out.txt up-to-date:
  bazel-genfiles/out.txt
INFO: Elapsed time: 1.879s, Critical Path: 0.08s
$ cat bazel-genfiles/out.txt
input a
input b

and you change in/a and do another build, bazel won't check if a has changed and then do nothing to rebuild out.txt, because, indeed, in/a wasn't declared an input to the genrule. This is what that warning means by unsound.

No other rule as far as I know allows directories in its srcs, so to be honest, this is probably some legacy use case that we neglected to clean up.


Zip files are just compressed directories. Everything you say you "can't know" about a directory, you can't know about a zip file either, and everything you can know about a zip file, you can know about a directory, too: you can know that they exist, but you can't verify correctness, because you don't know what the contents should be. You can checksum a zip, and you can checksum a directory. 

Your can literally rewrite your paragraph in the reverse and it's equally true:
 
Verifying that a directory was output from the genrule is easy: the genrule will either create it or not. Bazel can't verify a zip file because it doesn't know what should be in there. The underlying idea at work here is that bazel wants to know the exact inputs and outputs of each action, and if the action (i.e. the genrule) outputs a zip file, bazel can't know if the action generated what it should have, and furthermore can't know if downstream actions will get what they need if they take in a zip file.

Put another way, bazel tracks only files, so while zip files might contain directories, they are still actually files. If you had something like this:
genrule(
  name = "gen",
  outs = ["out"],
  cmd = "mkdir out; echo 'hello a' > out/a; echo 'hello b' > out/b")
bazel will look for the file "out", and complain that it can't be found. Note that by "verify" I only meant checking whether the file exists. Bazel doesn't actually look inside the file or zip to know if it's correct, that's the responsibility of the action (e.g. if you have an invalid jar file, bazel is going to happily pass that to java and java will fail).



--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.

Dan Fabulich

unread,
Mar 18, 2016, 7:56:32 PM3/18/16
to bazel-discuss, danfa...@gmail.com
I was thinking briefly today about trying to work around this issue.

If I want to generate a directory full of files, would it work for me to generate a manifest file in addition to the directory? As long as I'm careful to update the manifest file every time I modify something in the output directory, and as long as downstream rules declare both the directory and the manifest as an input, the build should remain sound, right?

Alex Humesky

unread,
Mar 21, 2016, 1:20:53 PM3/21/16
to Dan Fabulich, bazel-discuss
This is actually part of the solution we're pursuing for unpredictable action outputs. Unfortunately most rules still won't accept a directory in their srcs attribute (genrule might be the only exception. It's also not clear yet how we would apply unpredictable action outputs to genrules).

Alex Humesky

unread,
Mar 21, 2016, 1:26:14 PM3/21/16
to Dan Fabulich, bazel-discuss
I know it's annoying, and I don't like it either, but you might still consider the tarring approach. Depending on the machine, even if you have 10s of thousands of small files, it might still only take less than a second to tar them up.
Reply all
Reply to author
Forward
0 new messages