I'd also recommend looking at the jt logs as Ken suggested, there's a
chance you'll see a huge job.xml there.
From the logs it looks like your job is taking 15 minutes to submit.
I've seen this happen that if the number of input files is large,
GlobHfs cannot submit the job ("large" depends on the hadoop hardware
config and how much memory it has, 7k as in your case may be enough).
The reason is that GlobHfs hammers the namenode with 1000s of requests
to expand the glob to get each *individual* path to build a list of
inputs, which will take a long time. Then the input list will contain
these paths, hence the large size of job.xml, and the jobtracker runs
out of memory on these.
A solution that has worked for me was to use Hfs with globs. Hadoop
should be able to handle globified input paths, and PyCascading sets
the mapred.input.dir.recursive to true for the job config, so you can
specify your source tap like Hfs(TextDelimited(), 'path/to/bigones/
2011/*'). A caveat: make sure the path part doesn't contain the glob
characters {}. If it does, although completely legal from Hadoop's
point of view, Hfs won't pass on the right glob to Hadoop. (To explain
Hfs converts the path to a URI first and {} become percent-encoded.) I
believe Chris fixed this in Cascading 2, but it may exist in 1.2.
Alternatively you can use HfsDirect in PyCascading safely where Hfs is
subclassed to fix this.
Btw as a side note you can see which MR steps failed to run in the
logs or the jt, each/376:ad_network_perf... etc. means that the step
comes from an Each at line 376 in your ad_net... Python file.
Gabor