query-master process not releasing file descriptors

25 views
Skip to first unread message

Rory Douglas

unread,
May 12, 2015, 8:50:30 PM5/12/15
to hydr...@googlegroups.com
We recently experienced a production outage where all Hydra querying became unresponsive. We traced it to the query-master process running out of file descriptors.

While easy to resolve (we simply restarted query-master), we've noticed that file descriptor usage for that process is growing monotonically, at around 300/hour. At this rate we'll need to periodically restart the process, something we've not previously had to do.

Running lsof on the process shows the vast majority of descriptors are for now deleted temporary files e.g.

java 22440 hydra *844uW REG 202,1 0 1400858 /home/hydra/hydra/log/query/tmp/2102d3c2-0aaa-4015-9aca-9b1b3c638bd1/mff.lock (deleted)
java 22440 hydra *845uW REG 202,1 0 1400859 /home/hydra/hydra/log/query/tmp/2102d3c2-0aaa-4015-9aca-9b1b3c638bd1/mfs.lock (deleted)
java 22440 hydra *846w REG 202,1 269268 1400860 /home/hydra/hydra/log/query/tmp/2102d3c2-0aaa-4015-9aca-9b1b3c638bd1/out-00000001 (deleted)
java 22440 hydra *847r REG 202,1 269268 1400860 /home/hydra/hydra/log/query/tmp/2102d3c2-0aaa-4015-9aca-9b1b3c638bd1/out-00000001 (deleted)
java 22440 hydra *848r REG 202,1 269268 1400860 /home/hydra/hydra/log/query/tmp/2102d3c2-0aaa-4015-9aca-9b1b3c638bd1/out-00000001 (deleted)
java 22440 hydra *849r REG 202,1 269268 1400860 /home/hydra/hydra/log/query/tmp/2102d3c2-0aaa-4015-9aca-9b1b3c638bd1/out-00000001 (deleted)

That pattern of 2 lock files & 4 descriptors (1 write, 3 read) for the same out* files is repeated for different UUIDs over & over again.

The only infrastructure change we've recently made to Hydra is to reduce the gold backup lifetime to 20 mins (from 90mins). This doesn't seem like it should have affected file descriptor usage on the query master however.

We're not really sure where we should be looking for a culprit here. Memory & disk are not under pressure on the query master node. We haven't drastically changed our query patterns or the overall data volume. Any ideas what we can look into to get to root cause?

Ian Barfield

unread,
May 18, 2015, 9:11:41 AM5/18/15
to Rory Douglas, hydr...@googlegroups.com
This was an old bug in either the query op pipeline or muxy where things were not closed and properly cleaned up in some (maybe even many?) cases. They would probably involve only queries using "sort". Upgrading to a newer version should fix, but I don't know which one exactly. Might be able to delve into the commits or something.


--
You received this message because you are subscribed to the Google Groups "hydra-oss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hydra-oss+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages