Rory Douglas
unread,May 12, 2015, 8:50:30 PM5/12/15Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to hydr...@googlegroups.com
We recently experienced a production outage where all Hydra querying became unresponsive. We traced it to the query-master process running out of file descriptors.
While easy to resolve (we simply restarted query-master), we've noticed that file descriptor usage for that process is growing monotonically, at around 300/hour. At this rate we'll need to periodically restart the process, something we've not previously had to do.
Running lsof on the process shows the vast majority of descriptors are for now deleted temporary files e.g.
java 22440 hydra *844uW REG 202,1 0 1400858 /home/hydra/hydra/log/query/tmp/2102d3c2-0aaa-4015-9aca-9b1b3c638bd1/mff.lock (deleted)
java 22440 hydra *845uW REG 202,1 0 1400859 /home/hydra/hydra/log/query/tmp/2102d3c2-0aaa-4015-9aca-9b1b3c638bd1/mfs.lock (deleted)
java 22440 hydra *846w REG 202,1 269268 1400860 /home/hydra/hydra/log/query/tmp/2102d3c2-0aaa-4015-9aca-9b1b3c638bd1/out-00000001 (deleted)
java 22440 hydra *847r REG 202,1 269268 1400860 /home/hydra/hydra/log/query/tmp/2102d3c2-0aaa-4015-9aca-9b1b3c638bd1/out-00000001 (deleted)
java 22440 hydra *848r REG 202,1 269268 1400860 /home/hydra/hydra/log/query/tmp/2102d3c2-0aaa-4015-9aca-9b1b3c638bd1/out-00000001 (deleted)
java 22440 hydra *849r REG 202,1 269268 1400860 /home/hydra/hydra/log/query/tmp/2102d3c2-0aaa-4015-9aca-9b1b3c638bd1/out-00000001 (deleted)
That pattern of 2 lock files & 4 descriptors (1 write, 3 read) for the same out* files is repeated for different UUIDs over & over again.
The only infrastructure change we've recently made to Hydra is to reduce the gold backup lifetime to 20 mins (from 90mins). This doesn't seem like it should have affected file descriptor usage on the query master however.
We're not really sure where we should be looking for a culprit here. Memory & disk are not under pressure on the query master node. We haven't drastically changed our query patterns or the overall data volume. Any ideas what we can look into to get to root cause?