We observed a general increase of batch jobs which run more than 4 hours starting on June 17, 2017.
Some context: we have 6K - 10K batch jobs with 30M - 50M operations per day. Before Jun 17, on most days all batch jobs completed or explicitly failed within 4 hours. Starting on June 17, we have 200 - 500 batch jobs each day which stay in processing state for multiple hours (we stop polling eventually).
Please let me know if I can provide more details to solve this issue.