> On Mar 21, 2016, at 11:44 AM, smittal via Presto <
presto...@googlegroups.com> wrote:
>
> So here is an update. The issue was indeed network. The announcements were not going through because network was completely used up by tasks reading data from hdfs. We have 1GE network and working on getting 10GEs. But in the mean time, is there a way to limit the network usage per worker?
>
> We tried reducing the "task.max-worker-threads" from 96 to 32 to limit it, and it seemed to work in most cases, but we saw a large query that ran for more than 30 mins, ultimately causing this "No worker nodes available" error when a few more queries were issued.
>
> Any more configs that might help in this case?
>
> Thanks
>
> On Wednesday, March 16, 2016 at 2:54:36 PM UTC-7,
smi...@twitter.com wrote:
> Hi Kamil, we are still seeing the "No worker nodes available". Increasing cache-ttl didn't seem to work.
>
> On Tuesday, March 15, 2016 at 8:55:57 AM UTC-7, Sailesh Mittal wrote:
> I updated the coordinator yesterday evening and there hasn't been much traffic yet. I will update this thread on how it went today evening.
>
> On Tue, Mar 15, 2016 at 7:14 AM, Kamil Bajda-Pawlikowski <
kba...@gmail.com> wrote:
> Do you see any improvement?
>
>
> On Monday, March 14, 2016 at 3:43:24 PM UTC-4, Sailesh Mittal wrote:
> @Kamil Coordinator does not go out of network. All http-requests have 2xx return codes, and network graph does not show any strange behaviors.
>
> @Rebecca, didn't know that ttl was just 1 second. I will try increasing it to 10 seconds to match with announcement frequency.
>
> On Mon, Mar 14, 2016 at 12:31 PM, Schlussel, Rebecca T <
Rebecca....@teradata.com> wrote:
> I agree that regular periods where the worker nodes aren't registering themselves at the appropriate frequency seem like momentary network outages. There are some discovery server properties about how long the discovery server keeps cached node information:
https://github.com/airlift/discovery/blob/master/discovery-server/src/main/java/io/airlift/discovery/server/DiscoveryConfig.java. You could try setting discovery.store-cache-ttl to a higher value to allow the coordinator to use the cached worker information for longer to assign work. Though if coordinator and worker nodes are still disconnected then you'll see an error when it tries actually sending the work.
>