Hello fellow Briskers,
My team has been struggling with hostname to IP resolution in our Brisk cluster. (Single-node cluster seems to run fine.)
We have worked through various difficulties with listen_address and jobtrackers and InetAddress.getLocalhost, and we are down to one major blocker in this area:
When we run a MapReduce job generate from a Hive query, the reduce fails. Specifically, we get a FileNotFoundException when the reducer on node-1 (the seed/jobtracker) attempts to fetch mapOutput for a task that was completed (and is visible in the tasktracker web UI) on node-2.
The problem seems to be that node-1 looks for http://localhost.localdomain:50060/mapOutput?...
instead of http://node-2:50600. (Note that jobtracker and map task distribution and execution work seemed to mostly complete OK, but then fetching output fails.)
I suspect that's he root cause is that InetAddress.getLocalHost returns 127.0.0.1 which maps back to localhost.localdomain.
(My DataCenter team assures me this is correct networking interface configuration, that the IP address of $(hostname) should always be 127.0.0.1, even when $(hostname) is a public "node-2.cloud.redfin.com" name.)
As such, I have already set my nodes' cassandra.yaml listen_address values to be an IP like 10.13.0.102 etc, because blank and localhost lead to 127.0.0.1 being stored in Cassandra as the jobtracker. )
Could someone post a sample /etc/hosts file that shows a recommended DNS configuration for a Brisk node?
Or any other suggestions for fixing the domain names exposed to the reducers?
Thank you,
-Mike Brauwerman