not able to read HDFS from remote machine

148 views
Skip to first unread message

Bharani Ramu

unread,
Jan 9, 2018, 10:31:43 PM1/9/18
to Google Cloud Dataproc Discussions
I am able to copy the local files to hdfs within the cluster. However, it fails when I read the HDFS files from remote machine ,it fails with the following stack trace : 
[UNK_66009] File [hdfs://35.199.186.207/tmp/sample.csv] could not be read because of the following error: [org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1033207192-10.138.0.2-1513236990327:blk_1073741825_1001 file=/tmp/sample.csv
        at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1022)
        at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:641)
        at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:920)
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:976)
        at java.io.DataInputStream.read(DataInputStream.java:149)

Dennis Huo

unread,
Jan 10, 2018, 5:21:38 PM1/10/18
to Google Cloud Dataproc Discussions
How are you configuring access to the Dataproc cluster from your external machines? Note that the namenode will be providing datanode locations in terms of "internal IP addresses" which only work within the GCE network, and in general you almost certainly don't want to configure access to occur directly over external IP addresses, even with careful firewall rules permitting access only from your known external IP addresses. (One of the reasons being that the Datanode traffic is not encrypted).

A recommended way to configure HDFS access is to set up a Cloud VPN: https://cloud.google.com/vpn/docs/concepts/overview

You could also potentially configure a more lightweight SSH tunnel, something like https://rainerpeter.wordpress.com/2014/02/12/connect-to-hdfs-using-a-proxy/

The best way overall however is to avoid using direct HDFS where possible; have you tried putting those files directly in to GCS instead of HDFS? You could then install the GCS connector on your remote machines as well and use it as an HDFS replacement: https://cloud.google.com/dataproc/docs/concepts/connectors/install-storage-connector
Reply all
Reply to author
Forward
0 new messages