Transferring files to and from HDFS on Hadoop Cluster

wvau...@vols.utk.edu

unread,

Mar 13, 2018, 10:21:43 AM3/13/18

to cloudlab-users

I was wondering if anyone would be able to help me figure out how to transfer files to and from HDFS so that I can use my data files in an experiment? I have spend a lot of time trying to figure this out, but I am simply not very familiar with the environment. For my experiment, I am using the profile “hadoop” by gary, which is a Hadoop cluster with 3 slave nodes, 1 name node, and 1 resource manager node.

I am able to login to the resource manager node and to copy my files over to my user directory on the resource manager node. However, I cannot figure out how to get my files into HDFS so that I can run my experiments with my data. When I try to use “hadoop fs -copyFromLocal”, it says that I do not have permission to access HDFS.

The general form of the command that I am trying to use to run my experiment is below:

/usr/local/hadoop-2.7.3/bin/hadoop jar /usr/local/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar -files mapper.py,reducer.py -mapper mapper.py -reducer reducer.py -input PATH_TO_INPUT_DATA -output PATH_TO_OUTPUT

If anyone could help me figure out how to transfer my files to HDFS or even reference them locally using the command above, I would be grateful. Thank you for your time, and I look forward to any feedback!

Jeff Ballard

unread,

Mar 13, 2018, 12:57:50 PM3/13/18

to cloudlab-users

In general this is what I do to load files on HDFS using the exact same profile you are using:

# first, make my directory (**NOTE** you'll need to change the username)

sudo /usr/local/hadoop/bin/hdfs dfs -mkdir /user/ballard

sudo /usr/local/hadoop/bin/hdfs dfs -chown ballard /user/ballard

#second, EVERY TIME I make a shell I issue the following 5 commands:

export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop

export HADOOP_HOME=/usr/local/hadoop

export PATH=$PATH:/usr/local/hadoop/bin:/usr/local/spark/bin

. /usr/local/hadoop/etc/hadoop/hadoop-env.sh

. /usr/local/hadoop/etc/hadoop/yarn-env.sh

# then, after that load some files

hdfs dfs -copyFromLocal some_filename_on_my_local_directory filename_in_hdfs

By default the file will be put in your HDFS directory made above (mine by default is /user/ballard).

Hope this helps,

-Jeff

wvau...@vols.utk.edu

unread,

Mar 13, 2018, 2:06:13 PM3/13/18

to cloudlab-users

Hey Jeff,

Thank you so much for your help! I was able to transfer my files over and queue the MapReduce tasks.

Out of curiosity, is there a certain node that the MapReduce task needs to be run on? I tried running it from the resource manager, and I get the exception that the resource manager node can't connect to the slave nodes. I'm currently trying it on the other nodes to see what works.

- Ty

Jeff Ballard

unread,

Mar 13, 2018, 2:18:02 PM3/13/18

to cloudlab-users

In general I've submitted everything from the resource manager. Make sure you have the environment setup correctly (the 5 commands I posted before).

-Jeff

wvau...@vols.utk.edu

unread,

Mar 13, 2018, 4:46:04 PM3/13/18

to cloudlab-users

Thank you again for the response. I ran those five commands and moved my files over to HDFS, and then I've run my MapReduce task with /usr/local/hadoop-2.7.3/bin/hadoop jar /usr/local/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar -files mapper.py,reducer.py -mapper mapper.py -reducer reducer.py -input DATA_FILES -output OUTPUT

However, I still get the error along the lines of "No Route to Host from resourcemanager to slavenode failed on socket timeout exception: java.net.NoRouteToHostException: No route to host". I'm still trying to figure this one out.

Jeff Ballard

unread,

Mar 13, 2018, 7:51:35 PM3/13/18

to cloudlab-users

Do your worker nodes show up in your hadoop "All Applications" list?

http://computer-name.site.cloudlab.us:8088/cluster

(NB: firewalls might be in the way, I happen to be at U-Wisc and using my campus VPN I have no problems getting to the wisc.cloudlab.us machines... but I do not know how that might work for whatever network/VPN you might be using vs which machines you might be using).

If yes, then I don't know what might be causing your problem.

If the machines aren't showing up in Hadoop, can you ssh into the other nodes (from your computer)? If you can't login to the other nodes: then I'll note that this sounds kinda like a problem I faced with the Hadoop 2.7.3 profile some time ago:

https://groups.google.com/d/msg/cloudlab-users/MWIscqzyzAA/xGPLtAdxEQAJ

That problem was so much of a pain to me, that ever since then when I use the profile, I make sure to update the entire OS on the machines before I start an experiment. The process is to first ssh to all the nodes[1] (including namenode and resourcemanager), and update the os (using sudo apt-get or aptitude[2]), and then reboot the nodes. After rebooting the nodes, restart Hadoop. I haven't had problems since I've been doing that... naturally this could totally be unnecessary, but it makes me feel safe. :)

-Jeff

[1] clusterssh is a wonderful thing.

[2] One fine detail to note is that when updating everything, make sure you have grub install itself on both sda and sda1 (or hda and hda1, i forget since I'm not doing it at the moment). All other questions should be answered with "keep the version of the configs installed" or whatever it says.

Reply all

Reply to author

Forward