Setup Hdfs On Windows

0 views

Skip to first unread message

Debra Necochea

unread,

Aug 4, 2024, 3:30:46 PM8/4/24

to maberdaysenp

Im trying to specify the base directory for HDFS files in my hdfs-site.xml under Windows 7 (Hadoop 2.7.1 that I built from source, using Java SDK 1.8.0_45 and Windows SDK 7.1). I can't figure how to provide a path that specifies a drive.

You can specify a drive spec in hadoop.tmp.dir in core-site.xml by prepending a '/' in front of the absolute path, and using '/' as the path separator instead of '\' for all path elements. For example, if the desired absolute path is D:\tmp\hdp, then it would look like this:

The reason this works is that the default values for many of the HDFS directories are configured to be file://$hadoop.tmp.dir/suffix. See the default definitions of dfs.namenode.name.dir, dfs.datanode.data.dir and dfs.namenode.checkpoint.dir here:

Substituting the above value for hadoop.tmp.dir yields a valid file: URI with a drive spec and no authority, which satisfies the requirements for the HDFS configuration. It's important to use '/' instead of '\', because a bare unencoded '\' character is not valid in URL syntax.

If you prefer not to rely on this substitution behavior, then it's also valid to override all configuration properties that make use of hadoop.tmp.dir within your hdfs-site.xml file. Each value must be a full file: URI. For example:

LinkedIn and 3rd parties use essential and non-essential cookies to provide, secure, analyze and improve our Services, and to show you relevant ads (including professional and job ads) on and off LinkedIn. Learn more in our Cookie Policy.

Recently I had a client ask about how would we go about connecting a windows share to Nifi to HDFS, or if it was even possible. This is how you build a working proof of concept to demo the capabilities!

Next you need to setup a windows share of some format. This can be combined with active directory but I personally just enabled guest accounts and made an account called Nifi_Test. These instructions were the basis of creating a windows share -how-to-make-unc-folder-shares/ Keep in mind network user permissions may get funky and the example above will enforce a read only permission unless you do additional work.

Next you need tosetup a windows share of some format. This can be combined withactive directory but I personally just enabled guest accounts andmade an account called Nifi_Test. These instructions were the basisof creating a windows share -how-to-make-unc-folder-shares/ Keep in mind network user permissions may get funky and the exampleabove will enforce a read only permission unless you do additionalwork.

Thanks for sharing this article. I was working for one of our client but I couldn't achive this. They gave me a dfs path and I mounted it in linux using 'mount -t cifs//mydfsdomain.lan/namespaceroot/sharedfolder /mnt -o username=windowsuser'

Then I create a directory as you do in hdfs. Then I create processes in nifi. But when I start flow, nothing happens. After waits for 2-3 minutes, GetFile throw an error as 'gc overhead limit exceeded'.

Its probably reading the same file repeatedly without permissions to delete it. On the get file processor configure it to only run every 5 seconds. Then in flow view right click and refresh the page and you will probably see the outbound que with a file. If you don't refresh the view you may not see the flow files building up and hten it builds up enough and you run out of memory.

I write the inner path in mounted directory. Then I start and this time it wrote xml's to the HDFS. So it seems that Recurse Subdirectories property is not working. Could ypu use recurse subdirectories property correctly? Still can't achieve to write all xml's in all sub directories automatically!

We recently got a big new server at work to run Hadoop and Spark (H/S) on for a proof-of-concept test of some software we're writing for the biopharmaceutical industry and I hit a few snags while trying to get H/S up and running on Windows Server 2016 / Windows 10. I've documented here, step-by-step, how I managed to install and run this pair of Apache products directly in the Windows cmd prompt, without any need for Linux emulation.

I can't guarantee that this guide works with newer versions of Java. Please try with Java 8 if you're having issues. Also, with the new Oracle licensing structure (2019+), you may need to create an Oracle account to download Java 8. To avoid this, simply download from AdoptOpenJDK instead.

Even though newer versions of Hadoop and Spark are currently available, there is a bug with Hadoop 3.2.1 on Windows that causes installation to fail. Until that patched version is available (3.3.0 or 3.1.4 or 3.2.2), you must use an earlier version of Hadoop on Windows.

Next, download 7-Zip to extract the *gz archives. Note that you may need to extract twice (once to move from *gz to *.tar files, then a second time to "untar"). Once they're extracted (Hadoop takes a while), you can delete all of the *.tar and *gz files. You should now have two directories and the JDK installer in your Downloads directory:

Note that -- as shown above -- the "Hadoop" directory and "Spark" directory each contain a LICENSE, NOTICE, and README file. With particular versions of Hadoop, you may extract and get a directory structure like

...if this is the case, move the contents of the inner hadoop- directory to the outer hadoop- directory by copying-and-pasting, then delete the inner hadoop- directory. The path to the LICENSE file, for example, should then be:

WARNING: If you see a message like "Can not create symbolic link : A required privilege is not held by the client" in 7-Zip, you MUST run 7-Zip in Administrator Mode, then unzip the directories. If you skip these files, you may end up with a broken Hadoop installation.

Move the Spark and Hadoop directories into the C:\ directory (you may need administrator privileges on your machine to do this). Then, run the Java installer but change the destination folder from the default C:\Program Files\AdoptOpenJDK\jdk-\ to just C:\Java. (H/S can have trouble with directories with spaces in their names.)

Once the installation is finished, you can delete the Java *.msi installer. Make two new directories called C:\Hadoop and C:\Spark and copy the hadoop- and spark- directories into those directories, respectively:

If you echo %PATH% in cmd you should now see these three directories somewhere in the middle of the path, because the User Path is appended to the System Path for the %PATH variable. You should check now that java -version, hdfs -version, and spark-shell --version return version numbers, as shown below. This means that they were correctly installed and added to your %PATH%:

Please note that if you try to run the above commands from a location with any spaces in the path, the commands may fail. For example, if your username is "Firstname Lastname" and you try to check the Hadoop version, you may see an error message like:

...yes, they should be forward slashes, even though Windows uses backslashes. This is due to the way that Hadoop interprets these file paths. Also, be sure to replace with the appropriate Hadoop version number. Finally, edit yarn-site.xml so it reads:

Now, you need to apply a patch created by and posted to GitHub by user cdarlint. (Note that this patch is specific to the version of Hadoop that you're installing, but if the exact version isn't available, try to use the one just before the desired version... that works sometimes.)

Make a backup of your %HADOOP_HOME%\bin directory (copy it to \bin.old or similar), then copy the patched files (specific to your Hadoop version, downloaded from the above git repo) to the old %HADOOP_HOME%\bin directory, replacing the old files with the new ones.

We know now how to create directories (fs -mkdir) and list their contents (fs -ls) in HDFS, what about creating and editing files? Well, files can be copied from the local file system to HDFS with fs -put. We can then read files in the spark-shell with sc.textFile(...):

So there you have it! Spark running on Windows, reading files stored in HDFS. This took a bit of work to get going and I owe a lot to people who previously encountered the same bugs as me, or previously wrote tutorials which I used as a framework for this walkthrough. Here are the blogs, GitHub repos, and SO posts I used to build this tutorial:

These error messages are giving you hints about what's going wrong. It looks like your %PATH% is set up correctly and hadoop is on it, but you can't run the hadoop command by itself. That's what the error message is telling you. You need to include additional command-line arguments.

So, i tried to see what the content is for the file start-yarn.cmd and it has a call to yarn command. So i tried to call it in a independent console and i get the same error. That is the reason why i think the problem is yarn, the command as is.

If all of that checks out, and the %PATH% is correct, and all of the .cmd files are on the path, I'm not sure what else I would do. There's no reason why those commands shouldn't work if they're on the %PATH%.

Hi Andrew ,

It is me again. Now i am testing in my personal machine. But now i ma having another problem. In my local machine my user is "David Serrano". As you can see it has one space in it. When i try to format the namenode with "hdfs namenode -format" I am getting this error:

I think i can do something similar to the advice in the above blog. However i need to know which is the variable that hadoop is using to call java in order to change it in the config files.

If you have some info about it, please post here in order to try to solve the problem.

Thanks in advance.

Hadoop uses JAVA_HOME to determine where your Java distribution is installed. In a Linux installation, there's a file called hadoop/etc/hadoop/hadoop-env.sh. It might be .cmd instead of .sh on Windows, but I'm not sure.

Yes, the JAVA_HOME variable is fine in my laptop. However, hadoop must use in another part of its code the variable %USERNAME% or %USERPROFILE%. Those variables are the problematic thing. I need to locate that part in hadoop and try to change in some config file (if it is possible). Actually i have another machine with ubuntu and hadoop works normally. The idea was installing on windows to do some specific work in both systems.