Tutorial #2

A2 AK

unread,

Oct 19, 2024, 2:44:19 PM10/19/24

to Sparkour

I am new to Linux, EC2, and Spark. I tried to follow the instructions on Tutorial #2 to install Spark to run some machine learning models.

Things appeared to be ok until I tried to execute the command spark-submit -version. I got an error message "JAVA_HOME is not set".

Any assistance would be appreciated. Thanks.

Brian Uri!

unread,

Oct 19, 2024, 3:17:30 PM10/19/24

to Sparkour

Hi,

Tutorial #2 is showing its age -- it was originally written and tested using the 2018.3 edition of the Amazon Linux OS. If you're getting a Java error in 2024, it's likely that Amazon Linux no longer includes Java by default.

Please try this:

1) Check if java is available in Linux on the EC2 instances.

which java

2) If Java has not been installed, use yum to download one. You can download OpenJDK (which was what I had running on the 2018.3 OS) or a more modern Amazon Corretto JDK (choose one):

yum install java-1.8.0-openjdk

yum install java-23-amazon-corretto

3) Check again to confirm that Java is available:

which java

4) Try continuing with the Sparkour tutorial (You may need to re-"source" your bash_profile):

spark-submit -version

Lots of fiddly configuration if you are learning Linux, EC2, and Spark all at the same time! Good luck!

Regards,

BU

A2 AK

unread,

Oct 19, 2024, 6:29:14 PM10/19/24

to Sparkour

Great. Your suggestion worked. I used "yum install java" instead as " yum install java-1.8.0-openjdk" gave me an error message.

Now I need to figure out how to load my input file into HDFS and set up at least one slave VM for Spark. I will study your other tutorials to see if I can find answers.

Thank you so much, Brian.

Regards,

A2 AK

Message has been deleted

A2 AK

unread,

Oct 21, 2024, 12:51:34 AM10/21/24

to Sparkour

Couldn't figure out how to place a file into HDFS to run Spark using that input file. I thought there is a Hadoop folder under /usr/local, but I did not see it.

Any assistance would be greatly appreciated.

Brian Uri!

unread,

Oct 21, 2024, 7:10:12 AM10/21/24

to Sparkour

You should not need a specific directory on the machine running Spark. When you set up an org.apache.hadoop.fs.FileSystem, you can pass in a URI that maps to the file wherever you've put it. For example, these two URIs would point to a file in your home directory:

file:///home/dotnetdot/test.txt
hdfs:///home/dotnetdot/test.txt

A2 AK

unread,

Oct 21, 2024, 2:19:12 PM10/21/24

to Sparkour

Thanks again, Brian.

I was following another Spark tutorial to do a mkdir for the input files using "bin/hadoop fs -mkdir /...".

I followed your advice and accessed the input file directly using "sc.textFile('file:///home/ec2-user/...'). It worked.

However, the hdfs: option did not work. Wonder what differences are in the file: vs. the hdfs: options, for my learning curiosity, and how to "set up" my file so that the hdfs: option would work.

Note that numpy does not appear to be in the python version from the Amazon Linux AMI on EC2 as it gave me an error on the statement "import numpy as np". I suppose I have to install numpy?

Appreciate greatly your kind assistance.

Regards,

A2 AK

Brian Uri!

unread,

Oct 21, 2024, 4:39:30 PM10/21/24

to Sparkour

Yes, numpy isn't always in default installations. You can try installing it either with yum or with pip/pip3.

Regards,

BU

A2 AK

unread,

Oct 21, 2024, 6:21:10 PM10/21/24

to Sparkour

Thanks, again. I installed numpy with yum. My Spark program successfully ran in the Spark shell.

Now I ran into a problem when submitting the same Spark program using spark-submit. I followed the instructions in https://spark.apache.org/docs/latest/submitting-applications.html. But I got a connection refused error to the spark master ... the 172.31.27.208 is the private IP address of my EC2 VM instance.

$SPARK_HOME/bin/spark-submit --master spark://172.31.27.208:7077 /home/ec2-user/final.py

So I thought the port 7077 is blocked. I added it to the security group and restarted the VM. I still ran into the same error.

Any advice?

Brian Uri!

unread,

Oct 21, 2024, 6:31:01 PM10/21/24

to Sparkour

A couple things to troubleshoot:

1) Make sure a Spark master is actually running on the VM (using the start-master.sh script). When you run in the interactive shell, a temporary master turns on and is running until you quit the shell. (More info in https://sparkour.urizone.net/recipes/managing-clusters/ )

2) If you're running spark-submit from a dev environment outside the EC2 instance and you're sure the Security Group has your IPv4 address and the right port, things should be working. Any chance your dev environment is trying to communicate over IPv6 instead? (Unlikely that you could SSH into the box but not connect over spark-submit so #1 is probably more likely).

Side note: AWS Security Group changes are instantaneous. When you add/modify/delete a rule, you do not need to restart the EC2 instance.

Regards,

BU

A2 AK

unread,

Oct 21, 2024, 11:16:40 PM10/21/24

to Sparkour

Thanks, again, Brian. You are right, again.

The tutorial I used did not mention the need to start the master. Maybe it is assuming a version that automatically starts it.

Following your advice, I did the "start-master.sh". Now the previous error is not showing, yet after quite a few INFO messages, it just hangs without proceeding. The last INFO message is ...

INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0

Any advice?

Brian Uri!

unread,

Oct 22, 2024, 6:23:40 AM10/22/24

to Sparkour

Try these steps to see if the master is running and accepting connections:

1) From the same EC2 instance where you ran start-master.sh, find the master log file. When you start the master, it should report where it is logging. You can use stop-master.sh then start-master.sh again to see this info. For my master, I see:

starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-ip-172-31-24-101.out

2) Tail this log in a terminal window.

tail -f <thePathToTheLogFile>

The last thing I see in the log before doing anything useful is:

24/10/22 10:16:58 INFO Master: I have been elected leader! New state: ALIVE

3) In a 2nd terminal window on the same EC2 instance, try starting up an interactive Spark shell pointing at the master. For my master IP, I run.

$SPARK_HOME/bin/pyspark --master spark://ip-172-31-24-101:7077

4) Observe the log and confirm that the master accepted the connection from the interactive shell. In my log, I see:

24/10/22 10:17:42 INFO Master: Registering app PySparkShell
24/10/22 10:17:42 INFO Master: Registered app PySparkShell with ID app-20241022101742-0000

If all 4 steps worked, your master is running, so the issue might be somewhere between the master and wherever you're running spark-submit.sh from.

Regards,

BU

A2 AK

unread,

Oct 22, 2024, 11:17:44 PM10/22/24

to Sparkour

1) The output I got is ... starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-ec2-user-org.apache.spark.deploy.master.Master-1-ip-172-31-27-208.us-east-2.compute.internal.out

2) The last line output I got is ... 24/10/23 02:59:53 INFO Master: I have been elected leader! New state: ALIVE

3) Started the interactive Spark shell ... $SPARK_HOME/bin/pyspark --master spark://ip-172-31-27-208:7077

4) The log has these 2 lines ...

24/10/23 03:12:17 INFO Master: Registering app PySparkShell

24/10/23 03:12:17 INFO Master: Registered app PySparkShell with ID app-20241023031217-0000

I ran all the commands from MobaXterm, including the successful run of the interactive Spark shell.

A2 AK

unread,

Oct 23, 2024, 2:04:05 AM10/23/24

to Sparkour

I sensed that there is a connectivity issue between the master and the worker, even though they are in the same VM instance. So, I edited the hosts file to associate the private IP addresses with the ip-x-x-x reference. After this, the Spark code is starting. It is running awfully slow though. The interactive shell run was much faster. I suppose I need to add worker nodes into the Spark cluster.

A couple of things I still can't figure out ...

(1) I can't access the master UI using http://3.129.90.70:8080 (3.129.90.70 is the public IP address of the master VM). I could ping that IP address after I added ICMP to the security group. Port 80, 443, 8080, and 8081 are in my security group.

(2) How do I stop the spark-submit run? I tried Ctrl-Z and Ctrl-C but neither appears to work.

(3) How can I find out the running Java processes in each VM instance? It should list the master and the worker processes. I tried jps from reading some articles, but the JVM I installed does not appear to have jps.

Thanks again.

Brian Uri!

unread,

Oct 23, 2024, 7:06:30 AM10/23/24

to Sparkour

There's some overhead in involving masters/workers in your Spark jobs, made worse if you're hosting everything on one VM or using one of the budget EC2 sizes. I get reasonable performance running my Cluster tutorial #3 on a "large" instance type.

1) Not sure why you can't view the master UI in a browser if the Security Group is configured for your IP. My Security Group rule is "Custom TCP / TCP / ports 8080 - 8082" and I can get in via browser. If you can't solve this immediately, you should still be able to get a lot of the info you want via logs.

2) There is a --kill parameter on the spark-submit script that let's you kill a running app based on its ID. You can get the ID from the Master UI or logs.

3) Try just using ps:

ps -fC java

Regards,

BU

A2 AK

unread,

Oct 23, 2024, 3:21:06 PM10/23/24

to Sparkour

Thanks for the guidance, again, Brian.

Last night, the Spark-submit ran so slowly, I killed it and went to bed. I got some output from the program before I killed it. Today, when I resubmitted, it did not even run. The error was ...

24/10/23 18:35:29 ERROR TaskSchedulerImpl: Lost executor 0 on 172.31.27.208: worker lost: Not receiving heartbeat for 60 seconds

Could it be due to running out of cpu cycles?

(1) The next step for me is to set up a cluster of 2 or 3 Spark-Hadoop VM instances. I saw your tutorial on the EC2-script and plan to follow the steps there. I assume the EC2-script can work on my free-tier EC2?

(2) I did not see among your tutorials on how to upload a data file (csv in my case) to HDFS and how to have the Spark code access that HDFS file. Any suggestions on where I can find them?

(3) Since I am using MobaXterm to log into my master VM to do work, I suppose the master VM is also my "development environment" per your tutorial. How can I then access spark://ip-x-x-x-x:7077? I still could not access my VM using HTTP on port 8080 for some reason.

Brian Uri!

unread,

Oct 24, 2024, 12:08:36 PM10/24/24

to Sparkour

If the master cannot connect to the worker, it might be best to stop all of the workers and restart them fresh -- not sure why the worker might have gotten lost.

(1) The spark-ec2 script will work for your free-tier EC2 instance. However, be aware that the script was split away from the main Spark distro around 2020 and is no longer maintained. I have heard good things about Flintrock as an alternative but have no direct experience with it:
https://github.com/nchammas/flintrock

(2) I never did get around to writing an HDFS tutorial, opting instead to focus on using S3 as storage. There are 3 S3-related recipes listed under the "Spark Integration" tab.

(3) I'm not familiar with MobaXterm or what sorts of SSH tunneling it might be doing that might make connectivity messy. While logged into the VM through MobaXterm, you could try doing a "wget" on the Master UI 8080 URL to see if you can get to it without involving a web browser. That might narrow down the troubleshooting.

Regards,

BU

A2 AK

unread,

Oct 24, 2024, 10:58:03 PM10/24/24

to Sparkour

It is probably due to high CPU usage that the heartbeat did not get through. When I submitted the Spark code, I stopped and started the master and the worker first. So, they are all fresh. I tested with a more trivial Spark code and it ran fine. When I submitted my machine learning Spark code, the EC2 CPU monitor jumped to 100% quickly. It looks like the hosts file edit I tried is a red herring. The slowness is not due to network connectivity but high CPU.

Thanks for the comments in (1)-(2).

(3) The wget got an error ... Connection refused.

Regards,

A2 AK

Brian Uri!

unread,

Oct 25, 2024, 6:14:08 AM10/25/24

to Sparkour

Interesting -- here's what I see when I run the commands through Putty. (The Security Group allows Custom TCP / TCP 8080-8082 for my IPv4 address (with a /32 CIDR suffix, but that shouldn't come into play when trying to run these commands on the same EC2 instance as the master):

[root@ip-172-31-24-101 sbin]# $SPARK_HOME/sbin/start-master.sh

starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-ip-172-31-24-101.out

[root@ip-172-31-24-101 sbin]# vi /opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-ip-172-31-24-101.out

24/10/25 10:07:59 INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://ip-172-31-24-101.ec2.internal:8080

[root@ip-172-31-24-101 ec2-user]# wget http://ip-172-31-24-101.ec2.internal:8080

--2024-10-25 10:08:28-- http://ip-172-31-24-101.ec2.internal:8080/
Resolving ip-172-31-24-101.ec2.internal (ip-172-31-24-101.ec2.internal)... 172.31.24.101
Connecting to ip-172-31-24-101.ec2.internal (ip-172-31-24-101.ec2.internal)|172.31.24.101|:8080... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5343 (5.2K) [text/html]
Saving to: ‘index.html’

index.html 100%[===========================================================>] 5.22K --.-KB/s in 0s

2024-10-25 10:08:28 (589 MB/s) - ‘index.html’ saved [5343/5343]

[root@ip-172-31-24-101 ec2-user]# vi index.html

(this shows a well-formed HTML file with title tag "Spark Master at spark://ip-172-31-24-101.ec2.internal:7077")

Hope it helps!

BU

A2 AK

unread,

Oct 25, 2024, 10:01:16 PM10/25/24

to Sparkour

This is what I got and I had to revise the URL to get this far ...

[ec2-user@ip-172-31-27-208 /]$ wget http://ip-172-31-27-208.us-east-2.compute.internal:8080
--2024-10-26 01:57:46-- http://ip-172-31-27-208.us-east-2.compute.internal:8080/
Resolving ip-172-31-27-208.us-east-2.compute.internal (ip-172-31-27-208.us-east-2.compute.internal)... 172.31.27.208
Connecting to ip-172-31-27-208.us-east-2.compute.internal (ip-172-31-27-208.us-east-2.compute.internal)|172.31.27.208|:8080... connected.

HTTP request sent, awaiting response... 200 OK

Length: 5701 (5.6K) [text/html]
index.html: Permission denied

Cannot write to ‘index.html’ (Permission denied).

Brian Uri!

unread,

Oct 26, 2024, 6:12:42 AM10/26/24

to Sparkour

Your error may be unrelated to connectivity. Looks like ec2-user is sitting in the / directory and may not have write privileges? Try adding a path flag:

wget http://ip-172-31-27-208.us-east-2.compute.internal:8080 -P /home/ec2-user

A2 AK

unread,

Oct 26, 2024, 11:36:17 PM10/26/24

to Sparkour

It works. Thanks.

This is my first attempt at EC2. I saw the Connect option to get terminal access to a VM instance. Is there a browser access option on the EC2 console to access spark://ip-172-31-24-101.ec2.internal:7077?

Brian Uri!

unread,

Oct 27, 2024, 12:52:12 PM10/27/24

to Sparkour

Not that I know of. AWS is unaware of the spark:// protocol.

Regards,

BU

A2 AK

unread,

Oct 28, 2024, 1:45:44 AM10/28/24

to Sparkour

Puzzled with why the issues I have been having with accessing the html pages. Anyway, I need to move on with the Spark cluster and HDFS.

Thanks so much for your help, time, and guidance so far.

Reply all

Reply to author

Forward