Question about amazon ec2 instances that doesn't "work"

79 views
Skip to first unread message

Fantin Charles

unread,
May 20, 2022, 3:50:49 AM5/20/22
to Web Data Commons

Hi all,

So I am following this tutorial for the WDC Framework. I followed the tutorial step by step and everything seems to work except the instances that seems to do nothing:

I have objects in the queue, the instances are started and functional (I can see them from the AWS ec2 instances panel) but nothing change when I monitor the process.
Even when I wait 15 minutes or more.
And there is nothing when I try to retrieve the data/stats.

Note:
- I don't use the same data file as the tutorial: CC-MAIN-2022-05/segments/1642320299852.23/warc

- I don't use the same instances type (for now):  t2.micro (ami-0022f774911c1d690)

- What i have in the terminal: Q: 1440 (0), N: 10/10

- I haven't modified any java file

- Only one warning appears when i start the instances (and i failed to fix it): WARN 14:57:16 Value not found in configuration for key ec2subnetid (ProcessingNode.java:77)

Am i doing something wrong ? Any ideas ?

Thanks in advance,
Fantin

Anna Primpeli

unread,
May 20, 2022, 7:37:07 AM5/20/22
to web-data...@googlegroups.com
Hello Fantin,

Thank you for your message.

For the ec2 settings of our account it was required to set a subnet id. More information on ec2 subnets can be found here:
The subnet id is requested from the Master.java class (src/main/java/org/webdatacommons/framework/cli/Master.java line 987).

I guess this is the reason that your instances are started but then remain idle.

You can try the following fixes:
1. Create a subnet for your instances and add the subnet id in the dpef.properties file as "ec2subnetid =<subnet-id-value>"
2. Comment out line 987 - but I doubt this will work.

Please also note that the script run on each instance has been tested for ubuntu 16.4 os and therefore your ami needs to support that.

In case my suggestions (1 or 2 above) do not work, please try to log into your running instances to see what happens and if there are any errors written in the log file.

Then we can do some further trouble shooting if necessary.

Best regards,
Anna




--
You received this message because you are subscribed to the Google Groups "Web Data Commons" group.
To unsubscribe from this group and stop receiving emails from it, send an email to web-data-commo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/web-data-commons/f2ec7806-a59c-4d54-ac7f-ace5c689d756n%40googlegroups.com.


--

Fantin Charles

unread,
May 24, 2022, 9:15:54 AM5/24/22
to Web Data Commons
Hi Anna,

Thanks for you answer.

I worked with your answer and successfully created and used a subnet but now I have another issue that i don't understand:

   - I can queue the data and start the instance without error (from the program POV)
   - I don't have the previous error message anymore
   - From the amazon instance panel I see no instance running at all
   - Here what i have from the monitor command: Q: 720 (0), N: 0/1  (I asked for only one worker)
   - I haven't modified anything from the property file (except for the TODO fields)
   - I waited more than 10 minutes to see if the instance would start.

Any ideas ?

Thanks again,
Fantin

apri...@gmail.com

unread,
May 25, 2022, 3:47:48 AM5/25/22
to Web Data Commons
Hi Fantin,

you are welcome!

If you cannot see the instance being initialized in the AWS monitor panel, I guess it is most probably due to a not sufficient provided price per worker (parameter [pricemilit] or [p] when you run the start command from Master.java). You can find pricing details about the EC2 instances here:  https://aws.amazon.com/ec2/pricing/.

Once you see the instance being initialized in the AWS monitor panel, you can log into the instance using its ip address from a command line tool or if you prefer sth like WinSCP and check if the process is running inside the instance. Here you can find some info on how to connect to your instance via putty: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html.
If the process is running, you should be able to see new files being created in its storage. Additionally, please note that in order to process a single WARC file from CC usually 40-45 minutes are required using a c3.2xlarge instance. If you use a smaller in terms of CPUs instance, you should increase the job time limit parameter which you can find in the dpef.properties file. Finally, please note that the extraction is bound to the config of your instances, meaning that if your instance does not have enough storage to store the intermediate files and results, an exception will be thrown. To circumvent that you should either change the instance type or configure the framework to flush the intermediate data at an earlier stage.

I hope this helps!

Best regards,
Anna

Fantin Charles

unread,
May 31, 2022, 9:38:38 AM5/31/22
to Web Data Commons
Hi,

Thanks again for your help.

I followed your advices and connected to my instances via ssh in order to monitor what's happening inside .
My issue is: no process is running and I can't figure out why.

I used the command "top" to monitor, and saw no process running. Also i saw no file getting created.


Note: I made sure to use the right ami and a large enough instance. I also tried to increase the price limit and the "jobTimeLimit"
Any ideas?

Best regards,
Fantin

Anna Primpeli

unread,
Jun 1, 2022, 8:27:20 AM6/1/22
to web-data...@googlegroups.com
Hello Fantin,

does this mean that also the .jar file is not copied to your running instances?
Every initialized instance makes a copy of the .jar project file which should be deployed within an S3 bucket. The jar. file should be found under tmp/start.jar within your instance.

Could you please check if this is the case? 

In case that the .jar file is not there you can try the following steps:
1. Make sure you deploy the correct jar file into S3 (deploy command), i.e. the jar file with dependencies. You should compile the framework with your updated dpef.properties config file before deploying.
2. Check that the jar. file can be found in the correct S3 bucket which is the one you set as deploybucket in the dpef.properties file.
3. If points 1 and 2 are covered and still the problem remains you can try to manually execute the startupscript which should actually start automatically in each initialized instance. You can find the startup script in Master.class variable private final String startupScript. Please run the startupscript within your instance and let us know if you receive any error messages. This can help us to assist you further if necessary.

Best regards,
Anna


You received this message because you are subscribed to a topic in the Google Groups "Web Data Commons" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/web-data-commons/xdQ7J-DOuVo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to web-data-commo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/web-data-commons/859495fd-9970-41ce-8dac-aae755220ef1n%40googlegroups.com.

Fantin Charles

unread,
Jun 2, 2022, 4:00:02 AM6/2/22
to Web Data Commons
Hello Anna,

After verification:
- dpef.jar is present in my deploy bucket (and it is not empty: ~200 MB)
- tmp/start.jar is present too, but it is empty
- tmp/start_errors.log contains "/var/lib/cloud/instance/scripts/part-001: line 9: java: command not found"
- tmp/start.log is empty

So i tried to execute the startup script within my instances:
when:
- I do: sudo apt-get install openjdk-8-jdk
- I have: E: Unable to locate package openjdk-8-jdk

- i do: sudo add-apt-repository ppa:openjdk-r/ppa
- I have: Please check that the PPA name or format is correct.

Best regards,
Fantin

Anna Primpeli

unread,
Jun 2, 2022, 4:33:49 AM6/2/22
to web-data...@googlegroups.com
Hello Fantin,

ok so it seems that the problem is the empty start.jar in the tmp folder of your instances which indicates that the file cannot be copied correctly from your S3 bucket to the instance. Is your S3 bucket configured to be private or public?

Also could you please run the complete startup script and not just the individual commands as you did and tell me what is the error you get there?

Best ,
Anna

Fantin Charles

unread,
Jun 3, 2022, 3:50:53 AM6/3/22
to Web Data Commons
Hi Anna,
After verification my S3 bucket is configured to be public.

I ran the complete startup code. It goes to the end and put some data in tmp/start.jar (21.5 MB).
I have two errors in the process:

Cannot add PPA: 'ppa:openjdk-r/ppa'.

Please check that the PPA name or format is correct.

and

E: Unable to locate package openjdk-8-jdk

Best,
Fantin
Reply all
Reply to author
Forward
0 new messages