wdc extraction framework question

72 views
Skip to first unread message

che...@vsri.biz

unread,
Mar 17, 2015, 2:40:38 AM3/17/15
to web-data...@googlegroups.com
hi all,

Thanks for setting this framework up and the detailed writeup at 

(Among the description of what I did, I have the questions as Question1 and Question2).

I followed the steps there, and was able to get the spot instances on EC2 running. Before doing any changes to do my own extraction, I was simply trying to run the code using the Wat processor. However, the queue does'nt seem to start getting processed. 

./bin/master queue --bucket-prefix CC-MAIN-2013-48/segments/1386163041297/wat/
added 100 objects to the queue

Then I started 10 workers (with instance type c3.xlarge).  One of the changes I made in dpef.properties is the following, since the AMI already listed there gave an error. 

## AMI which will be launched (Make sure the AMI you select has e.g. the write system language, which can influence your reading and writing of files.)
ec2ami = ami-01940631

Question1: 
I am not sure how to do the selection of AMI as per the comment above. It says 'make sure the AMI you select has e.g. the write system language'.  How is that done? Is the AMI have chosen okay for that?

 ./bin/master monitor
Monitoring job queue, extraction rate and running instances.
Q: 100 (0), N: 10/10                          

Even after like 10-15mins, this was continuing to show this same state. 

Question2: 
Would appreciate any hints/pointers on how to check if the wat files are being processed. For example, where in EC2 should I go to look at some log files for the jobs?

Thanks in advance for your help,
Cheenu

Robert Meusel

unread,
Mar 17, 2015, 3:56:03 AM3/17/15
to web-data...@googlegroups.com
Hi Cheenu,

Great that you find the framework helpful.

Regarding your questions:

Q1: You can test your AMI manually by launching it via the AWS Interface and run the .jar manually and see if it works fine. You can also log into the EC2 Instance and check for the right language settings using the linux commands ($ locale). In case you do not have special requirements, you can just leave the AMI as it is. 


Q2: Recommendation: Start small. Run the file with one Machine and log into the machine. You can also check the system log (/tmp/start.log) of the EC2 in the AWS Interface to see, if there was a problem (e.g. installing java - which might be a problem as java needs to be installed and depending on the AMI the package might not be available). For testing the whole framework you can also start it locally on your server/computer to debug it. E.g. from an IDE or java -cp framework.jar org.webdatacommons.framework.Worker. 

Hope this helps? Let me know if you need further support.

Robert

Srinivasan Venkatachary

unread,
Mar 17, 2015, 3:20:56 PM3/17/15
to web-data...@googlegroups.com
Thanks for the pointers, Robert. I made some progress, but still not getting the MR to process the queue.

- Used the Amazon default ami-dfc39aef. Created an instance and did java -version to check that Java is installed. $ locale returned a bunch of en-us.utf8.

- I ran the .jar on the machine where I built it (which is also an EC2 instance). It started processing the queue and processed several files one after another. 

- Setup the queue again and ran master start with 10 instances. The instances start up, but the queue is not getting processed. 

More Questions:
- I logged into one of the instances that started up as ec2-user. How do I check if it is running the worker? Is there any log file I should look for. The /tmp/start.log was empty. (Sorry I am a total noob to Hadoop. I used to work at Google and have done a bunch of web aggregation there but it was all C++ and Google MR. I am trying to pick up how to do it with Java and AWS and use WDC).

- How do these worker machines get the deployed .jar files? Is that done by master start? If so, is there any way to check where these jar files were put into the instances?

thanks,
Cheenu

Srinivasan Venkatachary

unread,
Mar 17, 2015, 5:55:39 PM3/17/15
to web-data...@googlegroups.com
From the Master.java code that sends a startupscript, I see it is supposed to install java and get jar file. Since the comment says ubuntu ami, tried with an ubuntu ami. 

Logging into an instance and looking at /tmp gives the following. So somehow start.jar is 0 bytes..even though in s3 the pdef.jar deployed is ~100MB. Any thoughts on why this could be? 
 
$ ls -l /tmp

total 8

drwxr-xr-x 2 root   root   4096 Mar 17 21:39 hsperfdata_root

drwxr-xr-x 2 ubuntu ubuntu 4096 Mar 17 21:46 hsperfdata_ubuntu

-rw-r--r-- 1 root   root      0 Mar 17 21:40 start.jar

-rw-r--r-- 1 root   root      0 Mar 17 21:40 start.log


thanks,

Cheenu

Srinivasan Venkatachary

unread,
Mar 17, 2015, 6:55:13 PM3/17/15
to web-data...@googlegroups.com
Ok, figured it out how to get the MR moving. Not sure if this is something specific to my setup:

- Following your advice, ran it on one machine, including by manually running each of the commands in the "startup script". Found that the wget from s3 was failing.

- Master.java was creating a wget from http://s3.amazonaws.com/bucketname/pdef.jar. I changed that to http://bucketname.s3.amazonaws.com/pdef.jar. And it worked! (found this wget command from some forums)

Now the --retrievedata is failing as follows, need to debug. If you are able to provide any suggestions, that would be great:
./bin/master  retrievedata --destination output
Exception in thread "main" com.martiansoftware.jsap.UnspecifiedParameterException: Parameter 'multiThreadMode' has no associated value.
at com.martiansoftware.jsap.JSAPResult.getInt(Unknown Source)
at org.webdatacommons.framework.cli.Master.main(Master.java:568)


thanks,
Cheenu

Srinivasan Venkatachary

unread,
Mar 17, 2015, 10:41:54 PM3/17/15
to web-data...@googlegroups.com
(Continuing to report progress on this thread, just in case it is useful for someone trying it in future).

- Specifying --multiThreadMode #numberofthreads with retrievedata makes the exception go away. Another way is change NullPointerException in the 'catch' in Master.java to UnspecifiedParameterException.

- With that, the retrievedata proceeds and prints the names of the .gz files in the s3 results bucket, and prints "retrieving file xxx" many times. But the output directory is still empty. Have'nt figured out why yet.

thanks,
Cheenu

Srinivasan Venkatachary

unread,
Mar 18, 2015, 12:33:12 AM3/18/15
to web-data...@googlegroups.com
Ok, so reading the readme.txt, it says retrievedata is applicable only if you implement your own new DataThread. So I guess I can ignore that. retrievestats worked without any issue.

Now I will attempt writing an extractor of my own (and possibly start a new email thread with any questions that come up).

thanks,
Cheenu

Robert Meusel

unread,
Mar 18, 2015, 10:21:47 AM3/18/15
to web-data...@googlegroups.com
Hi,

good to know - I need to figure out why this is not working, as I run an extraction 2 weeks ago.

Cheers,
Robert
Reply all
Reply to author
Forward
0 new messages