Queue command

John Berrie

unread,

Apr 7, 2017, 9:36:02 AM4/7/17

to Web Data Commons

I am having trouble getting the processor to run. I think I am not queuing documents. This is the command I am running and the output I get for the queue command:

./bin/master queue --bucket-prefix CC-MAIN-2016-44/segments/1476988717783.68/warc/ --file-number-limit 1000

INFO 09:11:36 Queuing all keys from bucket aws-publicdatasets with prefix segments/1476988717783.68/warc/ (Master.java:1165)

INFO 09:11:36 Setting limit of files to: 1000 (Master.java:1195)

INFO 09:11:38 Queued 0 objects for prefix /common-crawl/crawl-data/CC-MAIN-2016-44/segments/1476988717783.68/warc/ (Master.java:1243)

INFO 09:11:38 Queued 0 objects for all given prefixes. (Master.java:1250)./bin/master

Should I be getting "Queued 0 objects"?

Thanks,

John

Anna Primpeli

unread,

Apr 7, 2017, 9:49:53 AM4/7/17

to Web Data Commons

Hello John,

please make sure that the following properties are correctly set in your configuration file:

dataBucket = commoncrawl

dataPrefix = crawl-data

Then you may run the queue command as you already did. This should get you the following info:

INFO 15:42:21 Queuing all keys from bucket commoncrawl with prefix CC-MAIN-2016-44/segments/1476988717783.68/warc/

INFO 15:42:21 Setting limit of files to: 1000

INFO 15:42:32 Queued 567 objects for prefix crawl-data/CC-MAIN-2016-44/segments/1476988717783.68/warc/

INFO 15:42:32 Queued 567 objects for all given prefixes.

Let us know if you have any further issues.

Best,

Anna

John Berrie

unread,

Apr 7, 2017, 10:21:01 AM4/7/17

to Web Data Commons

Anna,

Thank you, that fixed the queue problem. I have a couple of quick questions.

Do I need to rebuild (mvn package) every time I change the dpef.properties file?

When I run the monitor command, what do the status symbols mean?

Q: 567 (0), N: 0/1

Thanks,

John

Message has been deleted

Anna Primpeli

unread,

Apr 7, 2017, 10:59:40 AM4/7/17

to Web Data Commons

Hello John,

yes, you would need to do so.

The monitor symbols have the following meaning:

Q: QueueSize (InflightSize*), N: runningInstances/requestedInstances

*A message is in "inflight" if

1. You have received it, and

2. The visibility timeout has not expired and

3. You have not deleted it.

Source: http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-visibility-timeout.html#sqs-inflight-messages

Best,

Anna

John Berrie

unread,

Apr 7, 2017, 2:06:17 PM4/7/17

to Web Data Commons

Anna,

I have not had any luck getting my queue to run. After the start and monitor commands, the monitor status never changes. I have given a price limit well above the spot price. Below are the commands I have issued and a copy of my dpef.properties file below that.

If you can see anything that is hindering the queue from being processed, please let me know.

I am running this as a test before building my extractor; I am not particularly wed to any of the data or parameters. If you have a recently tested set of commands and dpef.properties file, I would be happy to try that.

v/r

John

command issued:

xxxxxx@ xxxxxx:~/framework/trunk$ ./bin/master deploy --jarfile target/dpef-*.jar

Deploying JAR file at target/dpef-1.0.4.jar

INFO 12:20:27 File target/dpef-1.0.4.jar now accessible at http://s3.amazonaws.com/LoreleiDeployBucket/pdef.jar (Master.java:902)

xxxxxx @ xxxxxx:~/framework/trunk$ ./bin/master queue --bucket-prefix CC-MAIN-2016-44/segments/1476988717783.68/warc/ --file-number-limit 1000

INFO 12:21:09 Queuing all keys from bucket commoncrawl with prefix CC-MAIN-2016-44/segments/1476988717783.68/warc/ (Master.java:1165)

INFO 12:21:09 Setting limit of files to: 1000 (Master.java:1195)

INFO 12:21:19 Queued 567 objects for prefix crawl-data/CC-MAIN-2016-44/segments/1476988717783.68/warc/ (Master.java:1243)

INFO 12:21:19 Queued 567 objects for all given prefixes. (Master.java:1250)

xxxxxx @ xxxxxx:~/framework/trunk$ ./bin/master start --worker-amount 1 --pricelimit 0.60

INFO 13:23:17 Requesting 1 instances of type m3.medium with price limit of 0.6 US$ (Master.java:848)

INFO 13:23:18 Request placed, now use 'monitor' to check how many instances are running. Use 'shutdown' to cancel the request and terminate the corresponding instances. (Master.java:879)

done.

xxxxxx @ xxxxxx:~/framework/trunk$ ./bin/master monitor

Monitoring job queue, extraction rate and running instances. AutoShutdown is: off

Q: 567 (0), N: 0/1

The monitor numbers never change; nothing seems to happen.

Below are contents of my dpef.properties file:

# AWS Access Properties

## Your AWS Access Key (Update with your key)

awsAccessKey = xxx

## Your AWS Secret Key (Update with your key)

awsSecretKey = xxx

## Name of the key pair you can use to access the instances. (Update with your key)

ec2keypair = xxx

# AWS S3 Properties

## Your S3 Bucket name for results (create with s3cmd command line tool)

resultBucket = LoreleiResultBucket

## Your S3 Bucket for the code to be deployed on the EC2 instances

deployBucket = LoreleiDeployBucket

# AWS S3 data bucket prefix for public datasets (No need to change, unless you want to process other data than CC)

dataBucket = commoncrawl

# Common Crawl data bucket (Change depending on the dataset you want to process)

dataPrefix = crawl-data

# Name of the jar of the WDC Framework, after uploading to S3 (No need to change)

deployFilename = pdef.jar

# AWS EC2 Properties

## Endpoint of the EC2 API (No need to change, unless you want to launch your instances within another region)

ec2endpoint = ec2.us-east-1.amazonaws.com

## AMI which will be launched (Make sure the AMI you select has e.g. the write system language, which can influence your reading and writing of files.)

ec2ami = ami-018c9568

## Please check the available instance descriptions for the right instance type for your process. (Make sure #CPU, #RAM and #DISC is enough for your job!)

## Pricing: https://aws.amazon.com/ec2/pricing/

## EC2 Instant Types: https://aws.amazon.com/ec2/instance-types/

ec2instancetype = m3.medium

# AWS SQS Properties

## Name of the SQS with AWS (No need to change, unless you are running other SQS with a similar name)

jobQueueName = jobs

## AWS Queue endpoint (No need to change)

queueEndpoint = https://queue.amazonaws.com/

## Data Suffix for file processing and filtering (Change according to the files you want to put into the queue, e.g. .warc.gz, .arc.gz, ...)

dataSuffix = .warc.gz

## Batch size for filling the queue (No need to change)

batchsize = 10

## Time the SQS waits for a message - object taken from the queue - to be returned, no matter if successful processed or not (Change according to your average processing time of one file. Good results with 3x the average processing time)

jobTimeLimit = 900

## Number of times a message is retried before it is left out and an error is written to the SDB (No need to change, unless you know that some message will cause errors and you cannot process them)

jobRetryLimit = 3

# AWS SimpleDB Properties

## Name of the SDB for data written per file (No need to change, unless you already have a SimpleDB with this name)

sdbdatadomain = data

## Name of the SDB for errors occurring while processing a file (No need to change, unless you already have a SimpleDB with this name)

sdberrordomain = failures

## In case one of the domains above has less then this number of entries, statistics will not be written (Change according to your preferences)

minResults = 5

# WDC Extraction Framework Properties

## the class you want to use to process your files. This class needs to implement org.webdatacommons.framework.processor.FileProcessor

processorClass = org.webdatacommons.structureddata.processor.WarcProcessor

## Memory which will be given to Java Process, when executing the .jar on each machine (java -Xmx)

javamemory = 5G

# WDC Extraction Framework Processor Specific Properties

## log regex failures (structured data extraction)

logRegexFailures = false

# WDC WebTables Specific Properties

## extraction of top n terms

extractTopNTerms = true

## in case you want mh, use mh, otherwise basic is used

extractionAlgorithm = org.webdatacommons.webtables.extraction.BasicExtraction

## selected model for phase 1

phase1ModelPath = /RandomForest_P1.mdl

## selected model for phase 2

phase2ModelPath = /RandomForest_P2.md

Anna Primpeli

unread,

Apr 10, 2017, 3:40:34 AM4/10/17

to Web Data Commons

Hello John,

the logging information seems normal- your queue is filled with messages and your instance is active. Probably something happens in the script.

It would be helpful to closely monitor your workers by connecting to your instance and check if you have any exception occuring.

You can find more information about that here: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html

Let me know if this solves your issue.

Best,

Anna

John Berrie

unread,

Apr 10, 2017, 3:14:56 PM4/10/17

to Web Data Commons

Anna,

I have changed the ami, the instance type and spot price so that now I get an instance to show that it is running.

Q: 567 (0), N: 1/1

My console gives a status of "fulfilled" for the instance

I logged into the instance through PuTTY SSH and looked at all of the /var/log files. The messages file has some information, but I did not see any exceptions except possibly

udevadm[1870]: --type failed is deprecated will be removed from a future udev release.

It has been running for three hours with no changes to status.

Am I looking in the wrong place for the worker events?

v/r

John

Anna Primpeli

unread,

Apr 11, 2017, 4:02:26 AM4/11/17

to Web Data Commons

Hello John,

It seems like your instance is running, but the files in the queue are not processed by the workers.

What about the info from the amazon monitoring console, ie: CPU usage, disk reads and status check failed?

I would propose the following:

1. Check if your worker is running smoothly locally, without using ec2 instances

2. Change your startup script so that is writes the errors in a file. You can find the script inside org.webdatacommons.framework.cli Master class and you may change it like -jar /tmp/start.jar > /tmp/start.log 2> /tmp/start_errors.log so that you have your errors being written in a file.

In addition make sure your instances are running java8, otherwise you need to install it - for this you can use the first defined script of the class.

I hope this helps.

Best,

Anna

John Berrie

unread,

Apr 11, 2017, 9:16:09 AM4/11/17

to Web Data Commons

Anna,

The status checks for system and instance reachability pass. Yesterday, I was seeing some CPU usage and disk reads that decreased to 0. Today after 45 minutes I am seeing no activity on the CPU usage, disk reads or other metrics

How do I run the worker locally?

I added the /tmp/start_errors.log. This file now appears in the /tmp directory along with the start.log. After 45 minutes both files are empty.

v/r

John

Anna Primpeli

unread,

Apr 11, 2017, 9:48:01 AM4/11/17

to Web Data Commons

Hello John,

to run the extraction locally (no ec2 instances involved) you need to send some files to your SQS and then run the Worker class (package:org.wdc.framework). In that case you may want to set threadLimit = 1 so that only one core is used.

Best,

Anna

John Berrie

unread,

Apr 11, 2017, 2:20:35 PM4/11/17

to Web Data Commons

Anna,

Can you send me a recently tested dpef.properties file and the cli commands used with it? I don't care if it is WARC or WET.

v/r

John

Anna Primpeli

unread,

Apr 12, 2017, 3:37:08 AM4/12/17

to web-data...@googlegroups.com

Hello John,

the properties file you can find in the repo is a workable one- as long as you add the correct data of your queue, ec2 instances etc.

Make sure the dataSuffix field matches with the one of your files.

If you still get problems, please run the local test as proposed.

Best,

Anna

--
You received this message because you are subscribed to the Google Groups "Web Data Commons" group.
To unsubscribe from this group and stop receiving emails from it, send an email to web-data-commons+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Anna Primpeli

apri...@gmail.com

an...@informatik.uni-mannheim.de

John Berrie

unread,

Apr 13, 2017, 9:58:54 AM4/13/17

to Web Data Commons

Anna,

I changed AMIs and I am now getting cpu usage on the monitors. After 2 hours I have yet to see any disk read or write.

Q: 559 (8), N: 4/4

Is there a nuance in the ami pertaining to the storage read and write? Or should I let it run longer?

v/r

John

John Berrie

unread,

Apr 13, 2017, 11:09:36 PM4/13/17

to Web Data Commons

Anna,

I ran the processor for 5 hours seeing cpu activity and then shut it down. I was able to pull files to my destination directory. My issue was with the AMI. Once I changed it, I was able to get output. Thank you for your help. I was close to abandoning the effort.

My goal is to process WET files to find specific topics and copy those files back to my data bucket. Is there a processor that I could build off of to do that?

v/r

John

Anna Primpeli

unread,

Apr 24, 2017, 7:19:50 AM4/24/17

to Web Data Commons

Hello John,

sorry for the late answer.

I am glad to know you managed it!

I am not sure how you define the topics you want to find but I would suggest you have a look at the WarcProcessor class (org.webdatacommnos.structureddata.processor package). You could implement a method to filter your extraction result based on your needs.

Reply all

Reply to author

Forward