--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/CANwgJutxzspHqnsRMzLccqXN%3DMpHEUC%2BOF4aR3ZTsTStNPmLjg%40mail.gmail.com.
--
Hi Basil,
FYI, we’re developing a system in Spark, which uses SparkSQL to query the common crawl index. These results are then mapped to a task which does an HTTP range request to retrieve the WARC text from S3, and then uses the jwarc library to parse the actual text. This is all in Scala.
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/2615619c-f1c2-7563-1305-30f9fc0ebddc%40commoncrawl.org.
>> How exactly is the sparkcc.py file listed in step 4 used in the spark deployment?
>The easiest way is to add it directly via `--py-files`:
$SPARK_HOME/bin/spark-submit \
--conf ... \
... (other Spark options) \
--py-files sparkcc.py \
script.py \
... (script-specific options)
Is anything else needed in order to run a spark job on this file? What is the data source for this spark job?
>> These readings also suggest that the crawl archives can also be accessed using Spark in a similar way to how Athena runs SQL queries on the crawl data using the parquet format.
>The columnar index can be queried using SQL either from Athena, Spark or Hive.
It's not possible to directly query on WARC files.
s3://commoncrawl/cc-index/table/cc-main/warc/ in a Spark shell?
Regards,
Basil L
Sebastien,
Sorry, not currently open-source, but I’d like to if my employer would allow. We’re also investigating a Google Cloud Dataflow implementation, but we’re not sure about all the S3/AWS egress costs. I’ll keep the group posted.
From: <common...@googlegroups.com> on behalf of Sebastian Nagel <seba...@commoncrawl.org>
Reply-To: "common...@googlegroups.com" <common...@googlegroups.com>
Date: Tuesday, May 26, 2020 at 12:11 PM
To: "common...@googlegroups.com" <common...@googlegroups.com>
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/7a58f263-5cb3-39a7-61dc-ce04e39ad2de%40commoncrawl.org.
aws emr add-steps --cluster-id j-2AXXXXXXGAPLF --steps Type=Spark,Name="Spark Program",
ActionOnFailure=CONTINUE,Args=[--class,org.apache.spark.examples.SparkPi,/usr/lib/spark
/examples/jars/spark-examples.jar,10]
What is the "spark-examples.jar" file that I need here and how do I generate this .jar file?
On the cc-pyspark github, it says I can submit a job with this code? Do I just include the server_count.py file with the job?
$SPARK_HOME/bin/spark-submit ./server_count.py --help
> To unsubscribe from this group and stop receiving emails from it, send an email to common...@googlegroups.com
> <mailto:common-crawl+unsub...@googlegroups.com>.
Under "Steps" it says that the spark application failed.
I am also having
trouble ssh'ing into the cluster. I am figuring out how to overcome
that.
What is the next step? How do I run a spark job on the Common Crawl data?