Hi Alex,
> are you interested in a git pull?
Yes. Why not?
> A little suggestion for improving the document regarding fastwarc:
> job warcio python file
But for some jobs you need more because of an inheritance chain / tree.
A zip file might be easier because some jobs inherit from further
classes. See
https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management
But agreed there should be an example, either in the FastWARC section or
in
https://github.com/commoncrawl/cc-pyspark#running-in-spark-cluster-over-large-amounts-of-data
Feel free to add an example in the PR.
Best,
Sebastian
On 5/28/23 21:41, 'AlexAR' via Common Crawl wrote:
> Hi Sebastian,
>
> Thanks for the accurate answer, finally I was very lucky
> server_ip_address.py
> <
https://github.com/commoncrawl/cc-pyspark/blob/main/server_ip_address.py> already exists!
> I wrote the fastwarc server_ip_address_fastwarc.py
> <
https://github.com/commoncrawl/cc-pyspark/blob/main/server_ip_address.py> in the same way as server_count_fastwarc.py <
https://github.com/commoncrawl/cc-pyspark/blob/main/server_count_fastwarc.py>, are you interested in a git pull?
>
> A little suggestion for improving the document regarding fastwarc:
> when running the job in a Spark cluster, sparkcc_fastwarc.py and job
> warcio python file must be passed via --py-files in addition to
> sparkcc.py [...] example below:
> $SPARK_HOME/bin/spark-submit --deploy-mode client --master
> spark://spark_cluster_to_be_replaced.:7077 --py-files
> sparkcc.py,server_count.py,sparkcc_fastwarc.py
> ./server_count_fastwarc.py --num_output_partitions 1 --log_level WARN
> ./input/test_warc.txt servernames
>
> Cheers,
> Alex
> Le samedi 27 mai 2023 à 19:57:49 UTC+2, Sebastian Nagel a écrit :
>
> Hi Alex,
>
> > Sadly, IP address isn't indexed in ccindex. Bad luck, Athena would
> > have been appropriate for that.
>
> Agreed.
>
> > Rebuilding the index on my own with WARC-IP-Address value seems a
> bit
> > overkill even if the robotstxt crawl is way smaller.
>
> This is indeed the most efficient way to go. It's comparable small,
> 150 GiB per crawl dataset. There's already code for the task:
>
https://github.com/commoncrawl/cc-pyspark/blob/main/server_ip_address.py <
https://github.com/commoncrawl/cc-pyspark/blob/main/server_ip_address.py>
>
> > Then I'm wondering what would be the best approach that won't take
> > more than 1 or 2 days to do so?
>
> Definitely not. Even with a single-node Spark setup (4-8 vCPUs, 16 GiB
> RAM) this should take less than a day. Assumed, it's running in the AWS
> region us-east-1, so that data transfer times are negligible.
>
> Note: you might want to try the FastWARC to speed up the processing,
> see
>
>
https://github.com/commoncrawl/cc-pyspark/#using-fastwarc-to-parse-warc-files <
https://github.com/commoncrawl/cc-pyspark/#using-fastwarc-to-parse-warc-files>