Best approach to get IP address and hostname from robotstxt

115 views
Skip to first unread message

AlexAR

unread,
May 27, 2023, 1:27:19 PM5/27/23
to Common Crawl
Hi there,

I'm trying to extract IP addresses resolved from a hostname. Sadly, IP address isn't indexed in ccindex. Bad luck, Athena would have been appropriate for that.
Then I'm wondering what would be the best approach that won't take more than 1 or 2 days to do so?

Rebuilding the index on my own with WARC-IP-Address value seems a bit overkill even if the robotstxt crawl is way smaller.

Cheers,
Alex

Sebastian Nagel

unread,
May 27, 2023, 1:57:49 PM5/27/23
to common...@googlegroups.com
Hi Alex,

> Sadly, IP address isn't indexed in ccindex. Bad luck, Athena would
> have been appropriate for that.

Agreed.

> Rebuilding the index on my own with WARC-IP-Address value seems a bit
> overkill even if the robotstxt crawl is way smaller.

This is indeed the most efficient way to go. It's comparable small,
150 GiB per crawl dataset. There's already code for the task:
https://github.com/commoncrawl/cc-pyspark/blob/main/server_ip_address.py

> Then I'm wondering what would be the best approach that won't take
> more than 1 or 2 days to do so?

Definitely not. Even with a single-node Spark setup (4-8 vCPUs, 16 GiB
RAM) this should take less than a day. Assumed, it's running in the AWS
region us-east-1, so that data transfer times are negligible.

Note: you might want to try the FastWARC to speed up the processing, see

https://github.com/commoncrawl/cc-pyspark/#using-fastwarc-to-parse-warc-files

Best,
Sebastian

AlexAR

unread,
May 28, 2023, 3:41:58 PM5/28/23
to Common Crawl
Hi Sebastian,

Thanks for the accurate answer, finally I was very lucky server_ip_address.py already exists!
I wrote the fastwarc server_ip_address_fastwarc.py in the same way as  server_count_fastwarc.py, are you interested in a git pull?

A little suggestion for improving the document regarding fastwarc:
when running the job in a Spark cluster, sparkcc_fastwarc.py and job warcio python file must be passed via --py-files in addition to sparkcc.py [...] example below:
$SPARK_HOME/bin/spark-submit --deploy-mode client --master spark://spark_cluster_to_be_replaced.:7077 --py-files sparkcc.py,server_count.py,sparkcc_fastwarc.py ./server_count_fastwarc.py --num_output_partitions 1 --log_level WARN ./input/test_warc.txt servernames

Cheers,
Alex

Sebastian Nagel

unread,
May 29, 2023, 8:36:47 AM5/29/23
to common...@googlegroups.com
Hi Alex,

> are you interested in a git pull?

Yes. Why not?

> A little suggestion for improving the document regarding fastwarc:

> job warcio python file

But for some jobs you need more because of an inheritance chain / tree.

A zip file might be easier because some jobs inherit from further
classes. See

https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management

But agreed there should be an example, either in the FastWARC section or
in
https://github.com/commoncrawl/cc-pyspark#running-in-spark-cluster-over-large-amounts-of-data

Feel free to add an example in the PR.

Best,
Sebastian


On 5/28/23 21:41, 'AlexAR' via Common Crawl wrote:
> Hi Sebastian,
>
> Thanks for the accurate answer, finally I was very lucky
> server_ip_address.py
> <https://github.com/commoncrawl/cc-pyspark/blob/main/server_ip_address.py> already exists!
> I wrote the fastwarc server_ip_address_fastwarc.py
> <https://github.com/commoncrawl/cc-pyspark/blob/main/server_ip_address.py> in the same way as server_count_fastwarc.py <https://github.com/commoncrawl/cc-pyspark/blob/main/server_count_fastwarc.py>, are you interested in a git pull?
>
> A little suggestion for improving the document regarding fastwarc:
> when running the job in a Spark cluster, sparkcc_fastwarc.py and job
> warcio python file must be passed via --py-files in addition to
> sparkcc.py [...] example below:
> $SPARK_HOME/bin/spark-submit --deploy-mode client --master
> spark://spark_cluster_to_be_replaced.:7077 --py-files
> sparkcc.py,server_count.py,sparkcc_fastwarc.py
> ./server_count_fastwarc.py --num_output_partitions 1 --log_level WARN
> ./input/test_warc.txt servernames
>
> Cheers,
> Alex
> Le samedi 27 mai 2023 à 19:57:49 UTC+2, Sebastian Nagel a écrit :
>
> Hi Alex,
>
> > Sadly, IP address isn't indexed in ccindex. Bad luck, Athena would
> > have been appropriate for that.
>
> Agreed.
>
> > Rebuilding the index on my own with WARC-IP-Address value seems a
> bit
> > overkill even if the robotstxt crawl is way smaller.
>
> This is indeed the most efficient way to go. It's comparable small,
> 150 GiB per crawl dataset. There's already code for the task:
> https://github.com/commoncrawl/cc-pyspark/blob/main/server_ip_address.py <https://github.com/commoncrawl/cc-pyspark/blob/main/server_ip_address.py>
>
> > Then I'm wondering what would be the best approach that won't take
> > more than 1 or 2 days to do so?
>
> Definitely not. Even with a single-node Spark setup (4-8 vCPUs, 16 GiB
> RAM) this should take less than a day. Assumed, it's running in the AWS
> region us-east-1, so that data transfer times are negligible.
>
> Note: you might want to try the FastWARC to speed up the processing,
> see
>
> https://github.com/commoncrawl/cc-pyspark/#using-fastwarc-to-parse-warc-files <https://github.com/commoncrawl/cc-pyspark/#using-fastwarc-to-parse-warc-files>
Reply all
Reply to author
Forward
0 new messages