How to download specific fields of tabular/columnar index

207 views
Skip to first unread message

Krishna

unread,
Nov 18, 2018, 7:04:22 AM11/18/18
to Common Crawl
Folks,

I was excited to see the tabular/columnar fields. However, unable to figure out how to download just the couple of columns I care about, such as "url". I tried to list the parquet files:

aws --no-sign-request s3 ls --recursive s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2018-43

The above command only returned subset=warc|robotstxt|crawldiagnostics. Downloaded one parquet file of type warc and it has all the columns.

My question: is there a dump that only exports individual columns such as url? Or do I have to get all the warc subsets?

Thanks,
Krishna

Sebastian Nagel

unread,
Nov 19, 2018, 7:51:47 AM11/19/18
to common...@googlegroups.com
Hi Krishna,

the columnar index indeed allows to access columns separately. However, you need a reader
that understands the Parquet file format [1] and takes all the provided advantages which
also include quick filtering on columns, selecting column chunks based on filters on other
columns, etc. These optimizations are used by Spark, Athena/Presto, Hive and other big data and
NoSQL tools.

Anyway, you can even read the Parquet files directly on S3 but you need the right tools.
Access to S3 works remotely from outside the AWS cloud. However, as you may guess you
gain a lot of speed if the data is read where it lives, i.e. in the AWS us-east-1 region.

Here just one way to go (I think there are more, e.g., python + s3fs + pandas + pyarrow , but I
don't have the time right now to try them):

1. download and install a recent Hadoop [2] (I've tried 2.8.4 but 2.8.5 or newer should
work as well)

2. read about the S3A file system and how to configure it. I've chosen to set the AWS
credentials as environment variables. You may use the
org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider
if you do not have an AWS account or just by default to read from s3://commoncrawl/:
<property>
<name>fs.s3a.bucket.commoncrawl.aws.credentials.provider</name>
<value>org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider</value>
</property>
NOTE: the parquet-tools command used below does not read the Hadoop config
(parquet-cli does but has no command to dump columns)

3. clone and compile the Java Parquet lib and tools, see the instructions on [4]

4. the parquet-tools allow to dump a single column:

cd .../parquet-mr/parquet-tools/

# setup classpath to contain all required Hadoop libs + parquet-tools jar
export
CLASSPATH="/opt/hadoop/2.8.4/etc/hadoop:/opt/hadoop/2.8.4/share/hadoop/common/lib/*:/opt/hadoop/2.8.4/share/hadoop/common/*:/opt/hadoop/2.8.4/share/hadoop/hdfs:/opt/hadoop/2.8.4/share/hadoop/hdfs/lib/*:/opt/hadoop/2.8.4/share/hadoop/hdfs/*:/opt/hadoop/2.8.4/share/hadoop/yarn/lib/*:/opt/hadoop/2.8.4/share/hadoop/yarn/*:/opt/hadoop/2.8.4/share/hadoop/mapreduce/lib/*:/opt/hadoop/2.8.4/share/hadoop/mapreduce/*:/opt/hadoop/2.8.4/contrib/capacity-scheduler/*.jar:/opt/hadoop/2.8.4/share/hadoop/tools/lib/*:$PWD/target/parquet-tools-1.10.1-SNAPSHOT.jar"

# let the "dump" command print the content of the "url" column
java org.apache.parquet.tools.Main dump -c url
s3a://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2018-43/subset=robotstxt/part-00007-f47c372a-e3d4-4f2b-b7a0-a939c04fd01e.c000.gz.parquet
>urls.dump

The dump contains the URLs and some information about the data format:

BINARY url
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 554664 ***
value 1: R:0 D:0 V:https://cinehype.com.br/robots.txt
value 2: R:0 D:0 V:http://alineriscadooficial.com.br/robots.txt
...

As said that's one way to go. It limits the amount of downloaded data to only 20-30% of the entire
files - 20% for URL column plus some overhead for the Parquet metadata and column/offset indexes. If
you need more than a plain list of URLs or you want to filter the URLs anyway, I recommend to use
one of Spark, Athena, Hive, etc.

Best,
Sebastian

[1] http://parquet.apache.org/documentation/latest/
[2] https://hadoop.apache.org/releases.html
[3] https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#S3A
[4] https://github.com/apache/parquet-mr

On 11/18/18 1:04 PM, Krishna wrote:
> Folks,
>
> I was excited to see the tabular/columnar fields
> <https://groups.google.com/forum/#!forum/common-crawl>. However, unable to figure out how to
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Sebastian Nagel

unread,
Nov 19, 2018, 10:52:53 AM11/19/18
to Common Crawl
Hi Krishna,

correction: parquet-tools does read the Hadoop config: you only need to add
<property>
<name>fs.s3a.bucket.commoncrawl.aws.credentials.provider</name>
<value>org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider</value>
</property>
to $HADOOP_HOME/etc/hadoop/core-site.xml

Best,
Sebastian

Krishna

unread,
Nov 19, 2018, 3:01:56 PM11/19/18
to Common Crawl
Hi Sebastian,

Thank-you so much for the detailed reply. After I posted the question I figured might as well get all 300 parquet partitions and party on it. I was able to use parquet-dotnet and rip through all the data on my workstation.

I have also checked out parquet-tools project, part of parquet-mr, but was unable to build it on my Ubuntu 18.04 setup. looks like a known problem. Nothing for you to do. Just thought I would share this is someone struggled with the same issue.

Reply all
Reply to author
Forward
0 new messages