Hi Krishna,
the columnar index indeed allows to access columns separately. However, you need a reader
that understands the Parquet file format [1] and takes all the provided advantages which
also include quick filtering on columns, selecting column chunks based on filters on other
columns, etc. These optimizations are used by Spark, Athena/Presto, Hive and other big data and
NoSQL tools.
Anyway, you can even read the Parquet files directly on S3 but you need the right tools.
Access to S3 works remotely from outside the AWS cloud. However, as you may guess you
gain a lot of speed if the data is read where it lives, i.e. in the AWS us-east-1 region.
Here just one way to go (I think there are more, e.g., python + s3fs + pandas + pyarrow , but I
don't have the time right now to try them):
1. download and install a recent Hadoop [2] (I've tried 2.8.4 but 2.8.5 or newer should
work as well)
2. read about the S3A file system and how to configure it. I've chosen to set the AWS
credentials as environment variables. You may use the
org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider
if you do not have an AWS account or just by default to read from s3://commoncrawl/:
<property>
<name>fs.s3a.bucket.commoncrawl.aws.credentials.provider</name>
<value>org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider</value>
</property>
NOTE: the parquet-tools command used below does not read the Hadoop config
(parquet-cli does but has no command to dump columns)
3. clone and compile the Java Parquet lib and tools, see the instructions on [4]
4. the parquet-tools allow to dump a single column:
cd .../parquet-mr/parquet-tools/
# setup classpath to contain all required Hadoop libs + parquet-tools jar
export
CLASSPATH="/opt/hadoop/2.8.4/etc/hadoop:/opt/hadoop/2.8.4/share/hadoop/common/lib/*:/opt/hadoop/2.8.4/share/hadoop/common/*:/opt/hadoop/2.8.4/share/hadoop/hdfs:/opt/hadoop/2.8.4/share/hadoop/hdfs/lib/*:/opt/hadoop/2.8.4/share/hadoop/hdfs/*:/opt/hadoop/2.8.4/share/hadoop/yarn/lib/*:/opt/hadoop/2.8.4/share/hadoop/yarn/*:/opt/hadoop/2.8.4/share/hadoop/mapreduce/lib/*:/opt/hadoop/2.8.4/share/hadoop/mapreduce/*:/opt/hadoop/2.8.4/contrib/capacity-scheduler/*.jar:/opt/hadoop/2.8.4/share/hadoop/tools/lib/*:$PWD/target/parquet-tools-1.10.1-SNAPSHOT.jar"
# let the "dump" command print the content of the "url" column
java org.apache.parquet.tools.Main dump -c url
s3a://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2018-43/subset=robotstxt/part-00007-f47c372a-e3d4-4f2b-b7a0-a939c04fd01e.c000.gz.parquet
>urls.dump
The dump contains the URLs and some information about the data format:
BINARY url
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 554664 ***
value 1: R:0 D:0 V:
https://cinehype.com.br/robots.txt
value 2: R:0 D:0 V:
http://alineriscadooficial.com.br/robots.txt
...
As said that's one way to go. It limits the amount of downloaded data to only 20-30% of the entire
files - 20% for URL column plus some overhead for the Parquet metadata and column/offset indexes. If
you need more than a plain list of URLs or you want to filter the URLs anyway, I recommend to use
one of Spark, Athena, Hive, etc.
Best,
Sebastian
[1]
http://parquet.apache.org/documentation/latest/
[2]
https://hadoop.apache.org/releases.html
[3]
https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#S3A
[4]
https://github.com/apache/parquet-mr
On 11/18/18 1:04 PM, Krishna wrote:
> Folks,
>
> I was excited to see the tabular/columnar fields
> <
https://groups.google.com/forum/#!forum/common-crawl>. However, unable to figure out how to
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> To post to this group, send email to
common...@googlegroups.com
> <mailto:
common...@googlegroups.com>.
> Visit this group at
https://groups.google.com/group/common-crawl.
> For more options, visit
https://groups.google.com/d/optout.