Hi,
thanks for the really nice example! I'll add it to our list of examples.
Duckdb is a great tool, I used it a couple of times to inspect and query
Parquet files on the local disk. It's good to know that it can also used
for data on S3.
> > so presumably it's a full read of that one column.
>
> Yes. It is a full read of the column.
That's not necessarily the case. The column `content_languages`
should be backed by a dictionary in almost all row groups.
The Parquet writer may fall back from dictionary to plain encoding
only if the number of unique combinations of content languages
gets to high to be stored efficiently using a dictionary.
If there is a dictionary, only the dictionary is used to determine
whether there are rows with Telugu content. If not the row group is
entirely skipped.
To speed up the query I would try to avoid that all columns are
extracted in the results. Instead of
select * from ...
only list the columns needed for further processing steps, eg.
select url, content_languages,
warc_filename, warc_record_offset, warc_record_length
from ...
If possible avoid columns which make a significant part of the
storage, see the numbers in
http://data.commoncrawl.org/cc-index/table/cc-main/index.html
Best,
Sebastian