Hi Ala,
please start a new thread for a new question. Thanks!
Could you also share the column layout of "tableone" and
how you filled the WET files into the table? WET isn't a format
supported by Athena, I also doubt that Athena is the right
tool to produce a word count on billions or trillions of words.
You might have a look at
https://github.com/commoncrawl/cc-pyspark
https://github.com/commoncrawl/cc-warc-examples
https://github.com/commoncrawl/cc-mrjob
which include a word count job on WET files using MapReduce or Spark.
Best,
Sebastian
On 6/8/20 4:27 PM, Ala Anvari wrote:
> Hi all,
>
> I'm trying to get an athena query to word count the segments in a wet folder. I'm pretty sure my 'database' in Athena is pointing at the
> right folder. Completely new to presto.
>
> This is what my SQL looks like - what have I done wrong?
>
> select (unnest( split("line", ' ') ) ) as words, COUNT from "default"."tableone" GROUP BY(words)
>
> On Wed, Apr 15, 2020 at 9:40 AM Sebastian Nagel <
seba...@commoncrawl.org <mailto:
seba...@commoncrawl.org>> wrote:
>
> Hi all,
>
> the crawl archives of March/April 2020 are now available. The crawl was run from March 28
> to April 10. It covers 2.85 billion web pages or 280 TiB of uncompressed content. As usual,
> more details about the crawl and information how to access and use the data can be found
> on the Common Crawl blog [1].
>
> Please note that we will also merge the next two monthly crawls as a joint May/June crawl
> which is planned to start in the last week of May and to be released between June 10 and 15.
>
> Best,
> Sebastian
>
> [1]
https://commoncrawl.org/2020/04/march-april-2020-crawl-archive-now-available/
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
common-crawl...@googlegroups.com
> <mailto:
common-crawl%2Bunsu...@googlegroups.com>.
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
common-crawl...@googlegroups.com
> <mailto:
common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/common-crawl/CAO9dMdMTrk_E63rN2oUfc7b_s9fs9%2BCpVxxNCPwjHsOMQbuf6A%40mail.gmail.com
> <
https://groups.google.com/d/msgid/common-crawl/CAO9dMdMTrk_E63rN2oUfc7b_s9fs9%2BCpVxxNCPwjHsOMQbuf6A%40mail.gmail.com?utm_medium=email&utm_source=footer>.