A null somewhere causes an error with Apache Drill SQL query

62 views
Skip to first unread message

Jason Boxman

unread,
Mar 5, 2024, 11:58:55 PMMar 5
to Common Crawl
Hi,

I've run this query successfully on different sets of Parquet index files, but I'm coming up short on this. Sadly Apache Drill has limited error output, so it only tells me there's an internal error because of a null value. Are there possible nulls in these columns that somehow I never hit before, but I'm running into now? I'm not sure how to workaround this? I succeeded with 2023-23, 2023-40, and 2023-50.

```
CREATE TABLE dfs.tmp.com_tld AS
SELECT url_host_registered_domain, warc_filename, warc_record_offset, warc_record_length
FROM dfs.`CC-MAIN-2023-14-idx`
WHERE url_host_tld = 'com' AND content_languages LIKE '%eng%' AND url_path = '/' AND url_query IS null AND fetch_status = 200 AND
 content_mime_type = 'text/html'
```

Drill merely reports:

```
apache drill Fragment: 5:3 Error: INTERNAL_ERROR ERROR: null Please, refer to logs for more information.
```

The logs have no further actionable information.

Thanks!


Sebastian Nagel

unread,
Mar 6, 2024, 11:10:43 AMMar 6
to common...@googlegroups.com
Hi Jason,

yes, some of the columns allow null as values. The table schema
indicates for every column whether it's nullable or not. See


https://github.com/commoncrawl/cc-index-table/blob/main/src/main/resources/schema/cc-index-schema-flat.json

That's for the query below:
url_host_tld -- no TLD for IP addresses
url_host_registered_domain -- IP addresses, URLs which fail to parse
(yes, those exist because the HTTP protocol implementation is more
permissive than the Java core URL class), host names identical to a
public suffix, eg. "gov.il". There may be more such cases.
content_mime_type -- the server may not sent the Content-Type header
content_languages
-- language detection only run for successfully fetched documents

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/3749c78c-b8ea-440e-a4e7-d7fef81818a6n%40googlegroups.com <https://groups.google.com/d/msgid/common-crawl/3749c78c-b8ea-440e-a4e7-d7fef81818a6n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Greg Lindahl

unread,
Mar 6, 2024, 2:24:57 PMMar 6
to common...@googlegroups.com
Thank you Sebastian, for posting the schema.

A little googling says that Dremel can throw that error because a column is 100% NULLs. https://www.mail-archive.com/iss...@drill.apache.org/msg70809.html

Here's an official troubleshooting guide: https://drill.apache.org/docs/troubleshooting/


I hope this helps you, Jason! 

To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/a0bb4bc9-a7e9-458b-b92b-4879b117a736%40commoncrawl.org.

Jason Boxman

unread,
Mar 6, 2024, 7:24:08 PMMar 6
to Common Crawl
Thanks all!

I hopped on the Apache Drill Slack and it seems like this might actually be a bug in Apache Drill.

Reply all
Reply to author
Forward
0 new messages