Understand how I can filter certain information

59 views
Skip to first unread message

הדסה אהרוניאן

unread,
Jan 15, 2023, 8:24:15 AM1/15/23
to Common Crawl
Hi,

I would like to understand how I can filter certain information through a Java written process.
Maybe filter according to specific values.
Is there a way of seeing the data through a JSON viewer?

Thanks,
Hadassa

Sebastian Nagel

unread,
Jan 15, 2023, 8:48:49 AM1/15/23
to common...@googlegroups.com
Hi Hadassa,

could you share more details what "certain information" or "specific
values" means. The way to go depends mostly on whether the information
the filters are applied to the URL (or any parts of it), capture
metadata (fetch time, status), some HTML metadata, hyperlinks, or the
content itself.

The various data formats and subsets are described here:
https://commoncrawl.org/the-data/get-started/
For efficient filtering the URL indexes should be utilized. But this
depends on the use case.

> a Java written process.

Java and Python are the most common programming languages to process
Common Crawl data.

> Is there a way of seeing the data through a JSON viewer?

The URL index (CDX) and the WAT a JSON derivatives.

Best,
Sebastian

הדסה אהרוניאן

unread,
Jan 15, 2023, 8:54:22 AM1/15/23
to common...@googlegroups.com
Hi,
I would like to filter values from within the content itself. 
Maybe values like "smart watch" or "handbag".

Thanks, 
Hadassa

בתאריך יום א׳, 15 בינו׳ 2023, 15:24, מאת הדסה אהרוניאן ‏<a65...@gmail.com>:
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/e7ad5fa8-d99a-4fb9-a00f-a9a909cfef73n%40googlegroups.com.

Sebastian Nagel

unread,
Jan 17, 2023, 4:12:50 AM1/17/23
to common...@googlegroups.com
Hi Hadassa,

unfortunately, Common Crawl does not provide a fulltext index which
would make it easy to search the content for keywords. You'd need to
search through the WET files (plain text extracts of HTML pages),
7-10 TiB per main crawl. This is doable but requires some experience
working with big data tools.

Alternatively, have a look at
https://www.alexandria.org/
a search engine which uses Common Crawl to feed the index.

Best,
Sebastian

On 1/15/23 14:54, הדסה אהרוניאן wrote:
> Hi,
> I would like to filter values from within the content itself.
> Maybe values like "smart watch" or "handbag".
>
> Thanks,
> Hadassa
>
> בתאריך יום א׳, 15 בינו׳ 2023, 15:24, מאת הדסה אהרוניאן
> ‏<a65...@gmail.com <mailto:a65...@gmail.com>>:
Reply all
Reply to author
Forward
0 new messages