Hi Hadassa,
could you share more details what "certain information" or "specific
values" means. The way to go depends mostly on whether the information
the filters are applied to the URL (or any parts of it), capture
metadata (fetch time, status), some HTML metadata, hyperlinks, or the
content itself.
The various data formats and subsets are described here:
https://commoncrawl.org/the-data/get-started/
For efficient filtering the URL indexes should be utilized. But this
depends on the use case.
> a Java written process.
Java and Python are the most common programming languages to process
Common Crawl data.
> Is there a way of seeing the data through a JSON viewer?
The URL index (CDX) and the WAT a JSON derivatives.
Best,
Sebastian