How to build anchor's related data from CommonCrawl?

63 views
Skip to first unread message

Amir H. Jadidinejad

unread,
Jun 1, 2016, 3:09:54 AM6/1/16
to Common Crawl
Hi,

I do appreciate all of you for such a great resource. It is very interesting.
I'm interested to know if anchor's related data has been stored during the crawl process?
Exactly, I'm looking for a list of <ANCHOR> <URL> <FREQ> which represents corresponding URLs for a specific ANCHOR. Something like ClueWeb09 Anchor Log.

I do appreciate any help.
Kind regards

Tom Morris

unread,
Jun 1, 2016, 10:28:06 AM6/1/16
to common...@googlegroups.com
On Wed, Jun 1, 2016 at 3:09 AM, Amir H. Jadidinejad <jadid...@gmail.com> wrote:

I'm interested to know if anchor's related data has been stored during the crawl process?
Exactly, I'm looking for a list of <ANCHOR> <URL> <FREQ> which represents corresponding URLs for a specific ANCHOR. Something like ClueWeb09 Anchor Log.

The crawl includes the raw HTML of each page, so both anchor text and URLs are included for all embedded links.  You would need to extract, clean, count, etc them yourself.

Tom

Sebastian Nagel

unread,
Jun 1, 2016, 10:55:39 AM6/1/16
to common...@googlegroups.com
Hi,

alternatively, the WAT files can be used. They should be more efficiently to process
because links (<a> but also <img> and others) are wrapped into JSON (see [1]):

{
"text":"Home",
"path":"A@/href",
"url":"/index.html"
}

No need to extract the links from raw HTML. However, still to do:
- make links absolute
- transform into frequency list

[1] https://webarchive.jira.com/wiki/display/Iresearch/Web+Archive+Metadata+File+Specification

Sebastian

On 06/01/2016 04:28 PM, Tom Morris wrote:
> On Wed, Jun 1, 2016 at 3:09 AM, Amir H. Jadidinejad <jadid...@gmail.com
> <mailto:jadid...@gmail.com>> wrote:
>
>
> I'm interested to know if anchor's related data has been stored during the crawl process?
> Exactly, I'm looking for a list of <ANCHOR> <URL> <FREQ> which represents corresponding URLs for
> a specific ANCHOR. Something like ClueWeb09 Anchor Log
> <http://lemurproject.org/clueweb09/anchortext-querylog/>.
>
>
> The crawl includes the raw HTML of each page, so both anchor text and URLs are included for all
> embedded links. You would need to extract, clean, count, etc them yourself.
>
> Tom
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Amir H. Jadidinejad

unread,
Jun 1, 2016, 11:56:25 PM6/1/16
to common...@googlegroups.com
Dear Sebastian,

Thank you for your help. It is a solution better than processing the original data.

Kind regards,
Amir
> You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/dYNSvC_OOIw/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com.
> To post to this group, send email to common...@googlegroups.com.

Tom Morris

unread,
Jun 2, 2016, 12:58:01 AM6/2/16
to common...@googlegroups.com
On Wed, Jun 1, 2016 at 10:55 AM, 'Sebastian Nagel' via Common Crawl <common...@googlegroups.com> wrote:

alternatively, the WAT files can be used. They should be more efficiently to process
because links (<a> but also <img> and others) are wrapped into JSON (see [1]):

Excellent point! That's a lot less data to munge through -- although I'm always a little nervous not having control of the total pipeline. Does the WAT extractor use the character encoding declared in the header, use a sniffer like ICU's character set detector, or something else? 

No need to extract the links from raw HTML. However, still to do:
 - make links absolute
 - transform into frequency list

Plus any of the relevant/desired cleaning steps from the original Clueweb protocol to eliminate generic PDF/download/etc links.

Tom

Sebastian Nagel

unread,
Jun 2, 2016, 4:39:08 AM6/2/16
to common...@googlegroups.com
Hi Amir, hi Tom,

> ... although I'm always a little nervous not having control of the total pipeline. Does the WAT
> extractor use the character encoding declared in the header, use a sniffer like ICU's character
> set detector, or something else?

Good question, I also would have to check the source code in:
https://github.com/commoncrawl/ia-hadoop-tools/
https://github.com/commoncrawl/ia-web-commons/
The WAT and WET files are generated from the WARC files by
https://github.com/commoncrawl/ia-hadoop-tools/blob/master/src/main/java/org/archive/hadoop/jobs/WEATGenerator.java

A look into the pom.xml of ia-web-commons suggests that the charset detection is based
on Mozilla's juniversalchardet package.


@Amir: Let us know if you managed to extract the anchor frequency list.
This could be of interest for other people and maybe worth to share
(code and/or resulting data).

Sebastian
Reply all
Reply to author
Forward
0 new messages