Hi,
alternatively, the WAT files can be used. They should be more efficiently to process
because links (<a> but also <img> and others) are wrapped into JSON (see [1]):
{
"text":"Home",
"path":"A@/href",
"url":"/index.html"
}
No need to extract the links from raw HTML. However, still to do:
- make links absolute
- transform into frequency list
[1]
https://webarchive.jira.com/wiki/display/Iresearch/Web+Archive+Metadata+File+Specification
Sebastian
On 06/01/2016 04:28 PM, Tom Morris wrote:
> On Wed, Jun 1, 2016 at 3:09 AM, Amir H. Jadidinejad <
jadid...@gmail.com
> <mailto:
jadid...@gmail.com>> wrote:
>
>
> I'm interested to know if anchor's related data has been stored during the crawl process?
> Exactly, I'm looking for a list of <ANCHOR> <URL> <FREQ> which represents corresponding URLs for
> a specific ANCHOR. Something like ClueWeb09 Anchor Log
> <
http://lemurproject.org/clueweb09/anchortext-querylog/>.
>
>
> The crawl includes the raw HTML of each page, so both anchor text and URLs are included for all
> embedded links. You would need to extract, clean, count, etc them yourself.
>
> Tom
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> To post to this group, send email to
common...@googlegroups.com
> <mailto:
common...@googlegroups.com>.
> Visit this group at
https://groups.google.com/group/common-crawl.
> For more options, visit
https://groups.google.com/d/optout.