common crawl dataset

104 views
Skip to first unread message

srb...@gmail.com

unread,
Jul 6, 2017, 5:51:07 AM7/6/17
to Common Crawl
hi, i need daaset of urls with search interface present of page. can anyone help?

Sebastian Nagel

unread,
Jul 6, 2017, 11:58:31 AM7/6/17
to common...@googlegroups.com
Hi,

does this mean you want to get only those pages which contain a search box, i.e.
have a HTML form triggering a search? It's a good question whether is a way to
universally detect such forms.

Please, share an example what you exactly have in mind as "search interface".

Thanks,
Sebastian


On 07/06/2017 11:51 AM, srb...@gmail.com wrote:
> hi, i need daaset of urls with search interface present of page. can anyone help?
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

sawroop bal

unread,
Jul 7, 2017, 5:15:25 AM7/7/17
to common...@googlegroups.com
Hi,
No i need urls only, whether they have search interface or not, that crawler has to detect.

Thanks
Sawroop
Sent from my iPhone
> You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/fzZ9hsZ_M-Y/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com.
> To post to this group, send email to common...@googlegroups.com.

sawroop bal

unread,
Jul 7, 2017, 5:18:07 AM7/7/17
to common...@googlegroups.com
Hi,
I need dataset of urls, from those urls using html form tag, crawler will detect if search interface is present or not. If it has search interface then it will learn the features.
Thanks
Sawroop

Sent from my iPhone

> On 06-Jul-2017, at 9:28 PM, Sebastian Nagel <seba...@commoncrawl.org> wrote:
>
> You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/fzZ9hsZ_M-Y/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com.
> To post to this group, send email to common...@googlegroups.com.

Sebastian Nagel

unread,
Jul 7, 2017, 5:34:32 AM7/7/17
to common...@googlegroups.com
Hi Sawroop,

unfortunately there is no index which would allow to quickly filter out all pages/URLs containing a
HTML form element.

The easiest way is to process the WAT files and extract URLs of pages containing, e.g.,

...
"HTML-Metadata": {
"Links":[
{"path":"FORM@/action","url":"/search"},


Just a quick estimate (based one a single WAT file only): approx. 40% of the pages contain a
"FORM@/action". So it may be more efficient to directly process the WARC files (they contain the
raw HTML) and look for a form and decide there whether it's a search box or something else.

Best,
Sebastian

sawroop bal

unread,
Jul 7, 2017, 6:58:27 AM7/7/17
to common...@googlegroups.com
Hi,
Thankyou for your help.
Thanks
Sawroop

Sent from my iPhone

Reply all
Reply to author
Forward
0 new messages