Hi Sawroop,
unfortunately there is no index which would allow to quickly filter out all pages/URLs containing a
HTML form element.
The easiest way is to process the WAT files and extract URLs of pages containing, e.g.,
...
"HTML-Metadata": {
"Links":[
{"path":"FORM@/action","url":"/search"},
Just a quick estimate (based one a single WAT file only): approx. 40% of the pages contain a
"FORM@/action". So it may be more efficient to directly process the WARC files (they contain the
raw HTML) and look for a form and decide there whether it's a search box or something else.
Best,
Sebastian