Get title field for from CDX API

Neil

unread,

Sep 8, 2016, 3:55:55 AM9/8/16

to Common Crawl

Hi all, quick question: when I use the CDX API and search for pages crawled for a particular

domain, the result includes the URL among other fields - I want to additionally get the 'title'

tag for the url in the JSON response

I tried adding title in the fields option but its not included - what would be the best way to

pull the 'title' along with the other fields?

Thanks a lot :)

Message has been deleted

zbagz

unread,

Sep 8, 2016, 12:46:21 PM9/8/16

to Common Crawl

The title tag is not indexed by the index server. You will have to extract it manually.

You can query the cdx api for your domain, download the range of the page and then do the parsing.

Let's say you have this for "www.ipc.com":

com,ipc)/ 20150127054500 {"url": "http://www.ipc.com/", "digest": "2WIVV4MGIEL27MAOOREEEKCIATEK43GM", "length": "9953", "offset": "768421563", "filename": "crawl-data/CC-MAIN-2015-06/segments/1422115861027.55/warc/CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz"}

curl -s -r 768421563-768431515 "https://commoncrawl.s3.amazonaws.com:/crawl-data/CC-MAIN-2015-06/segments/1422115861027.55/warc/CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz" | zgrep -oP '<title>(.*)</title>' -m 1 | sed -e "s#^<title>##" -e "s#</title>\$##"

I don't know why grep -o is not working on my machine but, anyway after the two ugly sed replacements to extract the tags it gets you the first <title> tag content that appears on the document just fine.

zbagz

unread,

Sep 8, 2016, 12:50:20 PM9/8/16

to Common Crawl

Note you should be using an HTML Parser to help you with badly formatted HTML. Regex is not a very viable solution for parsing HTML, but this is just for demo purposes.

Reply all

Reply to author

Forward