Get title field for from CDX API

36 views
Skip to first unread message

Neil

unread,
Sep 8, 2016, 3:55:55 AM9/8/16
to Common Crawl
Hi all, quick question: when I use the CDX API and search for pages crawled for a particular
domain, the result includes the URL among other fields - I want to additionally get the 'title'
tag for the url in the JSON response

I tried adding title in the fields option but its not included - what would be the best way to
pull the 'title' along with the other fields?

Thanks a lot :)
Message has been deleted

zbagz

unread,
Sep 8, 2016, 12:46:21 PM9/8/16
to Common Crawl
The title tag is not indexed by the index server. You will have to extract it manually.

You can query the cdx api for your domain, download the range of the page and then do the parsing.

Let's say you have this for "www.ipc.com":

com,ipc)/ 20150127054500 {"url": "http://www.ipc.com/", "digest": "2WIVV4MGIEL27MAOOREEEKCIATEK43GM", "length": "9953", "offset": "768421563", "filename": "crawl-data/CC-MAIN-2015-06/segments/1422115861027.55/warc/CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz"}

curl -s -r 768421563-768431515 "https://commoncrawl.s3.amazonaws.com:/crawl-data/CC-MAIN-2015-06/segments/1422115861027.55/warc/CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz" | zgrep -oP '<title>(.*)</title>' -m 1 | sed -e "s#^<title>##" -e "s#</title>\$##"

I don't know why grep -o is not working on my machine but, anyway after the two ugly sed replacements to extract the tags it gets you the first <title> tag content that appears on the document just fine.

zbagz

unread,
Sep 8, 2016, 12:50:20 PM9/8/16
to Common Crawl
Note you should be using an HTML Parser to help you with badly formatted HTML. Regex is not a very viable solution for parsing HTML, but this is just for demo purposes.
Reply all
Reply to author
Forward
0 new messages