I understand that for each record in cluster.idx i need to find the matching cdx-XXXXX.gz file, extract the gzip and I actually found out that the first line contains the result.
com,tripadvisor)/alllocations-g297914-c3-o100-restaurants-khao_lak_phang_nga_province.html 20170624092242 {"url": "
", "mime": "text/plain", "mime-detected": "text/html", "status": "301", "digest": "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ", "length": "1505", "offset": "27705127", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320243.11/crawldiagnostics/CC-MAIN-20170624082900-20170624102900-00271.warc.gz"}
However i am still a little bit confused. When i save the unzipped result to a file by doing
and pick a line from there, eg.
i cannot find it in the cluster.idx,it is not present there.
grep -P 'com,tripadvisor\)/alllocations-g298053' cluster.idx
returns empty results.
So if i understand it correctly, all the files in for example cdx-00165.gz match the common prefix com,tripadvisor)/*. Now i am really confused again how did the index server figured this output for request
http://index.commoncrawl.org/CC-MAIN-2017-26-index?url=tripadvisor.com%2F&output=json&fl=url,urlkey,filename{"url": "http://tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128319912.4/crawldiagnostics/CC-MAIN-20170622220117-20170623000117-00482.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128319912.4/warc/CC-MAIN-20170622220117-20170623000117-00008.warc.gz"}
{"url": "https://www.tripadvisor.com", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128319943.55/robotstxt/CC-MAIN-20170623012730-20170623032730-00551.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128319943.55/warc/CC-MAIN-20170623012730-20170623032730-00008.warc.gz"}
{"url": "http://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320023.23/crawldiagnostics/CC-MAIN-20170623063716-20170623083716-00585.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320023.23/warc/CC-MAIN-20170623063716-20170623083716-00008.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320040.36/warc/CC-MAIN-20170623082050-20170623102050-00008.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320077.32/warc/CC-MAIN-20170623170148-20170623190148-00008.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320209.66/warc/CC-MAIN-20170624013626-20170624033626-00008.warc.gz"}
{"url": "https://www.tripadvisor.com", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320261.6/robotstxt/CC-MAIN-20170624115542-20170624135542-00551.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320261.6/warc/CC-MAIN-20170624115542-20170624135542-00008.warc.gz"}
{"url": "http://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320264.42/robotstxt/CC-MAIN-20170624152159-20170624172159-00585.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320264.42/warc/CC-MAIN-20170624152159-20170624172159-00008.warc.gz"}
{"url": "http://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320338.89/crawldiagnostics/CC-MAIN-20170624203022-20170624223022-00585.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320338.89/warc/CC-MAIN-20170624203022-20170624223022-00008.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320362.97/warc/CC-MAIN-20170624221310-20170625001310-00008.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320395.62/warc/CC-MAIN-20170625032210-20170625052210-00008.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320595.24/warc/CC-MAIN-20170625235624-20170626015624-00008.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320669.83/warc/CC-MAIN-20170626032235-20170626052235-00008.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320679.64/warc/CC-MAIN-20170626050425-20170626070425-00008.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320887.15/warc/CC-MAIN-20170627013832-20170627033832-00008.warc.gz"}
{"url": "http://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128321025.86/crawldiagnostics/CC-MAIN-20170627064714-20170627084714-00585.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128321025.86/warc/CC-MAIN-20170627064714-20170627084714-00008.warc.gz"}
{"url": "https://www.tripadvisor.com./", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128322275.28/warc/CC-MAIN-20170628014207-20170628034207-00200.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128322873.10/warc/CC-MAIN-20170628065139-20170628085139-00008.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128323680.18/warc/CC-MAIN-20170628120308-20170628140308-00008.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128323870.46/warc/CC-MAIN-20170629051817-20170629071817-00008.warc.gz"}
{"url": "http://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128323970.81/crawldiagnostics/CC-MAIN-20170629121355-20170629141355-00585.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128323970.81/warc/CC-MAIN-20170629121355-20170629141355-00008.warc.gz"}
There is literally nothing after the "/" and i can' t find that kind of record in cluster.idx which has "\t" after the slash. I have tried and it is just not present.
grep -P '^com,tripadvisor\)/\s' cluster.idx
returns no match. I have the regexp correct, it returns expected result for line that i know that is present
grep -P '^com,tripadvisor\)/alllocations-g297914-c3-o100-restaurants-khao_lak_phang_nga_province\.html\s' cluster.idx
returns