How to read raw index files

958 views
Skip to first unread message

brano199

unread,
Jul 31, 2017, 4:33:01 PM7/31/17
to Common Crawl
Hello,

i am having trouble understanding on how to read the cluster.idx file. I understand that each line contains an information of format
url search key, url timestamp, original url, archive file, archive offset, archive length

However, when i compared output of http://index.commoncrawl.org/CC-MAIN-2017-26-index?url=tripadvisor.com%2F&output=json

There is no line which starts with exactly "com,tripadvisor)/". If there was a line like this i would understand that the data are located in .gz file with given offset and length,but there is no such reference,so how did the python index server get that output? I can't use the python web server, because it is taking too long time to download the results into buffer, i need direct access so i want to implement it myself.

Another question is how do i build the prefix B-tree from those url? com,tripadvisor)/Hotel creates tree like this?
                       com
               /
tripadvisor
   |
hotel

Can you help me please?

Tom Morris

unread,
Jul 31, 2017, 7:15:10 PM7/31/17
to common...@googlegroups.com
I see plenty of lines that begin with that string. The first few are:

com,tripadvisor)/alllocations-g297914-c3-o100-restaurants-khao_lak_phang_nga_province.html 20170624092242       cdx-00165.gz    556156506       108645  725890
com,tripadvisor)/alllocations-g652202-c1-hotels-oudemirdum_friesland_province.html 20170627220256       cdx-00165.gz    556265151       211894  725891
com,tripadvisor)/attraction_review-g186460-d187982-reviews-welsh_national_opera-cardiff_southern_wales_wales.html 20170626231242        cdx-00165.gz    556477045       208522  725892
com,tripadvisor)/attraction_review-g274887-d1909678-reviews-or420-budapest_city_tour_hop_on_hop_off_giraffe_red_bus-budapest_central_hungary.html 20170627005439        cdx-00165.gz    556685567       203087  725893

cluster.idx is the index to the indexes, so you'll need to look in the index file cdx-00165.gz for the actual index data with the pointers to the crawl files themselves.

Tom

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

brano199

unread,
Aug 1, 2017, 5:01:43 AM8/1/17
to Common Crawl
I understand that for each record in cluster.idx i need to find the matching cdx-XXXXX.gz file, extract the gzip and I actually found out that the first line contains the result.
Running
dd bs=1 skip=556156506 count=108645 if=cdx-00165.gz | gunzip -c | head -n 1
outputs expected result
com,tripadvisor)/alllocations-g297914-c3-o100-restaurants-khao_lak_phang_nga_province.html 20170624092242 {"url": "https://www.tripadvisor.com/AllLocations-g297914-c3-o100-Restaurants-Khao_Lak_Phang_Nga_Province.html", "mime": "text/plain", "mime-detected": "text/html", "status": "301", "digest": "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ", "length": "1505", "offset": "27705127", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320243.11/crawldiagnostics/CC-MAIN-20170624082900-20170624102900-00271.warc.gz"}

However i am still a little bit confused. When i save the unzipped result to a file by doing
dd bs=1 skip=556156506 count=108645 if=cdx-00165.gz | gunzip -c > file.txt
and pick a line from there, eg.
com,tripadvisor)/alllocations-g298053-c1-hotels-kryvyy_rih_dnipropetrovsk_oblast.html 20170625053924 {"url": "https://www.tripadvisor.com/AllLocations-g298053-c1-Hotels-Kryvyy_Rih_Dnipropetrovsk_Oblast.html", "mime": "text/plain", "mime-detected": "text/html", "status": "301", "digest": "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ", "length": "1236", "offset": "26668329", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320438.53/crawldiagnostics/CC-MAIN-20170625050430-20170625070430-00042.warc.gz"}
 i cannot find it in the cluster.idx,it is not present there.
grep -P 'com,tripadvisor\)/alllocations-g298053' cluster.idx
returns empty results.

So if i understand it correctly, all the files in for example cdx-00165.gz match the common prefix com,tripadvisor)/*. Now i am really confused again how did the index server figured this output for request http://index.commoncrawl.org/CC-MAIN-2017-26-index?url=tripadvisor.com%2F&output=json&fl=url,urlkey,filename
{"url": "http://tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128319912.4/crawldiagnostics/CC-MAIN-20170622220117-20170623000117-00482.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128319912.4/warc/CC-MAIN-20170622220117-20170623000117-00008.warc.gz"}
{"url": "https://www.tripadvisor.com", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128319943.55/robotstxt/CC-MAIN-20170623012730-20170623032730-00551.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128319943.55/warc/CC-MAIN-20170623012730-20170623032730-00008.warc.gz"}
{"url": "http://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320023.23/crawldiagnostics/CC-MAIN-20170623063716-20170623083716-00585.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320023.23/warc/CC-MAIN-20170623063716-20170623083716-00008.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320040.36/warc/CC-MAIN-20170623082050-20170623102050-00008.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320077.32/warc/CC-MAIN-20170623170148-20170623190148-00008.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320209.66/warc/CC-MAIN-20170624013626-20170624033626-00008.warc.gz"}
{"url": "https://www.tripadvisor.com", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320261.6/robotstxt/CC-MAIN-20170624115542-20170624135542-00551.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320261.6/warc/CC-MAIN-20170624115542-20170624135542-00008.warc.gz"}
{"url": "http://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320264.42/robotstxt/CC-MAIN-20170624152159-20170624172159-00585.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320264.42/warc/CC-MAIN-20170624152159-20170624172159-00008.warc.gz"}
{"url": "http://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320338.89/crawldiagnostics/CC-MAIN-20170624203022-20170624223022-00585.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320338.89/warc/CC-MAIN-20170624203022-20170624223022-00008.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320362.97/warc/CC-MAIN-20170624221310-20170625001310-00008.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320395.62/warc/CC-MAIN-20170625032210-20170625052210-00008.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320595.24/warc/CC-MAIN-20170625235624-20170626015624-00008.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320669.83/warc/CC-MAIN-20170626032235-20170626052235-00008.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320679.64/warc/CC-MAIN-20170626050425-20170626070425-00008.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320887.15/warc/CC-MAIN-20170627013832-20170627033832-00008.warc.gz"}
{"url": "http://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128321025.86/crawldiagnostics/CC-MAIN-20170627064714-20170627084714-00585.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128321025.86/warc/CC-MAIN-20170627064714-20170627084714-00008.warc.gz"}
{"url": "https://www.tripadvisor.com./", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128322275.28/warc/CC-MAIN-20170628014207-20170628034207-00200.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128322873.10/warc/CC-MAIN-20170628065139-20170628085139-00008.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128323680.18/warc/CC-MAIN-20170628120308-20170628140308-00008.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128323870.46/warc/CC-MAIN-20170629051817-20170629071817-00008.warc.gz"}
{"url": "http://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128323970.81/crawldiagnostics/CC-MAIN-20170629121355-20170629141355-00585.warc.gz"}
{"url": "https://www.tripadvisor.com/", "urlkey": "com,tripadvisor)/", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128323970.81/warc/CC-MAIN-20170629121355-20170629141355-00008.warc.gz"}


There is literally nothing after the "/" and i can' t find that kind of record in cluster.idx which has "\t" after the slash. I have tried and it is just not present.
grep -P '^com,tripadvisor\)/\s' cluster.idx
returns no match. I have the regexp correct, it returns expected result for line that i know that is present
grep -P '^com,tripadvisor\)/alllocations-g297914-c3-o100-restaurants-khao_lak_phang_nga_province\.html\s' cluster.idx
returns

Sebastian Nagel

unread,
Aug 1, 2017, 5:33:56 AM8/1/17
to common...@googlegroups.com
Hi,

> i cannot find it in the cluster.idx,it is not present there.

The cdx-*.gz files are organized in blocks of 3000 lines/records.
The cluster.idx contains only the first line of each block.
Because the index is sorted by SURT URL, a binary search over
the cluster.idx allows to find quickly in which block(s) of the
cdx-*.gz files the requested URL or domain is located.

The CDX format is explained here:
https://github.com/ikreymer/pywb/wiki/CDX-Index-Format

For the "Sort-friendly URI Reordering Transform (SURT)":
https://github.com/internetarchive/surt

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.

brano199

unread,
Aug 1, 2017, 7:12:09 AM8/1/17
to Common Crawl
Just to be sure if i understand correctly. So given these two lines taken from cluster.idx

com,tripadvisor)/alllocations-g297914-c3-o100-restaurants-khao_lak_phang_nga_province.html 20170624092242       cdx-00165.gz    556156506       108645  725890
com,tripadvisor)/alllocations-g652202-c1-hotels-oudemirdum_friesland_province.html 20170627220256       cdx-00165.gz    556265151       211894  725891

then file cdx-00165.gz contains in its byte range 556156506- 556265151 records which are lexicographically
>= 
com,tripadvisor)/alllocations-g297914-c3-o100-restaurants-khao_lak_phang_nga_province.html 20170624092242
and
<
com,tripadvisor)/alllocations-g652202-c1-hotels-oudemirdum_friesland_province.html 20170627220256

Is that correct?

If i correctly understand, then to efficiently implement binary search on those strings, i would need a prefix tree, I have found one of the implementations here https://github.com/Tessil/hat-trie.

Sebastian Nagel

unread,
Aug 1, 2017, 8:02:20 AM8/1/17
to common...@googlegroups.com
Hi,

see remarks inline.

On 08/01/2017 01:12 PM, brano199 wrote:
> Just to be sure if i understand correctly. So given these two lines taken from cluster.idx
> com,tripadvisor)/alllocations-g297914-c3-o100-restaurants-khao_lak_phang_nga_province.html
> 20170624092242 cdx-00165.gz 556156506 108645 725890
> com,tripadvisor)/alllocations-g652202-c1-hotels-oudemirdum_friesland_province.html 20170627220256
> cdx-00165.gz 556265151 211894 725891
>
> then file cdx-00165.gz contains in its byte range 556156506-556265151 records which are
> lexicographically
>>= com,tripadvisor)/alllocations-g297914-c3-o100-restaurants-khao_lak_phang_nga_province.html
> 20170624092242
> and
> < com,tripadvisor)/alllocations-g652202-c1-hotels-oudemirdum_friesland_province.html 20170627220256
>
> Is that correct?

Yes, that's correct.

>
> If i correctly understand, then to efficiently implement binary search on those strings, i would
> need a prefix tree, I have found one of the implementations here https://github.com/Tessil/hat-trie.

A trie is just one way to perform look-ups in a set of strings. Binary search [1] in a sorted list
is another way. A efficient trie implementation would be even faster. However: the cluster.idx
contains 1 million lines for a crawl archive of 3 billion pages. A binary search requires only
ld(1000000) = 20 string comparisons. Intuitively, the time required for 20 string comparisons is
negligible compared to the time spent to fetch and decompress the gzipped block from the cdx-*.gz.

My recommendation is to use cc-index-server / pywb.
If you cannot run it on AWS in the us-east-1 region:
- fetch the 300 cdx-*.gz files,
- set up a local web server which supports range requests
so that it serves these files
- let the URL in cc-index-server/config.yaml point to your local server
This should take only a couple of hours.

Of course, if it's about learning how to implement quick string lookups, then go ahead.
We (or the author of pywb, Ilya Kreymer) will be happy to integrate your improvements.
But my estimate is that it'll take days or even weeks to find a solution which significantly
outperforms the current state.

Best,
Sebastian


[1] https://en.wikipedia.org/wiki/Binary_search_algorithm

brano199

unread,
Aug 1, 2017, 8:17:15 AM8/1/17
to Common Crawl
I am not saying you implemented it wrong, i am trying to find the bottleneck on my side. Things i have discovered so far. Even when i download all those 300 cdx file and run the python server, request to get just one out of 81 pages for given query takes 1.5s.
But when i download it using curl to my program, actual reading of those few MB files takes just few milliseconds, so that' s why i need to skip the communication between python and C++. This is just the situation from my point of view.

I will update this post if using custom traversal solves my problem.

Sebastian Nagel

unread,
Aug 1, 2017, 8:22:03 AM8/1/17
to common...@googlegroups.com
> I will update this post if using custom traversal solves my problem.

Thanks. We're eager to hear from you!
> https://github.com/Tessil/hat-trie <https://github.com/Tessil/hat-trie>.
>
> A trie is just one way to perform look-ups in a set of strings. Binary search [1] in a sorted list
> is another way. A efficient trie implementation would be even faster. However: the cluster.idx
> contains 1 million lines for a crawl archive of 3 billion pages. A binary search requires only
> ld(1000000) = 20 string comparisons. Intuitively, the time required for 20 string comparisons is
> negligible compared to the time spent to fetch and decompress the gzipped block from the cdx-*.gz.
>
> My recommendation is to use cc-index-server / pywb.
> If you cannot run it on AWS in the us-east-1 region:
> - fetch the 300 cdx-*.gz files,
> - set up a local web server which supports range requests
> so that it serves these files
> - let the URL in cc-index-server/config.yaml point to your local server
> This should take only a couple of hours.
>
> Of course, if it's about learning how to implement quick string lookups, then go ahead.
> We (or the author of pywb, Ilya Kreymer) will be happy to integrate your improvements.
> But my estimate is that it'll take days or even weeks to find a solution which significantly
> outperforms the current state.
>
> Best,
> Sebastian
>
>
> [1] https://en.wikipedia.org/wiki/Binary_search_algorithm
> <https://en.wikipedia.org/wiki/Binary_search_algorithm>

brano199

unread,
Aug 11, 2017, 3:30:29 PM8/11/17
to Common Crawl
Hello Sebastian,

i have successfully managed to read the index files on my own. It is order of magnitude faster and it can also use multiple threads for opening/reading from files.

I have made some assumptions to simplify the code for SURT transformation. I basically followed these guidelines http://crawler.archive.org/articles/user_manual/glossary.html#surt
1) I assume there is never a trailing comma and no scheme, so  http://www.archive.org =  org,archive) in SURT form
2) Ip addresses are always reversed
3) If original URL contains slash,it is also present in final SURT http://www.archive.org/ =  org,archive)/ in SURT

Also changes were made when doing search on the index,because most of the time you just want to match all subdomains, so for each SURT an SURT prefix is made as follows:
1) if there are at least 2 slashes, remove everything after the last slash
2)
if the resulting form ends in an off-parenthesis (')'), remove it

For instance when we have original SURT
org,archive), the corresponding prefix would be "org,archive" which matches also the subdomains like home.archive.org


Unfortunalely i am still facing one difficulty when the prefix couldn' t be found in the tree. For example consider these two cases.
1) SURT prefix = com,tripadvisor
By checking with grep you can find that cluster.idx contains 3941 records for the CC-MAIN-26 index and everythign works fine.

2) SURT prefix = hk,com,tripadvisor is not present in the cluster.idx so it yields 0 results.
Your implementation however yields results for this type of domain query http://test-index.commoncrawl.org/CC-MAIN-2017-26-index?url=*.tripadvisor.com.hk&output=json&showNumPages=true
How do you deal with these cases? You just first try the whole match and then try to find shorter prefix? I can confirm that trying
hk,com,tri
instead of whole prefix
hk,com,tripadvisor
yields one result in cluster.idx and we can find it in the group of 3000 lines present in that particular file.


What also bothers me that sometimes it is necessary to search line in cluster.idx before the first prefix match. Is it always necessary or you somehow compare the prefixes and determine if it can lie in the previous range of 3000 records with different prefix?
What i mean can be described by following commands

grep -P "com,tripadvisor" cluster.idx | head -n 1
ar,com,tripadvisor)/alllocations-g183461-c1-hotels-sussex_new_brunswick.html 20170626194209    cdx-00000.gz    551318036    122856    2789
 
sed -n '2788p' < cluster.idx
ar,com,tribunahacker)/2016/10/genial-administrador-de-archivos-para-android 20170628172951    cdx-00000.gz    551161655    156381    2788
 
  dd if=cdx-00000.gz ibs=1 skip=551161655 count=156381 | gunzip -c | grep -Pc "com,tripadvisor"
156381+0 records in
305+1 records out
156381 bytes (156 kB, 153 KiB) copied, 0,0489068 s, 3,2 MB/s
1995

 

Tom Morris

unread,
Aug 11, 2017, 5:12:50 PM8/11/17
to common...@googlegroups.com
cluster.idx has the first entry in each chunk, not necessarily every prefix that you might want to search for. If you're looking for something without many entries, it could be in the middle of a chunk. Similarly, if you're searching for something with many entries, there's no guarantee that there won't be entries in the chunk before the chunk with a matching prefix in cluster.idx.

Tom

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.

brano199

unread,
Aug 14, 2017, 2:36:20 PM8/14/17
to Common Crawl
Thank you Tom. Once i distinguished all the cases you mentioned
1) cluster.idx may contain the whole prefix
    -> then there exists prefix range of all matching prefixes
   -> line before first match may also contain result
2) cluster.idx may contain prefix inside prefix range prefix1-prefix2
   -> then the prefix is contained as a chunk in some other prefix in cluster.idx

it is working as expected and provides same results as your index server.

However guys i still require assistance with last big problem. When the index gives me file name hosted on Amazon, file length and offset, i request the byte range using the HTTP. Unfortunately when downloading only 2 kB range it takes 300 ms to do so. I need to analyze cca. 11 million of entries like this just for one crawl. Can you help me figure out what am i doing wrong? I am using Curl for the requests. Why is it so slow for 2 kB file? :(

I have created a file to show me curl download details as follows.
curl_format.txt
    time_namelookup:  %{time_namelookup}\n
       time_connect:  %{time_connect}\n
    time_appconnect:  %{time_appconnect}\n
   time_pretransfer:  %{time_pretransfer}\n
      time_redirect:  %{time_redirect}\n
 time_starttransfer:  %{time_starttransfer}\n
                    ----------\n
         time_total:  %{time_total}\n

and when i run
 curl -w "@curl-format.txt" -r 822555329-822557378        -O        http://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-04/segments/1484560279657.18/warc/CC-MAIN-20170116095119-00156-ip-10-171-10-70.ec2.internal.warc.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2050  100  2050    0     0   5451      0 --:--:-- --:--:-- --:--:--  5452
    time_namelookup:  0,013
       time_connect:  0,134
    time_appconnect:  0,000
   time_pretransfer:  0,134
      time_redirect:  0,000
 time_starttransfer:  0,376
                    ----------
         time_total:  0,376



Sebastian Nagel

unread,
Aug 14, 2017, 3:54:10 PM8/14/17
to common...@googlegroups.com
Hi,

> However guys i still require assistance with last big problem. When the index gives me file name
> hosted on Amazon, file length and offset, i request the byte range using the HTTP. Unfortunately
> when downloading only 2 kB range it takes 300 ms to do so. I need to analyze cca. 11 million of
> entries like this just for one crawl. Can you help me figure out what am i doing wrong? I am using
> Curl for the requests. Why is it so slow for 2 kB file? :(

From Germany I get similar results: around 300 ms, min. 275, sometimes up to 600.
When I sent the same request from a machine running in the AWS us-east-1 region
the time measured ranges from 36 ms to 300ms. The difference I would explain by
the time for request and response to "travel". It's close to 10,000 km there and back,
even with ideal conditions this will take 20,000 km / 300,000 km/s = 66 ms,
but here is a more detailed explanation:
http://royal.pingdom.com/2007/06/01/theoretical-vs-real-world-speed-limit-of-ping/
The base line (min. 36 ms) is used to process the request and fetch the data from disk
in one of S3's front-end servers.

If you really want to be faster: download the entire index (all cdx-*.gz files, around 250 GB)
and serve them from a local server. Of course, downloading will also take some time.
Alternatively, move the computation to the data - the big data principle of "data locality"
isn't wrong.

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.

brano199

unread,
Aug 14, 2017, 4:32:26 PM8/14/17
to Common Crawl
I actually downloaded the 250 GB index files, but now i am talking here about the actual WARC files. 36ms is ok,but 300ms for 2 KB request it would take 11 000 000  requests / 3 requests_sec =  42 days to process. So if i am not wrong, the problem is that it doesn 't matter if i request 2 KB or 200 MB, each time i am going to suffer that latency. So this means making a lot of small requests is going to hurt a lot

However i have an idea and please tell me your opinion. We know that the server supports byte ranged requests, but does it also support multiple byte ranged requests? I mean for file XX.warc.gz we would request range1, range2. etc. in one request.

brano199

unread,
Aug 14, 2017, 4:48:40 PM8/14/17
to Common Crawl
Hmm, according to this, i don' t know if it is still valid,but it says those kinds of requests are not supported. http://docs.aws.amazon.com/AmazonS3/latest/dev/GettingObjectsUsingAPIs.html
Amazon S3 doesn't support retrieving multiple ranges of data per GET request.

Ok so let' s forget about that,we can only use single byte range,but nevermind, i think there can still be a solution as follows:
1) For all matches in index remember file name and byte ranges we need
2) For XX.Warc.gz try to make as few bigger blocks that contain all the ranges we need
3) We request the bigger gzipped chunks and we will locally fetch the parts we need.

This way we should not suffer that much from latency,because once we establish connection it is getting faster each second, at least i noticed this behaviour when downloading the 1 GB parts of 250 GB index,that it reached even 6 MB/s speed when dealing with larger chunks.

Hugh Tomkins

unread,
Sep 1, 2017, 10:19:28 AM9/1/17
to Common Crawl
Hey Brano,
Any way you'd consider releasing a gist of your code? I'm struggling with this myself!

brano199

unread,
Sep 1, 2017, 2:35:36 PM9/1/17
to Common Crawl
Hello Hugh,

it took me like 14 days to figure this out and i still don' t have all the answers i was waiting for. Please check also my other thread https://groups.google.com/forum/#!topic/common-crawl/DPYMbMRavkI for my struggle with trying HTTP/2.

To summarize:
1) S3 bucket seems to support up to 300 socket connections. When i have tried bigger number, i get error when processing requests.
CURL error says: Couldn't connect to server

2) Creating bigger chunks seemed like a good idea, but when i tried to create at least 5 MB chunks,chunks were 1-3  MB anyway and it increased the total download size like 5 times. So that' s why i decided to just request unmodified exact range requests that i need - those cca. 5 million requests.

The real question here is if it is somehow possible to increase the number of socket connections, eg. retry to connect or if that is the limitation of S3. According their unclear documentation here http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html it says
However, if you expect a rapid increase in the request rate for a bucket to more than 300 PUT/LIST/DELETE requests per second or more than 800 GET requests per second, we recommend that you open a support case to prepare for the workload and avoid any temporary limits on your request rate. 

I assume since 1 average size of one crawled HTML is about 100 kB and as i mentioned before and lets assume i can make 2-3 requests/ second since i wrote before it takes around 350-550 ms for small 2 kB request. With 300 connections open that is maybe at the edge of 600-800 requests per second. However i don' t think the problem is with RPS, but with the number of open connections.

I will need to test if there are any problems downloading all the HTML files i need this way, i haven' t tested it on whole dataset yes. In case it won' t work for some reason I will try to make compromise between the two. Try to make 1 MB chunks and use 100 or 50 socket connections instead of 300. I hope this approach will also drain all my 33 MB/s available bandwidth.

This is the code i ended up using https://curl.haxx.se/libcurl/c/multi-uv.html.
I have also modified it a little bit,so it is re-using handles like advised here https://www.onlineaspect.com/2009/01/26/how-to-use-curl_multi-without-blocking/.

Please let me know your progress. Maybe we can figure this out together :)

Message has been deleted

brano199

unread,
Sep 1, 2017, 3:17:31 PM9/1/17
to Common Crawl
Just crossed on my mind, maybe it is in correlation with actual bandwidth i have. I have 250 mbit/s ~ 31.25 MB/s and one request is cca. 100 kB. Thats why i can only use 300 sockets maximum. If someone has more bandwidth, i would be really happy if he could test if he can use his Bandwith in MB / 0.1 MB = number of simultaneous socket connections.

Anyways, i have checked that PHP code again and they seem to did just fine with window of only 10 sockets open at a time. So maybe the right way to go is the mentioned hybrid solution between relatively small number of simultaneous socket connections (50-100) and having 500 kB - 1 MB requests. I will test this tomorrow and report if it is also using all the bandwidth,because it should. Maybe the only reason why i was using just 20 MB/s with 200 sockets was the fact the chunks were only 100 kB in legth.
Reply all
Reply to author
Forward
0 new messages