UUIDs in warc and wat files

24 views
Skip to first unread message

Common Screens

unread,
Jan 19, 2023, 11:01:14 AM1/19/23
to Common Crawl
Are the UUIDs for URLs maintained across crawl i.e. the same URL in next crawl will have the same UUID or different ?
I need to understand this since I am building a search engine at https://visualsearch.org and will like to add only new URLs in the subsequent indexes and update the existing ones.
Please advise.

Sebastian Nagel

unread,
Jan 19, 2023, 11:41:52 AM1/19/23
to common...@googlegroups.com
Hi,

WARC record IDs are unique to each WARC record. Via "WARC-Concurrent-To"
or "WARC-Refers-To" records can be linked. For example a response record
is linked to a request and a metadata to a response, etc. See the WARC
format specification [1].

If the same URL is fetched twice in one crawl the WARC record will have
different UUIDs.


> add only new URLs in the subsequent indexes and update the existing
> ones

You need to use the URL as document ID or alternatively a digest of the URL.

Best,
Sebastian

[1]
https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/
Reply all
Reply to author
Forward
0 new messages