Hi Everyone,
I have been experimenting with the CC URL index by querying various domains and reviewing the results in json format. From my research I learned that I can access the original webpage (archive) of specific articles by appending the url of the CC URL index+url value found in each of the json objects.
Is there a better way to retrieve the html from each of the urls for a specific domain? I appreciate your feedback.
ThanksHi Sebastian,
Thank you for your excellent feedback! I am currently experimenting with the methods you suggested.
Does selecting json metadata with "mime-detected":“text/html” ensure that I am always processing text/html related data?
Thanks,
LewisTo unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/61cd3fcd-4356-299d-826e-d9ff582c6b16%40commoncrawl.org.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/34ecfda5-14ed-f9a8-18d1-566e1850d96d%40commoncrawl.org.
Hi Sebastian,
Is there a way to get json objects with unique digest values? Also does CC capture publication dates of each captured webpage?
Thanks,
Lewis
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/61cd3fcd-4356-299d-826e-d9ff582c6b16%40commoncrawl.org.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/e0d05460-aa5f-160f-eebd-2c0636e3580c%40commoncrawl.org.
Hi Everyone,
I need to extract the html for each captured page (url) directly from warc files and have specific domains that I need to extract the text from. Is there a way to download all the warc files for a specific domain and then extract the text? Can you provide an example?
Thanks,
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/61cd3fcd-4356-299d-826e-d9ff582c6b16%40commoncrawl.org.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/2fc9afb4-af3c-5d27-5042-ae259ac14c62%40commoncrawl.org.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/61cd3fcd-4356-299d-826e-d9ff582c6b16%40commoncrawl.org.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/1f80944c-50e4-bcd5-a1a1-09864eb905b3%40commoncrawl.org.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/23af935e-4c3a-61a3-d7ec-194d0366c467%40commoncrawl.org.