Dear All, Dear Sebastian
- Is it just a question of scaling up the cluster instances, if so what would you recommend
- Is it the notebook environment making it slow? would a script be much faster?
Thanks Colin for the reply, i am stuck and anxious to get some help here :=)
Thanks Colin for the reply, i am stuck and anxious to get some help here :=)So I must admit, as a newbie, some the details some of your feedback is more difficult to understand but I get the gist of itclearly the bottleneck is with the process_warc / process. I will adapt to only parse if regex finds keywords on the record.content_stream().read() directly.if accessing individual range is slower (which i understand s3 cannot get multiple ranges) then maybe a better option would be to go through entire WARC files instead? check for WARC-Type: response, Content-Length: x > min_size, check content_type (like is_html function)but not sure how to check for the eng language. Maybe not needed and simply rely on the regex search as a first collection pass. and then refine further by analyzing further the fetched-saved pages
languages-cld2: {"reliable":true,"text-bytes":2464,"languages":[{"code":"en","code-iso-639-3":"eng","text-covered":0.99,"score":952.0,"name":"ENGLISH"}]}
Other option you mention, fetch pages in parallel. That, I do not know how to improve by doing the 100 at a time using an asynchronous HTTP client...
Thanks
A
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/d8b05ec2-176c-4812-9949-41ea849eabedo%40googlegroups.com.
Yes it sounds definitely easier to work and ingest from WET files.
Though I was wondering if it is easy to clean and extract relevant text topic from plain without any html info
For ex, with html info, you could search keywords in header tags and the extract text of all children elements and omit the rest of a page that might not be relevant.
With plain text, it seems way more challenging ?
Thanks
So basic that I could not find the solution on SO!
Thanks again
Dear Sebastian and All,
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/ce87f549-0c3b-4a58-86df-8ee8453a72bb%40commoncrawl.org.
Dear all