Including CSS files in the crawl?

Kenneth C. Arnold

unread,

May 24, 2012, 2:49:54 PM5/24/12

to common...@googlegroups.com

I'm working on mining styles from websites. CommonCrawl looked like
just what I needed, until I saw that the crawl didn't include CSS
files. (I'd be happy to be proven wrong on that point!)

Could the crawler be told to include CSS files as well? If you need
code for extracting the references from the HTML, I can help with
that.

"In return", I'll write up an example of working with the new archive
format in Python -- once I figure it out myself :)

Thanks for getting the crawl started back up,
-Ken

Ahad Rana

unread,

May 24, 2012, 3:03:52 PM5/24/12

to common...@googlegroups.com

Hi Ken,

Yes, sorry, the new crawl is ignoring CSS files as of now :-( The same goes for JavaScript files. But, I don't see why we couldn't add them back in in the near future. Does anybody else have any comments on this issue ? And, of course, we would welcome any examples of people making use of the crawl :-)

Ahad.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To post to this group, send email to common...@googlegroups.com.
To unsubscribe from this group, send email to common-crawl...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/common-crawl?hl=en.

Jason Duke

unread,

May 24, 2012, 3:08:18 PM5/24/12

to common...@googlegroups.com

I'd personally love for JS and CSS files to be included as the ability to create the rendered DOM, as viewed by a person, is an extremely important part of the process for me.

Having said that, the sky won't fall in on me if you decide I'm in the minority :)

---
Jason Duke

Email: ja...@strangelogic.com
Mob: +44 (0)7595 924 934

Twitter: @JasonD

LinkedIn: http://uk.linkedin.com/in/jasonduke1

The information contained within this email along with any attachments are confidential, may be legally privileged and/or protected by copyright. If you are not the intended recipient of this email then further dissemination, copying or printing is prohibited. If you have received this email in error then you should notify the sender by replying to this email and thereafter permanently deleting the email from your systems.

Any views or opinions in this email are solely those of the sender. This email is not intended to form a binding contract and as such all communications are “subject to contract” unless it is expressly indicated to the contrary and is properly authorised. You should not rely on any information contain within this email, and any actions taken are at the recipient’s own risk.

Jake Quist

unread,

May 24, 2012, 3:15:50 PM5/24/12

to common...@googlegroups.com

+1 for CSS/JS

Jake Quist | (650) 485-3427 | Zillabyte

Dave Lester

unread,

May 24, 2012, 3:56:00 PM5/24/12

to common...@googlegroups.com

What about commonly used JS and CSS files? There is probably a short list of files that could be excluded from the crawl because they are used everywhere, including jQuery and reset.css.

Dave

Kenneth C. Arnold

unread,

May 24, 2012, 4:11:05 PM5/24/12

to common...@googlegroups.com

A simple solution: with some small probability, add any new CSS/JS
file to a central collection (another S3 bucket?). Check each new file
against the contents of that collection; if it's there, store a
reference to it instead of the contents. After processing lots of
sites, the commonly used files will end up there with high
probability. Off-site references may want to get higher inclusion
probabilities.

One important property is that you should be able to get all of the
dependent files for a page without buffering too much or making
multiple passes. A basic rule to ensure that is that for all pages
within a domain, dependent files come before the files that depend on
them. To bound the buffer size, keep page domains contiguous and reset
the buffer for each new domain. Off-site references could just be
pulled into the stream right before the file that first references
them on a site; the above suggestion should help ensure that this
doesn't cause massive duplication.

Hopefully this is all just a few lines of code.
-Ken

Ben Nagy

unread,

May 24, 2012, 10:25:40 PM5/24/12

to common...@googlegroups.com

On Fri, May 25, 2012 at 12:53 AM, Jason Duke <ja...@strangelogic.com> wrote:
> I'd personally love for JS and CSS files to be included as the ability to
> create the rendered DOM, as viewed by a person, is an extremely important
> part of the process for me.

If the crawl doesn't contain pretty much exactly what was sent by the
server then it's almost useless for my purposes. I plan to hunt
'unusual' or possibly even malicious pages, and CSS and JS are both
absolutely within scope for that.

I don't want to get on a high horse, but doing _any_ kind of
preprocessing / omission etc is almost certain to make the data
unsuitable for at least some people. You may as well change the name
from 'common' to 'handy for NLP and SEO geeks' crawl. Not that there's
anything wrong with that.

Cheers,

ben

Kenneth C. Arnold

unread,

May 30, 2012, 4:35:07 PM5/30/12

to common...@googlegroups.com

Though Ben's tone is somewhat harsh, I think he's vehemently agreeing
with this request.

My previous message was written backwards. What I meant to provide was:

(1) a specification for ordering of resources in the stream that
ensures that a stream processor can use them, and
(2) a trick for approximately deduplicating external resources so you
don't end up with 100k copies of jquery.

The github commits suggest that there's a big code push upcoming.
Which code, if any, should I look into hacking to make this work?

-Ken

Reply all

Reply to author

Forward