CommonCrawl Wix sites and/or Wordpress sites

104 views
Skip to first unread message

Tyler King

unread,
Feb 23, 2021, 12:18:36 PM2/23/21
to Common Crawl
Hi all,

It's great to be a part of your group!  I am a Digital Product Designer student from Canada taking a class in Data-Driven Design. 
I was wondering if it's possible to crawl only Wordpress sites or Wix sites. I'd like to analyze the progression of design and layouts based on the popularity of their themes and templates.  Is it possible to narrow it down to just creative/portfolio sites or UI elements such as gradients/round buttons etc. ? Or is that data too specific.

Thanks in advance!

Tyler

Jay Patel

unread,
Feb 23, 2021, 8:35:34 PM2/23/21
to common...@googlegroups.com
You can do something like this by first creating a list of domains running on wordpress or wix from builtwith.com and then querying for those domains using a common crawl index. The downside is that builtwith.com plans are super expensive ($295 for looking up 2 technologies, $495 for unlimited lookups).

Alternately, you can write your own technology analyzer (using regex based pattern matching) and use it to do filtering once you fetch raw html from WARC files. This is ofcourse more time consuming, but lots of public rest APIs are out there (https://algorithmia.com/algorithms/specrom/technology_analyzer) so it's definitely an option if builtwith.com is too expensive.





--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/19c9e8a1-61bc-4359-8242-ce91e5edf010n%40googlegroups.com.


--
Jay M. Patel
Cofounder and Principal Data Scientist
Specrom Analytics
+1-678-740-7834 (US)
+91-9323998105 (India)

Sebastian Nagel

unread,
Feb 24, 2021, 4:13:25 AM2/24/21
to common...@googlegroups.com
Hi Tyler,

> analyze the progression of design and layouts based
> on the popularity of their themes and templates

If it's about a quantitative analysis of many web sites
and a larger time span (Common Crawl has data back to 2008)
this looks like a big data project. But it's nothing magic,
see this nice example analyzing usage of "z-index" in CSS:
https://psuter.net/2019/07/07/z-index

> crawl only Wordpress sites or Wix sites

If you have a list of host or domain names using these technologies,
you could one of the URL indexes to pick single pages from those sites.
Otherwise, just analyze a random sample of WARC files and skip over
pages not matching your criteria.

Best,
Sebastian

On 2/24/21 2:35 AM, Jay Patel wrote:
> You can do something like this by first creating a list of domains running on wordpress or wix from builtwith.com <http://builtwith.com> and
> then querying for those domains using a common crawl index. The downside is that builtwith.com <http://builtwith.com> plans are super
> expensive ($295 for looking up 2 technologies, $495 for unlimited lookups).
>
> Alternately, you can write your own technology analyzer (using regex based pattern matching) and use it to do filtering once you fetch raw
> html from WARC files. This is ofcourse more time consuming, but lots of public rest APIs are out there
> (https://algorithmia.com/algorithms/specrom/technology_analyzer <https://algorithmia.com/algorithms/specrom/technology_analyzer>) so it's
> definitely an option if builtwith.com <http://builtwith.com> is too expensive.
>
>
> On Tue, Feb 23, 2021 at 10:48 PM Tyler King <tyler...@gmail.com <mailto:tyler...@gmail.com>> wrote:
>
> Hi all,
>
> It's great to be a part of your group!  I am a Digital Product Designer student from Canada taking a class in Data-Driven Design.
> I was wondering if it's possible to crawl only Wordpress sites or Wix sites. I'd like to analyze the progression of design and layouts
> based on the popularity of their themes and templates.  Is it possible to narrow it down to just creative/portfolio sites or UI elements
> such as gradients/round buttons etc. ? Or is that data too specific.
>
> Thanks in advance!
>
> Tyler
>
> --
> Jay M. Patel
> Cofounder and Principal Data Scientist
> Specrom Analytics
> +1-678-740-7834 (US)
> +91-9323998105 (India)
> www.specrom.com <http://www.specrom.com> | www.jaympatel.com <http://www.jaympatel.com>
> Pate...@specrom.com <mailto:Pate...@specrom.com>
> j...@jaympatel.com <mailto:j...@jaympatel.com>
>

Tyler King

unread,
Feb 24, 2021, 11:57:54 AM2/24/21
to Common Crawl

Hey guys,

Thanks for all this help!  It's definitely putting me in the right direction for this assignment. Although, a lot of this is going over my head since I'm still learning about data analysis - I'm a creative guy.  I didn't know about builtwith.com. It looks like there are some interesting datasets on there..  but wow, I had no idea data could be so pricey! 60k for all the WordPress sites..yikes

One question (which may seem obvious), does Common Crawl offer datasets for download, or do you typically generate the data yourself?

Regards,

Tyler

Sebastian Nagel

unread,
Feb 25, 2021, 5:54:07 AM2/25/21
to common...@googlegroups.com
Hi Tyler,

> does Common Crawl offer datasets for download, or do you
> typically generate the data yourself?

All Common Crawl data sets are both for download or to be processed
in the Amazon cloud. There are multiple data sets and also data formats, see
https://commoncrawl.org/the-data/get-started/

However, there is no dedicated data set listing sites by technology.
You'd need to generate it by processing the WARC files or WARC records
selected via the URL indexes.

Best,
Sebastian

On 2/24/21 5:57 PM, Tyler King wrote:
>
> Hey guys,
>
> Thanks for all this help!  It's definitely putting me in the right direction for this assignment. Although, a lot of this is going over my
> head since I'm still learning about data analysis - I'm a creative guy.  I didn't know about builtwith.com. It looks like there are some
> interesting datasets on there..  but wow, I had no idea data could be so pricey! 60k for all the WordPress sites..yikes
>
> One question (which may seem obvious), does Common Crawl offer datasets for download, or do you typically generate the data yourself?
>
> Regards,
>
> Tyler
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wednesday, 24 February 2021 at 04:13:25 UTC-5 Sebastian Nagel wrote:
>
> Hi Tyler,
>
> > analyze the progression of design and layouts based
> > on the popularity of their themes and templates
>
> If it's about a quantitative analysis of many web sites
> and a larger time span (Common Crawl has data back to 2008)
> this looks like a big data project. But it's nothing magic,
> see this nice example analyzing usage of "z-index" in CSS:
> https://psuter.net/2019/07/07/z-index <https://psuter.net/2019/07/07/z-index>
>
> > crawl only Wordpress sites or Wix sites
>
> If you have a list of host or domain names using these technologies,
> you could one of the URL indexes to pick single pages from those sites.
> Otherwise, just analyze a random sample of WARC files and skip over
> pages not matching your criteria.
>
> Best,
> Sebastian
>
> On 2/24/21 2:35 AM, Jay Patel wrote:
> > You can do something like this by first creating a list of domains running on wordpress or wix from builtwith.com
> <http://builtwith.com> <http://builtwith.com <http://builtwith.com>> and
> > then querying for those domains using a common crawl index. The downside is that builtwith.com <http://builtwith.com>
> <http://builtwith.com <http://builtwith.com>> plans are super
> > expensive ($295 for looking up 2 technologies, $495 for unlimited lookups).
> >
> > Alternately, you can write your own technology analyzer (using regex based pattern matching) and use it to do filtering once you
> fetch raw
> > html from WARC files. This is ofcourse more time consuming, but lots of public rest APIs are out there
> > (https://algorithmia.com/algorithms/specrom/technology_analyzer <https://algorithmia.com/algorithms/specrom/technology_analyzer>
> <https://algorithmia.com/algorithms/specrom/technology_analyzer <https://algorithmia.com/algorithms/specrom/technology_analyzer>>) so it's
> > definitely an option if builtwith.com <http://builtwith.com> <http://builtwith.com <http://builtwith.com>> is too expensive.
> >
> >
> > On Tue, Feb 23, 2021 at 10:48 PM Tyler King <tyler...@gmail.com <mailto:tyler...@gmail.com>> wrote:
> >
> > Hi all,
> >
> > It's great to be a part of your group!  I am a Digital Product Designer student from Canada taking a class in Data-Driven Design.
> > I was wondering if it's possible to crawl only Wordpress sites or Wix sites. I'd like to analyze the progression of design and layouts
> > based on the popularity of their themes and templates.  Is it possible to narrow it down to just creative/portfolio sites or UI elements
> > such as gradients/round buttons etc. ? Or is that data too specific.
> >
> > Thanks in advance!
> >
> > Tyler
> >
> > --
> > Jay M. Patel
> > Cofounder and Principal Data Scientist
> > Specrom Analytics
> > +1-678-740-7834 <tel:(678)%20740-7834> (US)
> > +91-9323998105 <tel:+91%2093239%2098105> (India)
> > www.specrom.com <http://www.specrom.com> <http://www.specrom.com <http://www.specrom.com>> | www.jaympatel.com
> <http://www.jaympatel.com> <http://www.jaympatel.com <http://www.jaympatel.com>>

Tom Morris

unread,
Feb 25, 2021, 1:02:55 PM2/25/21
to common...@googlegroups.com
On Thu, Feb 25, 2021 at 5:54 AM Sebastian Nagel
<seba...@commoncrawl.org> wrote:

> However, there is no dedicated data set listing sites by technology.
> You'd need to generate it by processing the WARC files or WARC records
> selected via the URL indexes.

I haven't tried this, so it might not work, but one thought that
occurred to me is that you might be able to leverage the fact that
these kinds of sites often includes links back to a few common sites
e.g. wordpress.com or the source of their theme to interrogate the
host/domain graph for hosts that link back to them.

Tom

Sebastian Nagel

unread,
Feb 25, 2021, 4:38:06 PM2/25/21
to common...@googlegroups.com
Hi Tom,

> these kinds of sites often includes links back to a few common sites
> e.g. wordpress.com or the source of their theme to interrogate the
> host/domain graph for hosts that link back to them.

Excellent idea!

It actually seems to work. So far, I've tried it only with the
latest domain-level graph [1] (the domain-level graph is smaller
than the host-level graph):

- 12.4 million domains link to "w.org" which is an indicator
that WordPress is used

- "wix.com" is linked from 570k domains and 2 millions link to
"parastorage.com" used to host static content of wix-built sites

Of course, using the host-level graph you could refine the candidate list by
searching for links to "s.w.org" or "static.parastorage.com". In any case,
the results need to be verified by looking at other indicators, eg., the
"generator" meta element.

Sebastian

[1] https://commoncrawl.org/2021/02/host-and-domain-level-web-graphs-oct-nov-jan-2020-2021/
Reply all
Reply to author
Forward
0 new messages