Hi Tyler,
> does Common Crawl offer datasets for download, or do you
> typically generate the data yourself?
All Common Crawl data sets are both for download or to be processed
in the Amazon cloud. There are multiple data sets and also data formats, see
https://commoncrawl.org/the-data/get-started/
However, there is no dedicated data set listing sites by technology.
You'd need to generate it by processing the WARC files or WARC records
selected via the URL indexes.
Best,
Sebastian
On 2/24/21 5:57 PM, Tyler King wrote:
>
> Hey guys,
>
> Thanks for all this help! It's definitely putting me in the right direction for this assignment. Although, a lot of this is going over my
> head since I'm still learning about data analysis - I'm a creative guy. I didn't know about
builtwith.com. It looks like there are some
> interesting datasets on there.. but wow, I had no idea data could be so pricey! 60k for all the WordPress sites..yikes
>
> One question (which may seem obvious), does Common Crawl offer datasets for download, or do you typically generate the data yourself?
>
> Regards,
>
> Tyler
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wednesday, 24 February 2021 at 04:13:25 UTC-5 Sebastian Nagel wrote:
>
> Hi Tyler,
>
> > analyze the progression of design and layouts based
> > on the popularity of their themes and templates
>
> If it's about a quantitative analysis of many web sites
> and a larger time span (Common Crawl has data back to 2008)
> this looks like a big data project. But it's nothing magic,
> see this nice example analyzing usage of "z-index" in CSS:
>
https://psuter.net/2019/07/07/z-index <
https://psuter.net/2019/07/07/z-index>
>
> > crawl only Wordpress sites or Wix sites
>
> If you have a list of host or domain names using these technologies,
> you could one of the URL indexes to pick single pages from those sites.
> Otherwise, just analyze a random sample of WARC files and skip over
> pages not matching your criteria.
>
> Best,
> Sebastian
>
> On 2/24/21 2:35 AM, Jay Patel wrote:
> > You can do something like this by first creating a list of domains running on wordpress or wix from
builtwith.com
> <
http://builtwith.com> <
http://builtwith.com <
http://builtwith.com>> and
> <
http://builtwith.com <
http://builtwith.com>> plans are super
> > definitely an option if
builtwith.com <
http://builtwith.com> <
http://builtwith.com <
http://builtwith.com>> is too expensive.
> >
> >
> > On Tue, Feb 23, 2021 at 10:48 PM Tyler King <
tyler...@gmail.com <mailto:
tyler...@gmail.com>> wrote:
> >
> > Hi all,
> >
> > It's great to be a part of your group! I am a Digital Product Designer student from Canada taking a class in Data-Driven Design.
> > I was wondering if it's possible to crawl only Wordpress sites or Wix sites. I'd like to analyze the progression of design and layouts
> > based on the popularity of their themes and templates. Is it possible to narrow it down to just creative/portfolio sites or UI elements
> > such as gradients/round buttons etc. ? Or is that data too specific.
> >
> > Thanks in advance!
> >
> > Tyler
> >
> > --
> > Jay M. Patel
> > Cofounder and Principal Data Scientist
> > Specrom Analytics
> >
+1-678-740-7834 <tel:(678)%20740-7834> (US)
> >
+91-9323998105 <tel:+91%2093239%2098105> (India)
> >
www.specrom.com <
http://www.specrom.com> <
http://www.specrom.com <
http://www.specrom.com>> |
www.jaympatel.com
> <
http://www.jaympatel.com> <
http://www.jaympatel.com <
http://www.jaympatel.com>>