Websites behind login?

25 views
Skip to first unread message

Florian Hantke

unread,
Nov 6, 2021, 7:32:25 AMNov 6
to Common Crawl
Hi everyone,

thank you for the great project.
I have one question about your crawling engine:
Do you crawl all websites as an anonymous user or do you have accounts for some websites?
For instance, when crawling Facebook, I assume that most websites are behind a login page.

I know that this is a tough problem as there is no simple approach.
I'm only wondering whether you address it somehow or ignore it?

Thank you and have a greate weekend,
Florian

Sebastian Nagel

unread,
Nov 7, 2021, 9:09:03 AMNov 7
to common...@googlegroups.com
Hi Florian,

> Do you crawl all websites as an anonymous user or do you have accounts
> for some websites?

We do not use logins and the crawler does not fill any (login) forms.

> For instance, when crawling Facebook, I assume that most websites are
> behind a login page.

Our crawler respects the robots.txt rules. Facebook's robots.txt
excludes most of the content independent of any login. A few subdomains
(eg. ai.facebook.com) allow crawling in their robots.txt, and for these
there is content in our web archives.

> I'm only wondering whether you address it somehow or ignore it?

We even try to not include any sensitive or personal content.
Of course, this is also a though problem.

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/5c9d7384-6469-4d6c-8c8d-bfdca6b0aab3n%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/5c9d7384-6469-4d6c-8c8d-bfdca6b0aab3n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Florian Hantke

unread,
Nov 7, 2021, 9:19:24 AMNov 7
to Common Crawl
Hi Sebastian,

thank you for the quick answer.

Best,
Florian

Reply all
Reply to author
Forward
0 new messages