Noisy text classification

34 views
Skip to first unread message

ajay kumar

unread,
Aug 16, 2017, 9:04:38 AM8/16/17
to Common Crawl
Hello,

I wanted to know how the noisy texts like chats,emails are treated ? Are there any per-processing stages that remove them or its too common for a common crawl dataset to have them

Thank you

Sebastian Nagel

unread,
Aug 16, 2017, 10:30:23 AM8/16/17
to common...@googlegroups.com
Hi,

it only depends on the protocol/schema and the format.
Chats and emails are crawled in case they are
- reachable via HTTP(S)
- are in HTML
- and if not excluded by robots.txt

Mail archives (e.g., mail-archives.apache.or) are included for example.

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages