Extracted text in WET files

131 views

Skip to first unread message

Lukas Michelbacher

unread,

Mar 19, 2015, 6:06:31 AM3/19/15

to common...@googlegroups.com

Hi,

I'm wondering how WET file content is created exactly. The WET description says:

WET files [...] only contain extracted plaintext

Which file types exactly are converted into plain text? I'm asking as I'm interested in all text including text from binary formats (PDF, DOC, etc.).

I understand that extracting PDFs is a separate topic ([1]). I'd like to make sure that PDF text is not already contained in WET data before starting my own PDF extraction.

Thanks,

Lukas

talliso...@gmail.com

unread,

Apr 6, 2015, 10:41:14 AM4/6/15

to common...@googlegroups.com

Thanks to https://groups.google.com/d/msg/common-crawl/FD5uwzKWjig/wHv5_cDDX7QJ, I just started digging through the code. I think the answer is at line 834 of ParserMapper.java:

if (MimeTypeFilter.isTextType(normalizedMimeType)) {

...then do charset id

...if html, parse the HTML with a ParseWorker

}

In short, I think CC is pulling text out of documents that are of text type.

Slightly off topic, but thought I'd document this while it is fresh.

For charset detection (see org.commoncrawl.util.CharsetUtils' bestEffortDecodeBytes())., CC appears to take the first non null value that passes Charset.forName from:

httpheaders
http meta header equiv
Mozilla detector
ICU detector

They also try to get a Charset.forName on the alias of the charsetName.

CC appears to be using com.dappit.Dapper.parser.MozillaParser to parse the html.