Extracted text in WET files

131 views
Skip to first unread message

Lukas Michelbacher

unread,
Mar 19, 2015, 6:06:31 AM3/19/15
to common...@googlegroups.com
Hi,

I'm wondering how WET file content is created exactly. The WET description says:

WET files [...] only contain extracted plaintext

Which file types exactly are converted into plain text? I'm asking as I'm interested in all text including text from binary formats (PDF, DOC, etc.).

I understand that extracting PDFs is a separate topic ([1]). I'd like to make sure that PDF text is not already contained in WET data before starting my own PDF extraction.

Thanks,
Lukas

 

talliso...@gmail.com

unread,
Apr 6, 2015, 10:41:14 AM4/6/15
to common...@googlegroups.com
Thanks to https://groups.google.com/d/msg/common-crawl/FD5uwzKWjig/wHv5_cDDX7QJ, I just started digging through the code.  I think the answer is at line 834 of ParserMapper.java:

 if (MimeTypeFilter.isTextType(normalizedMimeType)) {
...then do charset id
...if html, parse the HTML with a ParseWorker
}

In short, I think CC is pulling text out of documents that are of text type.


Slightly off topic, but thought I'd document this while it is fresh. 
For charset detection (see org.commoncrawl.util.CharsetUtils' bestEffortDecodeBytes())., CC appears to take the first non null value that passes Charset.forName from:
  1. httpheaders
  2. http meta header equiv
  3. Mozilla detector
  4. ICU detector
They also try to get a Charset.forName on the alias of the charsetName.

CC appears to be using com.dappit.Dapper.parser.MozillaParser to parse the html.

This is my first time looking at the code, and I may be misreading it.
Reply all
Reply to author
Forward
0 new messages