if (MimeTypeFilter.isTextType(normalizedMimeType)) {
...then do charset id
...if html, parse the HTML with a ParseWorker
}
In short, I think CC is pulling text out of documents that are of text type.
Slightly off topic, but thought I'd document this while it is fresh.
For charset detection (see org.commoncrawl.util.CharsetUtils' bestEffortDecodeBytes())., CC appears to take the first non null value that passes Charset.forName from:
- httpheaders
- http meta header equiv
- Mozilla detector
- ICU detector
They also try to get a Charset.forName on the alias of the charsetName.
CC appears to be using com.dappit.Dapper.parser.MozillaParser to parse the html.
This is my first time looking at the code, and I may be misreading it.