Hi Tim,
> How much unhappiness in the world would it create not to have the
> fallback?
For Common Crawl: not much because the MIME type from the Content-Type
HTTP header is also in the index in a separate field / column.
For Nutch it would be more painful because the MIME type is required
to select the parser. If there is none or "application/octet-stream"
the document cannot be parsed.
Because the Common Crawler is just a fork of Nutch, we'd need to keep
the purely identified MIME type separately. Note: also the URL (resp.
the file suffix) is passed to Tika's MIME detector as an additional hint.
Of course, improving the MIME detection would be the best solution.
A MIME type spellchecker or cleanup utility wouldn't be also good to
fix "appliation/octet-stream" and similar. This is nothing Tika already
provides?
> get actual files that Tika wasn't able to identify
In addition, there are those files which are identified but with a wrong
MIME type. There was one example recently on the Nutch user mailing
list: a text/html by Content-Type which was detected as application/xml.
It's a HTML snippet (a XML/HTML comment and one iframe element), not a
trivial document to identify.
I remember years ago I've prepared a confusion matrix of the identified
MIME type and the Content-Type, see [1]. I'm happy to repeat this for a
recent crawl and a more recent Tika version. Currently, 2.9.1.0 is used.
Maybe I also add the file suffix from the URL (.html, .pdf, .docx) if
there is one.
Best,
Sebastian
[1]
https://lists.apache.org/thread/fhhp1p6y4ttxmplvz1ohk3wwjz25ozbc
On 11/11/24 21:50, Tim Allison wrote:
> Thank you, Sebastian!
>
> How much unhappiness in the world would it create not to have the fallback?
>
> From a Tika perspective, we'd like to be able to process all the
> "application/octet-stream"s in CC and see where we need to improve
> detection. We can (and are!) certainly working with what we have, but
> there's no way -- short of reprocessing -- for us to get actual files
> that Tika wasn't able to identify, right?
>
> Any recommendations?
>
> Thank you, again.
>
> Best,
>
> Tim
>
> On Friday, November 8, 2024 at 6:01:38 PM UTC-5 Sebastian Nagel wrote:
>
> Hi Tim,
>
> > Does Nutch or CC's implementation have a fallback
> > option if Tika returns application/octet-stream or otherwise fails?
>
> Yes. The MIME type from the HTTP header "Content-Type" is used as
> fall-back. See
>
>
https://github.com/apache/nutch/blob/
> e1b8dbe909b0f8c181dcb5ee0e7e072f27f82cbb/src/java/org/apache/nutch/
> util/MimeUtil.java#L185 <
https://github.com/apache/nutch/blob/
> e1b8dbe909b0f8c181dcb5ee0e7e072f27f82cbb/src/java/org/apache/nutch/
> util/MimeUtil.java#L185>