Weird mimetypes in mime_detected?

96 views
Skip to first unread message

Tim Allison

unread,
Nov 8, 2024, 2:30:19 PM11/8/24
to Common Crawl
In looking through the mime_detected mime types that I extracted from CC-MAIN-2024-38 indices, there are some weird ones that I hope didn't come from Tika. :D

The numbers are really small, but I'd like to try to figure out if Tika is trusting user-generated data and passing it through as a mime type or if something else is going on. Does Nutch or CC's implementation have a fallback option if Tika returns application/octet-stream or otherwise fails?

Any ideas what might be going on? Thank you!

Some examples:

application/.octet-stream
text/x-unknown-content-type
metahtml/interpreted
application/.
text/meme
appliation/octet-stream
applications/octet-strea
application/stream-octet

Sebastian Nagel

unread,
Nov 8, 2024, 6:01:38 PM11/8/24
to common...@googlegroups.com
Hi Tim,

> Does Nutch or CC's implementation have a fallback
> option if Tika returns application/octet-stream or otherwise fails?

Yes. The MIME type from the HTTP header "Content-Type" is used as
fall-back. See

https://github.com/apache/nutch/blob/e1b8dbe909b0f8c181dcb5ee0e7e072f27f82cbb/src/java/org/apache/nutch/util/MimeUtil.java#L185

Best,
Sebastian

Tim Allison

unread,
Nov 11, 2024, 3:50:41 PM11/11/24
to Common Crawl
Thank you, Sebastian!

How much unhappiness in the world would it create not to have the fallback? 

From a Tika perspective, we'd like to be able to process all the "application/octet-stream"s in CC and see where we need to improve detection. We can (and are!) certainly working with what we have, but there's no way -- short of reprocessing -- for us to get actual files that Tika wasn't able to identify, right?

Any recommendations?

Thank you, again.

Best,

         Tim

Sebastian Nagel

unread,
Nov 11, 2024, 5:21:29 PM11/11/24
to common...@googlegroups.com
Hi Tim,

> How much unhappiness in the world would it create not to have the
> fallback?

For Common Crawl: not much because the MIME type from the Content-Type
HTTP header is also in the index in a separate field / column.

For Nutch it would be more painful because the MIME type is required
to select the parser. If there is none or "application/octet-stream"
the document cannot be parsed.

Because the Common Crawler is just a fork of Nutch, we'd need to keep
the purely identified MIME type separately. Note: also the URL (resp.
the file suffix) is passed to Tika's MIME detector as an additional hint.

Of course, improving the MIME detection would be the best solution.
A MIME type spellchecker or cleanup utility wouldn't be also good to
fix "appliation/octet-stream" and similar. This is nothing Tika already
provides?


> get actual files that Tika wasn't able to identify

In addition, there are those files which are identified but with a wrong
MIME type. There was one example recently on the Nutch user mailing
list: a text/html by Content-Type which was detected as application/xml.
It's a HTML snippet (a XML/HTML comment and one iframe element), not a
trivial document to identify.

I remember years ago I've prepared a confusion matrix of the identified
MIME type and the Content-Type, see [1]. I'm happy to repeat this for a
recent crawl and a more recent Tika version. Currently, 2.9.1.0 is used.
Maybe I also add the file suffix from the URL (.html, .pdf, .docx) if
there is one.

Best,
Sebastian

[1] https://lists.apache.org/thread/fhhp1p6y4ttxmplvz1ohk3wwjz25ozbc


On 11/11/24 21:50, Tim Allison wrote:
> Thank you, Sebastian!
>
> How much unhappiness in the world would it create not to have the fallback?
>
> From a Tika perspective, we'd like to be able to process all the
> "application/octet-stream"s in CC and see where we need to improve
> detection. We can (and are!) certainly working with what we have, but
> there's no way -- short of reprocessing -- for us to get actual files
> that Tika wasn't able to identify, right?
>
> Any recommendations?
>
> Thank you, again.
>
> Best,
>
>          Tim
>
> On Friday, November 8, 2024 at 6:01:38 PM UTC-5 Sebastian Nagel wrote:
>
> Hi Tim,
>
> > Does Nutch or CC's implementation have a fallback
> > option if Tika returns application/octet-stream or otherwise fails?
>
> Yes. The MIME type from the HTTP header "Content-Type" is used as
> fall-back. See
>
> https://github.com/apache/nutch/blob/
> e1b8dbe909b0f8c181dcb5ee0e7e072f27f82cbb/src/java/org/apache/nutch/
> util/MimeUtil.java#L185 <https://github.com/apache/nutch/blob/
> e1b8dbe909b0f8c181dcb5ee0e7e072f27f82cbb/src/java/org/apache/nutch/
> util/MimeUtil.java#L185>

Tim Allison

unread,
Nov 13, 2024, 11:57:31 AM11/13/24
to Common Crawl
> Because the Common Crawler is just a fork of Nutch, we'd need to keep the purely identified MIME type separately.
Can I open a ticket? How do requests work? Does this seem like something that would be useful to CC?

> A MIME type spellchecker or cleanup utility wouldn't be also good to fix "appliation/octet-stream" and similar. This is nothing Tika already provides?
Right. Not something that Tika does.

>In addition, there are those files which are identified but with a wrong MIME type.
Y. We now have an integration with Google's magika...might be interesting to run a bake-off or simply trust it for text based files. :D

> Note: also the URL (resp. the file suffix) is passed to Tika's MIME detector as an additional hint.
Y.

Thank you, again, Sebastian!

Sebastian Nagel

unread,
Nov 14, 2024, 8:12:15 AM11/14/24
to common...@googlegroups.com
Hi Tim,

> Can I open a ticket? How do requests work? Does this seem like
> something that would be useful to CC?

Definitely. Simply because improvements in Tika's MIME detector will
improve our data (annotations) in the mid/long-term.

Issues for our fork of Nutch or tracked at
https://github.com/commoncrawl/nutch/issues/

However, passing a second MIME type forward in the pipeline and expose
it in the index, might be challenging.

Primarily I'd focus on improving the detection, see
https://issues.apache.org/jira/browse/NUTCH-3089
https://issues.apache.org/jira/browse/TIKA-4351

Best,
Sebastian
> > https://github.com/apache/nutch/blob/ <https://github.com/apache/
> nutch/blob/>
> > e1b8dbe909b0f8c181dcb5ee0e7e072f27f82cbb/src/java/org/apache/nutch/
> > util/MimeUtil.java#L185 <https://github.com/apache/nutch/blob/
Reply all
Reply to author
Forward
0 new messages