MIME Magic

robertburrelldonkin

unread,

Jul 11, 2007, 5:24:26 PM7/11/07

to Discuss A Release Audit Tool

ATM RAT uses a poor man's MIME typing system. it guesses based on file
name and some of the initial bytes to try to determine whether the
document should be ignored (binary), read as text (standard) or as an
archive. this doesn't work all that well. really should be able to use
some form of MIME magic.

tika is currently in the incubator but there are plans to implement
MIME typing libraries. i think it's probably a good idea to
collaborate with them rather than push the RAT heuristics any further.

opinions?

- robert

Jochen Wiedmann

unread,

Jul 12, 2007, 1:36:26 AM7/12/07

to discuss-a-rele...@googlegroups.com

I have no problems with Tika or RAT depending on external libraries,
but I wonder whether this doesn't extend the scope of RAT.

A message depends on a certain format. Whoever wants to apply license
headers to the message, must decide between two possible options:

- Add the message to the body. Fine with me, but I certainly wouldn't
expect that RAT
is clever enough to extract bodies from the message and inspect
them. If the user
wants to inspect the bodies, then he or she should store the bodies
and build the
message on-the-fly.
- Violate the format, add the license to the top and have a mechanism
to skip the
license header. In which case RAT should have no problem.

I'd suggest to configure RAT to ignore the messages, if I had one in
my project that RAT fails to handle right. IMO, RAT should concentrate
on the obious use cases, which make about 99% of a typical project. If
there are exceptions with good reason, then there's no harm in
configuring an exclusion.

Jochen

--
"Besides, manipulating elections is under penalty of law, resulting in
a preventative effect against manipulating elections.

The german government justifying the use of electronic voting machines
and obviously believing that we don't need a police, because all
illegal actions are forbidden.

http://dip.bundestag.de/btd/16/051/1605194.pdf

robert burrell donkin

unread,

Jul 12, 2007, 8:19:44 AM7/12/07

to discuss-a-rele...@googlegroups.com

sorry for causing confusion. the context is general document meta-data
rather than parsing mail.

(MIME typing was originally developed for mail but the meta-data
definitions can be use as a way of defining content meta-data for
documents in general. the idea is that you take a look at a document
and then guess - from the file name and a sample of the byte stream -
the type of document it is. then describe it using the standard MIME
system of meta-data.)

- robert

Matthieu

unread,

Aug 4, 2007, 9:08:03 PM8/4/07

to Discuss A Release Audit Tool

I think a good starter would be to group the different guessers behind
a single interface that would find the mime type of a document. The
relevant analyzers would then be used depending on the found mime
type. Then when Tika is good enough (or even right away if it is
already, I haven't checked it much) it should be pretty
straightforward to use it as an implementation for this single
interface.

We could then either change the ConditionalAnalyser to take a mime
type or have something that directly selects the right set of
analyzers given a mime type. Does that make sense? I'm just getting
into the code so I could be saying something stupid :-)

On Jul 12, 5:19 am, "robert burrell donkin"
<robertburrelldon...@gmail.com> wrote:
> On 7/12/07, Jochen Wiedmann <jochen.wiedm...@gmail.com> wrote:

robertburrelldonkin

unread,

Aug 6, 2007, 6:07:07 PM8/6/07

to Discuss A Release Audit Tool

On Aug 5, 1:08 am, Matthieu <matthieu.r...@gmail.com> wrote:
> I think a good starter would be to group the different guessers behind
> a single interface that would find the mime type of a document. The
> relevant analyzers would then be used depending on the found mime
> type. Then when Tika is good enough (or even right away if it is
> already, I haven't checked it much) it should be pretty
> straightforward to use it as an implementation for this single
> interface.

+1

RAT needs to be modularised into libraries

it was convenient for me to have all the source together but it's not
a good idea going forward

this is probably in danger of moving into the vexed area of IoC
containers so i'll stay clear of that :-/

> We could then either change the ConditionalAnalyser to take a mime
> type or have something that directly selects the right set of
> analyzers given a mime type. Does that make sense? I'm just getting
> into the code so I could be saying something stupid :-)

ConditionalAnalyser is just a wiring class. (RAT suffers from well-
intentioned over-engineering.) IMHO more elegant to implement in a
matcher.

the original intention was to use a streaming pipelined event drive
design (resource->object->RDF->xml->html). this type of design
typically scales ok. also tends toward neatness. disadvantage is that
it's also tends towards being impossible to understand :-/

need to break out more code into useful libraries

i'm not sure that using all those readers is a good idea any more.
typically, i'm applying a lot of different analysers which means the
reading is becoming inefficient. probably better with nio.

- robert

Matthieu Riou

unread,

Aug 7, 2007, 12:21:55 AM8/7/07

to discuss-a-rele...@googlegroups.com

On 8/6/07, robertburrelldonkin <robertbur...@gmail.com> wrote:

On Aug 5, 1:08 am, Matthieu <matthieu.r...@gmail.com> wrote:
> I think a good starter would be to group the different guessers behind
> a single interface that would find the mime type of a document. The
> relevant analyzers would then be used depending on the found mime
> type. Then when Tika is good enough (or even right away if it is
> already, I haven't checked it much) it should be pretty
> straightforward to use it as an implementation for this single
> interface.

+1

RAT needs to be modularised into libraries

Would it really be needed to have separate modules? I was actually just thinking of clearer APIs isolating each functionality but all within a single jars. Maybe that's actually what you meant. Or are you thinking of a need to reuse some specific pieces outside of rat?

it was convenient for me to have all the source together but it's not
a good idea going forward

this is probably in danger of moving into the vexed area of IoC
containers so i'll stay clear of that :-/

Can't say I'm a great IoC fan either :-)

> We could then either change the ConditionalAnalyser to take a mime
> type or have something that directly selects the right set of
> analyzers given a mime type. Does that make sense? I'm just getting
> into the code so I could be saying something stupid :-)

ConditionalAnalyser is just a wiring class. (RAT suffers from well-
intentioned over-engineering.) IMHO more elegant to implement in a
matcher.

the original intention was to use a streaming pipelined event drive
design (resource->object->RDF->xml->html). this type of design
typically scales ok. also tends toward neatness. disadvantage is that
it's also tends towards being impossible to understand :-/

Yeah I've noticed a slightly over-engineered streak ;-) I concur with the intent though, it's definitely a more scalable architecture. However, as you're mentioning, I'd say RAT should be optimized for making it easy for people to contribute patches more than execution performance (I don't mind to wait a couple more seconds for a release report).

need to break out more code into useful libraries

i'm not sure that using all those readers is a good idea any more.
typically, i'm applying a lot of different analysers which means the
reading is becoming inefficient. probably better with nio.

I was actually wondering if it wouldn't make sense to read a fixed quantity of bites (say a few kb) and have the analysers work in memory. I haven't seen a lot of readers usage though, I probably missed something...

Cheers,
Matthieu

- robert

Robert Burrell Donkin

unread,

Aug 7, 2007, 5:42:10 PM8/7/07

to discuss-a-rele...@googlegroups.com

On 8/7/07, Matthieu Riou <matthi...@gmail.com> wrote:
> On 8/6/07, robertburrelldonkin <robertbur...@gmail.com> wrote:
> >
> > On Aug 5, 1:08 am, Matthieu <matthieu.r...@gmail.com> wrote:
> > > I think a good starter would be to group the different guessers behind
> > > a single interface that would find the mime type of a document. The
> > > relevant analyzers would then be used depending on the found mime
> > > type. Then when Tika is good enough (or even right away if it is
> > > already, I haven't checked it much) it should be pretty
> > > straightforward to use it as an implementation for this single
> > > interface.
> >
> > +1
> >
> > RAT needs to be modularised into libraries
>
> Would it really be needed to have separate modules? I was actually just
> thinking of clearer APIs isolating each functionality but all within a
> single jars. Maybe that's actually what you meant. Or are you thinking of a
> need to reuse some specific pieces outside of rat?

there's interest in reusing bits and pieces of RAT outside

for example, the license-artifact index we were talking about would be
useful outside

> > it was convenient for me to have all the source together but it's not
> > a good idea going forward
> >
> > this is probably in danger of moving into the vexed area of IoC
> > containers so i'll stay clear of that :-/
>
> Can't say I'm a great IoC fan either :-)

as you can probably tell from the design i'm very IoC but it's wise to
keep away but containers tend to be a bit of a religous subject...

> > > We could then either change the ConditionalAnalyser to take a mime
> > > type or have something that directly selects the right set of
> > > analyzers given a mime type. Does that make sense? I'm just getting
> > > into the code so I could be saying something stupid :-)
> >
> > ConditionalAnalyser is just a wiring class. (RAT suffers from well-
> > intentioned over-engineering.) IMHO more elegant to implement in a
> > matcher.
> >
> > the original intention was to use a streaming pipelined event drive
> > design (resource->object->RDF->xml->html). this type of
> design
> > typically scales ok. also tends toward neatness. disadvantage is that
> > it's also tends towards being impossible to understand :-/
>
> Yeah I've noticed a slightly over-engineered streak ;-) I concur with the
> intent though, it's definitely a more scalable architecture. However, as
> you're mentioning, I'd say RAT should be optimized for making it easy for
> people to contribute patches more than execution performance (I don't mind
> to wait a couple more seconds for a release report).

(i have a personal interest in performance but this should only really
effects the wiring)

RDF pipelines are interesting but have some tricky wrinkles. (talked
to leo and stefano about this at apachecon.) probably easier to
complete analyse the document rather than try to stream the results.

> > need to break out more code into useful libraries
> >
> > i'm not sure that using all those readers is a good idea any more.
> > typically, i'm applying a lot of different analysers which means the
> > reading is becoming inefficient. probably better with nio.
>
> I was actually wondering if it wouldn't make sense to read a fixed quantity
> of bites (say a few kb) and have the analysers work in memory. I haven't
> seen a lot of readers usage though, I probably missed something...

not at all

originally i thought that a line-by-line approach using state machines
would be good enough (the idea is that you read line-by-line and then
throw each line to every analyser) but it's not. nio should be an
efficient approach to read just the start of the file but need to
understand the mime type first. so mime-type guessing is probably the
first step.

if it's a character format then use nio to load up a sample from the
file. the header analysers would run better against charbuffers with
white space removed.

ATM there's too much hassle involved in creating matchers for new
headers. would be better using meta-data as input to a single analyser

(may be worth taking a look at nutch and lucene to see how they
approach similar problems)

- robert

Reply all

Reply to author

Forward