Development Directions

Matthieu Riou

unread,

Aug 7, 2007, 7:05:34 PM8/7/07

to discuss-a-rele...@googlegroups.com

Starting a new subject as the magic mime one is getting a bit off topic :)

So if I sum up what we've discussed so far and the directions you're seeing for RAT, there are basically 4 main development tasks:

* An artifact / license index. This would have an API that the RAT release checker could use, a back-end to store the data and a web front-end for people to add more artifact licenses.
* A mime type analyzer. Ideally based on Tikka (which seems to have been very quiet recently) but anyway something that would reuse an existing mime magic numbers database.
* RAT refactoring. Basically changing the current design a bit to make it more modular and easier to grasp for eventual newcomers.
* Tooling integration. Maven and Ant plugins. IIUC the Maven one could live inside Maven. I'm also interested in adding a buildr one ( http://buildr.rubyforge.org)

Is this accurate? I'd be willing to get started on the artifact/license index thing but if you think that another area should be explored first, I'm pretty flexible.

Robert Burrell Donkin

unread,

Aug 8, 2007, 5:36:31 PM8/8/07

to discuss-a-rele...@googlegroups.com

On 8/7/07, Matthieu Riou <matthi...@gmail.com> wrote:
> Starting a new subject as the magic mime one is getting a bit off topic :)

cool

> So if I sum up what we've discussed so far and the directions you're seeing
> for RAT, there are basically 4 main development tasks:
>
> * An artifact / license index. This would have an API that the RAT release
> checker could use, a back-end to store the data and a web front-end for
> people to add more artifact licenses.

there are various complementary approaches to this problem

the static approach means building up a large quantity of data that
can be quickly be queried. the output library only needs to read data
in a suitable format or formats then answer questions. input libraries
are then needed to allow data to be easily added.

it should be possible to mine a large quantity of data from the maven
repository by using an RDFizer then stripping out just the information
of interest. this would give a good starting data set.

in terms of a web application, i think that a pure javascript solution
(possible built using XAP) would be the best approach. this would
allow the application to be used from a static page.

the other approach is dynamic. even given an unknown artifact it
should be also possible to do quite a lot by dynamically guessing
based on artifact content. for example, artifacts containing pom's or
LICENSE files.

> * A mime type analyzer. Ideally based on Tikka (which seems to have been
> very quiet recently) but anyway something that would reuse an existing mime
> magic numbers database.

pretty much

> * RAT refactoring. Basically changing the current design a bit to make it
> more modular and easier to grasp for eventual newcomers.

that's about right

the header recognition code may well be useful for maven so i'd like
to factor that out

it's probably more convenient to build up meta-data for a resource
rather than stream it upon creation. this will probably make analytics
easier. i think that UIMA has a very similar architecture so that's
probably worth looking into.

i am still convinced that RDF is the right structure, though

> * Tooling integration. Maven and Ant plugins. IIUC the Maven one could live
> inside Maven. I'm also interested in adding a buildr one (
> http://buildr.rubyforge.org)

sounds good

> Is this accurate?

reasonably so

> I'd be willing to get started on the artifact/license
> index thing but if you think that another area should be explored first, I'm
> pretty flexible.

if that's your itch, scratch it :-)

i plan to check an RDFised pom from the maven repository for
dependencies on bouncy castle (and other open source libraries) to
report on projects that may need to think about filing ECCNs.

- robert

robertburrelldonkin

unread,

Aug 10, 2007, 7:10:43 AM8/10/07

to Discuss A Release Audit Tool

On Aug 7, 11:05 pm, "Matthieu Riou" <matthieu.r...@gmail.com> wrote:
> Starting a new subject as the magic mime one is getting a bit off topic :)

<snip>

> * Tooling integration. Maven and Ant plugins. IIUC the Maven one could live
> inside Maven. I'm also interested in adding a buildr one (http://buildr.rubyforge.org)

buildr sounds good

i'm starting to get annoyed by the plain text reports so probably we
should decide upon an output format and start producing html

the recursive RAT python script which i use to verify trees of
releases ATM produces a concatenated output. i'm start to get annoyed
by this so i may find time to improve this soon.

- robert

robertburrelldonkin

unread,

Aug 10, 2007, 7:15:01 AM8/10/07

to Discuss A Release Audit Tool

On Aug 8, 9:36 pm, "Robert Burrell Donkin"
<robertburrelldon...@gmail.com> wrote:

> On 8/7/07, Matthieu Riou <matthieu.r...@gmail.com> wrote:

> > I'd be willing to get started on the artifact/license
> > index thing but if you think that another area should be explored first, I'm
> > pretty flexible.
>
> if that's your itch, scratch it :-)
>
> i plan to check an RDFised pom from the maven repository for
> dependencies on bouncy castle (and other open source libraries) to
> report on projects that may need to think about filing ECCNs.

but don't let this put you off starting taking a look at artifact/
license format etc

the only wrinkle is that you don't really want to know about licenses
but about the characteristics of that license. for example, apache
license 1.1 is a license family (each license is different but fit a
common template). this license family is OSI compatible but GPL
incompatible. the good news is that these analytics logically belong
to a separate component. so, it's just the license (or family) that
needs to be stored.

- robert

Matthieu Riou

unread,

Aug 10, 2007, 9:56:07 AM8/10/07

to discuss-a-rele...@googlegroups.com

On 8/10/07, robertburrelldonkin <robertbur...@gmail.com> wrote:

On Aug 7, 11:05 pm, "Matthieu Riou" <matthieu.r...@gmail.com> wrote:
> Starting a new subject as the magic mime one is getting a bit off topic :)

<snip>

> * Tooling integration. Maven and Ant plugins. IIUC the Maven one could live
> inside Maven. I'm also interested in adding a buildr one (http://buildr.rubyforge.org )

buildr sounds good

It's a great tool, simply saves a lot of time.

i'm starting to get annoyed by the plain text reports so probably we
should decide upon an output format and start producing html

the recursive RAT python script which i use to verify trees of
releases ATM produces a concatenated output. i'm start to get annoyed
by this so i may find time to improve this soon.

Yeah the textual output can be long and you always have to redirect to a file if you want to analyze it. And then including the full content of the files that have a problem also kind of dilutes the useful information. Maybe a line number pointing to the problem (or to the absence of something needed) would tighten that a bit.

As for the output format we could go for simple XML, an XSL sheet would give us HTML. Another alternative (I'm not so big on XML these days) could be JSON. It could be read directly by a small Javascript and rendered easily, eliminating the need of an additional report generation tool.

Matthieu

- robert

Matthieu Riou

unread,

Aug 10, 2007, 10:36:31 AM8/10/07

to discuss-a-rele...@googlegroups.com

On 8/10/07, robertburrelldonkin <robertbur...@gmail.com> wrote:

On Aug 8, 9:36 pm, "Robert Burrell Donkin"
<robertburrelldon...@gmail.com> wrote:
> On 8/7/07, Matthieu Riou < matthieu.r...@gmail.com> wrote:

> > I'd be willing to get started on the artifact/license
> > index thing but if you think that another area should be explored first, I'm
> > pretty flexible.
>
> if that's your itch, scratch it :-)
>
> i plan to check an RDFised pom from the maven repository for
> dependencies on bouncy castle (and other open source libraries) to
> report on projects that may need to think about filing ECCNs.

but don't let this put you off starting taking a look at artifact/
license format etc

Nope, it's actually good to know. ODE will probably need WSS4J at some point.

the only wrinkle is that you don't really want to know about licenses
but about the characteristics of that license. for example, apache
license 1.1 is a license family (each license is different but fit a
common template). this license family is OSI compatible but GPL
incompatible. the good news is that these analytics logically belong
to a separate component. so, it's just the license (or family) that
needs to be stored.

Yep, my plan was more or less to list all OSI licenses adding a "family" checkbox when the license is slightly changed (my understanding is that it's usually for copyright). So the complete set of information would be something like:

Organization: Apache, Codehaus, Sourceforge, Google Code, ..., Unknown/Self-Hosted
Artifact Name (would need to be more specific than the project for multi module projects, like "Axis2 CodeGen" for example)
Version
Maybe a link to the project homepage
Released artifact name
License: with a combo of all OSI approved licenses
Family: for those having slight modifications

Also, I think we'd want a way for people to easily add artifacts with their license to the license storage. There are *a lot* of smaller projects out there that don't even bother including a license in their artifact (even more popular ones like Saxon). So we could bootstrap the database with the licenses contained in a Maven repo but even then that wouldn't be much compared to what people usually use.

Some sort of basic querying would also be very useful. Artifacts tend to get renamed so having to guess the exact name isn't so nice. We'd need at least some sort of index on the side. Which gets kind of problematic if you go for Javascript, unless the index is small enough to be downloaded every time (which is still possible).

Matthieu

- robert

Robert Burrell Donkin

unread,

Aug 10, 2007, 2:17:38 PM8/10/07

to discuss-a-rele...@googlegroups.com

On 8/10/07, Matthieu Riou <matthi...@gmail.com> wrote:
> On 8/10/07, robertburrelldonkin <robertbur...@gmail.com> wrote:

<snip>

> > i'm starting to get annoyed by the plain text reports so probably we
> > should decide upon an output format and start producing html
> >
> > the recursive RAT python script which i use to verify trees of
> > releases ATM produces a concatenated output. i'm start to get annoyed
> > by this so i may find time to improve this soon.
>
> Yeah the textual output can be long and you always have to redirect to a
> file if you want to analyze it. And then including the full content of the
> files that have a problem also kind of dilutes the useful information. Maybe
> a line number pointing to the problem (or to the absence of something
> needed) would tighten that a bit.

yeh

the information is just there to save me time but should be replaced
by decent reports

getting the analytics working would made things easier since there
would be definite pass and fail

stefano recommended going for a data store approach rather than
streaming for the RDF. that will probably make the analytics easier.
these could well just output RDF as well but at the aggregate level
for the complete tree/artifact.

> As for the output format we could go for simple XML, an XSL sheet would give
> us HTML.

the pipeline is resource->rdf->xml->txt ATM

what's needed is to:

1. fix a format for the xml or just serialize xml serialization for the RDF
2. write a more complete reporting tool so that it's not necessary to
include snippets of the text in the xml

> Another alternative (I'm not so big on XML these days) could be
> JSON. It could be read directly by a small Javascript and rendered easily,
> eliminating the need of an additional report generation tool.

JSON sounds interesting especially when mixed with RDF

- robert

Matthieu Riou

unread,

Aug 10, 2007, 2:27:52 PM8/10/07

to discuss-a-rele...@googlegroups.com

On 8/10/07, Robert Burrell Donkin <robertbur...@gmail.com> wrote:

<snip>

> Another alternative (I'm not so big on XML these days) could be
> JSON. It could be read directly by a small Javascript and rendered easily,
> eliminating the need of an additional report generation tool.

JSON sounds interesting especially when mixed with RDF

I tend to like JSON a lot these days, XML is good when you need to interchange things but for writing sort-of-human-readable structured data, it's much sweeter.

Matthieu

- robert

Robert Burrell Donkin

unread,

Aug 10, 2007, 2:56:53 PM8/10/07

to discuss-a-rele...@googlegroups.com

On 8/10/07, Matthieu Riou <matthi...@gmail.com> wrote:

> On 8/10/07, robertburrelldonkin <robertbur...@gmail.com> wrote:
>
> > On Aug 8, 9:36 pm, "Robert Burrell Donkin"
> > <robertburrelldon...@gmail.com> wrote:
> > > On 8/7/07, Matthieu Riou < matthieu.r...@gmail.com> wrote:
> >
> > > > I'd be willing to get started on the artifact/license
> > > > index thing but if you think that another area should be explored
> first, I'm
> > > > pretty flexible.
> > >
> > > if that's your itch, scratch it :-)
> > >
> > > i plan to check an RDFised pom from the maven repository for
> > > dependencies on bouncy castle (and other open source libraries) to
> > > report on projects that may need to think about filing ECCNs.
> >
> > but don't let this put you off starting taking a look at artifact/
> > license format etc
>
> Nope, it's actually good to know. ODE will probably need WSS4J at some
> point.

it's part of the analytics stuff. deduce from the presence of
bouncycastle that the release depends on crypto so add meta-data to
the aggregate level. if the notice contains the standard ECCN
statement then add meta-data. the top level analytics check the
meta-data. if the crypto is flagged but not the ECCN then deduce that
the release has issues.

> > the only wrinkle is that you don't really want to know about licenses
> > but about the characteristics of that license. for example, apache
> > license 1.1 is a license family (each license is different but fit a
> > common template). this license family is OSI compatible but GPL
> > incompatible. the good news is that these analytics logically belong
> > to a separate component. so, it's just the license (or family) that
> > needs to be stored.
>
> Yep, my plan was more or less to list all OSI licenses adding a "family"
> checkbox when the license is slightly changed (my understanding is that it's
> usually for copyright). So the complete set of information would be
> something like:
>
> Organization: Apache, Codehaus, Sourceforge, Google Code, ...,
> Unknown/Self-Hosted

probably this should be the licensor. the copyright holder may be different.

> Artifact Name (would need to be more specific than the project for multi
> module projects, like "Axis2 CodeGen" for example)
> Version
> Maybe a link to the project homepage
> Released artifact name

MD5 sum would be useful

> License: with a combo of all OSI approved licenses
> Family: for those having slight modifications

OSI approves license families (not licenses).

> Also, I think we'd want a way for people to easily add artifacts with their
> license to the license storage. There are *a lot* of smaller projects out
> there that don't even bother including a license in their artifact (even
> more popular ones like Saxon). So we could bootstrap the database with the
> licenses contained in a Maven repo but even then that wouldn't be much
> compared to what people usually use.

it'd be a start

> Some sort of basic querying would also be very useful. Artifacts tend to get
> renamed so having to guess the exact name isn't so nice. We'd need at least
> some sort of index on the side. Which gets kind of problematic if you go for
> Javascript, unless the index is small enough to be downloaded every time
> (which is still possible).