using or extending or forking+renaming github.com/google/licensecheck to provide similar functionality

62 views
Skip to first unread message

fge...@gmail.com

unread,
Nov 13, 2019, 3:06:20 AM11/13/19
to golang-nuts
Hi,

"licensecheck classifies license files and heuristically determines
how well they correspond to known open source licenses."

I'd like to identify license references in the file system. If I
understand correctly package licensecheck in it's current form is not
useful to help with this.
If it's still possible, could you please share a hint how to do that?
(input: byte array, output: license references in the byte array)
If I understand correctly and I can't use licensecheck in it's current
form, which one is preferred:
extend current api, (maybe: func Refers(input []byte) (References,
bool) or fork+rename the package? (References{...} being similar to
Coverage{...})

thanks,
Gergely Födémesi

Rob Pike

unread,
Nov 13, 2019, 3:54:12 PM11/13/19
to fge...@gmail.com, golang-nuts
Can you please explain in more detail what you're asking for? I don't understand the problem you have or why the current package cannot handle it.

-rob


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/CA%2BctqrqKKUPTHihMLhLTH5O-tBm1qENQV6y41Qwde4jHp1kNmA%40mail.gmail.com.

fge...@gmail.com

unread,
Nov 14, 2019, 2:14:30 AM11/14/19
to Rob Pike, golang-nuts
func Cover(input []byte, opts Options) (Coverage, bool) in
licensecheck currently reports len(input)/len(one of the licenses) for
each known license. I'd need for all known licenses len(known
license)/len(license reference in input).

I'd like to scan >100000 files (possibly a lot more), where some of
them (<0.1%) contain full or partial known license texts.

An example scenario for an example /src, containing >100000 files:
$ listlicenses /src # to get an overview of 100% matching license references
LGPL-2.1
MIT
$ listlicenses -details /src # same tree, more detailed output, to
see the details
/src/license refers 100% MIT # the bytes in /src/license correspond
one for one for the MIT license
/src/fonts/LICENSE refers 100% MIT # the bytes in /src/fonts/LICENSE
correspond one for one for the MIT license
/src/a/Notice refers 100% LGPL-2.1 # same as above with LGPL-2.1
/src/a/b/whatever.go refers 94% GPL2 # most probably a broken
license reference in whatever.go, maybe someone inadvertently deleted
the last word from the lines containing the GPL2 license text. Needs
human inspection to check what's the license situation with
whatever.go
/src/c/ConfusingLicenseReferences.c refers 7% ZLIB #
ConfusingLicenseReferences.c has most probably a false positive report
for reference to ZLIB
/src/c/ConfusingLicenseReferences.c refers 65% MIT #
ConfusingLicenseReferences.c has only 65% of MIT, the author intended
to refer to MIT, but some inadvertent edit later broke the license
reference in ConfusingLicenseReferences.c

Command listlicenses iterates over all files in the subtree, gathering
all full or partial (broken) license references. Command listlicenses
uses the functionality similar to github.com/google/licensecheck to
check the files in the file system.



thanks!

Rob Pike

unread,
Nov 14, 2019, 3:49:20 AM11/14/19
to fge...@gmail.com, golang-nuts
As I understand what you're trying to do, you just need to write a tree walker, perhaps using filepath.Walk, that opens each file and calls Cover on it. You can set the Options field to control the threshold for reporting, and use the result of that to choose which licenses to report.

I don't believe an API change is called for.

-rob

fge...@gmail.com

unread,
Nov 14, 2019, 4:25:31 AM11/14/19
to Rob Pike, golang-nuts
Sorry if I was not clear: on walking the file system, that's clear, I
did not intend to talk about that, only about matching and reporting
on matching. The example I gave was just to put in context why I
believe I'd need a different api.

Using the Options field is good enough in the first example. (That's
how I used licensecheck first.)
Although for the second example Cover() does not report what I'd need.

As far as I've seen currently using func Cover(INPUT []byte, opts
Options) (Coverage, bool) reports 100% MIT if INPUT matches byte for
byte 100% MIT. If INPUT has more text than the complete 100% matching
text of MIT license, for example the MIT license is only in the
beginning of INPUT and the rest of INPUT is for example Go code, than
Coverage will report len(INPUT)/len(MIT license) which is less than
100%.

In this case, the new api would report 100%, since input contains 100%
MIT license text (and some programming code, which is not relevant
here).

If I understand correctly the current api is for checking _already_
identified license files, which contain _only_ the license text.
I believe to look for files containing - complete or possibly broken -
license references a different matching is needed.

Dan Kortschak

unread,
Nov 14, 2019, 4:42:43 AM11/14/19
to fge...@gmail.com, Rob Pike, golang-nuts
The licensecheck.Match type holds the start and end offsets in the
file. Can't you use that to extract the license portion and either
check it's length against the length of the license or repeat the Check
with only that portion of the file?

fge...@gmail.com

unread,
Nov 14, 2019, 4:57:32 AM11/14/19
to Dan Kortschak, Rob Pike, golang-nuts
Thanks, I did not realize that Coverage -> Match[n] could be that useful!
Though the field Match.Name is not a file name I can os.Open().
How can I directly access the known license texts?

fge...@gmail.com

unread,
Nov 14, 2019, 5:00:06 AM11/14/19
to Dan Kortschak, Rob Pike, golang-nuts
Sorry I did not read your response fully. Repeating the matching is just fine.
Thanks again!
Reply all
Reply to author
Forward
0 new messages