Exact matching of tagsets?

4 views
Skip to first unread message

D Gowers

unread,
Jul 6, 2015, 10:49:16 PM7/6/15
to tm...@googlegroups.com
Hi tmsuers,

I have recently implemented a feature in my 'tagfilt'[2] program which I thought maybe we might want in TMSU, so I want to bring it up here:

Exact tagset matches.

To give a little necessary background, what tagfilt does is, accept a list of files on stdin, check their tags against criteria passed on the commandline, and output those files which qualify.

So for example if I invoke it with 'tagfilt -e %uploaded% -e %wont_upload%' [1], it will return only those files that have exactly one tag and that tag is either uploaded or wont_upload.

Whereas if I invoke it with 'tagfilt -e "%wont_upload duplicate_scan%"', it will return only those files that have exactly two tags and those tags are uploaded duplicate_scan.


Do we want this function, to match exact tagsets, in the TMSU querying system itself? And if so, what kind of syntax would be sensible?


----

[1] this is exactly the use case I wrote the feature for, BTW.
[2] Current version of tagfilt: https://bpaste.net/show/d3a2ff9ef50f . Requires Python3, xargs, and Plumbum.

D Gowers

unread,
Jul 6, 2015, 10:50:37 PM7/6/15
to tm...@googlegroups.com
Oops -- in the second case I meant 'and those tags are wont_upload duplicate_scan'.

Paul Ruane

unread,
Jul 7, 2015, 4:51:03 PM7/7/15
to tm...@googlegroups.com
Sorry to have not got back to you sooner. I've been waiting to get
some time to think about your suggestion properly. Hopefully I'll have
time to do so tomorrow, otherwise I'll back to you later in the week.
> --
> You received this message because you are subscribed to the Google Groups
> "TMSU" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tmsu+uns...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

D Gowers

unread,
Jul 9, 2015, 9:09:19 PM7/9/15
to tm...@googlegroups.com
Okay, thanks for the update.
I noticed myself that the tagfilt code was a little crufty with old unused code and had one obvious bug, so here is an update ( https://bpaste.net/show/b18dd8023b5e ) that removes the bug and the cruft, FWIW.

Paul Ruane

unread,
Jul 10, 2015, 3:50:02 PM7/10/15
to tm...@googlegroups.com
Hi,

Have had a think about this. As you know there's currently no way to
do the count guard with TMSU. One thought I've had for a while is
automatic tags based upon file meta-data so, for example, all files
might gain an automatic 'size' tag of the file-size as its value,
images may gain 'width' and 'height' tags, &c. Along those lines it
would be possible to have a dynamic 'tagCount' tag which describes how
many tags a file has that could then be used in queries.

I do worry this sort of stuff would put a large processing overhead on
queries, especially on large databases. For example, a query of
'tagCount == 3' would require either calculating, on the fly, the tag
count for every file or would have to be converted to the equivalent
SQL.

Regarding exact tag set matching, that sounds possible with the
addition of an '--exact' option to the 'files' subcommand, which
would instruct it only to match files that have the exact tags
specified. Is this a common scenario though? In what sort of
situations would one care that a file doesn't have, additionally,
other tags?

Finally limiting the queries to a set of specific files. This can be
done already using the --path option but it would have to be done on a
single path at a time (fine if your files are all under a directory
but not possible in a single invocation otherwise). I guess it would
be straightforward to add an option to match against a set of paths.
Again, in what situation is this useful?

Thanks,
Paul

D Gowers

unread,
Jul 10, 2015, 8:37:50 PM7/10/15
to tm...@googlegroups.com
Thanks for your time.

> What situations are there in which this is useful?
Well, my use cases were as I outlined so far. The main unique application was to find files in a specific category that were inadequately tagged (archival purposes -- files that don't have content information ('tree' etc) as opposed to decision information ('wont_upload' etc)) will probably never be found again). I wasn't sure what other use cases people might have and whether it was a worthwhile feature for TMSU, which is one reason why I posted my message.

There is probably a recipe involving egrep and xargs that could also be used to get exact match sets -- `tmsu files -0 "$TAGS" |xargs -0 tmsu tags | egrep -e ": $TAGS\$"` | sed -E 's/: .+$//g'`? (not tested, but I've done things like this before, it is reasonably fast. Probably an even faster recipe is possible with awk, but I don't know awk well enough.)

There are a few things to me that seem to fit into a 'second pass' model, and items such as 'tagCount' fit well IMO.  (so do other meta-tags like 'height' and 'width', according to how expensive it is to extract the particular information versus the memory requirement.). From my moderate experience with SQL, I consider this type of thing to be better handled outside of SQL querying.

> Finally limiting the queries to a set of specific files. This can be
done already using the --path option but it would have to be done on a
single path at a time (fine if your files are all under a directory
but not possible in a single invocation otherwise). I guess it would
be straightforward to add an option to match against a set of paths.
Again, in what situation is this useful?

I don't know. I'm not sure how you interpreted my message -- did you go and read the source of tagfilt for the 'multiple paths' idea? Tagfilt doesn't do that per se (although the 'glob pattern' is moderately more flexible than TMSU's --path option).
I guess you could probably do decent multi-path filtering by using `fgrep` on the output of `tmsu tags`, or even set mathematics (union, intersection) etc using `comm`.

So to be clear on that one, I don't have a use case for multiple-path filtering, and I am not sure whether or how you arrived at the conclusion that I was suggesting we should be able to do that particular thing. Possibly I worded something poorly?


Anyway I'll try to sum up, for the sake of improved communication:

* I was proposing the ability to do matches on exact tagsets (which is of course functionally the same as matching those tags and then checking the count of the tags on the returned files). I was myself uncertain of how much utility this would have to TMSUers in general, though I knew it had value to me in getting my archives in order; therefore I posted to the mailing list rather than filing an issue.
* You replied that this could be possible via metatags, but could well be slow, and what was the application?
* You commented on matching precise filesets/pathsets. My impression was that you believed I had proposed this in some way, but I am not sure why.


I hope that the above summary is correct.


Additionally, I'll add that for me, tagfilt currently performs this task adequately and reasonably efficiently, so it's certainly not something that I would personally benefit much from -- a little convenience, at most. If I were in your position as the author of TMSU then I certainly would not choose to add such a feature without information suggesting that other users also had a use for it.
Reply all
Reply to author
Forward
0 new messages