On horses in MIR

bobl...@gmail.com

unread,

Oct 30, 2014, 2:14:14 AM10/30/14

to ismir2014-unco...@ismir.net

During my Tuesday presentation, I suddenly began talking about "horses", and cited a recently published paper:

B. L. Sturm, “A simple method to determine if a music information retrieval system is a ‘horse’,” IEEE Trans. Multimedia 16(6): 1636-1644, 2014.

Admittedly, this was confusing for many people. :)

- "What is a 'horse'?"
From my article: "[A horse is] a system appearing capable of a remarkable human feat, e.g., music genre recognition [from an audio signal], but actually working by using irrelevant characteristics (confounds)."

- "Did he just call the result of all my research a 'horse'?"
From my article: "[Calling] an MIR system a 'horse' is not meant to be an aspersion. As an intentional nod to Clever Hans, a 'horse' is just a system that is not actually addressing the problem it appears to be [or is claimed to be] solving. The judgment of whether a 'horse' is useful or not completely revolves around a use case of a system [1], [2]: can the requirements demanded of a use case be satisfied by a system that relies on characteristics confounded with the 'ground truth'?"

- "Why does this matter?"
From my article: "By explaining why an MIR system produces the [figure of merit] it does [from an evaluation], our method provides a sanity test of an MIR system, suggests ways to improve it, and thus ultimately provides a way to complete the 'IR research and development cycle' for which MIR has been described as falling short [3]–[5]."

- "Let's break his face!"
From my article: "!?!?!?"

So, I propose a session to have a conversation about these questions and many more, providing my face for breaking --- metaphorically speaking. For the first 10 minutes I will talk about "horses", real and virtual, with several fun examples. Then I will try to facilitate a discussion/brawl on the topic.

---
References:

[1] C. C. Liem, M. Mueller, D. Eck, G. Tzanetakis, and A. Hanjalic, “The need for music information retrieval with user-centered and multimodal strategies,” in Proc. Int. ACM workshop on Music information retrieval with user-centered and multimodal strategies, pp. 1-6, 2011.

[2] M. Schedl, A. Flexer, and J. Urbano, “The neglected user in music information retrieval research,” J. Intell. Info. Systems, vol. 41, no. 3, pp. 523-539, Dec. 2013.

[3] J. Urbano, M. Schedl, and X. Serra, “Evaluation in music information retrieval,” J. Intell. Info. Systems, vol. 41, no. 3, pp. 345-369, Dec. 2013.

[4] J. Urbano, “Evaluation in audio music similarity,” Ph.D. dissertation, University Carlos III of Madrid, 2013.

[5] X. Serra, M. Magas, E. Benetos, M. Chudy, S. Dixon, A. Flexer, E. Go mez, F. Gouyon, P. Herrera, S. Jorda, O. Paytuvi, G. Peeters, J. Schluter, H. Vinet, and G. Widmer, "Roadmap for Music Information ReSearch", G. Peeters, Ed. Creative Commons, 2013.

Julián Urbano

unread,

Oct 30, 2014, 10:20:55 PM10/30/14

to ismir2014-unco...@ismir.net

Sadly I didn't make it to ISMIR this year either, but I'd like to second this topic from here.

It appears that this year there's been quite a bit of interest on evaluation (at least in the original submissions), so please gather around and keep the topic alive.

PS: it's better to talk about horses over beer than over coffee.

Cheers

Thibault Langlois

unread,

Oct 31, 2014, 11:55:52 AM10/31/14

to ismir2014-unco...@ismir.net

Hi all. I did not attended ISMIR either but I think this topic deserves attention beyond the context of the conference.

In an other area a paper is making quite a lot of buzz : http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fjournal.pmed.0020124

What Bob calls a horse is called in this paper a "false positive". Since I was not there I do not understand what makes Bob think he may be punched (metaphorically speaking) it is a bit like blaming the messenger for the bad news.

Calling someone's work a horse may not be the most diplomatic way of saying such things but sometimes a slightly provocative posture helps gather attention.

In my opinion the problem is mostly with the datasets that are used. If the goal is to build a music genre classifier and you use the ISMIR2004 datastet for training, it will be very hard not to make a horse. If the ISMIR2004 was _so_ large and _so_ hard to classify that the best system only reach 35% F-score maybe we were not talking about horses today.

But *if* you use some extremely large dataset (maybe your own music collection) an produce a system that when evaluated on the ISMIR2004 (used only as a test set) gives a good FoM, maybe, since the training and testing data sets are completely distinct, it will be less likely that the system is a horse.

Reply all

Reply to author

Forward