Through a collaboration between MusicBrainz (http://musicbrainz.org) and the Music Technology Group of the Universitat Pompeu Fabra (http://mtg.upf.edu) we are starting the AcousticBrainz.org project to crowd source data automatically extracted from audio recordings. This information will allow us to create open datasets of audio descriptors + metadata that will be very useful to the MIR community.
We have put together an open source Essentia extractor (https://github.com/MTG/essentia/tree/master/src/examples/extractor_music) that people can download and run on their music collections. The audio recordings must have a MusicBrainz ID and the resulting analysis data is then automatically uploaded to AcousticBrainz.org. In the two weeks since we started, we have already collected close to 500,000 description files.
The main information obtained from each recording is a set of low level audio descriptors. We will soon start running another extractor to extract high level descriptors from the low level data. From the MBIDs (https://musicbrainz.org/doc/MusicBrainz_Identifier) included in each description file you can access all the metadata available in MusicBrainz and in any other resources using MBIDs.
We are just in a testing phase and the information and explanations in http://acousticbrainz.org are not yet complete, but will be updated soon, but before making the final decision on how to develop this project further I would like to have an open discussion on what could be the most relevant things to do.
Possible topics for discussion:
- How could we develop useful datasets from the gathered data?
- How should we distribute the data? currently we are planning on having zip files and on an API.
- What type of research problems can be tackled with this data?
- ....
I am looking forward to your feed back.
...xavier