Dataset is all you need

Nathan Libermann

unread,

Mar 14, 2018, 2:31:25 PM3/14/18

to Magenta Discuss

Quality and number of data are known to be really important for training Deep Learning models and for research coherence and reproducibility it's also important to get shared dataset.

When I read a state of the art paper like 'Hierarchical Variational Autoencoders for Music', Adam Roberts et al. claim to use 1.5 million unique MIDI files but don't explain how to get the data. A big part of the recent paper in Music Generation using deep neural network just say they scrap the web for publicly-available MIDI files.

How can we say a model is better than another if the training data are not comparable ?

Maybe I missed something... Do you know how to get big amount of 'good' sample of music/melody MIDI files ? Maybe those of the paper I cite before ?

Thx,

Nathan

Jesse Engel

unread,

Mar 14, 2018, 2:48:31 PM3/14/18

to Nathan Libermann, Magenta Discuss

For the paper, (and future ones forthcoming) there are lots of comparisons provided to other architectures trained on the same data so you can get a sense of the value of the architectural contributions.

--
Magenta project: magenta.tensorflow.org
To post to this group, send email to magenta...@tensorflow.org
To unsubscribe from this group, send email to magenta-discuss+unsubscribe@tensorflow.org
---
You received this message because you are subscribed to the Google Groups "Magenta Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to magenta-discuss+unsubscribe@tensorflow.org.

Bruno Afonso

unread,

Mar 14, 2018, 5:13:52 PM3/14/18

to Jesse Engel, Magenta Discuss, Nathan Libermann

It is up to the journals to enforce the dataset availability. If the authors do not make the data available, beware of the results or claims...

On Wed, Mar 14, 2018, 14:48 'Jesse Engel' via Magenta Discuss, <magenta...@tensorflow.org> wrote:

For the paper, (and future ones forthcoming) there are lots of comparisons provided to other architectures trained on the same data so you can get a sense of the value of the architectural contributions.

On Wed, Mar 14, 2018 at 11:31 AM, Nathan Libermann <n.lib...@gmail.com> wrote:

Quality and number of data are known to be really important for training Deep Learning models and for research coherence and reproducibility it's also important to get shared dataset.

When I read a state of the art paper like 'Hierarchical Variational Autoencoders for Music', Adam Roberts et al. claim to use 1.5 million unique MIDI files but don't explain how to get the data. A big part of the recent paper in Music Generation using deep neural network just say they scrap the web for publicly-available MIDI files.

How can we say a model is better than another if the training data are not comparable ?

Maybe I missed something... Do you know how to get big amount of 'good' sample of music/melody MIDI files ? Maybe those of the paper I cite before ?

Thx,
Nathan

--
Magenta project: magenta.tensorflow.org
To post to this group, send email to magenta...@tensorflow.org

To unsubscribe from this group, send email to magenta-discu...@tensorflow.org

---
You received this message because you are subscribed to the Google Groups "Magenta Discuss" group.

To unsubscribe from this group and stop receiving emails from it, send an email to magenta-discu...@tensorflow.org.

--
Magenta project: magenta.tensorflow.org
To post to this group, send email to magenta...@tensorflow.org

To unsubscribe from this group, send email to magenta-discu...@tensorflow.org

---
You received this message because you are subscribed to the Google Groups "Magenta Discuss" group.

To unsubscribe from this group and stop receiving emails from it, send an email to magenta-discu...@tensorflow.org.

Adam Roberts

unread,

Mar 14, 2018, 7:44:55 PM3/14/18

to Bruno Afonso, Jesse Engel, Magenta Discuss, Nathan Libermann

All of the code is available on our github at https://goog.gl/magenta/musicvae-code. Please train it on any dataset you prefer!

On Thu, Mar 15, 2018 at 6:13 AM, Bruno Afonso <baf...@gmail.com> wrote:

It is up to the journals to enforce the dataset availability. If the authors do not make the data available, beware of the results or claims...

On Wed, Mar 14, 2018, 14:48 'Jesse Engel' via Magenta Discuss, <magenta-discuss@tensorflow.org> wrote:

For the paper, (and future ones forthcoming) there are lots of comparisons provided to other architectures trained on the same data so you can get a sense of the value of the architectural contributions.

On Wed, Mar 14, 2018 at 11:31 AM, Nathan Libermann <n.lib...@gmail.com> wrote:

Quality and number of data are known to be really important for training Deep Learning models and for research coherence and reproducibility it's also important to get shared dataset.

When I read a state of the art paper like 'Hierarchical Variational Autoencoders for Music', Adam Roberts et al. claim to use 1.5 million unique MIDI files but don't explain how to get the data. A big part of the recent paper in Music Generation using deep neural network just say they scrap the web for publicly-available MIDI files.

How can we say a model is better than another if the training data are not comparable ?

Maybe I missed something... Do you know how to get big amount of 'good' sample of music/melody MIDI files ? Maybe those of the paper I cite before ?

Thx,
Nathan

--
Magenta project: magenta.tensorflow.org
To post to this group, send email to magenta...@tensorflow.org

To unsubscribe from this group, send email to magenta-discuss+unsubscribe@tensorflow.org

---
You received this message because you are subscribed to the Google Groups "Magenta Discuss" group.

To unsubscribe from this group and stop receiving emails from it, send an email to magenta-discuss+unsubscribe@tensorflow.org.

--
Magenta project: magenta.tensorflow.org
To post to this group, send email to magenta...@tensorflow.org

To unsubscribe from this group, send email to magenta-discuss+unsubscribe@tensorflow.org

---
You received this message because you are subscribed to the Google Groups "Magenta Discuss" group.

To unsubscribe from this group and stop receiving emails from it, send an email to magenta-discuss+unsubscribe@tensorflow.org.

--
Magenta project: magenta.tensorflow.org
To post to this group, send email to magenta...@tensorflow.org

To unsubscribe from this group, send email to magenta-discuss+unsubscribe@tensorflow.org

---
You received this message because you are subscribed to the Google Groups "Magenta Discuss" group.

To unsubscribe from this group and stop receiving emails from it, send an email to magenta-discuss+unsubscribe@tensorflow.org.

GANESH S

unread,

Mar 16, 2018, 3:45:30 AM3/16/18

to magenta...@tensorflow.org, Bruno Afonso, Jesse Engel, Nathan Libermann, ada...@google.com

>>> Do you know how to get big amount of 'good' sample of music/melody MIDI files?

Have you looked at the Lakh MIDI Dataset created by Colin Raffel as part of his PhD thesis?

The Lakh MIDI dataset is a collection of 176,581 unique MIDI files, 45,129 of which have been matched and aligned to entries in the Million Song Dataset. Its goal is to facilitate large-scale music information retrieval, both symbolic (using the MIDI files alone) and audio content-based (using information extracted from the MIDI files as annotations for the matched audio files).

Best wishes,

Ganesh Srinivas

GANESH S

unread,

Mar 16, 2018, 3:54:37 AM3/16/18

to magenta...@tensorflow.org, Colin Raffel, Bruno Afonso, Jesse Engel, Nathan Libermann, ada...@google.com

I just realized that Colin Raffel IS one of the authors of the paper! :-)

On Fri, Mar 16, 2018 at 1:15 PM, GANESH S <gs...@snu.edu.in> wrote:

>>> Do you know how to get big amount of 'good' sample of music/melody MIDI files?

Have you looked at the Lakh MIDI Dataset created by Colin Raffel as part of his PhD thesis?
The Lakh MIDI dataset is a collection of 176,581 unique MIDI files, 45,129 of which have been matched and aligned to entries in the Million Song Dataset. Its goal is to facilitate large-scale music information retrieval, both symbolic (using the MIDI files alone) and audio content-based (using information extracted from the MIDI files as annotations for the matched audio files).

Best wishes,
Ganesh Srinivas

Message has been deleted

Adam Roberts

unread,

Mar 17, 2018, 10:36:49 AM3/17/18

to Nathan Libermann, Magenta Discuss, Colin Raffel, baf...@gmail.com, jesse...@google.com

Nathan, you don't need to prefilter the data. The preprocessing will extract monophonic tracks from multi track files too.

There isn't a script to dump the extracted examples but we should probably add one.

On Sat, Mar 17, 2018, 4:00 AM Nathan Libermann <n.lib...@gmail.com> wrote:

Yes I try to use the Lakh MIDI Dataset (the clean_midi subset) but as I want to work on monophonic melody, by removing the noisy file + the automatic melody detection that may miss the melody track I only succeed to get 1000 or 2000 good MIDI file of good monophonic melody. I don't know how to increase this volume of good data.

I didn't realize Coling Raffel is in acknowledgments of the paper !

Le vendredi 16 mars 2018 08:54:37 UTC+1, Ganesh Srinivas a écrit :

I just realized that Colin Raffel IS one of the authors of the paper! :-)

On Fri, Mar 16, 2018 at 1:15 PM, GANESH S <gs...@snu.edu.in> wrote:

>>> Do you know how to get big amount of 'good' sample of music/melody MIDI files?

Have you looked at the Lakh MIDI Dataset created by Colin Raffel as part of his PhD thesis?
The Lakh MIDI dataset is a collection of 176,581 unique MIDI files, 45,129 of which have been matched and aligned to entries in the Million Song Dataset. Its goal is to facilitate large-scale music information retrieval, both symbolic (using the MIDI files alone) and audio content-based (using information extracted from the MIDI files as annotations for the matched audio files).

Best wishes,
Ganesh Srinivas

On Thu, Mar 15, 2018 at 5:14 AM, 'Adam Roberts' via Magenta Discuss <magenta...@tensorflow.org> wrote:

All of the code is available on our github at https://goog.gl/magenta/musicvae-code. Please train it on any dataset you prefer!

On Thu, Mar 15, 2018 at 6:13 AM, Bruno Afonso <baf...@gmail.com> wrote:

It is up to the journals to enforce the dataset availability. If the authors do not make the data available, beware of the results or claims...

On Wed, Mar 14, 2018, 14:48 'Jesse Engel' via Magenta Discuss, <magenta...@tensorflow.org> wrote:

For the paper, (and future ones forthcoming) there are lots of comparisons provided to other architectures trained on the same data so you can get a sense of the value of the architectural contributions.

On Wed, Mar 14, 2018 at 11:31 AM, Nathan Libermann <n.lib...@gmail.com> wrote:

Quality and number of data are known to be really important for training Deep Learning models and for research coherence and reproducibility it's also important to get shared dataset.

When I read a state of the art paper like 'Hierarchical Variational Autoencoders for Music', Adam Roberts et al. claim to use 1.5 million unique MIDI files but don't explain how to get the data. A big part of the recent paper in Music Generation using deep neural network just say they scrap the web for publicly-available MIDI files.

How can we say a model is better than another if the training data are not comparable ?

Maybe I missed something... Do you know how to get big amount of 'good' sample of music/melody MIDI files ? Maybe those of the paper I cite before ?

Thx,
Nathan

--
Magenta project: magenta.tensorflow.org
To post to this group, send email to magenta...@tensorflow.org

To unsubscribe from this group, send email to magenta-discu...@tensorflow.org

---
You received this message because you are subscribed to the Google Groups "Magenta Discuss" group.

To unsubscribe from this group and stop receiving emails from it, send an email to magenta-discu...@tensorflow.org.

--
Magenta project: magenta.tensorflow.org
To post to this group, send email to magenta...@tensorflow.org

To unsubscribe from this group, send email to magenta-discu...@tensorflow.org

---
You received this message because you are subscribed to the Google Groups "Magenta Discuss" group.

To unsubscribe from this group and stop receiving emails from it, send an email to magenta-discu...@tensorflow.org.

--
Magenta project: magenta.tensorflow.org
To post to this group, send email to magenta...@tensorflow.org

To unsubscribe from this group, send email to magenta-discu...@tensorflow.org

---
You received this message because you are subscribed to the Google Groups "Magenta Discuss" group.

To unsubscribe from this group and stop receiving emails from it, send an email to magenta-discu...@tensorflow.org.

--
Magenta project: magenta.tensorflow.org
To post to this group, send email to magenta...@tensorflow.org

To unsubscribe from this group, send email to magenta-discu...@tensorflow.org

---
You received this message because you are subscribed to the Google Groups "Magenta Discuss" group.

To unsubscribe from this group and stop receiving emails from it, send an email to magenta-discu...@tensorflow.org.

Nathan Libermann

unread,

Mar 20, 2018, 9:13:08 AM3/20/18

to Magenta Discuss, n.lib...@gmail.com, cra...@gmail.com, baf...@gmail.com, jesse...@google.com

Thank you Adam, i'v take a deeper look on extraction from MIDI proposed my Magenta, it may be good for selecting data even if as I work on structure I would like to get well structured melody.

Reply all

Reply to author

Forward