Creating my own dataset: How large should my SequenceExamples tfrecord be?

161 views
Skip to first unread message

Sean Farrell

unread,
Feb 22, 2018, 9:06:23 PM2/22/18
to Magenta Discuss
Hey all!

I am trying to make my own TF model from my own MIDI data.  I can successfully create the initial dataset tfrecord (https://github.com/tensorflow/magenta/blob/master/magenta/scripts/README.md), but when I go to the next step to create SequenceExamples, the tfrecord balloons in size to well over 100GB.

I figured that this is not normal.  Could someone confirm that this is off, and perhaps offer some suggestions on what might be causing the issue.

I also repeated these steps with the recommended data set (Lakh Midi Dataset) with the same results.

Any help is appreciated!

Thanks and y'all rock,

Sean

Curtis "Fjord" Hawthorne

unread,
Mar 12, 2018, 8:52:52 PM3/12/18
to Sean Farrell, Magenta Discuss
Hi Sean,

Unfortunately, this is normal (or at least expected) for large datasets. Our initial data conversion pipeline does the onehot conversion before saving the record instead of doing that in the graph. Some of our later models do the correct thing and wait to expand to the onehot encoding until the data is actually being used. If you'd like to send a PR our way for updating the models that do the onehot conversion in the wrong place, we'd be happy to look at it.

-Fjord

--
Magenta project: magenta.tensorflow.org
To post to this group, send email to magenta...@tensorflow.org
To unsubscribe from this group, send email to magenta-discu...@tensorflow.org
---
You received this message because you are subscribed to the Google Groups "Magenta Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to magenta-discu...@tensorflow.org.

Sean Farrell

unread,
Mar 13, 2018, 2:17:22 PM3/13/18
to Curtis Fjord Hawthorne, Magenta Discuss
Hey thanks for getting back to me Curtis

I did mount a large external disk to hold that big ol tfrecord (almost 700 gb!), and the rest of the training went really smoothly!

I am very pleased with the results so far.  I am "sampling" the outputs I like to make new tracks, and its pretty awesome to have that performance feel straight from the model.

Great work again,

Sean

On Mon, Mar 12, 2018 at 5:52 PM, Curtis "Fjord" Hawthorne <fj...@google.com> wrote:
Hi Sean,

Unfortunately, this is normal (or at least expected) for large datasets. Our initial data conversion pipeline does the onehot conversion before saving the record instead of doing that in the graph. Some of our later models do the correct thing and wait to expand to the onehot encoding until the data is actually being used. If you'd like to send a PR our way for updating the models that do the onehot conversion in the wrong place, we'd be happy to look at it.

-Fjord
On Thu, Feb 22, 2018 at 6:06 PM Sean Farrell <se...@a52.com> wrote:
Hey all!

I am trying to make my own TF model from my own MIDI data.  I can successfully create the initial dataset tfrecord (https://github.com/tensorflow/magenta/blob/master/magenta/scripts/README.md), but when I go to the next step to create SequenceExamples, the tfrecord balloons in size to well over 100GB.

I figured that this is not normal.  Could someone confirm that this is off, and perhaps offer some suggestions on what might be causing the issue.

I also repeated these steps with the recommended data set (Lakh Midi Dataset) with the same results.

Any help is appreciated!

Thanks and y'all rock,

Sean

--
Magenta project: magenta.tensorflow.org
To post to this group, send email to magenta...@tensorflow.org
To unsubscribe from this group, send email to magenta-discuss+unsubscribe@tensorflow.org

---
You received this message because you are subscribed to the Google Groups "Magenta Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to magenta-discuss+unsubscribe@tensorflow.org.

Evan Templeton

unread,
May 16, 2021, 10:38:33 AM5/16/21
to Magenta Discuss, Sean Farrell, Magenta Discuss, Curtis Hawthorne
Hey Sean, could you tell us a little bit more about your model? Is it all piano, or does it contain any new instruments/voices/labels?

Thanks!
Evan

On Tuesday, March 13, 2018 at 1:17:22 PM UTC-5 Sean Farrell wrote:
Hey thanks for getting back to me Curtis

I did mount a large external disk to hold that big ol tfrecord (almost 700 gb!), and the rest of the training went really smoothly!

I am very pleased with the results so far.  I am "sampling" the outputs I like to make new tracks, and its pretty awesome to have that performance feel straight from the model.

Great work again,

Sean

On Mon, Mar 12, 2018 at 5:52 PM, Curtis "Fjord" Hawthorne <fj...@google.com> wrote:
Hi Sean,

Unfortunately, this is normal (or at least expected) for large datasets. Our initial data conversion pipeline does the onehot conversion before saving the record instead of doing that in the graph. Some of our later models do the correct thing and wait to expand to the onehot encoding until the data is actually being used. If you'd like to send a PR our way for updating the models that do the onehot conversion in the wrong place, we'd be happy to look at it.

-Fjord
On Thu, Feb 22, 2018 at 6:06 PM Sean Farrell <se...@a52.com> wrote:
Hey all!

I am trying to make my own TF model from my own MIDI data.  I can successfully create the initial dataset tfrecord (https://github.com/tensorflow/magenta/blob/master/magenta/scripts/README.md), but when I go to the next step to create SequenceExamples, the tfrecord balloons in size to well over 100GB.

I figured that this is not normal.  Could someone confirm that this is off, and perhaps offer some suggestions on what might be causing the issue.

I also repeated these steps with the recommended data set (Lakh Midi Dataset) with the same results.

Any help is appreciated!

Thanks and y'all rock,

Sean

--
Magenta project: magenta.tensorflow.org
To post to this group, send email to magenta...@tensorflow.org
To unsubscribe from this group, send email to magenta-discu...@tensorflow.org

---
You received this message because you are subscribed to the Google Groups "Magenta Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to magenta-discu...@tensorflow.org.

Reply all
Reply to author
Forward
0 new messages