I next want to add a cross-attention component so the model can attend to some note seq. The ultimate goal is to have a model that can take arbitrary midi input and output coherent sheet music without a tempo map or quantized performance. My hunch is that this is tractable for single-staff or piano. But I'm a hacker-type, not an ML expert, so it may just be naive optimism.
Anyways, I'm curious to know if anyone else in the Magenta community/team has experimented with this. If so, what were your results? I'd also be curious to know what you used as training data. It'd be nice to maybe one day get a good quality dataset for musicxml going.
Also if people are interested I can probably find a way to share the code -- I'd just have to clean it up a bit and get it cleared by Google.