This could be an important paper and it is therefore important that the presentation is watertight. I provide the comments below fully aware of current constraints on daily life, and in the spirit of wanting to minimize additional work for the authors. I think that overall the data already exists to improve the manuscript (or at least it should). But there is a fundamental question of whether the data are over-interpreted, and whether the effect of Clamp is as significant as the authors claim, at least within the framework of the process of ZGA.
Many AI governance proposals rest on the idea that there's some significant delta between what you can do on a dense and exquisitely optimized cluster and what you can do on a much scrappier system. While it will always be more efficient to train on a bunch of computers sat next to one another, the technical frontier of training in other ways is evolving very quickly and I think no one has much of a sense of how far it can go. There are also cool startups like together.xyz making a real go at betting on this tech direction as a business.
Many AI governance proposals rest on the idea that there's some significant delta between what you can do on a dense and exquisitely optimized cluster and what you can do on a much scrappier system. While it will always be more efficient to train on a bunch of computers sat next to one another, the technical frontier of training in other ways is evolving very quickly and I think no one has much of a sense of how far it can go. There are also cool startups like together.xyz making a real go at betting on this tech direction as a business.