Overthe last few months, my Twitter timeline has been taken over by this CLIP generated art. A growing community of artists, researchers, and hackers have been experimenting with these models and sharing their outputs. People have also been sharing code and various tricks/methods for modifying the quality or artistic style of the images produced. It all feels a bit like an emerging art scene.
On January 5th 2021, OpenAI released the model-weights and code for CLIP: a model trained to determine which caption from a set of captions best fits with a given image. After learning from hundreds of millions of images in this way, CLIP not only became quite proficient at picking out the best caption for a given image, but it also learned some surprisingly abstract and general representations for vision (see multimodal neuron work from Goh et al. on Distill).
Image representations at this level of abstraction were somewhat of a first of their kind. And in addition to all of this, the model also demonstrated a greater classification robustness than any prior work.
Nonetheless, it only took a day for various hackers, researchers, and artists (most notably @advadnoun and @quasimondo on Twitter) to figure out that with a simple trick CLIP can actually be used to guide existing image generating models (like GANs, Autoencoders, or Implicit Neural Representations like SIREN) to produce original images that fit with a given caption.
DeepDream was an incredibly popular AI art technique from a previous generation (2015). The technique essentially takes in an image and modifies it slightly (or dramatically) such that the image maximally activates certain neurons in a neural network trained to classify images. The results are usually very psychedelic and trippy, like the image below.
Although aesthetically DeepDream is quite different from The Big Sleep, both of these techniques share a similar vision: they both aim to extract art from neural networks that were not necessarily meant to generate art. They dive inside the network and pull out beautiful images. These art techniques feel like deep learning interpretability tools that accidentally produced art along the way.
Around early April @advadnoun and @RiversHaveWings started doing some experiments combining VQ-GAN and CLIP to generate images from a text prompt. On a high level, the method they used is mostly identical to The Big Sleep. The main difference is really just that instead of using Big-GAN as the generative model, this system used VQ-GAN.
CLIP learned general enough representations that in order to induce desired behavior from the model, all we need to do is to ask for it in the prompt. Of course, finding the right words to get the best outputs can be quite a challenge; after all, it did take several months to discover the unreal engine trick.
Ever since OpenAI released the weights and code for their CLIP model, various hackers, artists, researchers, and deep learning enthusiasts have figured out how to utilize CLIP as a an effective \u201Cnatural language steering wheel\u201D for various generative models, allowing artists to create all sorts of interesting visual art merely by inputting some text \u2013 a caption, a poem, a lyric, a word \u2013 to one of these models.
You can even mention specific cultural references and it\u2019ll usually come up with something sort of accurate. Querying the model for a \u201Cstudio ghibli landscape\u201D produces a reasonably convincing result:
These models have so much creative power: just input some words and the system does its best to render them in its own uncanny, abstract style. It\u2019s really fun and surprising to play with: I never really know what\u2019s going to come out; it might be a trippy pseudo-realistic landscape or something more abstract and minimal.
And despite the fact that the model does most of the work in actually generating the image, I still feel creative \u2013 I feel like an artist \u2013 when working with these models. There\u2019s a real element of creativity to figuring out what to prompt the model for. The natural language input is a total open sandbox, and if you can weild words to the model\u2019s liking, you can create almost anything.
In concept, this idea of generating images from a text description is incredibly similar to Open-AI\u2019s DALL-E model (if you\u2019ve seen my previous blog posts, I covered both the technical inner workings and philosophical ideas behind DALL-E in great detail). But in fact, the method here is quite different. DALL-E is trained end-to-end for the sole purpose of producing high quality images directly from language, whereas this CLIP method is more like a beautifully hacked together trick for using language to steer existing unconditional image generating models.
Since the CLIP based approach is a little more hacky, the outputs are not quite as high quality and precise as what\u2019s been demonstrated with DALL-E. Instead, the images produced by these systems are weird, trippy, and abstract. The outputs are grounded in our world for sure, but it\u2019s like they were produced by an alien that sees things a little bit differently.
I\u2019m not going to go in-depth on the technical details of how this system generates art. Instead, I\u2019m going to document the unexpected origins and evolution of this art scene, and along the way I\u2019ll also present some of my own thoughts and some cool artwork.
Of course I am not able to cover every aspect of this art scene in a single blog post. But I think this blog hits most of the big points and big ideas, and if there\u2019s anything important that you think I might have missed, feel free to comment below or tweet at me.
For instance, CLIP learned to represent a neuron that activates specifically for images and concepts relating to Spider-Man. There are also other neurons that activate for images relating to emotions, geographic locations, or even famous individuals (you can explore these neuron activations yourself with OpenAI\u2019s microscope tool).
So from a research perspective, CLIP was an incredibly exciting and powerful model. But nothing here clearly suggests that it would be helpful with generating art \u2013 let alone spawning the art scene that it did.
In this method, CLIP acts as something like a \u201Cnatural language steering wheel\u201D for generative models. CLIP essentially guides a search through the latent space of a given generative model to find latents that map to images which fit with a given sequence of words.
Of course, the outputs from The Big Sleep are maybe not everyone\u2019s cup of tea. They\u2019re weird and abstract, and while they are usually globally coherent, sometimes they don\u2019t make much sense. There is definitely a unique style to artworks produced by The Big Sleep, and I personally find it to be aesthetically pleasing.
But the main wonder and enchantment that I get from The Big Sleep does not necessarily come from its aesthetics, rather it\u2019s a bit more meta. The Big Sleep\u2019s optimization objective when generating images is to find a point in GAN latent space that maximally corresponds to a given sequence of words under CLIP. So when looking at outputs from The Big Sleep, we are literally seeing how CLIP interprets words and how it \u201Cthinks\u201D they correspond to our visual world.
To really appreciate this, you can think of CLIP as being either statistical or alien. I prefer the latter. I like to think of CLIP as something like an alien brain that we\u2019re able to unlock and peer into with the help of techniques like The Big Sleep. Neural networks are very different from human brains, so thinking of CLIP as some kind of alien brain is not actually that crazy. Of course CLIP is not truly \u201Cintelligent\u201D, bit it\u2019s still showing us a different view of things, and I find that idea quite enchanting.
The alternative perspective/philosophy on CLIP is a little more statistical and cold. You could think of CLIP\u2019s outputs as the product of mere statistical averages: the result of computing the correlations between language and vision as they exist on the internet. And so with this perspective, the outputs from CLIP are more akin to peering into the zeitgeist (at least the zeitgeist at the time that CLIP\u2019s training data was scraped) and seeing things as something like a \u201Cstatistical average of the internet\u201D (of course this assumes minimal approximation error with respect to the true distribution of data, which is probably an unreasonable assumption).
Since CLIP\u2019s outputs are so weird, the alien viewpoint makes a lot more sense to me. I think the statistical zeigeist perspective applies more to situations like GPT-3, where the approximation error is presumably quite low.
3a8082e126