Alien Experiment Scene

0 views

Skip to first unread message

Heli Whetzel

unread,

Aug 4, 2024, 7:48:06 PM8/4/24

to pifapifelt

Thisis proved to be undeniably true in the 1993 alien abduction movie Fire in the Sky, following an Arizona logger who mysteriously disappears for five days in an alleged encounter with a flying saucer in 1975. Based on a true story, the film deals with the abduction itself as well as the reaction of his friends and family who struggle to bring him back to the reality of life when he regains consciousness on earth.

Ebert does a good job of trying to describe the nightmarish scenes, though it is truly difficult to do the extraordinary moment justice, with director Robert Lieberman doing a remarkable job of putting the audience on the operating table from which the protagonist suffers from.

Forced upon the raised stone platform with a bright white light hanging high above him, the protagonist Travis Walton is surrounded by three fleshy alien beings overseeing the operation, looking like far more realistic depictions of extraterrestrials, in comparison with what Hollywood usually offers. Screaming so much that it almost sounds like droning stock footage, the sheer terror of the situation is quickly established by the entirely convincing set design.

Wrapped in a tactile rubber lining that mimics cling film, pinning him tightly to the surface, Walton awaits his treatment, with total fear present in his darting eyes that the filmmaker takes pleasure in focusing in on. Around him, the aliens look on in bewildered curiosity, like a child peering at a helpless animal through the glass in a zoo, totally devoid of mercy and emotion, they offer no chance of aid.

The whole moment is a masterful example of how makeup, special effects and set design can elevate a moment that largely comes out of nowhere, with little leadup and zero dialogue to heighten the drama. As far as alien abduction scenes go in cinema, there are none that even come close.

Over the last few months, my Twitter timeline has been taken over by this CLIP generated art. A growing community of artists, researchers, and hackers have been experimenting with these models and sharing their outputs. People have also been sharing code and various tricks/methods for modifying the quality or artistic style of the images produced. It all feels a bit like an emerging art scene.

On January 5th 2021, OpenAI released the model-weights and code for CLIP: a model trained to determine which caption from a set of captions best fits with a given image. After learning from hundreds of millions of images in this way, CLIP not only became quite proficient at picking out the best caption for a given image, but it also learned some surprisingly abstract and general representations for vision (see multimodal neuron work from Goh et al. on Distill).

Image representations at this level of abstraction were somewhat of a first of their kind. And in addition to all of this, the model also demonstrated a greater classification robustness than any prior work.

Nonetheless, it only took a day for various hackers, researchers, and artists (most notably @advadnoun and @quasimondo on Twitter) to figure out that with a simple trick CLIP can actually be used to guide existing image generating models (like GANs, Autoencoders, or Implicit Neural Representations like SIREN) to produce original images that fit with a given caption.

DeepDream was an incredibly popular AI art technique from a previous generation (2015). The technique essentially takes in an image and modifies it slightly (or dramatically) such that the image maximally activates certain neurons in a neural network trained to classify images. The results are usually very psychedelic and trippy, like the image below.

Although aesthetically DeepDream is quite different from The Big Sleep, both of these techniques share a similar vision: they both aim to extract art from neural networks that were not necessarily meant to generate art. They dive inside the network and pull out beautiful images. These art techniques feel like deep learning interpretability tools that accidentally produced art along the way.

Around early April @advadnoun and @RiversHaveWings started doing some experiments combining VQ-GAN and CLIP to generate images from a text prompt. On a high level, the method they used is mostly identical to The Big Sleep. The main difference is really just that instead of using Big-GAN as the generative model, this system used VQ-GAN.

CLIP learned general enough representations that in order to induce desired behavior from the model, all we need to do is to ask for it in the prompt. Of course, finding the right words to get the best outputs can be quite a challenge; after all, it did take several months to discover the unreal engine trick.

Ever since OpenAI released the weights and code for their CLIP model, various hackers, artists, researchers, and deep learning enthusiasts have figured out how to utilize CLIP as a an effective \u201Cnatural language steering wheel\u201D for various generative models, allowing artists to create all sorts of interesting visual art merely by inputting some text \u2013 a caption, a poem, a lyric, a word \u2013 to one of these models.

You can even mention specific cultural references and it\u2019ll usually come up with something sort of accurate. Querying the model for a \u201Cstudio ghibli landscape\u201D produces a reasonably convincing result:

These models have so much creative power: just input some words and the system does its best to render them in its own uncanny, abstract style. It\u2019s really fun and surprising to play with: I never really know what\u2019s going to come out; it might be a trippy pseudo-realistic landscape or something more abstract and minimal.

And despite the fact that the model does most of the work in actually generating the image, I still feel creative \u2013 I feel like an artist \u2013 when working with these models. There\u2019s a real element of creativity to figuring out what to prompt the model for. The natural language input is a total open sandbox, and if you can weild words to the model\u2019s liking, you can create almost anything.

In concept, this idea of generating images from a text description is incredibly similar to Open-AI\u2019s DALL-E model (if you\u2019ve seen my previous blog posts, I covered both the technical inner workings and philosophical ideas behind DALL-E in great detail). But in fact, the method here is quite different. DALL-E is trained end-to-end for the sole purpose of producing high quality images directly from language, whereas this CLIP method is more like a beautifully hacked together trick for using language to steer existing unconditional image generating models.

Since the CLIP based approach is a little more hacky, the outputs are not quite as high quality and precise as what\u2019s been demonstrated with DALL-E. Instead, the images produced by these systems are weird, trippy, and abstract. The outputs are grounded in our world for sure, but it\u2019s like they were produced by an alien that sees things a little bit differently.

I\u2019m not going to go in-depth on the technical details of how this system generates art. Instead, I\u2019m going to document the unexpected origins and evolution of this art scene, and along the way I\u2019ll also present some of my own thoughts and some cool artwork.

Of course I am not able to cover every aspect of this art scene in a single blog post. But I think this blog hits most of the big points and big ideas, and if there\u2019s anything important that you think I might have missed, feel free to comment below or tweet at me.

For instance, CLIP learned to represent a neuron that activates specifically for images and concepts relating to Spider-Man. There are also other neurons that activate for images relating to emotions, geographic locations, or even famous individuals (you can explore these neuron activations yourself with OpenAI\u2019s microscope tool).

So from a research perspective, CLIP was an incredibly exciting and powerful model. But nothing here clearly suggests that it would be helpful with generating art \u2013 let alone spawning the art scene that it did.

In this method, CLIP acts as something like a \u201Cnatural language steering wheel\u201D for generative models. CLIP essentially guides a search through the latent space of a given generative model to find latents that map to images which fit with a given sequence of words.

Of course, the outputs from The Big Sleep are maybe not everyone\u2019s cup of tea. They\u2019re weird and abstract, and while they are usually globally coherent, sometimes they don\u2019t make much sense. There is definitely a unique style to artworks produced by The Big Sleep, and I personally find it to be aesthetically pleasing.