Stability Ai Img2img

0 views
Skip to first unread message

Bowie Maur

unread,
Aug 4, 2024, 7:29:36 PM8/4/24
to rousspagcentsu
Imageto-image (img2img for short) is a method to generate new AI images from an input image and text prompt. The output image will follow the color and composition of the input image.

The prompt requirement is the same as text-to-image. You can view image-to-image as a generalization of text-to-image: Text-to-image starts with an image of random noise. Image-to-image starts with an image you specify and then adds noise.


Hey, I love your blog, instructions are amaisingly simple and yet very informative.

I am trying to find a way to transform lineart into a photorealistic image, while following the outlines, similar to what Vizcom does. Img2img does not give good results, no matter the prompt used. Is there a way SD can do this?

Thanks, Vesna


In Stable Diffusion there are many parameters you can change which gives you a lot of opportunities but it also takes a while to understand. Generally speaking you enter a prompt that you want the image to look like, and an optional negative prompt (what the image should exclude).


The base resolution in which SD renders images is generally low, since the images SD 1.5 is trained on is only 512512 pixels, so it does not make sense to get much higher in the initial round. Also, of course, the bigger the longer it takes. User with powerful rigs can use Highres Fix to get over this issue, but most (like me) will have to do it differently.


How the image looks like does not only depend on the parameters but also which checkpoint (or model/diffusor) is used. SD initially started with their own model but since its open source other creative minds have used it to train models on more specific images. Those models can again be merged so it is possible to have a rather unique blend.


When I find a prompt that looks promising, I usually let the computer run overnight and let him generate 100 variations. For this quick example I generate 10 images, which took around 20 minutes or so. Here the results:


The checkpoint (diffusor) that I created is more on the creative / illustrative side so its a good starting point, however I like it ever more realistic. This I do in the next step, with img2img. This is the tool I feel in love with when I heard about SD. You basically take an image as a base and interpolate it again with another prompt. You can modify the prompt, or you can modify the diffusor. Some diffusors also need trigger words that they unleash their training, for instance inkpunk diffusion uses nvinkpunk as a trigger, this you put into you prompt.


This step of fine-tuning I combine with the next one, to upscale the image via img2img. Basically he slices the image into smaller parts, which runs again though the diffusion. this way a lot of detail can be added.


For LMS, I see that in txt2img the output is exactly the same (20 steps, 7.5 CFG, 0 seed), but when moving to img2img, the output is very different, and most notably, it seems there is some smoothing or something happening in diffusers that causes the output to lose crispness. (Images to follow in posts)


For LMS, I see that in txt2img the output is exactly the same (20 steps, 7.5 CFG, 0 seed), but when moving to img2img, the output is very different, and most notably, it seems there is some smoothing or something happening in diffusers that causes the output to lose crispness.


I would like to blend the latent representations of two Stable Diffusion img2img-pipeline images;

I think the blending part will be a trivial task, but I have extreme difficulties accessing the latents.




The leftmost image is the initial image.

The second on the left is the SD img2img result for japanese wood painting, seems about fine.

The third is the reverse decoded image representation of the VAE encoded result, which I understand to be the latent representation of the img2img result.

The last is the result of SD txt2img pipeline of the latent representation and the prompt.




What am I doing wrong? Also, it is clear that I don't understand the stable diffusion process concerning the latents. I have gone through a multitude of general descriptions of the SD process and huggingface docs, if someone could point me in the direction of good resources, that would be appreciated as well. If it is pertinent, my final goal would be blending latents of video sequence frames for better continuity.


Stable Diffusion is known for its ability to generate images through prompts with its txt2img model. However, it extends further by considering an image as a reference and generates an image through its img2img model, making it handy for various purposes.


Stable Diffusion generates reliable results if you get the prompt and other factors right, but inconsistencies are commonly found. For instance, Stable Diffusion generated an image with an apple that had two stems and required editing.


We want to run the image through a low step, low denoising pass on img2img to make it come together a bit more. An optional step after this is to do a masking pass to only grab the parts of each image you want (original txt2img, the inpainting final, and the img2img final).


my day one notebook:

-text2image

which runs at full speed with all parameters exposed, this is great to learn how the new system is working as you can see all of the code needed to run inference, but requires A100 to run. This is probably more than most people will need, so while the GRADIO version at the top might be slower, it is running on T4.




Official info here - Stable Cascade has arrived. -AI/StableCascade

This revolutionary 3 stage model generates 1024x1024 images faster than SDXL



There is a huggingface space for simple inference, however i put together a colab (requires A100) if people want to play with all the parameter in each stage. You can use a text prompt with txt2img, all four parameters are exposed for the B and C stages. there is an image variation mode, which only needs an image URL with no prompt. Finally there is an img2img mode, which can take an image URL and a prompt, with some parameters automatically scaling, which you can see in the code.







each video has a walkthrough if you need it :)




In today's digital age, images play a vital role in various fields, including art, entertainment, advertising, and more. Advancements in artificial intelligence (AI) have revolutionized image processing, giving rise to cutting-edge techniques such as image-to-image transformation. Among these techniques, the Stable Diffusion img2img model has emerged as a groundbreaking solution that enables powerful and seamless image manipulation. Whether you're an artist, designer, or simply an enthusiast, Stable Diffusion img2img opens up a world of possibilities for transforming images in captivating ways.


Stable Diffusion img2img is an advanced AI model designed to perform image-to-image transformation. Developed using state-of-the-art machine learning techniques, this model leverages the concept of diffusion processes to achieve remarkable results in various image manipulation tasks. It offers an intuitive online platform where users can effortlessly transform their images with a few clicks, making it accessible to professionals and beginners alike.


Stable Diffusion img2img has emerged as a game-changer in the field of image-to-image transformation. Its powerful capabilities, combined with its accessible online platform, make it a valuable tool for artists, designers, photographers, and anyone interested in unlocking the creative potential of their images. With Stable Diffusion img2img, the boundaries of visual expression are pushed further, enabling users to effortlessly transform and enhance their images, bringing their visions to life like never before.


At the moment, it is a bit of a hassle to get it run. The immediate next is to move the tokenizer off Python, that will make the whole thing "Swift-only". After that, will be various performance optimizations, memory usage optimizations, CPU / Metal work to finally make this usable on mobile. I estimate at least a month of work ahead to achieve mobile-friendly.


For some fun, here is the logo for this project generated with prompt: "a logo for swift diffusion, with a carton animal and a text diffusion underneath". It seems have trouble to understand exactly what "diffusion" is though


There is no comparison. swift-diffusion in current form only supports Linux + CUDA. CPU support will come later and at that time, it can run on Mac / iOS. But to run it efficiently, some ops need to leverage the hardware either as Metal compute kernels or use the neural engine: tinygrad/accel/ane at master geohot/tinygrad GitHub DiffusionBee currently use MPS backend implemented in PyTorch to run on M1 efficiently. I haven't looked too deep into how the MPS backend implemented but would imagine some Metal kernels plus ANE there.


@liuliu note that on the M1 Max at least, the GPU F16 processing power is higher than the ANE. Also, the GPU is more programmable and can be accessed with lower latency. Make sure to use MPS and MPSGraph, otherwise try the simdgroup_matrix in Metal Shading Language. This will provide the highest matrix mul performance.


Apple went this route with MetalFX temporal upscaling. Most people suspect that it runs on the ANE, but it actually runs entirely on the GPU. It's also restricted to only M1/Pro/Max and doesn't run on A14/A15, probably because it needs sufficient GPU F16 TFLOPS.


MPSGraph should be pleasant to use for making neural networks, so I advise trying exclusively MPSGraph at first. But measure the CPU-side overhead, which is massive with MPSGraph, before shipping the final product.


I've been updating the repo in the past a few days. Now img2img should work, as well as inpainting (or you can call it outpainting, really depends on where the mask is). Both requires text prompt to work (it is weird for inpainting, but I haven't figured out a way to avoid that).

3a8082e126
Reply all
Reply to author
Forward
0 new messages