So if you have the Hugging Face Stable Diffusion model at /Users/myuser/stable-diffusion-v1-4/, then you can simply switch to the folder where you have the Jupyter notebooks and run the following (on Linux/macOS, the Windows command is slightly different):
Hi folks, I have started to look into Lesson 9 video. Where is the main notebook that @jeremy showed in the video? Specifically the part where he is talking about guidance scale, negative prompts, init image, textual inversion, Dreambooth etc.
Is there a specific reason UNet predicts the noise, instead of predicting the image with some noise reduced? Is it because we want to have control over what percentage of the predicted noise we actually want to subtract from the image?
Does anyone have experience with text embeddings having more than 77 tokens (CLIP)?
I am working on an exciting Dataset where the text for an associated image is more extensive, on average, 150 tokens.
Alternatively:
Is there a LLM directly outputting a prompt for Stable Diffusion? I could not find anything on the HuggingFace Hub, only GPT2 prompt generation (can I misuse such a model to summarize something to a prompt?)
Under WSL, at least with my configuration, I have to export export LD_LIBRARY_PATH=/usr/lib/wsl/lib otherwise I have this error: Could not load library libcudnn_cnn_infer.so.8. Error: libcuda.so: cannot open shared object file: No such file or directory
Hi, I watched lesson 9A by @johnowhitaker and was so impressed by details and nice explanations of abstract concepts. Especially liked paper drawings and all comments about bringing more control to the generation process.
I have a follow up question about generation process. Can I use very low resolution image like 32x32 as a guidance to create artistically stylized image with same core object? I wonder if anyone is willing to work on it or give me some guidance how to create that flow. Is that technically possible, and how to make upscaling process generative like?
Final note: a new paper using a GAN-based approach came out yesterday with some very impressive super-resolution results. Not diffusion so a bit off-topic but interesting as comparison: GigaGAN: Scaling up GANs for Text-to-Image Synthesis
I am training a model using stable-diffusion-v1-4, with around 4900 training dataset size. I use the following to create a pipe. It is run on cuda with Nvidia GPU of 8GB, even though I tried it on google colab the time to train a batch size of 16 wirh number of inference=25 is around 7 minutes, much higher if I increase this number, so for one full training of one epoc it takes around 34 hours which seems very excessive. I am wondering if I am doing something wrong here or is this the best I can hope for given the hardware which is a 8GB GPU on my personal cmputer? All the delay is for this line of code :
MasaCtrl: Tuning-free Mutual Self-Attention Control for Consistent Image Synthesis and Editing by Tencent ARC Lab and University of Tokyo. This model uses CompVis/stable-diffusion-v1-4 as the base model and can simultaneously perform image synthesis and editing.
Abstract:Despite the success in large-scale text-to-image generation and text-conditioned image editing, existing methods still struggle to produce consistent generation and editing results. For example, generation approaches usually fail to synthesize multiple images of the same objects/characters but with different views or poses. Meanwhile, existing editing methods either fail to achieve effective complex non-rigid editing while maintaining the overall textures and identity, or require time-consuming fine-tuning to capture the image-specific appearance. In this paper, we develop MasaCtrl, a tuning-free method to achieve consistent image generation and complex non-rigid image editing simultaneously. Specifically, MasaCtrl converts existing self-attention in diffusion models into mutual self-attention, so that it can query correlated local contents and textures from source images for consistency. To further alleviate the query confusion between foreground and background, we propose a mask-guided mutual self-attention strategy, where the mask can be easily extracted from the cross-attention maps. Extensive experiments show that the proposed MasaCtrl can produce impressive results in both consistent image generation and complex non-rigid real image editing.
The model has two modes: (1) simultaneous image synthesis and editing and (2) editing of real or generated images. To use the model in mode (1), simply enter the source_prompt for the initial image and target_prompt for the final / edited image. To use the model in mode (2), upload your source_image and enter your target_prompt to edit the image.
In this session, you will learn how to optimize Stable Diffusion for Inerence using Hugging Face ? Diffusers library. and DeepSpeed-Inference. The session will show you how to apply state-of-the-art optimization techniques using DeepSpeed-Inference.This session will focus on single GPU (Ampere Generation) inference for Stable-Diffusion models.By the end of this session, you will know how to optimize your Hugging Face Stable-Diffusion models using DeepSpeed-Inference. We are going to optimize CompVis/stable-diffusion-v1-4 for text-to-image generation.
DeepSpeed-Inference is an extension of the DeepSpeed framework focused on inference workloads. DeepSpeed Inference combines model parallelism technology such as tensor, pipeline-parallelism, with custom optimized cuda kernels.DeepSpeed provides a seamless inference mode for compatible transformer based models trained using DeepSpeed, Megatron, and HuggingFace. For a list of compatible models please see here.As mentioned DeepSpeed-Inference integrates model-parallelism techniques allowing you to run multi-GPU inference for LLM, like BLOOM with 176 billion parameters.If you want to learn more about DeepSpeed inference:
Before we can load our model from the Hugging Face Hub we have to make sure that we accepted the license of CompVis/stable-diffusion-v1-4 to be able to use it. CompVis/stable-diffusion-v1-4 is published under the CreativeML OpenRAIL-M license. You can accept the license by clicking on the Agree and access repository button on the model page at: -diffusion-v1-4.
The next and most important step is to optimize our pipeline for GPU inference. This will be done using the DeepSpeed InferenceEngine. The InferenceEngine is initialized using the init_inference method. We are going to replace the models including the UNET and CLIP model in our pipeline with DeepSpeed optimized models.
We can now inspect a model graph to see that the vanilla UNet2DConditionModel has been replaced with an DSUNet, which includes the DeepSpeedAttention and triton_flash_attn_kernel module, custom nn.Module that is optimized for inference.
As the last step, we want to take a detailed look at the performance of our optimized pipelines. Applying optimization techniques, like graph optimizations or mixed-precision, not only impact performance (latency) those also might have an impact on the accuracy of the model. So accelerating your model comes with a trade-off.
We successfully optimized our Stable Diffusion with DeepSpeed-inference and managed to decrease our model latency from 4.57s to 2.68s or 1.7x.Those are good results results thinking of that we only needed to add 1 additional line of code, but applying the optimization was as easy as adding one additional call to deepspeed.init_inference.But I have to say that this isn't a plug-and-play process you can transfer to any Transformers model, task, or dataset. Also, make sure to check if your model is compatible with DeepSpeed-Inference.
Diffusion models are some of the most recent disrubtive models. Outperforming generative models, they have been made popular as result of the success of DALL-E 2 or Imagen to generate photorealistic images when prompted on text.
The diffusers librar contains the Flax implementation of the model, but we need to also use some weights to initialize it. We will use checkpoints from CompVis/stable-diffusion-v1-4, but first we need to accept the terms of the licence for this model.
We need to make sure that JAX is using TPU as backend before proceeding further Note: The number of TPU cores is displayed, this should be 8 if you are running on a v2 or v3 TPU or 4 if you are running on a v4 TPU.
The input promot is text, we need to be able to convert it to numbers as expected by this Diffusion model. Hence, the following function wraps pipeline.prepare_inputs to properly convert the text to Token IDs.
The next helper function will run the pipeline with the input prompt text. This will generate a JAX represnetation of the image corresponding to that prompt. We will at the end covert this JAX array into an actual PIL image.
Stable Diffusion is a text-to-image latent diffusion model created bythe researchers and engineers fromCompVis, StabilityAI and LAION. It istrained on 512x512 images from a subset of theLAION-5B database. This model usesa frozen CLIP ViT-L/14 text encoder to condition the model on textprompts. With its 860M UNet and 123M text encoder. See the modelcard for moreinformation.
General diffusion models are machine learning systems that are trainedto denoise random gaussian noise step by step, to get to a sample ofinterest, such as an image. Diffusion models have shown to achievestate-of-the-art results for generating image data. But one downside ofdiffusion models is that the reverse denoising process is slow. Inaddition, these models consume a lot of memory because they operate inpixel space, which becomes unreasonably expensive when generatinghigh-resolution images. Therefore, it is challenging to train thesemodels and also use them for inference. OpenVINO brings capabilities torun model inference on Intel hardware and opens the door to thefantastic world of diffusion models for everyone!
Model capabilities are not limited text-to-image only, it also is ablesolve additional tasks, for example text-guided image-to-imagegeneration and inpainting. This tutorial also considers how to runtext-guided image-to-image generation using Stable Diffusion.
e59dfda104