Image generation with custom objects by training DreamBooth Lora on SSD-1B, Distilled Stable Diffusion XL 1.0

Custom Object Image Generation by Training DreamBooth Lora on SSD-1B

Ramsri Goutham
GoPenAI

--

Segmind Stable Diffusion Image Generation with Custom Objects

Segmind has open-sourced its latest marvel, the SSD-1B model.

SSD-1B is a distilled version of Stable Diffusion XL 1.0 delivering up to 60% more speed in inference and fine-tuning and 50% smaller in size.

Since SSD-1B is compatible with SDXL 1.0, it can be used directly with the HuggingFace Diffusers library for inference and fine-tuning just like SDXL 1.0. You can find the open-sourced code at Segmind’s SSD-1B repository.

In the previous blog, we saw how to perform inference with Segmind’s Stable Diffusion (SSD-1B) using the Hugging Face Diffusers library.

In this blog post, we will see how to train SSD-1B to do image generation with custom objects!

Training Code

The colab notebook for this tutorial can be found here

The custom object on which I am going to train is my kid’s bicycle! I clicked about 6 images of the bicycle from different angles with my phone and later resized them to be smaller in size (around 1000 pixels in max width/height). For reference, you can download them from here.

The training images (my kid’s bicycle) captured with my phone.

Let’s install the necessary libraries and do the necessary initialization:

!pip install xformers bitsandbytes transformers accelerate -q
!pip install git+https://github.com/huggingface/diffusers.git -q
!wget https://raw.githubusercontent.com/huggingface/diffusers/main/examples/dreambooth/train_dreambooth_lora_sdxl.py

Now upload the training images onto the Google Colab notebook instance via the Files tab. Created a folder named bicycle and move the images there.

Visualize the images to see that all of them are uploaded correctly.

from PIL import Image
import glob

def image_grid(imgs, rows, cols, resize=256):
assert len(imgs) == rows * cols

if resize is not None:
imgs = [img.resize((resize, resize)) for img in imgs]
w, h = imgs[0].size
grid = Image.new("RGB", size=(cols * w, rows * h))
grid_w, grid_h = grid.size

for i, img in enumerate(imgs):
grid.paste(img, box=(i % cols * w, i // cols * h))
return grid



imgs = [Image.open(path) for path in glob.glob("./bicycle/*.jpg")]
image_grid(imgs, 1, len(imgs))
Visualization of training images

Login to HuggingFace if you haven’t and create an access token with WRITE permissions. Run the following code and enter the token when prompted:

!accelerate config default
!huggingface-cli login

Now start the training with the following code:

#!/usr/bin/env bash

!accelerate launch train_dreambooth_lora_sdxl.py \
--pretrained_model_name_or_path="segmind/SSD-1B" \
--pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \
--instance_data_dir="bicycle" \
--output_dir="lora-bicycle-SSD-1B" \
--mixed_precision="fp16" \
--instance_prompt="a photo of sks bicycle" \
--center_crop \
--resolution=1024 \
--train_batch_size=1 \
--gradient_accumulation_steps=2 \
--gradient_checkpointing \
--learning_rate=1e-4 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--use_8bit_adam \
--max_train_steps=500 \
--validation_prompt="A photo of sks bicycle on a mountain" \
--validation_epochs=300 \
--num_validation_images=1 \
--seed="0" \
--push_to_hub

It takes about 20–25 mins to finish the training. I am using Colab with a V100 GPU. Sometimes you may encounter CUDA memory errors. See if you can get a better GPU (A100) on Colab or modify parameters like batch size, restart runtime, and run it again.

If everything is successful the LORA will be uploaded onto the Huggingface hub. An example from this run can be seen here.

The beauty is that these LORAs are extremely lightweight (10.9MB) so one can store, load, and do inference on top of the base model (SSD-1B) easily.

Inference Code

Now that we have trained our model, let us do inference with it.

!pip install ipyplot
from diffusers import DiffusionPipeline
import torch
import ipyplot

from diffusers import DiffusionPipeline, AutoencoderKL

vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16)
pipe = DiffusionPipeline.from_pretrained(
"segmind/SSD-1B",
vae=vae, torch_dtype=torch.float16, variant="fp16",
use_safetensors=True
)
pipe.load_lora_weights("ramsrigouthamg/lora-bicycle-SSD-1B")

pipe.to("cuda")



# prompt = "A sks bicycle bicycle on a sandy beach, with a backdrop of a setting sun, 4k, cinematic photo, highly detailed"
# prompt = "A sks bicycle against a brick wall covered in vibrant graffiti, wide shot, 4k, cinematic photo, highly detailed"
# prompt = "a sks bicycle near camp fire, under canopy of stars, wide shot, 4k, cinematic photo, highly detailed"
prompt = "a sks bicycle with a light dusting of snow on it, standing against a snowy landscape, 4k, cinematic photo, highly detailed"
# prompt = "A smiling teddy bear riding a sks bicycle, 4k, cinematic photo, highly detailed"

neg_prompt = "ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, extra limbs, disfigured, deformed, body out of frame, blurry, bad anatomy, blurred, watermark, grainy, signature, cut off, draft"

allimages = pipe(prompt=prompt, negative_prompt=neg_prompt, guidance_scale=7.5, num_inference_steps=30,num_images_per_prompt=4).images

images = [image for image in allimages]

for i, image in enumerate(images):
filename = f'image{i+1}.jpg'
image.save(filename)

ipyplot.plot_images(images,img_width=400)

It generates images like these:

Prompt: a sks bicycle with a light dusting of snow on it, standing against a snowy landscape, 4k, cinematic photo, highly detailed

Though not perfect, we got pretty close, as the top two images capture the main frame in yellow and the tire frames in green.

You can try with a few other prompts by uncommenting the prompts parameter in the above code.

These are a few more samples generated:

Images generated with Dreambooth LORA training using SSD-1B

As you can see the results are very interesting! The bicycle I trained on is a bit complex but with just 6 images we got most of the details correct. You can enhance the quality by training it on a better GPU with a higher batch size and a greater number of sample images.

The model generation quality is better with more generic objects like cats, dogs, etc. You can see the images I generated by training on a dog breed and then using the prompt: A sks dog in the snow

Images generated by training on a dog breed.

Happy AI exploration and if you loved the content, feel free to follow me on Twitter for daily AI content!

--

--