AI Automated editing of short-form videos — Add B-roll Image Footage in one click

Tutorial using OpenAI’s ChatGPT API and SDXL API from Segmind.com

Ramsri Goutham
11 min readOct 21, 2023
Add B-roll to short-form videos using AI

Introduction

Before delving into the topic of automated video editing, let’s familiarize ourselves with a term that’s central to our discussion: B-roll. Imagine you’re watching a documentary about a chef. The main footage (A-roll) might involve the chef talking about their passion for cooking. But to enhance the story, you might also see shots of sizzling pans, close-ups of ingredients, or scenic views of the marketplace. These supplementary shots that provide a visual break from the primary narrative are what we refer to as B-roll footage. B-roll not only adds depth and context to a video but also makes the story more engaging and visually appealing.

Sample B-roll images as collage generated with AI

Short-form videos have surged in popularity, thanks to platforms like YouTube Shorts and Instagram Reels. Creators are constantly seeking ways to enhance and streamline their production processes. Traditional short-form video editing sometimes requires hours of meticulous work, identifying sections to add B-roll, finding the right B-roll footage (images/videos) that goes along with the narrative, and editing the video with it included.

Imagine being able to use AI to identify the spots in the video where B-rolls can be inserted as well as generate text descriptions of the relevant B-roll footage. And with the advancement of text-to-image APIs, one can now automatically generate relevant and fitting B-roll images, from the text descriptions alone. And here, we are exactly attempting that - Automated B-roll addition to short-form videos in one click!

In this blog post, we will use OpenAI’s ChatGPT API and Segmind’s Stable Diffusion XL (SDXL) to achieve this. Let’s get started!

A sample auto-edited video using the techniques described in this blog can be seen here:

While it is not 100% perfect, it establishes a fully finished pipeline where you can swap different steps of the pipeline with more relevant ones for your use case.

Step 1: The Theory

Given a short-form video, we will use an open-source speech-to-text library based on OpenAI’s Whisper to first extract the transcript along with word-level timestamps of each word that is spoken in the video.

A sample transcript could be this -

Now coming to SuperMeme, with all these learnings, I have one idea, okay question is for teachers, quizzes, grade, but let me take a step back, how do you get massive virality, one, anyone on the street should be able to understand that, two, only a handful of people should be able to build that, for example just think about this question, which genre of movie is the highest grosses in the world, action, I mean genre as such, most people don’t even think like this, okay romance, this horror, etc they think, but there is one fundamental truth which is action predominantly applies to a broad range of audience and maybe you make 2 billion dollars or 1 .5 billion dollars from that, so it has the highest target audience.

A sample word-level timestamps JSON could be this -

[{“word”: “ Now”, “start”: 0.0, “end”: 0.38}, {“word”: “ coming”, “start”: 0.38, “end”: 0.74}, ……….. ]

With the above information, we will use OpenAI’s GPT-4 to generate a JSON with image description as well as start and end timestamps. A sample can be seen here:

[{'description': 'A teacher engaging a classroom quiz',
'end': 7.84,
'start': 4.2},
{'description': 'Person shouting news in a crowded street',
'end': 21.32,
'start': 18.64},
{'description': 'Unique construction work - architect working on a rare '
'building design',
'end': 26.52,
'start': 23.82},
{'description': 'Different choices of movie categories',
'end': 31.0,
'start': 27.04},
{'description': 'A crowded theatre watching a genre movie',
'end': 37.94,
'start': 33.18},
{'description': 'People interacting less with romance and horror section of '
'the movie categorization on streaming service.',
'end': 47.24,
'start': 42.12},
{'description': 'Multi-billionaire getting champagne popped at him, with '
'fireworks in background.',
'end': 57.6,
'start': 54.08}]

We will use Segmind’s Stable Diffusion XL 1.0 API to generate images from the descriptions.

description to image using Segmind’s API

Then we will use the Moviepy library to overlay the B-roll images from above onto the video at the appropriate timestamps of the video as well as add subtitles to it.

A screenshot from the video after the B-roll and subtitles are added

Step 2: Signup for services and get the API Keys

To get started, you’ll need to sign up for two important services: Segmind and OpenAI. We are going to use OpenAI’s ChatGPT (GPT-4) API to extract relevant illustration descriptions from text and use Segmind’s text-to-image API to generate high-quality illustrations from illustration descriptions.

Get OpenAI’s API Key

Go to platform.openai.com
Sign up or log in.
Click on the top right to go to “View API Keys”.
Create a new secret key if you don’t have one already and save it securely.
We will be using GPT-4 instead of GPT-3.5 as the complexity of the task is higher and the GPT-4 model yields better results.

Note that since we are using GPT-4, you need to add at least 0.5$ of credit with your card to start using the GPT-4 model from openAI.

Get Segmind’s API Key

Go to Segmind.com and Log in/Sign up
Click on the top right to go to the console.
Once in the console, click on the “API keys” tab and “Create New API Key”.
You get a few free credits daily but for interrupted usage, you can go to “billing” and “add credits” by paying with your card.

If you want to know how much cost each API call incurs for Segmind, you can go to the corresponding model’s pricing tab and see it. An example of SDXL pricing is shown here.

Step 3: The Code

The Google Colab notebook containing the full code can be found here.

Install the necessary Python libraries and enter your API keys from the above step when prompted!

!pip install faster-whisper==0.9.0
!pip install ffmpeg-python==0.2.0
!pip install --quiet segmind==0.2.3
!pip install --quiet ipyplot
!pip install --quiet git+https://github.com/Zulko/moviepy.git@bc8d1a831d2d1f61abfdf1779e8df95d523947a5
!pip install --quiet imageio==2.25.1
!apt install -qq imagemagick
!cat /etc/ImageMagick-6/policy.xml | sed 's/none/read,write/g'> /etc/ImageMagick-6/policy.xml

from getpass import getpass
openaikey = getpass('Enter the openai key: ')
segmindkey = getpass('Enter the segmind API key: ')

Now on your Google Colab Instance, upload your short-form video file. It might take a few minutes depending on your internet connection for the upload to fully finish.

Note that this tutorial assumes the video file is in .mp4 format and the video is in the vertical standard size of Youtube Short/Instagram Reel — 1920 X 1080 pixels. Although the code is parametrized for other vertical resolutions, you may make any necessary changes to the code according to your custom requirements.

The sample video (SaaS.mp4) used in this tutorial can be downloaded here.
Replace the variable ‘video_file’ with the name of your file uploaded to the Colab instance.

Define extract_audio_from_video function that extracts .mp3 file from the uploaded .mp4 file. We shall use this in the next step to extract transcript and word-level timestamps.

# Ideally make sure that the video is 1920 (height) by 1080 (width). 
# The standard size for vertical Short videos.
video_file = "SaaS.mp4"

chatgpt_url = "https://api.openai.com/v1/chat/completions"
chatgpt_headers = {
"content-type": "application/json",
"Authorization":"Bearer {}".format(openaikey)}

from faster_whisper import WhisperModel
import ffmpeg

def extract_audio_from_video(outvideo):
"""
Extract audio from a video file and save it as an MP3 file.

:param output_video_file: Path to the video file.
:return: Path to the generated audio file.
"""

audiofilename = outvideo.replace(".mp4",'.mp3')

# Create the ffmpeg input stream
input_stream = ffmpeg.input(outvideo)

# Extract the audio stream from the input stream
audio = input_stream.audio

# Save the audio stream as an MP3 file
output_stream = ffmpeg.output(audio, audiofilename)

# Overwrite output file if it already exists
output_stream = ffmpeg.overwrite_output(output_stream)

ffmpeg.run(output_stream)

return audiofilename



audiofilename = extract_audio_from_video(video_file)

Now let’s use faster_whisper library based on OpenAI’s Whisper to do speech-to-text and get word-level timestamps from the audio file (.mp3) extracted above.

from faster_whisper import WhisperModel

model_size = "medium"
model = WhisperModel(model_size)
segments, info = model.transcribe(audiofilename, word_timestamps=True)
segments = list(segments)

wordlevel_info = []

for segment in segments:
for word in segment.words:
wordlevel_info.append({'word':word.word,'start':word.start,'end':word.end})

Using the above code, we get wordlevel_info. A sample output is shown below:

[{'word': ' Now', 'start': 0.0, 'end': 0.38},
{'word': ' coming', 'start': 0.38, 'end': 0.74},
{'word': ' to', 'start': 0.74, 'end': 1.0},
{'word': ' SuperMeme,', 'start': 1.0, 'end': 1.52},
{'word': ' with', 'start': 1.8, 'end': 2.04},
{'word': ' all', 'start': 2.04, 'end': 2.34},
{'word': ' these', 'start': 2.34, 'end': 2.64},
{'word': ' learnings,', 'start': 2.64, 'end': 3.66},
.......
]

Next, we use GPT-4 from OpenAI to pass this wordlevel_info along with the transcript with a prompt designed to get B-roll image descriptions along with their start and end timestamps. Note that the transcript is just a concatenation of all the individual words in wordlevel_info.

import requests
import json
from pprint import pprint

transcript = " ".join([word.word.strip() for segment in segments for word in segment.words])

def fetch_broll_description(prompt,url,headers):

# Define the payload for the chat model
messages = [
{"role": "system", "content": "You are an expert short form video script writer for Instagram Reels and Youtube shorts."},
{"role": "user", "content": prompt}
]

chatgpt_payload = {
"model": "gpt-4",
"messages": messages,
"temperature": 1.3,
"max_tokens": 2000,
"top_p": 1,
"stop": ["###"]
}

# Make the request to OpenAI's API
response = requests.post(url, json=chatgpt_payload, headers=headers)
response_json = response.json()

print ("response ",response_json['choices'][0]['message']['content'])

# Extract data from the API's response
output = json.loads(response_json['choices'][0]['message']['content'].strip())
print ("output ",output)

return output


prompt_prefix = """{}
transcript: {}
------------------
Given this transcript and corresponding word-level timestamp information, generate very relevant stock image descriptions to insert as B-roll images.
The start and end timestamps of the B-roll images should perfectly match with the content that is spoken at that time.
Strictly don't include any exact word or text labels to be depicted in the images.
Don't make the timestamps of different illustrations overlap.
Leave enough time gap between different B-Roll image appearances so that the original footage is also played as necessary.
Strictly output only JSON in the output using the format-""".format(json.dumps(wordlevel_info), transcript)

sample = [
{ "description": "...", "start": "...", "end": "..." },
{ "description": "...", "start": "...", "end": "..." }
]

prompt = prompt_prefix + json.dumps(sample) + """\nMake the start and end timestamps a minimum duration of more than 3 seconds.
Also, place them at the appropriate timestamp position where the relevant context is being spoken in the transcript.\nJSON:"""



print (prompt)

broll_descriptions= fetch_broll_description(prompt,chatgpt_url,chatgpt_headers)
print("broll_descriptions: ", broll_descriptions)

We get the following JSON output directly from GPT-4 corresponding to B_roll descriptions along with timestamps.

[
{
"description": "A teacher engaging a classroom quiz",
"start": 4.2,
"end": 7.84
},
{
"description": "Person shouting news in a crowded street",
"start": 18.64,
"end": 21.32
},
{
"description": "Unique construction work - architect working on a rare building design",
"start": 23.82,
"end": 26.52
},
{
"description": "Different choices of movie categories",
"start": 27.04,
"end": 31.0
},
{
"description": "A crowded theatre watching a genre movie",
"start": 33.18,
"end": 37.94
},
{
"description": "People interacting less with romance and horror section of the movie categorization on streaming service.",
"start": 42.12,
"end": 47.24
},
{
"description": "Multi-billionaire getting champagne popped at him, with fireworks in background.",
"start": 54.08,
"end":57.6
}
]

Next, we use Stable Diffusion XL 1.0 via API from Segmind to convert each of these descriptions into a generated image. For the style parameter, you can experiment with the various available styles at https://www.segmind.com/models/sdxl1.0-txt2img to see which suits your video better!

from segmind import SDXL
model = SDXL(segmindkey)

import os
import io
import requests
from PIL import Image
import random

def generate_images(descriptions,style):
all_images =[]

num_images = len(descriptions)

currentseed = random.randint(1, 1000000)
print ("seed ",currentseed)

negative_prompt = "((deformed)), ((limbs cut off)), ((quotes)), ((extra fingers)), ((deformed hands)), extra limbs, disfigured, blurry, bad anatomy, absent limbs, blurred, watermark, disproportionate, grainy, signature, cut off, missing legs, missing arms, poorly drawn face, bad face, fused face, cloned face, worst face, three crus, extra crus, fused crus, worst feet, three feet, fused feet, fused thigh, three thigh, fused thigh, extra thigh, worst thigh, missing fingers, extra fingers, ugly fingers, long fingers, horn, extra eyes, amputation, disconnected limbs, glitch, low contrast, noisy"

for i, prompt in enumerate(descriptions):

final_prompt = "((perfect quality)), 4k, {}, no occlusion, highly detailed,".format(prompt.replace('.',","))
img = model.generate(prompt = final_prompt,negative_prompt=negative_prompt,samples = 1,style = style,scheduler="UniPC",
seed =currentseed, num_inference_steps=30)


print (f"Image {i + 1}/{num_images} is generated")
# img will be a PIL image
all_images.append(img)

return all_images

style = "base"
# style = "photographic"
descriptions = [item['description'] for item in broll_descriptions]
all_images = generate_images(descriptions,style)

The following images are generated by Segmind’s API for the above descriptions from GPT-4.

Images generated using Segmind.com’s SDXL 1.0 API

Next, we use Python’s MoviePy library to overlay these images on top of the original video at their respective timestamps as well as overlay spoken word subtitles that we extracted earlier using openAI’s Whisper.

import cv2
from PIL import Image
import numpy as np
from moviepy.editor import AudioFileClip, concatenate_audioclips, concatenate_videoclips, ImageClip


def create_combined_clips(allimages, b_roll_descriptions, output_resolution=(1080, 1920), fps=24):
video_clips = []

# Iterate over the images and descriptions
for img, item in zip(allimages, b_roll_descriptions):
img = np.array(img)
img_resized = cv2.resize(img, (output_resolution[0],output_resolution[0]))

start,end = item['start'],item['end']
duration = end-start


# Blur the image
blurred_img = cv2.GaussianBlur(img, (0, 0), 30)
blurred_img = cv2.resize(blurred_img, output_resolution)
blurred_img = cv2.cvtColor(blurred_img, cv2.COLOR_BGR2RGB)



# Overlay the original image on the blurred one
y_offset = (output_resolution[1] - output_resolution[0]) // 2
blurred_img[y_offset:y_offset+output_resolution[0], :] = img_resized

cv2.imwrite("test_blurred_image.jpg", blurred_img)

video_clip = ImageClip(np.array(blurred_img)).with_position('center').with_duration(end - start)
video_clip = video_clip.with_start(start)
video_clips.append(video_clip)

return video_clips



from moviepy.editor import VideoFileClip, TextClip, CompositeVideoClip

# Function to generate text clips
def generate_text_clip(word, start, end, video):
txt_clip = (TextClip(word,font_size=80,color='white',font = "Nimbus-Sans-Bold",stroke_width=3, stroke_color='black').with_position("center")
.with_duration(end - start))

return txt_clip.with_start(start)

# Load the video file
video = VideoFileClip(video_file)

print (video.size)

# Generate a list of text clips based on timestamps
clips = create_combined_clips(all_images, broll_descriptions, output_resolution=video.size, fps=24)

add_subtitles = True

if add_subtitles:
# Generate a list of text clips based on timestamps
wordclips = [generate_text_clip(item['word'], item['start'], item['end'], video) for item in wordlevel_info]

# Overlay the text clips on the video
final_video = CompositeVideoClip([video] + clips + wordclips)
else:
final_video = CompositeVideoClip([video] + clips)

finalvideoname = "final.mp4"
# Write the result to a file
final_video.write_videofile(finalvideoname, codec="libx264",audio_codec="aac")

With this, we get the final video that has the B-roll overlayed!

There is definitely room for further improvement :

  1. Instead of using OpenAI’s Whisper Medium, you can use “Large” or any other speech-to-text API like Deepgram or AssemblyAI to improve on accuracy of transcription.
  2. You can experiment with a more advanced prompt as well as other LLM APIs like Anthropic
  3. You can use more fine-tuned SDXL 1.0 models like https://www.segmind.com/models/sdxl1.0-realvis or https://www.segmind.com/models/sdxl1.0-timeless for more realistic image generation. Look at this blog for photorealistic image generation APIs

Happy AI exploration and if you loved the content, feel free to follow me on Twitter for daily AI content!

--

--